MKT 566
Homework 3 - Customer Segmentation with Clustering
Context
You are acting as analytics consultants for a retail company. The company has collected data on its customers, including demographics, spending patterns, and behavior. Management wants to better understand its customer base and design targeted marketing campaigns aimed at increasing sales. Your task is to analyze the dataset, segment customers using clustering techniques, and provide recommendations on how the company should act on your insights. For example, which type of products should be recommended to different customer segments? How should marketing campaigns be tailored to different segments based on their shopping behavior and demographics?
Dataset
For this assignment, use the dataset provided here. Read the data directly into your R or Python using the URL link above.
The dataset contains information on 2240 customers and their purchasing behavior. The dataset contains 29 variables, including customer demographics (Year_Birth, Education, Marital_Status, Income, Kidhome, Teenhome), purchase history (MntWines, MntMeatProducts, etc.), marketing responses (AcceptedCmp1–AcceptedCmp5), and other behavioral indicators.
The full set of variables and their descriptions is the following:
. Id: Unique identifier for each individual in the dataset.
. Year_Birth: The birth year of the individual.
. Education: The highest level of education attained by the individual.
. Marital_Status: The marital status of the individual.
. Income: The annual income of the individual.
. Kidhome: The number of young children in the household.
. Teenhome: The number of teenagers in the household.
. Dt_Customer: The date when the customer was first enrolled or became a part of the company Is database.
. Recency: The number of days since the last purchase or interaction.
. MntWines: The amount spent on wines.
. Mnt Fruits: The amount spent on fruits.
. Mnt Meat Products: The amount spent on meat products.
. Mnt Fish Products: The amount spent on fish products.
. Mnt Sweet Products: The amount spent on sweet products.
. Mnt GoldProds: The amount spent on gold products.
. NumDealsPurchases: The number of purchases made with a discount or as part of a deal.
. NumWeb Purchases: The number of purchases made through the company Is website.
. NumCatalogPurchases: The number of purchases made through catalogs.
. NumStorePurchases: The number of purchases made in physical stores.
. NumWebVisitsMonth: The number of visits to the company Is website in a month.
. Accepted Cmp3: Binary indicator (1 or 0) whether the individual accepted the third marketing campaign.
. Accepted Cmp4: Binary indicator (1 or 0) whether the individual accepted the fourth marketing campaign.
. Accepted Cmp5: Binary indicator (1 or 0) whether the individual accepted the fifth marketing campaign.
. Accepted Cmp1: Binary indicator (1 or 0) whether the individual accepted the first marketing campaign.
. Accepted Cmp2: Binary indicator (1 or 0) whether the individual accepted the second marketing campaign.
. Complain: Binary indicator (1 or 0) whether the individual has made a complaint.
. Z_Cost Contact: A constant cost associated with contacting a customer.
. Z_Revenue: A constant revenue associated with a successful campaign response.
. Response: Binary indicator (1 or 0) whether the individual responded to the marketing campaign.
Tasks
1. Data Cleaning & Preparation
. Inspect the dataset for missing values, duplicates, and outliers.
. Handle missing Income values appropriately (drop or impute).
. Remove unrealistic birth years (e.g., customers older than 100).
. Drop columns Z_CostContact, Z_Revenue as they are constants and not useful for analysis.
. Convert Dt_Customer to datetime format. Create a Tenure variable (number of years as a customer).
. Consolidate rare categories in Marital_Status and ensure Education is properly encoded.
. Engineer new features:
Age = current year – Year_Birth
Children = Kidhome + Teenhome
TotalSpend = sum of all product spending variables
TotalPurchases = sum of all purchase channel variables
. Standardize/scale all numerical features before clustering. Explain why scaling is important.
. Provide a summary table of the cleaned dataset (rows, columns, and new variables).
2. Exploratory Data Analysis (EDA)
. Report descriptive statistics for key variables.
. Explore correlations between numerical variables (e.g., income, age, total spend). HINT: You don't want to include highly correlated variables when doing clustering.
. Create plots to explore distributions (e.g., income, age, total spend).
. Comment on notable trends or outliers.
3. Clustering
. Choose a set of variables for clustering (e.g., you can focus on variables like income and
demographics). HINT: Note that variables need to be numeric so if you plan to use categorical variables, you will need to convert them to numeric first (e.g., one-hot encoding or dummies).
. Apply K-means clustering on the set of variables you chose.
. Use the elbow curve method to determine the number of optimal clusters.
. Report the final number of clusters chosen and explain your reasoning.
. Report the size of each cluster (i.e., number of customers in each cluster).
. Compute and report the cluster mean values of each variable used in the clustering analysis (HINT: This will help with clustering interpretation).
4. Dimensionality Reduction (PCA) and cluster visualization
. Run PCA for visualization of your clusters.
. Report the proportion of variance explained by the first two principal components.
. Explore the loadings of the first two PCA components: which original variables contribute most? . Plot the clusters in two dimensions using PC1 and PC2
. Comment on how well-separated the clusters appear in the PCA p lot.
5. Cluster Profiling Interpretation
. Interpret the clusters based on the original variables.
. Assign managerial labels to each cluster (e.g., “ Young Budget Families,” “ High Income – Low Spending”).
. Describe the key characteristics of each cluster beyond the variables used for clustering. Do you observe specific spending behavior or marketing response?.
6. Regression Analysis
. Choose one or more outcome variable of interest (e.g., TotalSpend, NumWebPurchases).
. Run a regression model where the dependent variable is your chosen outcome, and the main independent variable is the cluster assignment (i.e., include cluster dummies).
. Interpret the regression coefficients: how does customer shopping behavior differ across clusters? . Discuss how regression results complement your clustering analysis.
7. Managerial Recommendations
. Based on the cluster and regression analyses, suggest at least three actionable marketing strategies (e.g., targeted promotions, differentiated campaigns, product bundling).
. Connect your recommendations directly to the analysis you performed.
8. AI Log
. Briefly describe any use of AI tools (e.g., Chat GPT, Copilot) in completing this assignment in the R Markdown or Python Notebook.
. Submit an additional file (text/Word file) or URL pointing to the conversation with any AI tools (e.g., Chat GPT, Copilot) used to complete this homework.
. Summaries or screenshots are not sufficient — the entire conversation must be included to receive credit.
Deliverables
. Submit your work as either an R Markdown file (. Rmd) or Python Notebook (. ipynb).
. Include all code, plots, and explanations.
. Ensure your work is reproducible.
Rubric (100 points total)
. Data Cleaning & Preparation (20 points)
. Exploratory Data Analysis (15 points)
. Clustering (15 points)
. PCA & Visualization (10 points)
. Cluster Interpretation (10 points)
. Regression Analysis (10 points)
. Managerial Recommendations (5 points)
. AI Log (5 points)
. Reproducibility & Code Quality (10 points) Total: 100 points