Unlock Customer Insights: A Guide to K-Means Clustering for Segmentation in Python
Understanding your customers is fundamental to business success. Whether refining product offerings or tailoring marketing campaigns, knowing distinct customer groups is key. K-Means clustering, a powerful Machine Learning technique, offers an effective way to segment customers by grouping those with similar characteristics.
This guide provides a practical walkthrough on using K-Means clustering for customer segmentation using Python. We’ll explore this technique with two different datasets, demonstrating the steps involved from data preparation to interpreting the resulting clusters.
What is K-Means Clustering?
K-Means is an unsupervised learning algorithm that aims to partition a dataset into a predefined number of clusters (K). It works by:
- Initializing K cluster centers (centroids), often randomly.
- Assigning each data point to the nearest centroid.
- Recalculating the position of each centroid based on the mean of the data points assigned to it.
- Repeating steps 2 and 3 until the centroids stabilize or a maximum number of iterations is reached.
The goal is to minimize the within-cluster variance, meaning data points within the same cluster are as similar as possible, while clusters themselves are distinct.
Prerequisites
To follow along, you’ll need Python and the following libraries:
pandas
for data manipulation.numpy
for numerical operations.scikit-learn
for the K-Means algorithm and scaling.matplotlib
andseaborn
for data visualization.
We’ll use code examples suitable for environments like Google Colab or a local Jupyter Notebook setup.
Case Study 1: Mall Customer Segmentation
Let’s start with a common example using the Mall_Customers.csv
dataset, which contains basic information about mall shoppers, including their age, gender, annual income, and a spending score (1-100). The objective is to segment customers based on their ‘Annual Income’ and ‘Spending Score’.
Dataset Source: (Originally from Kaggle: `https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python/version/1`)
Step 1: Load and Prepare Data
First, import the necessary libraries and load the dataset.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
# Load the dataset (assuming Mall_Customers.csv is in the working directory or uploaded)
# In Google Colab, you might use:
# from google.colab import files
# uploaded = files.upload()
# df = pd.read_csv('Mall_Customers.csv')
# If running locally:
df = pd.read_csv('Mall_Customers.csv')
# Check the shape of the data
print(df.shape)
# Rename columns for clarity
df.rename(columns={'Annual Income (k$)': 'Income', 'Spending Score (1-100)': 'Spending_Score'}, inplace=True)
# Display the first few rows
print(df.head())
This loads the data and renames ‘Annual Income (k$)’ to ‘Income’ and ‘Spending Score (1-100)’ to ‘Spending_Score’ for easier use. df.head()
would typically show columns like CustomerID, Gender, Age, Income, and Spending_Score.
Step 2: Find the Optimal Number of Clusters (K) using the Elbow Method
Before applying K-Means, we need to determine the most appropriate number of clusters (K). The Elbow Method is a common technique for this. It involves running K-Means for a range of K values and plotting the Sum of Squared Distances (SSD) or inertia for each K. The “elbow” point on the plot—where the rate of decrease in inertia sharply slows down—suggests the optimal K.
# Select features for clustering
X = df[['Income', 'Spending_Score']] # Using Income and Spending Score
ssd = []
K_range = range(2, 10) # Check for K from 2 to 9
for k in K_range:
kmeans_model = KMeans(n_clusters=k, random_state=42, n_init=10) # Added n_init explicitly
kmeans_model.fit(X)
ssd.append(kmeans_model.inertia_) # inertia_ gives the SSD
# Create a DataFrame for plotting
ssd_df = pd.DataFrame({'k': K_range, 'ssd': ssd})
# Plot the Elbow Curve
plt.figure(figsize=(8, 5))
plt.plot(ssd_df['k'], ssd_df['ssd'], linestyle='--', marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Sum of Squared Distances (SSD) / Inertia')
plt.title('Elbow Method for Optimal K')
plt.grid(True)
plt.show()
# Optionally calculate percentage change to help identify the elbow
ssd_df['pct_chg'] = ssd_df['ssd'].pct_change() * 100
print(ssd_df)
The plot generated will show SSD decreasing as K increases. We look for the point where the line bends like an elbow. For the Mall Customers dataset focusing on Income and Spending Score, this elbow typically occurs at K=5. The printed DataFrame with pct_chg
helps quantify this change; the percentage decrease becomes significantly smaller after K=5.
Step 3: Apply K-Means with the Optimal K
Now that we’ve identified K=5 as optimal, let’s train the K-Means model.
optimal_k = 5
kmeans = KMeans(n_clusters=optimal_k, init='k-means++', random_state=42, n_init=10) # Use k-means++ initialization
kmeans.fit(X) # Fit the model to the selected features
# Get the cluster labels for each customer
df['Clusters'] = kmeans.labels_
# Get the coordinates of the cluster centers (centroids)
centroids = kmeans.cluster_centers_
print("Cluster Centroids:\n", centroids)
# Display the first few rows with the assigned cluster
print(df.head())
This code trains the model with 5 clusters and adds a new ‘Clusters’ column to the DataFrame, indicating which cluster each customer belongs to. It also prints the coordinates (Income, Spending_Score) of the five cluster centers.
Step 4: Visualize and Interpret the Clusters
Visualizing the clusters helps in understanding the segmentation. A scatter plot is ideal here.
plt.figure(figsize=(10, 6))
# Scatter plot of customers colored by cluster
sns.scatterplot(x='Income', y='Spending_Score', hue='Clusters', data=df, palette='viridis', s=100, alpha=0.7)
# Plot the centroids
plt.scatter(centroids[:, 0], centroids[:, 1], marker='X', s=200, c='red', label='Centroids')
plt.title('Customer Segmentation using K-Means (K=5)')
plt.xlabel('Annual Income (k$)')
plt.ylabel('Spending Score (1-100)')
plt.legend()
plt.grid(True)
plt.show()
The resulting scatter plot clearly shows five distinct groups:
- Low Income, High Spending Score
- Low Income, Low Spending Score
- Medium Income, Medium Spending Score
- High Income, High Spending Score (Target Customers)
- High Income, Low Spending Score
This segmentation provides valuable insights for targeted marketing strategies.
Case Study 2: Food Order Segmentation
Let’s apply the same principles to a different scenario: segmenting customers based on their food ordering behaviour using an order
dataset. Assume this dataset contains user_id
, order_count
, order_price
, and age
.
Step 1: Load and Prepare Data
Since a user might have multiple orders, we first need to aggregate the data per user.
# Load the dataset (Example: loading from a URL)
# url = "https://raw.githubusercontent.com/Yaowamanymc/food-order-data/refs/heads/main/food_order_data.csv"
# order_df = pd.read_csv(url) # Assuming index_col=0 is not needed or handled differently
# Let's assume order_df has columns: user_id, order_count, order_price, age
# Aggregate data per user
# user_order_df = order_df.groupby('user_id').agg({
# 'order_count': 'sum', # Total orders per user
# 'order_price': 'sum', # Total spending per user
# 'age': 'mean' # Average age (if consistent per user) or first value
# }).reset_index()
# For demonstration, let's create a sample aggregated DataFrame structure
# Replace this with actual data loading and aggregation
data = {
'user_id': range(1, 101),
'order_count': np.random.randint(1, 50, 100),
'order_price': np.random.uniform(10, 500, 100),
'age': np.random.randint(18, 65, 100)
}
user_order_df = pd.DataFrame(data)
print(user_order_df.shape)
print(user_order_df.head())
Step 2: Data Scaling and Finding Optimal K
K-Means is sensitive to the scale of features. Features like order_price
might have much larger values than order_count
or age
, potentially dominating the distance calculations. Therefore, scaling is crucial.
from sklearn.preprocessing import StandardScaler
# Select features for clustering
features = ['order_count', 'order_price', 'age']
X_order = user_order_df[features]
# Scale the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X_order)
# Find optimal K using the Elbow Method on scaled data
ssd_order = []
K_range_order = range(2, 10)
for k in K_range_order:
kmeans_order = KMeans(n_clusters=k, random_state=42, n_init=10)
kmeans_order.fit(X_scaled)
ssd_order.append(kmeans_order.inertia_)
# Plot the Elbow Curve
plt.figure(figsize=(8, 5))
plt.plot(K_range_order, ssd_order, linestyle='--', marker='o')
plt.xlabel('Number of clusters (K)')
plt.ylabel('Sum of Squared Distances (SSD) / Inertia')
plt.title('Elbow Method for Food Order Data')
plt.grid(True)
plt.show()
# Determine the elbow point (let's assume it's K=4 for this example)
optimal_k_order = 4 # Adjust based on the actual plot
Step 3: Apply K-Means with Optimal K
Train the model using the scaled data and the chosen K.
kmeans_final_order = KMeans(n_clusters=optimal_k_order, init='k-means++', random_state=42, n_init=10)
kmeans_final_order.fit(X_scaled)
# Add cluster labels to the original (non-scaled) DataFrame for interpretation
user_order_df['Clusters'] = kmeans_final_order.labels_
# Get centroids (these are in the scaled space)
centroids_scaled = kmeans_final_order.cluster_centers_
# To interpret centroids, inverse transform them back to the original scale
centroids_original = scaler.inverse_transform(centroids_scaled)
print("Cluster Centroids (Original Scale):\n", pd.DataFrame(centroids_original, columns=features))
print(user_order_df.head())
Step 4: Visualize and Interpret Clusters
Visualize using two prominent features, like order_count
and order_price
.
plt.figure(figsize=(10, 6))
sns.scatterplot(x='order_count', y='order_price', hue='Clusters', data=user_order_df, palette='Set2', s=100, alpha=0.8)
# Plot centroids (transformed back to original scale for visualization)
centroids_for_plot = scaler.inverse_transform(centroids_scaled)
plt.scatter(centroids_for_plot[:, 0], centroids_for_plot[:, 1], marker='X', s=300, c='red', label='Centroids')
plt.title(f'Food Order Customer Segmentation (K={optimal_k_order})')
plt.xlabel('Total Order Count')
plt.ylabel('Total Order Price')
plt.legend()
plt.grid(True)
plt.show()
The plot would reveal distinct customer segments based on their ordering frequency and total spending. For example:
- Low Frequency, Low Spenders: Customers who order infrequently and spend little.
- High Frequency, High Spenders: Loyal customers who order often and spend significantly (VIPs).
- High Frequency, Low Spenders: Customers who order often but typically small amounts.
- Low Frequency, High Spenders: Customers who order rarely but make large purchases when they do.
(Note: The interpretation depends on the actual data and the chosen K. Age also contributes to the clustering but isn’t directly visualized in this 2D plot).
Conclusion
K-Means clustering provides a straightforward yet powerful method for customer segmentation. By grouping customers based on similarities in behaviour or characteristics (like spending habits, purchase frequency, or demographics), businesses can gain actionable insights. Applying the Elbow Method helps determine an appropriate number of clusters, and visualization aids in interpreting the segments discovered. This data-driven approach allows for more personalized marketing, optimized product strategies, and ultimately, better business outcomes. Feel free to adapt these steps and explore different feature combinations with your own datasets.
Unlock the full potential of your customer data with Innovative Software Technology. Leveraging advanced techniques like K-Means clustering and machine learning, we transform raw data into actionable customer segments. Our experts help you gain deep insights into customer behaviour, enabling highly targeted marketing campaigns, personalized user experiences, and optimized product development strategies. Partner with us to implement data-driven decision-making, enhance customer engagement, and achieve significant business growth through effective customer segmentation analysis.