Table of Contents

Hierarchical Clustering for Customer Segmentation

Hierarchical clustering is a machine learning method that helps businesses group customers based on shared traits, without needing to predefine the number of groups. It creates a tree-like structure (dendrogram) to visualize relationships and identify natural clusters. This technique is especially useful for customer segmentation, enabling companies to tailor marketing strategies and improve outcomes like revenue growth, cost efficiency, and customer retention.

Key points:

Two approaches: Agglomerative (bottom-up) and divisive (top-down), with agglomerative being more common due to practicality.
Steps: Calculate distances between data points, merge or split clusters, create a dendrogram, and decide where to cut it for final segments.
Metrics: Distance metrics (e.g., Euclidean, Manhattan) and linkage methods (e.g., Ward’s, average) shape how clusters form.
Applications: Combine behavioral and demographic data for actionable insights, like identifying high-value customer groups.
Comparison: Hierarchical clustering is ideal for small-to-medium datasets, while K-means suits larger datasets due to speed and scalability.

Hierarchical clustering is a reliable way to analyze customer data and create personalized strategies, but it requires careful selection of metrics and methods to ensure meaningful results.

Customer Segmentation Using Hierarchical Clustering Machine Learning Algorithm in Python – Sklearn

How Hierarchical Clustering Works

Let’s break down how hierarchical clustering works when it comes to segmenting customers.

The process involves four main steps. First, the algorithm calculates the distance between every pair of customers in your dataset, creating a distance matrix. This means it measures how similar or different each pair of customers is. Next, based on the chosen approach (which we’ll get into shortly), the algorithm either merges smaller clusters or splits larger ones. Third, it keeps track of these merges or splits in a dendrogram – a tree-like diagram where the height of each branch reflects the distance at which clusters are joined. Finally, you decide where to cut the dendrogram horizontally to form your final customer segments.

One important tip: if your data includes features with varying scales, standardize it first (for example, using z-score normalization). Without this step, features with larger numeric ranges could overpower the distance calculations, leading to skewed results.

Agglomerative vs. Divisive Clustering

There are two main approaches to hierarchical clustering: agglomerative and divisive.

Agglomerative clustering works from the bottom up. It starts with each customer as its own cluster and then merges the most similar clusters step by step until everything is grouped into one hierarchy. This method is popular because it’s efficient at finding smaller, detailed customer segments.
Divisive clustering, on the other hand, takes a top-down approach. It begins with all customers in a single group and divides them into smaller, more distinct clusters. While this method can better capture the overall structure of your data, it’s computationally expensive, which is why it’s less commonly used.

Understanding Dendrograms

A dendrogram is the visual output of hierarchical clustering, resembling a branching tree. On the horizontal axis, you’ll find individual customers (or data points), while the vertical axis measures the distance or dissimilarity between clusters. When two clusters merge, they are connected by a horizontal “U”-shaped line. The height of this line tells you how different the clusters are – taller branches mean greater dissimilarity.

To finalize your customer segments, you’ll need to cut the dendrogram at a certain level. The number of vertical lines that your cut intersects determines the number of clusters. Large vertical gaps in the dendrogram, unconnected by horizontal lines, often reveal natural divisions in the data. To ensure your segmentation is solid, you can pair this visual analysis with statistical tools like Silhouette Analysis to validate that your clusters are distinct and meaningful.

Understanding these steps is key, but mastering the distance metrics and linkage methods behind the clustering process will take your customer segmentation to the next level.

Distance Metrics and Linkage Methods

Now that we’ve covered how hierarchical clustering creates its tree structure, it’s time to dive into the technical details that shape customer segments: distance metrics and linkage methods. These elements are the backbone of clustering, influencing how customers are grouped together.

Distance Metrics

Distance metrics are used to measure how alike or different two customers are based on their data. The right metric depends on the kind of data you’re analyzing.

Euclidean distance: This is the go-to metric for numerical data, like age or income. It calculates the straight-line distance between two points, much like measuring the physical distance between two spots on a map. Its simplicity and effectiveness make it the default for continuous variables.
Manhattan distance: Also known as city block distance, this metric sums up the absolute differences between coordinates, mimicking movement along a grid (think city streets). It’s particularly useful when your data reflects grid-like patterns or constrained movement.
Correlation distance: Instead of focusing on absolute values, this metric looks at patterns. It measures whether two customers’ behaviors rise and fall together, regardless of the actual amounts. For example, one customer might spend $500 monthly and another $50, but if their spending follows the same seasonal trends, correlation distance will group them. It does this by calculating 1 minus the Pearson correlation coefficient.

For categorical data, there are specialized metrics:

Hamming distance: Counts mismatches between labels, making it ideal for "Yes/No" survey responses.
Jaccard distance: Works well with binary attributes.
Cosine similarity: Often used in text analysis, this measures the angle between vectors rather than their magnitude.

One important note: distance metrics are sensitive to scale. If you’re clustering based on age (18 to 80) and income ($20,000 to $200,000), income will dominate unless you standardize your data. Using techniques like z-score normalization can help level the playing field.

Once you’ve defined how to measure distances, the next step is determining how clusters should connect.

Linkage Methods

Linkage methods decide how to measure the distance between clusters, shaping the overall structure of your segments. This choice impacts whether your clusters are compact, evenly sized, or stretched out.

"The sensible default"

Ward’s method, described by Tim Bock, Founder of Displayr, as the "sensible default", minimizes within-cluster variance. In other words, it creates tight, compact groups where customers are as similar as possible to others in their segment. This method often results in clusters of similar sizes, making them easier for marketing teams to target.

Using Hierarchical Clustering for Customer Segmentation

This section dives into how businesses use hierarchical clustering to turn raw customer data into actionable marketing strategies that drive results.

Grouping Customers by Behavior and Demographics

Hierarchical clustering is particularly effective at combining varied data types to build rich, multi-dimensional customer profiles. A great example of this is a project led by data scientist Emine Bozkus in December 2022. Working with Turkish retailer FLO, she analyzed data from 19,945 OmniChannel customers collected between 2020 and 2021. Using hierarchical clustering on 12 variables – including tenure, recency, frequency, and monetary value – she leveraged the Scipy library to generate dendrograms. These visual tools helped FLO identify distinct behavioral clusters and design targeted marketing strategies.

This method is versatile, allowing businesses to integrate RFM analysis (Recency, Frequency, Monetary) with demographic details like age and income, along with behavioral patterns such as purchasing habits and brand loyalty. The dendrogram structure visually highlights natural groupings, making it easier to align clusters with actual customer behaviors.

"Hierarchical clustering has the advantage of not requiring a predefined number of clusters. Instead, analysts can use the dendrogram to determine the best clustering solution based on the height at which distinct clusters emerge." – Anita Rostami

In another example, researcher Anita Rostami used Agglomerative Hierarchical Clustering in a 2024 retail analytics project. Analyzing a Mall Customer dataset of 200 observations, she segmented customers into five groups based on Annual Income and Spending Score. Using Euclidean distance and Ward linkage, the analysis achieved a silhouette score of 0.5530 and a Davies-Bouldin Index of 0.5782, both indicators of strong cluster quality. These insights enabled more precise marketing efforts tailored to mall shoppers.

Once customer clusters are clearly defined, businesses can move seamlessly into creating personalized strategies.

Creating Personalized Marketing Strategies

After identifying distinct customer groups, businesses can craft marketing strategies tailored to each segment’s unique needs and preferences. For instance, a study titled "Momentive study: credit card users weigh costs and perks" segmented credit card users into four personas using hierarchical clustering:

"Economical End-Users": Older, risk-averse customers who prefer no annual fees.
"Untapped Utilizers": Lower-income users unfamiliar with card benefits.
"Strapped Spenders": Younger customers with high usage but limited disposable income.
"Card Champions": High-income Millennials maximizing travel rewards.

This segmentation allowed credit card companies to address specific pain points – offering cash-back incentives for "Untapped Utilizers" and travel bonuses for "Card Champions" – while customizing content and outreach frequency for each group.

The same principle applies across industries. For example, a beer brand segmentation study identified three key customer segments: Premium/Imported brands like Heineken and Becks (high cost and alcohol content), Mainstream brands like Budweiser and Coors (moderate cost and alcohol), and Light/Health-conscious brands like Miller Lite (low calories and sodium).

Measurable Business Results

The true value of customer segmentation lies in its ability to deliver measurable business outcomes. Hierarchical clustering helps companies focus their efforts where it matters most – driving revenue, cutting costs, and improving retention.

Business Outcome	Benefit	Example Application
Revenue Growth	Identify high-value segments for premium offerings	Target "Card Champions" with travel rewards programs
Cost Optimization	Focus marketing spend on high-potential customers	Reduce wasted ad spend on low-engagement segments
Retention Improvement	Detect at-risk customers through RFM patterns	Launch win-back campaigns for declining spenders
Product Development	Understand unmet needs within specific segments	Develop products for "Untapped Utilizers" unfamiliar with benefits

This approach can scale to accommodate customer bases ranging from 10,000 to over a million. However, it’s important to note that segmentation isn’t a one-and-done process. Models need to be updated regularly to reflect new data and shifting trends, ensuring they remain effective.

Hierarchical Clustering vs. K-Means Clustering

Hierarchical Clustering vs K-Means: Key Differences for Customer Segmentation

When deciding between hierarchical clustering and K-means clustering, the choice often hinges on factors like dataset size, available computing power, and whether you already know how many clusters you need.

K-means clustering requires you to specify the number of clusters upfront. It’s known for being fast, thanks to its linear complexity (O(n)), making it a go-to option for very large datasets. However, its reliance on random initialization can lead to different results across multiple runs, which may be a drawback in some scenarios.

On the other hand, hierarchical clustering doesn’t require you to predefine the number of clusters. Instead, it creates a dendrogram, which visually represents the natural groupings within the data. This makes it particularly useful during the exploratory phase, such as when analyzing customer data for the first time. That said, hierarchical clustering comes with a trade-off: it’s computationally intensive, with complexity ranging from O(n²) to O(n³) due to the need to compute and store a full n×n distance matrix. As a result, it’s not practical for very large datasets.

Here’s a quick comparison of the two methods:

Key Differences Between Methods

Factor	K-Means Clustering	Hierarchical Clustering
Number of Clusters	Must be specified in advance	Determined by examining the dendrogram
Dataset Size	Ideal for very large datasets	Best for small to medium datasets
Processing Speed	Fast, with linear complexity (O(n))	Slower, with quadratic or cubic complexity
Cluster Shape	Works best with spherical, equally sized clusters	Handles clusters of various shapes and sizes
Reproducibility	Results vary due to random initialization	Consistent and reproducible
Visualization	Uses scatter plots to display final clusters	Shows hierarchical relationships via dendrograms
Memory Requirements	Low – only needs to store centroids	High – requires storing the full distance matrix

For businesses managing massive customer datasets, K-means is often the practical choice due to its speed and low memory requirements. But if the goal is to explore hierarchical relationships – such as identifying sub-groups within a niche customer segment – hierarchical clustering offers a deeper look into the data.

Interestingly, a hybrid approach can combine the strengths of both methods. For example, you could use hierarchical clustering on a smaller sample to determine the optimal number of clusters, then apply K-means to the entire dataset. This way, you get the exploratory benefits of hierarchical clustering with the efficiency of K-means.

Choosing the right clustering method can significantly influence how effectively businesses segment their customers and create personalized marketing strategies.

Best Practices for Hierarchical Clustering

Let’s build on the methods discussed earlier and refine your approach to hierarchical clustering, particularly for customer segmentation.

Reading Dendrograms and Determining the Right Number of Clusters

A dendrogram is like a visual guide that shows how data points or clusters combine step by step. To decide the number of clusters, look for the largest vertical gap in the dendrogram – this marks the most distinct groupings. Drawing a horizontal line across the dendrogram, count how many vertical lines it intersects; this gives you the number of clusters. For example, in a mall customer segmentation study using Annual Income and Spending Scores, the analysis revealed five clusters with a Silhouette Score of 0.5530 and a Davies-Bouldin Index of 0.5782.

To validate your results, use metrics like the Silhouette Score, which measures how well-separated clusters are, and the Davies-Bouldin Index, which evaluates cluster similarity. It’s worth noting that the linkage method, such as Ward’s method, often has a bigger influence on the outcome than the choice of distance metric. Experimenting with different combinations can help refine your results.

Handling Large Datasets

Hierarchical clustering can become unwieldy with large datasets because it requires storing a full distance matrix in memory. For instance, clustering a dataset with 1 million points might demand an enormous 7.45 terabytes of memory. The computational complexity – ranging from $O(n^2 \log n)$ to $O(n^3)$ – means that processing time increases dramatically as the dataset grows.

One way to manage this is through sampling. Start by clustering a smaller, representative subset of your data (e.g., 10,000 points out of 1 million), and then assign the remaining data points to the nearest clusters. For even larger datasets, a hybrid approach can be effective. For example, use K-means to create 1,000 flat clusters and then apply hierarchical clustering to the centroids instead of the entire dataset. Additionally, dimensionality reduction techniques like PCA can ease computational demands while retaining the core structure of the data.

If you’re working with very large datasets, specialized tools can make a big difference. Libraries like fastcluster in Python, GPU-accelerated frameworks such as RAPIDS cuML, or distributed systems like Apache Spark MLlib can significantly improve efficiency. These tools help scale hierarchical clustering to meet the demands of larger datasets.

When to Use Hierarchical Clustering

Hierarchical clustering is particularly useful for exploratory analysis, especially when you don’t know the number of customer segments beforehand. As highlighted by DataCamp, this method lets analysts visually explore data structures through dendrograms, making it ideal for early-stage market research or uncovering complex relationships among customer groups.

This approach works best with small to medium-sized datasets – typically a few thousand customers or fewer. It’s especially effective when customer segments have irregular shapes, which centroid-based methods like K-means might miss. Studies have shown that hierarchical clustering can lead to more targeted marketing strategies by identifying distinct customer groups.

Conclusion

Hierarchical clustering, with its use of dendrogram analysis and tailored linkage methods, offers a dynamic way to uncover customer relationships without needing to predefine the number of groups. The dendrogram itself reveals the natural structure of your data, helping you pinpoint clusters based on inherent similarities.

This method provides consistent and reliable results, giving businesses a solid base for strategic decision-making. Techniques like Ward’s method enable the creation of compact, homogeneous clusters, which can lead to more predictable and effective marketing outcomes.

One of the standout features of hierarchical clustering is its adaptability. By cutting the dendrogram at various heights, you can create either broad market categories or highly specific micro-segments, depending on your business goals. This adaptability ensures that segmentation efforts can align with diverse business needs.

The true power of hierarchical clustering lies in combining algorithmic precision with human insights. While the algorithm excels at identifying patterns and removing bias during the grouping process, human expertise is crucial for giving meaningful context to these segments and determining how resources should be allocated. Together, this approach provides the clarity and depth needed to build personalized marketing strategies that deliver measurable results. In the end, these insights help marketers turn raw data into actionable strategies that drive business growth.

FAQs

How can I select the best distance metric and linkage method for hierarchical clustering?

Choosing the right distance metric and linkage method is all about understanding your data and the patterns you’re trying to identify. The distance metric measures how similar (or different) data points are. Some of the most common options include:

Euclidean distance: Ideal for continuous, numerical data where geometric relationships matter. Think of it as the straight-line distance between points.
Manhattan distance: Works well for grid-like data, where movement happens along axes (like navigating city blocks).
Cosine distance: Great for high-dimensional data, such as text analysis or user preferences, where comparing angles in multi-dimensional space makes more sense than measuring straight-line distance.

For instance, if you’re analyzing customer preferences in a recommendation system, Cosine distance might be your go-to. On the other hand, Euclidean distance is better suited for datasets where spatial relationships are important.

The linkage method, on the other hand, decides how clusters are combined. Here are some popular options:

Single linkage: Good for detecting elongated clusters but can lead to "chaining", where clusters stretch out.
Complete linkage: Produces compact, well-separated clusters, making it a solid choice for many applications.
Average linkage: Balances the extremes of single and complete linkage by considering average distances.
Ward’s method: Focuses on minimizing variance within clusters, resulting in tighter, more cohesive groups. It’s often a strong default choice for most datasets.

When deciding on the best combination, think about your data’s scale, distribution, and the cluster shapes you’re aiming for. Testing different metrics and linkage methods, paired with visual tools like dendrograms, can help you fine-tune your approach and get better results – especially for tasks like customer segmentation.

What makes hierarchical clustering a better choice than K-means for customer segmentation?

Hierarchical clustering is a solid choice for customer segmentation because it doesn’t force you to pick the number of clusters right from the start. This approach lets businesses uncover natural groupings in their data without predefined limitations. Plus, it generates a dendrogram – a visual map that shows how customers are grouped – making it easier to spot relationships and patterns.

Unlike K-means, hierarchical clustering shines when you’re looking for a more in-depth and intuitive grasp of customer behavior. It’s especially handy for smaller datasets or when you want to understand the step-by-step process of how clusters are formed.

What are the best ways to apply hierarchical clustering to large datasets efficiently?

Hierarchical clustering, while powerful, can be computationally demanding, especially with large datasets. To tackle this challenge, several strategies can help streamline the process.

One approach is dimensionality reduction, such as using Principal Component Analysis (PCA). By reducing the number of features, you can significantly cut down the number of distance calculations required. Another effective method involves spatial indices like kd-trees or ball trees, which can speed up distance computations by organizing data in a way that makes searching faster.

For extremely large datasets, parallel processing or GPU acceleration can distribute the workload, making computations more manageable. Additionally, some techniques use approximate nearest neighbor searches or apply early stopping criteria to limit the number of merges, striking a balance between efficiency and accuracy. These optimizations make hierarchical clustering feasible for large-scale applications, such as customer segmentation.

Metric/Method	Ideal Use	Cluster Effect
Euclidean	Numerical data (Age, Income)	Groups based on magnitude differences
Correlation	Behavioral patterns (Spending cycles)	Groups by trend similarity
Hamming	Categorical data (Yes/No responses)	Groups based on matching labels
Ward’s Linkage	General segmentation	Creates compact, evenly sized groups
Single Linkage	Identifying outliers or chains	Can lead to trailing, poorly defined clusters

Hierarchical Clustering for Customer Segmentation

Hierarchical Clustering for Customer Segmentation

Customer Segmentation Using Hierarchical Clustering Machine Learning Algorithm in Python – Sklearn

How Hierarchical Clustering Works

Agglomerative vs. Divisive Clustering

Understanding Dendrograms

Distance Metrics and Linkage Methods

Distance Metrics

Linkage Methods

Using Hierarchical Clustering for Customer Segmentation

Grouping Customers by Behavior and Demographics

Creating Personalized Marketing Strategies

Measurable Business Results

sbb-itb-2ec70df

Hierarchical Clustering vs. K-Means Clustering

Key Differences Between Methods

Best Practices for Hierarchical Clustering

Reading Dendrograms and Determining the Right Number of Clusters

Handling Large Datasets

When to Use Hierarchical Clustering

Conclusion

FAQs

How can I select the best distance metric and linkage method for hierarchical clustering?

What makes hierarchical clustering a better choice than K-means for customer segmentation?

What are the best ways to apply hierarchical clustering to large datasets efficiently?

Related Blog Posts