This is a fictional project for studying purposes. The business context and the insights are not real. The dataset is available on Kaggle.
The business leaders of an e-commerce company concluded that a good strategy to leverage sales is to create a loyalty program for their customers. So, the business team asked the data scientists to select the most valuable customers for the company Recency, frequency and monetary aspects were considered by the business team as the main characterists to evaluate the customers in clusters.
Machine Learning Clustering Model: Using the dataset from Kaggle, a machine learning clustering model was created to be used for client clustering using the dataset and also for the identification of future clients clusters.
The notebook used to create the model is available here.Attribute | Description |
---|---|
InvoiceNo | Number of purchase invoice. |
StockCode | Code of the stock the object comes from. |
Description | Description of the item purchased. |
Quantity | Quantity of the item purchased. |
InvoiceDate | Date of the invoice. |
UnitPrice | Price of one item of the object purchased. |
CustomerID | Identification number of the client responsible for the purchase. |
Country | The country the purchase comes from. |
- Stock codes with letters, like POST, D, PADS, were discarded because it is not possible to know exactly what they mean.
- Unit prices lower than 0.04 were not considered because they seem to be wrong.
- Customers with that return almost every purchase they make cannot be considered.
- Understand the Business problem.
- Download the dataset.
- Clean the dataset removing outliers, NA values and unnecessary features.
- Prepare the data to be used by the modeling algorithms encoding variables, splitting train and test dataset and other necessary operations.
- Create the models using machine learning algorithms.
- Evaluate the created models to find the one that best fits to the problem.
- Tune the model to achieve a better performance.
- Explore the data to create hypothesis, think about a few insights and validate them.
- Deploy the model in production so that it is available to the user.
- Find possible improvements to be explored in the future.
I1: The customers of the loyalty program have a purchase volume (products) above 10% of the total purchases.
True: The loyalty program cluster has 34% of the total products purchased.
I2: The customers of the loyalty program have a volume (revenue) of purchases above 10% of the total purchases.
True: The loyalty program cluster has 46% of the total profit.
I3: Loyalty program customers have a lower number of returns than the average of the other customers.
False: Loyalty program cluster has an average quantity of retuns above the average of the other customers.
I4: The median billing by loyalty program customers is 10% higher than the median billing overall.
True: The median of the profit from the loyalty program cluster is 215% above the overall median.
I5: Loyalty program customers are on the third quantile.
False: They are mostly in the first quantile.
The final result of this project is a clustering model. Some dimensionality reduction algorithms, like PCA (Principal COmponent Analysis), UMAP (Uniform Manifold Approximation and Projection) and t-SNE (Distributed Stochastic Neighbor Embedding) were used to create embedding spaces as alternatives for the features space. Some machine learning modelling algorithms were also used as options to find the best possible model. In all, 3 types of models were created, k-Means, GMM (Gaussian MNixture Model) and HC (Hierarchical Clustering). The table below presents some of the models created, the embedding algorithm used to create the model, the number os clusters and the silhouette score.
Model Name | Space Creation | Nº CLusters | Silhouette Score |
---|---|---|---|
k-Means | Features | 2 | 0.69 |
GMM | Features | 2 | -0.01 |
HC | Features | 2 | 0.65 |
k-Means | UMAP | 15 | 0.56 |
GMM | UMAP | 14 | 0.47 |
HC | UMAP | 15 | 0.54 |
k-Means | t-SNE | 13 | 0.45 |
GMM | t-SNE | 13 | 0.36 |
HC | t-SNE | 12 | 0.42 |
k-Means | Tree Embedding Space | 15 | 0.48 |
GMM | Tree Embedding Space | 2 | 0.43 |
HC | Tree Embedding Space | 15 | 0.48 |
The final model was chosen based on the number of clusrters that the business team chose considering the silhouette scores. The final model characteristcs are presented in the table below.
Model Name | Space Creation | Nº CLusters | Silhouette Score |
---|---|---|---|
k-Means | UMAP | 11 | 0.52 |
The number of cluusters the business team belives to be the best is eleven. It is a good number because the silhouette score is one of the highest values found considering all the models created and the number of clusters is not high. The clusters profile with their average metrics are presented in the table below.
Cluster Number | Number of Customers | Customers Percentage | Gross Revenue | Recency | Products Purchased | Frequency | Returns |
---|---|---|---|---|---|---|---|
0 | 755 | 13.3 | 6260.09 | 11.7 | 241.9 | 0.05 | 76.6 |
1 | 383 | 6.7 | 2663.62 | 4.2 | 175.5 | 0.14 | 17.5 |
2 | 836 | 14.7 | 1705.62 | 36.6 | 98.1 | 0.04 | 16.7 |
3 | 392 | 6.9 | 1164.39 | 100.2 | 61.8 | 0.19 | 8.4 |
4 | 429 | 7.5 | 1028.46 | 290.7 | 59.7 | 0.63 | 202.2 |
5 | 277 | 4.9 | 906.62 | 362.6 | 65.1 | 1.05 | 2.5 |
6 | 586 | 10.3 | 861.55 | 35.1 | 44.7 | 0.71 | 3.5 |
7 | 595 | 10.4 | 774.43 | 135.1 | 65.1 | 0.77 | 3.7 |
8 | 391 | 6.9 | 647.62 | 199.1 | 47.2 | 1.02 | 2.4 |
9 | 408 | 7.2 | 606.15 | 56.2 | 46.1 | 1.07 | 6.6 |
10 | 643 | 11.3 | 492.88 | 246.8 | 39.6 | 1.02 | 1.6 |
Several models were created to meet the demand of the business team. FInally, it was possible to find a model that satisfied the data and business teams simultaneously. The features created in the beginning of the modeling process were effective to separate the customers in cluesters and find the cluster with the most valuable customers. The model can now be used by the business team to find the right marketing strategy for each customer according to the group they belong to and achieve higher profit.
- Try other clustering modeling algorithms.
- Try other embedding spaces with more than 2 components.