This blog post is as part of my final project for udacitys nanodegree Data Scientist.
Problem Statement
The analyze of demographics data for customers of a mail-order sales company in Germany is in focus, comparing it against demographics information for the general population. There will be used unsupervised learning techniques to perform customer segmentation, identifying the parts of the population that best describe the core customer base of the company. Then, with the information of a third dataset with demographics information for targets of a marketing campaign for the company is used to predict which individuals are most likely to convert into becoming customers for the company. The used data has been provided by Bertelsmann Arvato Analytics, and represents a real-life data science task.
Metrics
To classify which machine learning algorithm is the most suitable the AUC ROC curve is used. This value is used for the exact differentiation between the labels for the prediction. The higher this value the better a prediction can be made for class 0 and class 1. This also seems reasonable since there are only 1,23 percent of class 1 in the training dataset.
Get to Know the Data
There are four data files associated with this project:
- `Udacity_AZDIAS_052018.csv`: Demographics data for the general population of Germany; 891 211 persons (rows) x 366 features (columns).
- `Udacity_CUSTOMERS_052018.csv`: Demographics data for customers of a mail-order company; 191 652 persons (rows) x 369 features (columns).
- `Udacity_MAILOUT_052018_TRAIN.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 982 persons (rows) x 367 (columns).
- `Udacity_MAILOUT_052018_TEST.csv`: Demographics data for individuals who were targets of a marketing campaign; 42 833 persons (rows) x 366 (columns).
Data Cleaning
Missing Values
The first step is to analyze the missing values in the individual columns. The following two charts show the columns with the most missing values and their propotion.
Based on the visualizations, 20 percent can be defined as a suitable limit for missing values. This means that columns with more than 20 percent missing values are removed from the data set. As a result, the 16 affected columns are deleted.
Differences between Azdias & Customers
The customers dataset still contains three attributes which are not available in azdias. In order to enable a generally valid cleaning, the three columns concerned, ‘CUSTOMER_GROUP’, ‘ONLINE_PURCHASE’, ‘PRODUCT_GROUP’ , are deleted.
Dropping columns that are not described in DIAS Information Levels & Values
The two datasets for the attributes ensure the description or understanding of each attribute in the customers and azdias datasets. However, some descriptions are missing here as well, so the two initial data sets are reduced by the corresponding attributes.
Replacing unknown Values with NaN
For each of the corresponding attributes, there is the explanation of unknown values. On the basis of these explanations, the corresponding values are replaced with NaN so that they cannot be placed in a wrong context.
Re-encode Features
The attributes of the type object must be reencoded to their features, so that a subsequent consideration is possible. This is done by e.g. encoding the columns ‘EAST_WEST_KZ’ via the mapping ‘W’: 1 and ‘O’: 2 but also with the creation of dummy columns based on the ‘CAMEO_DEU_2015’.
Impute NaN & Scale the Data
The remaining nulls in the dataset are then replaced by the mode method, which ensures that the basis for further use of the data is as unaltered as possible. In addition, the dataset is scaled, scaling is a common feature pre-processing technique which results in scaled data values. In that case the StandardScaler is used.
Customer Segmentation Report
The PCA Method
Prinicipal Component Analysis, “PCA”) is a statistical technique that allows you to combine many variables into a few principal components. Your goal is to bundle the information from many individual variables into a few principal components in order to make your data clearer. In this case it can be seen that already with 150 components more than 80 percent of the Cumulative Explained Variance is achieved. Therefore, the data set is reduced to 150 components accordingly.
The Elbow Method
The elbow method is used in determining the number of clusters in a data set. The method consists of plotting the explained variation as a function of the number of clusters, and picking the elbow of the curve as the number of clusters to use. As shown in the figure above, the graph does not have a pronounced kink, so the number 12 seems to be a good compromise for clustering.
Kmeans Clustering
The previously defined number of clusterings is used next for the kmeans method. A k-means algorithm is a vector quantization method that is also used for cluster analysis. Here, a previously known number of k groups is formed from a set of similar objects, in this case 12.
The figure shows the percentage distribution in the individual clusters, with cluster 2 showing clear conspicuity for the customers dataset. the least for this seem to be covered by clusters 3 & 6.
Deep Dive in most affected Cluster
The following ten attributes appear to play the largest role in the clustering result. These are the attributes that play a decisive role in finding a potential customer:
Supervised Learning Model
Starting point
in this part of the project, the dataset mailout_train serves as the starting point. This dataset is assigned the attribute “RESPONSE”, that states whether or not a person became a customer of the company following the campaign.
Therefore, this is split accordingly into X without the target column “RESPONSE” and into y with only the target column. Based on this subdivision, different algorithms for predicting a customer are compared.
Model Evaluation and Validation
Scikit-learn offers an ideal template to compare the learning curves with each other. Based on these representations, the corresponding algorithms can be compared on the basis of the training score and the corresponding cross validation score. The XGB classifier, which has the highest AUC training score and cross-validation score, is chosen.
However, since the cross validation score does not change, this could indicate an overfitted data set.
Model Evaluation and Validation
Gridsearch is used for tuning the parameters of the model. For this purpose, they are tested for their highest roc_auc score in the following grid: param_grid = {
‘n_estimators’: [25, 50, 100],
‘colsample_bytree’: [0.5, 0.7, 0.8],
‘learning_rate’: [0.1, 0.2, 0.3],
‘max_depth’: [5, 10, 15],
‘reg_alpha’: [1.1, 1.2, 1.3],
‘reg_lambda’: [1.1, 1.2, 1.3],
}
The best model can therefore be optimized with the following parameters:
eval_metric = ‘auc’, n_estimators=50, colsample_bytree=0.8, learning_rate=0.3, max_depth=5, reg_alpha=1.3, reg_lambda=1.2
The grid score thus improves from 0.552 to 0.572, which is a slight improvement. As already suspected, the relatively low value suggests a non-optimal dataset as input. Asimilar value is also achieved in the kaggle competition, which definitely still has potential for improvement.
Justification
All in all, it was a very exciting and instructive project that gave me a lot of pleasure and always presented me with new challenges.
Thanks to udacity and arvato for this experience.
However, more time should be taken to prepare the data and think it through again, as the score is not yet optimal. If I had had more time, I would have been very happy to deal with the individual features and optimize them accordingly. The data set seems to be overfitted and therefore it would make sense to remove strongly correlated attributes via a correlation analysis, for example, in order to have a better basis to start from.