This is a Udacity Unsupervised Learning Project using Real Demographic & Customer Data
In this project, I used unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that was used was provided by Udacity's partners at Bertelsmann Arvato Analytics.
import numpy as np
import pandas as pd
from scipy.stats import ks_2samp
import time
import matplotlib.pyplot as plt
import seaborn as sns
import plotly_express as px
%matplotlib inline
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
There are four files associated with this project (not including this one):
Udacity_AZDIAS_Subset.csv
: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).Udacity_CUSTOMERS_Subset.csv
: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).Data_Dictionary.md
: Detailed information file about the features in the provided datasets.AZDIAS_Feature_Summary.csv
: Summary of feature attributes for demographics data; 85 features (rows) x 4 columnsWe will examine, explore, clean, and cluster the Germany demographics data Udacity_AZDIAS_Subset.csv
, then apply all the same transformations & cleaning to the customer demographics data Udacity_CUSTOMERS_Subset.csv
. Each row represents a singple person, and includes information outside of individuals, including information about their household, building, and neighborhood. I will use this information to cluster the general population into groups with similar demographic properties. Then, I will see how the people in the customers dataset fit into those created clusters.
After, we will compare the clusters between the general Germany demographic data with the customer data to see whether there are any over-represented clusters in our customer data. This cluster can be interpretted as the company's core user base, and could be targetted for future marketing campaigns. Alternatively, under-represented clusters may be viewed as a potential opportunity of expansion for the mail order company.
Let's Begin...
Loading in our general demographics data & feature summary file...
# Setting display to show all rows
# Apologies in advance for cluttering the notebook, but this makes for more convenient analysis
pd.options.display.max_rows = 260
# Load in the general demographics data.
azdias = pd.read_csv('Udacity_AZDIAS_Subset.csv',';')
# Load in the feature summary file.
# Feature summary file contains very useful information about the data types for each feature
feat_info = pd.read_csv('AZDIAS_Feature_Summary.csv',';')
Taking a look at our general demographics...
azdias.shape
azdias.head()
Skimming the numerical data summary statistics...
azdias.describe().transpose()
Now taking a look at our feature summary file AZDIAS_Feature_Summary.csv
...
feat_info.shape
# As you can see, we have convenient information on type of data and a code telling us what values denote NA values
feat_info.head()
Converting Missing Value Codes to NaNs
The fourth column of the feature attributes summary (loaded in above as feat_info
) tells us what codes indicate missing/NA values (e.g. [-1,0]). However, we have to convert these codes as a list because it's encoded as a string of a list.
When converting a DataFrame to csv, any columns containing Python
list
datatypes are converted to strings.
So, before we encode our NA values using the feature summary missing_or_unknown
column, we need to first convert this column into a list.
Just for fun, let's see how many NA values are already recognized by Pandas.
# Total number of known missing values (Pandas recognizes as NA)
# the first .sum() is used to sum across all columns. The second .sum() is used to aggregate all NAs
azdias.isna().sum().sum()
# Using this function to convert the strings of a list to actual lists
def string_to_list(x):
'''
This function takes a string of a list and returns a list containing integers and strings.
Note the list can only contain alphanumeric values for it to work.
Input: String of a list (i.e. '[1,2,XX]')
Returns: List data type
'''
new_list=[]
# The below line removes the '[]' and splits on ',', creating a list of strings
x = x[1:-1].split(',')
# For each value in the list of strings, try turning it into an integer if possible and append to new_list
# Otherwise, append it to new_list
for i in x:
try:
new_list.append(int(i))
except:
new_list.append(i)
return new_list
# Applying the function
feat_info['NA_tags'] = feat_info['missing_or_unknown'].apply(string_to_list)
# Example of an entry in this new column
type(feat_info['NA_tags'][0])
As you can see, we've successfully converted these entries into Python lists
feat_info.head()
Now that that's out of the way, let's encode our NA values.
Inserting NA values into our DataFrame if they are tagged as NA in the "NA_tags" column
# Using a nearly identical copy of the feat_info DF with the attributes as the index
att_index = feat_info.set_index('attribute')
# This is an example of what I just created. You'll see what I did this in a moment
att_index.head(1)
# Creating a copy of the demographics data to work with to be the version with NA's inserted
na_azdias = azdias[:]
Below I'm looking through each column, using the att_index
DataFrame as a reference to encode our NA values. I needed the features to be the index of the dataframe in order to use .loc
to retrieve each column's respective list of NA codes.
# Looping through our column names
for column in na_azdias.columns:
# For every column in our DF, if a value is in the 'NA_tags' list, we replace it inplace with np.NaN
na_azdias[column].replace(att_index.loc[column].loc['NA_tags'],np.NaN,inplace=True)
# Now we have about 3,500,000 more NA values as a result of this replacement
na_azdias.isna().sum().sum()
Below we will asses how much data is missing in each column and row to understand...
Below is a simple textual representation of how many NA values are in each column. As you can see, columns starting with FINANZ...
and SEMIO...
don't have any missing values. This must be standard person-level data, mandatory for each observation.
na_azdias.isna().sum()
Below I'm making a temporary DataFrame to help me understand the question: how much data is missing in each column?
# Creating a copy of the DF containing the counts of NAs with the names of the columns in the original DF as a column
na_azdias_df = na_azdias.isna().sum().reset_index()
# Filtering for only columns with more than 50000 NA values to declutter the plot
temp_view = na_azdias_df[na_azdias_df[0]>50000]
# FYI - 'index' is column name and '0' is count of NA values
temp_view.head(1)
Below is a horizontal bar plot showing columns and their NA counts as their bar width. Here I've noted that the following columns have much more NA values than the others...
# Sorting by count of NA values
temp_view.sort_values(by=0,ascending=True,inplace=True)
plt.figure(figsize=(6,12))
barh = plt.barh(temp_view['index'],temp_view[0],color = 'darkturquoise', alpha = 0.85)
for i in range(7):
barh[-i].set_color('darkred')
plt.tight_layout()
So now I'll remove those 6 outlier columns...
outlier_columns = ['KBA05_BAUMAX','KK_KUNDENTYP','ALTER_HH','TITEL_KZ','GEBURTSJAHR','AGER_TYP']
na_azdias.drop(labels=outlier_columns,axis=1,inplace=True)
Printing the shape shows that the no. columns decreased from 85 -> 79
na_azdias.shape
Recap
Per my analysis of missing data above, I've found that the entire dataset contained 4,896,838 known missing values (recognized by Pandas) before I manually encoded NA values and contained 8,912,221 known missing values after I encoded NA values per the feat_info file and dropped "outlier" columns with many NA values.
I plotted the count of NA values per column and noted that the following 6 columns were outliers in terms of number of missing values, and were subsequently removed.
Below I will asses how much data is missing in each row, and evaluate whether I should drop any rows.
This analysis is going to be a bit more work that assessing missing data in each column, because dropping rows has more implications than dropping columns. By dropping rows, we're dropping entire observations (one observation being one person), which can skew/misrepresent the demographic data.
Spoiler alert: I will be dropping rows for the sake of compatibility with Principal Component Analysis and KMeans Clustering. Let's understand the consequences of dropping these rows.
# Creating a temp column that contains the count of NA values in each row
na_azdias['no_na_row'] = na_azdias.isna().sum(axis=1)
This histogram plots a distribution of non-zero NA values. Looks like most of our rows with NA values are missing between 1-10 of their values.
plt.hist(na_azdias[na_azdias['no_na_row']>0]['no_na_row'],bins=10)
Because we have a significant number of rows with 30-50 NA values, I'll split the data into two subsets: one with rows with less than or equal to 30 NA values na_azdias_b30
and one with rows greater than 30 NA values na_azdias_g30
.
na_azdias_b30 = na_azdias[na_azdias['no_na_row']<=30]
na_azdias_g30 = na_azdias[na_azdias['no_na_row']>30]
Below is a quick comparison of the proportion of NA values in each row.
There's an oddly large amount of rows with 49 and 53 missing values.
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(13,4))
sns.countplot(x='no_na_row',data=na_azdias_b30,ax=ax1,orient='v')
sns.countplot(x='no_na_row',data=na_azdias_g30,ax=ax2,orient='v')
plt.tight_layout()
Now I'll compare the distributions of particular columns' values between the two subsets of data.
Note this is different than what we did in the cell above, which is plot a distribution of rows for each number of NA values.
def compare_columns(df1,df2,column,df1_desc=None,df2_desc=None):
'''
Takes two DataFrames and a column name as input and plots two countplots of the given column for both DataFrames.
'''
# Taking max value count between the 2 columns to set as ylim for both plots
df1_max = df1[column].value_counts().iloc[0]
df2_max = df2[column].value_counts().iloc[0]
if df1_max > df2_max:
top = df1_max * 1.05
else:
top = df2_max * 1.05
# Plotting
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,3))
sns.countplot(df1[column],ax=ax1)
ax1.set_ylabel('Count')
ax1.set_ylim(top=top)
ax1.set_title(df1_desc)
sns.countplot(df2[column],ax=ax2)
ax2.set_ylabel('Count')
ax2.set_ylim(top=top)
ax2.set_title(df2_desc)
plt.tight_layout()
Below are plots for six columns that have no missing values. Plots on the left are from the subset of data with less than 30 NA values in each row and plots on the right are from the subset of data with more than 30 NA values in each row.
for n in ['ANREDE_KZ','FINANZ_MINIMALIST','GREEN_AVANTGARDE','SEMIO_SOZ','ZABEOTYP']:
compare_columns(na_azdias_b30,na_azdias_g30,n,"Below 30 NA Values","Above 30 NA Values")
Clearly, the subset of data with more than 30 NA values in each row (plots on the right) is much smaller than the other subset. Additionally, the distribution of values between the two subsets of data are different for all of the columns above wish the exception of column "ANREDE_KZ" (Gender).
Note that we're looking at columns with zero NA values and comparing rows with many NA values against rows with little NA values.
This suggests a possible bias introduced in the way the data was collected and/or entered. Whatever the case may be, these rows with many NA values should be investigated, since they should hypothetically be similarly distributed relative to the other subset of data with less than 30 NA values.
Additional supplementary visualization. Plots Distribution plots for first 10 columns on top of one another.
This does't add much more information. Feel free to skim over this viz.
plt.figure(figsize=(15,12))
for i, col in enumerate(na_azdias_b30.columns[:10]):
plt.subplot(5, 2, i+1)
sns.distplot(na_azdias_b30[col][na_azdias_b30[col].notnull()], label='<=30 NA',hist=False)
sns.distplot(na_azdias_g30[col][na_azdias_g30[col].notnull()], label='>30 NA',hist=False)
plt.title('Distribution for column: {}'.format(col))
plt.legend();
plt.tight_layout()
plt.show()
Below I'll use the KS test to compare each column's distributions from our two datasets (rows with <30 NA values & rows with >30 NA values).
Purpose: To evaluate how different these two groups of data are distributed. If we see that these distributions are very different (low p-values as a result of our KS-Test), then we have statistical evidence that we can use to evaluate whether we should drop the >30 NA values-per row subset.
TL;DR? Scroll down below for my interpretation of the KS test and a quick note on how it works.
def KS_col_test(df1,df2):
'''
Used to evaluate whether distributions of data between two DataFrames with the same columns are different.
Input: 2 DataFrames with identical columns.
Output: D (KS Stat | Max Distance between the two cumulative probability functions for each Dataset) and p-values associated to each D
'''
# Lists for the KS Stat (D) and p-values to return
stats = []
p_values = []
# Grabbing an array of column names
cols = df1.columns.values
for col in cols:
try:
D,p = ks_2samp(df1[col],df2[col])
stats.append(D)
p_values.append(p)
except:
stats.append(np.NaN)
p_values.append(np.NaN)
return stats,p_values
# Calculating stats and p-values for each column
stats,p_values = KS_col_test(na_azdias_b30,na_azdias_g30)
# Creating a DataFrame to store the results. 80 rows (1/feature)
ks_df = pd.DataFrame(na_azdias_b30.columns.values,columns=['feature'])
ks_df['stat'] = stats
ks_df['p_value'] = p_values
# Plotting results. See below for my interpretation
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,4))
sns.distplot(ks_df[ks_df['stat'].notnull()]['stat'],ax=ax1,kde=False)
ax1.set_title('Distribution of D KS Stats')
ax1.set_xlabel('D | KS Stat')
ax1.set_ylim(top=70)
sns.distplot(ks_df[ks_df['p_value'].notnull()]['p_value'],ax=ax2,kde=False)
ax2.set_title('Distribution of P-Values')
ax2.set_xlabel('P Values')
plt.tight_layout()
plt.show()
So let's recap what we just did. We have two DataFrames:
na_azdiasb30
DataFrame with <= 30 NA values per row na_azdiasg30
DataFrame with > 30 NA values per rowWe used SciPy's implementation of the 2-way Kolmogorov-Smirnov test on each of the 80 columns to evaluate the similarity of each column's distributions.
Brief Aside on the KS-Test
The KS test measures similarity between two samples to check whether these two samples came from the same distribution. (In our case, we're checking if the values in each of our DataFrames for each column are distributed similarly to evaluate the consequence of us dropping one of these DataFrames; namely, the DataFrame with more NA values). The KS test creates empirical cumulative distribution functions (one for each sample/group of observations) and compares them. We take the difference of these two distributions, and the maximum difference (D) is our D/KS Stat. Then, we measure the p-value (Null Hypothesis: these two samples came from the same disttribution). If our p-value is low, then we are more confident that the samples come from different distributions.
KS Test Conclusion
So we see that our p-values are extremely low, telling us that our distributions between the two DataFrames (30 or less vals and greater than 30 NA vals) are highly likely to come from different distributions. I interpret this as a warning that if we drop our na_azdias_g30
DataFrame, we're going to lose valuable information that could affect our algorithm.
Below I investigated the remaining columns containing NA values and found that many of the rows containing NA values have consecutive NA values in related columns, such as columns that begin with "KBA05..." and "PLZ8...", suggesting that these missing values are likely not due to the chance of random erroneous data entry, and are rather caused by some external factor, such as the process/methodology of how the data was collected.
TL;DR? Key Finding: It looks like most rows I got back have many NA values in the same row. For example, the cell immediately below returns rows with NA values in column 'KBA05ANTG3', and by scrolling right to the end of the DataFrame, we see that the columns that start with 'PLZ8__' have many rows that have consecutive NA values.
na_azdias[pd.isna(na_azdias['KBA05_ANTG3'])]
Recap
I split the data into two subsets: rows containing 0-30 NA values and rows containing more than 30 NA values. When looking at the distribution of data for columns that do not contain NA values, I noticed that the distribution of data between these two subsets look very different, which is contrary to what you'd expect. When you split data and look at distributions on columns with no NA values, you'd expect the distributions of these columns to look the same. This suggested to me that there is some sort of bias introduced in the way the data was collected or entered for rows with many NA values.
Additionally, I looked at these rows with many NA values, and it looked like they had many consecutive NA values in related rows, such as rows that begin with "PLZ8..." or "KBA05...". This further suggests that these missing values are probably not caused by chance alone.
Let's see how much of our data has more than 30 missing values.
prop_g_30 = len(na_azdias_g30)/len(na_azdias)
print(f'{prop_g_30*100:.2f}% of our data has more than 30 missing values in a row.')
10.46% of data is a huge chunk that could be important for our ML algorithm. We also saw through the plots and the KS Test that the distribution of values between the two subsets are very different. However, considering the amount of missing values in a row (30 missing values out of 79 total features) that we'd have to impute, I'm dropping these rows (also for simplicity's sake...and laziness) and moving forward with a DF called azdias_1
.
na_azdias_b30.drop('no_na_row',inplace=True,axis=1)
azdias_1 = na_azdias_b30[:]
In this section, we will examine the different types of data that we have and evaluate next steps to prepare the data for our PCA and KMeans clustering algorithms.
Checking for missing data isn't the only way in which you can prepare a dataset for analysis. Since the unsupervised learning techniques to be used will only work on data that is encoded numerically, you need to make a few encoding changes or additional assumptions to be able to make progress. In addition, while almost all of the values in the dataset are encoded using numbers, not all of them represent numeric values. Check the third column of the feature summary (feat_info
) for a summary of types of measurement.
In the first two parts of this sub-step, you will perform an investigation of the categorical and mixed-type features and make a decision on each of them, whether you will keep, drop, or re-encode each. Then, in the last part, you will create a new data frame with only the selected and engineered columns.
Data wrangling is often the trickiest part of the data analysis process, and there's a lot of it to be done here. But stick with it: once you're done with this step, you'll be ready to get to the machine learning parts of the project!
Let's use the feat_info
file to get counts of how much of each data type we have.
# Removing the 6 outlier columns that were removed from feat_info
feat_info = feat_info.set_index('attribute').drop(outlier_columns).reset_index()
feat_info['type'].value_counts()
# Creating a dictionary where the 4 data types above are the keys and the corresponding DF columns are the values
# This will help with breaking up each group of features for analysis
df_dtype = {}
dtypes = ['ordinal','categorical','numeric','mixed']
for d in dtypes:
df_dtype[d]=feat_info[feat_info['type']==d]['attribute']
Our approach will be the following...
def check_unique_vals(df):
for col in df.columns:
values = list(df[col].unique())
print(f'{col}: {values}')
check_unique_vals(azdias_1[df_dtype['ordinal']])
The ordinals look okay (no strings; all of them are numeric).
# Columns that are tagged as "categorical" are either binary or multi_level
# Printing and tracking binary and multi-level categorical values
binary_cats = []
multi_level_cats = []
for col in azdias_1[df_dtype['categorical']]:
values = azdias_1[df_dtype['categorical']][col].nunique()
if values == 2:
binary_cats.append(col)
else:
multi_level_cats.append(col)
print(f'{col}: {values}')
# Examining our binary categorical values
azdias_1[binary_cats].head()
So it looks like we have to encode "OST_WEST_KZ" to be numerical.
# Looking at the values in this column
azdias_1['OST_WEST_KZ'].value_counts()
# I'll one-hot encode this column with the rest of our multi-level categorical variables
multi_level_cats.append('OST_WEST_KZ')
Now examining our multi-level categorical variables.
# These have to be one-hot encoded
azdias_1[multi_level_cats].head()
azdias_1[df_dtype['numeric']].describe()
Looks like all of the values in the numeric columns are indeed numeric. These look good.
azdias_1[df_dtype['mixed']].head()
# Printing the number of unique values within each of the mixed columns
for column in azdias_1[df_dtype['mixed']].columns:
print(column,azdias_1[df_dtype['mixed']][column].nunique())
Taking a look at the Data_Dictionary.md
, I noticed these are all categorical variables. I'm treating them as multi-level categorical variables (i.e. one-hot encoding them) with the exception of "PRAEGENDE_JUGENDJAHRE" and "CAMEO_INTL_2015". I'll deal with these two below in the next section.
for col in df_dtype['mixed']:
if col not in ['CAMEO_INTL_2015','PRAEGENDE_JUGENDJAHRE']:
multi_level_cats.append(col)
Recap
So we've taken a look at the ordinal, categorical, numeric, and mixed variable columns. We're fine with the ordinal and numeric columns. The categorical column can be further split between binary and multi-level columns, and we've already taken care of the binary categorical columns.
So now we must encode the multi-level categorical and mixed variable columes.
azdias_1 = pd.get_dummies(azdias_1,columns=multi_level_cats)
azdias_1.shape
In sum, after removing outlier columns and rows with over 30 NA values, I kept ordinal and numeric columns the same, and one-hot encoded multi-level categorical, one binary categorical, and some mixed variable columns.
Below I'm working on converting the "PRAEGENDE_JUGENDJAHRE" (combines information on three dimensions: generation by decade, movement, and nation) and "CAMEO_INTL_2015" (combines information on two axes: wealth and life stage).
I will break up the two-digit codes by their 'tens'-place and 'ones'-place digits into two new ordinal variables (which, for the purposes of this project, is equivalent to just treating them as their raw numeric values).
Below is the mapping for "PRAEGENDE_JUGENDJAHRE" from the Data_Dictionary.md
movement = []
decade = []
for val in azdias_1['PRAEGENDE_JUGENDJAHRE']:
# Mapping for movement
if val in [1,3,5,8,10,12,14]:
movement.append(0)
elif val in [2,4,6,7,9,11,13,15]:
movement.append(1)
else:
movement.append(np.NaN)
# Mapping for decade
if val in [1,2]:
decade.append(40)
elif val in [3,4]:
decade.append(50)
elif val in [5,6,7]:
decade.append(60)
elif val in [8,9]:
decade.append(70)
elif val in [10,11,12]:
decade.append(80)
elif val in [13,14,15]:
decade.append(90)
else:
decade.append(np.NaN)
# Quick sanity check
len(movement)==len(decade)==azdias_1.shape[0]
movement = pd.Series(movement)
decade = pd.Series(decade)
azdias_1['movement']=movement
azdias_1['decade']=decade
Below is a mapping for "CAMEO_INTL_2015" from the Data_Dictionary.md
cameo_intl = []
for val in azdias_1['CAMEO_INTL_2015']:
try:
val = int(val)
if 10 < val < 19:
cameo_intl.append(5)
elif 20 < val < 29:
cameo_intl.append(4)
elif 30 < val < 39:
cameo_intl.append(3)
elif 40 < val < 49:
cameo_intl.append(2)
elif 50 < val < 59:
cameo_intl.append(1)
else:
cameo_intl.append(np.nan)
except:
cameo_intl.append(np.nan)
cameo_intl = pd.Series(cameo_intl)
azdias_1['cameo_intl'] = cameo_intl
azdias_1.head()
azdias_1.drop(['CAMEO_INTL_2015','PRAEGENDE_JUGENDJAHRE'],axis=1,inplace=True)
Recap
I performed the following steps above...
PRAEGENDE_JUGENDJAHRE
I replaced "PRAEGENDE_JUGENDJAHRE" with two new columns: movement
and decade
according to the mapping from the Data_Dictionary.md
.
CAMEO_INTL_2015
I replaced "CAMEO_INTL_2015" with a column called cameo_intl
according to the mapping from the Data_Dictionary.md
.
Below is a huge function that replicates all the steps I performed above on any given DataFrame with the same structure as the demographics data.
def clean_data(df):
"""
Perform feature trimming, re-encoding, and engineering for demographics
data
INPUT: Demographics DataFrame
OUTPUT: Trimmed and cleaned demographics DataFrame
"""
feat_info = pd.read_csv('AZDIAS_Feature_Summary.csv',';')
feat_info['NA_tags'] = feat_info['missing_or_unknown'].apply(string_to_list)
att_index = feat_info.set_index('attribute')
# Looping through our column names
for column in df.columns:
# For every column in our DF, if a value is in the 'NA_tags' list, we replace it inplace with np.NaN
df[column].replace(att_index.loc[column].loc['NA_tags'],np.NaN,inplace=True)
att_index = feat_info.set_index('attribute')
# remove outlier columns
outlier_columns = ['KBA05_BAUMAX',
'KK_KUNDENTYP',
'ALTER_HH',
'TITEL_KZ',
'GEBURTSJAHR',
'AGER_TYP']
df.drop(labels=outlier_columns,axis=1,inplace=True)
# remove rows with > 30 NA values
df['no_na_row'] = df.isna().sum(axis=1)
df = df[df['no_na_row']<=30]
df.drop('no_na_row',inplace=True,axis=1)
# select, re-encode, and engineer column values.
feat_info = feat_info.set_index('attribute').drop(outlier_columns).reset_index()
df_dtype = {}
dtypes = ['ordinal','categorical','numeric','mixed']
for d in dtypes:
df_dtype[d]=feat_info[feat_info['type']==d]['attribute']
# Separating binary from multi-level categorical columns
binary_cats = []
multi_level_cats = []
for col in df[df_dtype['categorical']]:
values = df[df_dtype['categorical']][col].nunique()
if values == 2:
binary_cats.append(col)
else:
multi_level_cats.append(col)
# Including OST_WEST_KZ in the list to be one-hot encoded
multi_level_cats.append('OST_WEST_KZ')
for col in df_dtype['mixed']:
if col not in ['CAMEO_INTL_2015','PRAEGENDE_JUGENDJAHRE']:
multi_level_cats.append(col)
# One-hot encoding
df = pd.get_dummies(df,columns=multi_level_cats)
# Dealing with the 2 particular mixed vars...
movement = []
decade = []
for val in df['PRAEGENDE_JUGENDJAHRE']:
# Mapping for movement
if val in [1,3,5,8,10,12,14]:
movement.append(0)
elif val in [2,4,6,7,9,11,13,15]:
movement.append(1)
else:
movement.append(np.NaN)
# Mapping for decade
if val in [1,2]:
decade.append(40)
elif val in [3,4]:
decade.append(50)
elif val in [5,6,7]:
decade.append(60)
elif val in [8,9]:
decade.append(70)
elif val in [10,11,12]:
decade.append(80)
elif val in [13,14,15]:
decade.append(90)
else:
decade.append(np.NaN)
movement = pd.Series(movement)
decade = pd.Series(decade)
df['movement']=movement
df['decade']=decade
cameo_intl = []
for val in df['CAMEO_INTL_2015']:
try:
val = int(val)
if 10 < val < 19:
cameo_intl.append(5)
elif 20 < val < 29:
cameo_intl.append(4)
elif 30 < val < 39:
cameo_intl.append(3)
elif 40 < val < 49:
cameo_intl.append(2)
elif 50 < val < 59:
cameo_intl.append(1)
else:
cameo_intl.append(np.nan)
except:
cameo_intl.append(np.nan)
cameo_intl = pd.Series(cameo_intl)
df['cameo_intl'] = cameo_intl
df.drop(['CAMEO_INTL_2015','PRAEGENDE_JUGENDJAHRE'],axis=1,inplace=True)
return df
Testing the cleaning function on a copy of our data...
test_func_az = azdias[:]
test_df = clean_data(test_func_az)
azdias_1.shape==test_df.shape
So we're going to use principal component analysis (PCA) to reduce the number of dimensions (features) of our data. However, in order to do so, our data needs to be standardized (mean=0 and bonus if std dev = 0). We also need to take care of all the NA values that we kept.
My approach is the following...
StandardScaler
to standardize our data.# Imputing missing values with the median
med_imp = SimpleImputer(strategy='median')
azdias_2 = med_imp.fit_transform(azdias_1)
# Standardizing
scaler = StandardScaler()
azdias_2 = scaler.fit_transform(azdias_2)
For simplicity's sake, I chose to impute the data with the median value in each column. Then, I've standardized our data using a StandardScaler so PCA calculates the principal components using similar values and so K-Means clustering groups similar values based on their relative spread rather than absolute value.
Principal Component Analysis, crudely simplified, is transforming your data to make it more succint. It takes columns that are very related to each other and re-engineers new columns that better describe the relationships of our columns with one another. These new columns are Principal Components, and we can use any given number of these components for analysis.
The more principal components (PC) we use, the more information we have, but computing KMeans clustering is slower, and we increase risk of overfitting. So ideally, we want to choose a number of PC's that best represent our dataset without losing too much information.
I'd like to try several n component values for PCA Dimensionality Reduction. I'll start with 40 to see how much variance is explained. Ideally, I'd like to get to the number of components with 85% variance explained.
Quick Aside: PC's describe relationships between the columns of our data. The most relevant (most useful) PC's are said to explain the most variance. So, we want to choose a number of the PC's that explain the most variance while reducing the number of features we use.
This variance can be calculated by calling the explained_variance_ratio_
attribute from a fitted pca
object.
azdias_2.shape
PCA Test Run for 40.
pca = PCA(40)
X_pca = pca.fit_transform(azdias_2)
pca.explained_variance_ratio_.sum()
So about 50% is explained with 40 components. Below I'll loop through several values for No. Principal Components until we get 85% variance explained.
n_components = [80,110,150,180,210]
for n in n_components:
pca = PCA(n)
X_pca = pca.fit_transform(azdias_2)
if pca.explained_variance_ratio_.sum() >= .85:
break
print(pca.explained_variance_ratio_.sum())
print(pca.n_components)
Looks good, but 150 components feels a bit high. So I'm going to re-run this loop in smaller increments to just get around 85%.
n_components = [120,125,130,135,140]
for n in n_components:
pca = PCA(n)
X_pca = pca.fit_transform(azdias_2)
if pca.explained_variance_ratio_.sum() >= .85:
break
print(pca.explained_variance_ratio_.sum())
print(pca.n_components)
Now plotting total explained variance for our principal components.
ticks = np.arange(pca.n_components)
values = pca.explained_variance_ratio_
plt.figure(figsize = (13,5))
ax = plt.subplot(111)
cumvals = np.cumsum(values)
ax.bar(ticks,values)
ax.plot(ticks,cumvals)
ax.xaxis.set_tick_params(width=1)
ax.yaxis.set_tick_params(width=2,length=12)
ax.set_xlabel("No. Principal Components")
ax.set_ylabel("Variance Explained (%)")
plt.title('Explained Variance Per Principal Component\nScree Plot')
plt.show()
Recap
So above we performed PCA many times until we found a number of principal components that explains 85% variance. The scree plot above plots principal components on the x-axis, variance explained on the y-axis, bars representing amount of variance explained, and a line that cumulatively sums up the variance explained across our principal components. I've decided to go with 130 principal components because it explains a good amount of variance (again, 85%) while greatly reducing number of dimensions from 257 to 130.
Below I'll have a swing at interpretting what some principal components can tell us about our data.
def plot_pca_dim(pca,feat_names,dim,features):
'''
For a given fitted PCA object, the given component, and a given number of features, this plots
the variance that the particular component captures.
'''
fmt = lambda x: "{:.2f}%".format(x)
sort_abs = lambda x: abs(x[0])
names_to_weights = sorted(zip(pca.components_[dim],feat_names),key=sort_abs,reverse=True)
names_to_weights = np.array(names_to_weights)
nm_to_wt_df = pd.DataFrame(names_to_weights,columns=['var','feature'])
nm_to_wt_df['var'] = nm_to_wt_df['var'].astype(float)
return px.bar(nm_to_wt_df[:features],x='feature',y='var',
color='var',color_continuous_scale=px.colors.sequential.Sunsetdark,
labels = {'var': 'Variance (%)','feature':'Feature Names'},
height=400,width=600)
feat_names = azdias_1.columns.values
Skip ahead below all the plots for my interpretation of the PC's
plot_pca_dim(pca,feat_names,0,10)
Field Name | Interpretation | Pos/Neg |
---|---|---|
MOBI_REGIO | Region: Movement Patterns | Neg |
LP_STATUS_GROB_1.0 | Social Status | Pos |
KBA05_ANTG1 | No. 1-2 family houses in microcell | Neg |
PLZ8_ANTG3 | No. 6-10 family houses in region | Pos |
PLZ8_ANTG1 | No. 1-2 family houses in region | Neg |
FINANZ_MINIMALIST | Low Financial Interest | Neg |
KBA05_GBZ | No. Buildings in microcell | Neg |
HH_EINKOMMEN_SCORE | Est. Household net income | Pos |
PLZ8_ANTG4 | No. 10+ family homes in region | Pos |
PLZ8_BAUMAX_1.0 | Most common building type | Neg |
Observations: Component 1
plot_pca_dim(pca,feat_names,1,10)
Field Name | Interpretation | Pos/Neg |
---|---|---|
ALTERSKATEGORIE_GROB | Estimated age | Pos |
FINANZ_SPARER | Money Saver | Neg |
FINANZ_VORSORGER | Being financially prepared | Pos |
FINANZ_UNAUFFAELLIGER | Financially inconspicuous | Neg |
SEMIO_REL | Personality: Religious | Neg |
SEMIO_TRADV | Personality: Traditional-minded | Neg |
SEMIO_PFLICHT | Personality: Dufitul | Neg |
FINANZ_ANLEGER | Investor | Neg |
ZABEOTYP_3 | Fair energy consumer | Pos |
SEMIO_ERL | Personality: Event-oriented | Pos |
Observations: Component 2
plot_pca_dim(pca,feat_names,2,10)
Field Name | Interpretation | Pos/Neg |
---|---|---|
ANREDE_KZ | Gender (1-M, 2-F) | Neg |
SEMIO_VERT | Personality: Dreamful | Pos |
SEMIO_KAEM | Personality: Combative Attitude | Neg |
SEMIO_DOM | Personality: Dominant-minded | Neg |
SEMIO_KRIT | Personality: Critical-minded | Neg |
SEMIO_FAM | Personality: Family-minded | Pos |
SEMIO_SOZ | Personality: Socially-minded | Pos |
SEMIO_KULT | Personality: Cultural-minded | Pos |
SEMIO_ERL | Personality: Event-oriented | Neg |
FINANZ_ANLEGER | Investor | Neg |
Observations: Component 3
pca.explained_variance_ratio_[:3].sum()
Cumulatively, they only explain approx. 13.9% of variance from our features, so they don't explain very much variance.
PCA Summary Observations
Below I will use sklearn's KMeans clustering algorithm to identify different segments of the demographics data.
One thing that's very tricky when it comes to using KMeans is the fact that you have to choose the number of clusters that the algorithm will find. This can be difficult since we don't know how many clusters exist!
So, here's what we're going to do...
Intuitively, this makes sense. If we have 5 clear-cut clusters in our population, adding clusters 1-5 will significantly reduce average distance between clusters. Once we add a 6th cluster, that 6th cluster will situate itself close to one of the already-defined 5 clusters, which will reduce average distance from each cluster, but not by that much.
scores = []
First trying 3-10 clusters to see how KMeans performs.
# Over a number of different cluster counts...
clusters = np.arange(3,11)
for i in clusters:
start=time.time()
model = KMeans(i)
model.fit(X_pca)
# Score is sum of squared distances to a sample's assigned centroid
scores.append(abs(model.score(X_pca)))
end = time.time()
print(f'Running Score for {i} Clusters: {abs(model.score(X_pca)):.3f}')
print(f'Time taken: {end-start:.2f} seconds.')
Plotting the scores below...
plt.plot(clusters,scores,marker="o")
plt.show()
The scores I'm plotting are the sum of the squared distances from each point to its corresponding centroid. The trend looks linear. A possible sign of having too many clusters is diminished reduction in these distances for each additional cluster. This is known as the "elbow" method. Given the trend looks linear, I'm going to try a few more clusters below.
clusters = np.arange(11,16)
for i in clusters:
start=time.time()
model = KMeans(i)
model.fit(X_pca)
# Score is sum of squared distances to a sample's assigned centroid
scores.append(abs(model.score(X_pca)))
end = time.time()
print(f'Running Score for {i} Clusters: {abs(model.score(X_pca)):.3f}')
print(f'Time taken: {end-start:.2f} seconds.')
Replotting the scores.
plt.plot(np.arange(3,16),scores,marker="o")
plt.show()
Although it's not very clear in the plot, it looks like our distance reduction began diminishing from 13 to 14 clusters, from reading out the running score printed 3 cells above. I'll stick with 13 clusters and proceed with a final model fitted on the PCA'd version of our demographics data.
final_model = KMeans(13)
final_model.fit(X_pca)
gen_preds = final_model.predict(X_pca)
Recap
I fit and scored KMeans on our PCA-transformed data for n-centroids 3-15. Score is calculated as the distance between samples and their corresponding centroid. As we add centroids to the algorithm, the score will decrease (i.e. distance between points) less and less. 10 and 13 centroids seem like good candidates for our algorithm, but I chose to proceed with 13 clusters because it appears sum of squared distance was significantly reduced from 11-13 centroids.
Below I'm cleaning, scaling, PCA-ing, and clustering our customer data using all the objects and steps we used on the demographics data.
# Load in the customer demographics data.
customers = pd.read_csv('Udacity_CUSTOMERS_Subset.csv',';')
customers.head()
Cleaning
cleaned_cust = clean_data(customers)
cleaned_cust.shape
The imputer and scaler aren't going to work because "GEBAEUDETYP_5.0" isn't in our cleaned_cust
DataFrame. This wasn't created because the "GEBAEUDETYP" column didn't have any values of 5
in our customer set since we one-hot encoded it using pd.get_dummies
instead of OneHotEncoder
. Adding in a column of only zeros for this missing column is an equivalent operation to not having any values of 5
for this feature.
for i in azdias_1.columns:
if i not in cleaned_cust.columns:
print(i)
GEBAEUDETYP_5 = pd.Series(np.zeros(141725))
cleaned_cust['GEBAEUDETYP_5.0'] = GEBAEUDETYP_5
# Reordering columns to align with original DataFrame
cols = azdias_1.columns.tolist()
cleaned_cust = cleaned_cust[cols]
cleaned_cust.shape
Imputing & Standardization
imputed = med_imp.transform(cleaned_cust)
standardized = scaler.transform(imputed)
standardized.shape
PCA
X_pca_cust = pca.transform(standardized)
X_pca_cust.shape
Clustering
preds = final_model.predict(X_pca_cust)
Now that we have both our demographics data and our customer data clustered let's compare them!
df_preds_cust = pd.DataFrame(preds,columns=['cust_preds'])
df_preds_gen = pd.DataFrame(gen_preds,columns=['gen_preds'])
prop_cust = df_preds_cust['cust_preds'].value_counts().sort_index()/ \
sum(df_preds_cust['cust_preds'].value_counts())
prop_gen = df_preds_gen['gen_preds'].value_counts().sort_index()/ \
sum(df_preds_gen['gen_preds'].value_counts())
fig,ax = plt.subplots(figsize=(15,5))
width = 0.35
cust = ax.bar(prop_cust.index-.51*width, prop_cust, width, color='darkolivegreen')
gen = ax.bar(prop_gen.index+.51*width, prop_gen, width, color='steelblue')
ax.set_ylabel('Proportion (%)')
ax.set_xlabel('Labels')
ax.set_title('General Population Clusters Vs Customer Clusters')
ax.set_xticks(prop_cust.index)
ax.set_ylim(top=max(prop_cust.append(prop_gen))*1.15)
ax.legend(('Customer','General Pop'))
def autolabel(rects):
"""
Attach a text label above each bar displaying its height
"""
for rect in rects:
height = rect.get_height()
ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
'%.2f' % float(height),
ha='center', va='bottom')
autolabel(cust)
autolabel(gen)
The two relative distributions are slightly different. The most significant differences between them are the folliwing...
So we can see that some clusters are over-represented in our customer data and some are under-represented in our customer data. Let's take a look at what characteristics describe these clusters to help us understand who our customers are.
Below I'm extracting cluster 7's top 5 most significant principal components.
centroid = final_model.cluster_centers_[7]
centroid_df = pd.DataFrame(centroid,columns=['pca_weights'])
# Sorting the centroid in descending absolute value order
# Note indexes correspond to principal components
centroid_df.reindex(centroid_df.abs().sort_values('pca_weights',ascending=False).index).head()
Relatively speaking, it looks like PC 0 was highly negatively influencing this cluster, and PC 3 was highly positively influencing this cluster. Let's take a look at these components...
plot_pca_dim(pca,feat_names,0,10)
# Provided by Udacity Reviewer - Inserted for future reference
# This function is a (less pretty but) more concise way of showing the features that are
# Most negatively correlated with the cluster
# Most positively correlated with the cluster
# However, it is computed in a slightly different way
# (by inverse transforming our PCA data back into orig features)
def plot_scaled_comparison(df_sample, kmeans, cluster):
'''
Takes performs the PCA inverse_transform of a selected cluster from our trained model to obtain
the scaled feature weights. This helps us interpret who belongs in a given cluster.
'''
X = pd.DataFrame.from_dict(dict(zip(df_sample.columns,
pca.inverse_transform(kmeans.cluster_centers_[cluster]))), orient='index').rename(
columns={0: 'feature_values'}).sort_values('feature_values', ascending=False)
X['feature_values_abs'] = abs(X['feature_values'])
pd.concat((X['feature_values'][:10], X['feature_values'][-10:]), axis=0).plot(kind='barh');
plot_scaled_comparison(azdias_1, final_model, 5)
plot_pca_dim(pca,feat_names,3,5)
PC 0 (Negatively weighted by cluster 7)
PC 3 (Positively weighted by cluster 7)
My intuition suggests to me that cluster 7 reflects older and wealthier households that likely live in bigger and wealthier areas in Germany.
Below I'm extracting cluster 5's top 5 most significant PCAs.
centroid = final_model.cluster_centers_[5]
centroid_df = pd.DataFrame(centroid,columns=['pca_weights'])
centroid_df.reindex(centroid_df.abs().sort_values(by='pca_weights',ascending=False).index).head()
Relatively speaking, it looks like PCA 0 was highly positively influencing this cluster, and PCA 1 was highly negatively influencing this cluster. Let's take a look at these components...
plot_pca_dim(pca,feat_names,0,5)
plot_pca_dim(pca,feat_names,1,5)
PC 0 (Positively weighted by cluster 5. Note this was Negatively weighted by cluster 7.)
PC 1 (Negatively weighted by cluster 5)
My intuition tells me that cluster 5 represents a lower-income and younger part of our population that tends to save money and aren't very religious.
Limitation
Before I conclude, allow me to go over some limitations that you should be aware of...
Conclusion
These customers highly represented cluster 7 and under-represented cluster 5. By looking closer at the principle components that most greatly distinguish these clusters, I noted that cluster 7 generally looks older & wealthier people that live in larger houses (1-2 family homes vs 6-10) than cluster 5. On the other hand, cluster 5 generally looks younger and less wealthy.
We can deduce that most of these customers are older and wealthier. Cluster 7 makes up about 32% of our customer population per our KMeans clustering. So if my intuition is correct, I recommend that this mail-order company target older and wealthier households in their marketing efforts if they want to enforce or improve their standing with this set of customers. On another hand, this disparity may be caused by some other confounding factor, such as existing bias that already favors this subset of our population. If that is the case, it also seems reasonable to target younger money-savers since that subset of our population is underrepresented in our customer population.
The call to action is situational. What's more important is the awareness that the existing customer population is older and wealthier, and that there is an opportunity for growth in a younger and less-wealthy subset of the German population.
Below I'll address some of the things I see as weaknesses/points of improvement in this analysis that would be worth working on if I continued along with this project.
Pipeline
feature.