Project: Identify Customer Segments

This is a Udacity Unsupervised Learning Project using Real Demographic & Customer Data

Context

In this project, I used unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that was used was provided by Udacity's partners at Bertelsmann Arvato Analytics.

Importing Libraries
In [113]:
import numpy as np
import pandas as pd
from scipy.stats import ks_2samp
import time

import matplotlib.pyplot as plt
import seaborn as sns
import plotly_express as px
%matplotlib inline

from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

Loading the Data

There are four files associated with this project (not including this one):

  • Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).
  • Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).
  • Data_Dictionary.md: Detailed information file about the features in the provided datasets.
  • AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns

We will examine, explore, clean, and cluster the Germany demographics data Udacity_AZDIAS_Subset.csv, then apply all the same transformations & cleaning to the customer demographics data Udacity_CUSTOMERS_Subset.csv. Each row represents a singple person, and includes information outside of individuals, including information about their household, building, and neighborhood. I will use this information to cluster the general population into groups with similar demographic properties. Then, I will see how the people in the customers dataset fit into those created clusters.

After, we will compare the clusters between the general Germany demographic data with the customer data to see whether there are any over-represented clusters in our customer data. This cluster can be interpretted as the company's core user base, and could be targetted for future marketing campaigns. Alternatively, under-represented clusters may be viewed as a potential opportunity of expansion for the mail order company.

Let's Begin...

Loading in our general demographics data & feature summary file...

In [2]:
# Setting display to show all rows
    # Apologies in advance for cluttering the notebook, but this makes for more convenient analysis
pd.options.display.max_rows = 260
In [3]:
# Load in the general demographics data.
azdias = pd.read_csv('Udacity_AZDIAS_Subset.csv',';')

# Load in the feature summary file.
    # Feature summary file contains very useful information about the data types for each feature
feat_info = pd.read_csv('AZDIAS_Feature_Summary.csv',';')

Taking a look at our general demographics...

In [4]:
azdias.shape
Out[4]:
(891221, 85)
In [5]:
azdias.head()
Out[5]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 -1 2 1 2.0 3 4 3 5 5 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 -1 1 2 5.0 1 5 2 5 4 5 ... 2.0 3.0 2.0 1.0 1.0 5.0 4.0 3.0 5.0 4.0
2 -1 3 2 3.0 1 4 1 2 3 5 ... 3.0 3.0 1.0 0.0 1.0 4.0 4.0 3.0 5.0 2.0
3 2 4 2 2.0 4 2 5 2 1 2 ... 2.0 2.0 2.0 0.0 1.0 3.0 4.0 2.0 3.0 3.0
4 -1 3 1 5.0 4 3 4 1 3 2 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 4.0 6.0 5.0

5 rows × 85 columns

Skimming the numerical data summary statistics...

In [6]:
azdias.describe().transpose()
Out[6]:
count mean std min 25% 50% 75% max
AGER_TYP 891221.0 -0.358435 1.198724 -1.0 -1.0 -1.0 -1.0 3.0
ALTERSKATEGORIE_GROB 891221.0 2.777398 1.068775 1.0 2.0 3.0 4.0 9.0
ANREDE_KZ 891221.0 1.522098 0.499512 1.0 1.0 2.0 2.0 2.0
CJT_GESAMTTYP 886367.0 3.632838 1.595021 1.0 2.0 4.0 5.0 6.0
FINANZ_MINIMALIST 891221.0 3.074528 1.321055 1.0 2.0 3.0 4.0 5.0
FINANZ_SPARER 891221.0 2.821039 1.464749 1.0 1.0 3.0 4.0 5.0
FINANZ_VORSORGER 891221.0 3.401106 1.322134 1.0 3.0 3.0 5.0 5.0
FINANZ_ANLEGER 891221.0 3.033328 1.529603 1.0 2.0 3.0 5.0 5.0
FINANZ_UNAUFFAELLIGER 891221.0 2.874167 1.486731 1.0 2.0 3.0 4.0 5.0
FINANZ_HAUSBAUER 891221.0 3.075121 1.353248 1.0 2.0 3.0 4.0 5.0
FINANZTYP 891221.0 3.790586 1.987876 1.0 2.0 4.0 6.0 6.0
GEBURTSJAHR 891221.0 1101.178533 976.583551 0.0 0.0 1943.0 1970.0 2017.0
GFK_URLAUBERTYP 886367.0 7.350304 3.525723 1.0 5.0 8.0 10.0 12.0
GREEN_AVANTGARDE 891221.0 0.196612 0.397437 0.0 0.0 0.0 0.0 1.0
HEALTH_TYP 891221.0 1.792102 1.269062 -1.0 1.0 2.0 3.0 3.0
LP_LEBENSPHASE_FEIN 886367.0 14.622637 12.616883 0.0 4.0 11.0 27.0 40.0
LP_LEBENSPHASE_GROB 886367.0 4.453621 3.855639 0.0 1.0 3.0 8.0 12.0
LP_FAMILIE_FEIN 886367.0 3.599574 3.926486 0.0 1.0 1.0 8.0 11.0
LP_FAMILIE_GROB 886367.0 2.185966 1.756537 0.0 1.0 1.0 4.0 5.0
LP_STATUS_FEIN 886367.0 4.791151 3.425305 1.0 2.0 4.0 9.0 10.0
LP_STATUS_GROB 886367.0 2.432575 1.474315 1.0 1.0 2.0 4.0 5.0
NATIONALITAET_KZ 891221.0 1.026827 0.586634 0.0 1.0 1.0 1.0 3.0
PRAEGENDE_JUGENDJAHRE 891221.0 8.154346 4.844532 0.0 5.0 8.0 14.0 15.0
RETOURTYP_BK_S 886367.0 3.419630 1.417741 1.0 2.0 3.0 5.0 5.0
SEMIO_SOZ 891221.0 3.945860 1.946564 1.0 2.0 4.0 6.0 7.0
SEMIO_FAM 891221.0 4.272729 1.915885 1.0 3.0 4.0 6.0 7.0
SEMIO_REL 891221.0 4.240609 2.007373 1.0 3.0 4.0 6.0 7.0
SEMIO_MAT 891221.0 4.001597 1.857540 1.0 2.0 4.0 5.0 7.0
SEMIO_VERT 891221.0 4.023709 2.077746 1.0 2.0 4.0 6.0 7.0
SEMIO_LUST 891221.0 4.359086 2.022829 1.0 2.0 5.0 6.0 7.0
SEMIO_ERL 891221.0 4.481405 1.807552 1.0 3.0 4.0 6.0 7.0
SEMIO_KULT 891221.0 4.025014 1.903816 1.0 3.0 4.0 5.0 7.0
SEMIO_RAT 891221.0 3.910139 1.580306 1.0 3.0 4.0 5.0 7.0
SEMIO_KRIT 891221.0 4.763223 1.830789 1.0 3.0 5.0 6.0 7.0
SEMIO_DOM 891221.0 4.667550 1.795712 1.0 3.0 5.0 6.0 7.0
SEMIO_KAEM 891221.0 4.445007 1.852412 1.0 3.0 5.0 6.0 7.0
SEMIO_PFLICHT 891221.0 4.256076 1.770137 1.0 3.0 4.0 6.0 7.0
SEMIO_TRADV 891221.0 3.661784 1.707637 1.0 2.0 3.0 5.0 7.0
SHOPPER_TYP 891221.0 1.266967 1.287435 -1.0 0.0 1.0 2.0 3.0
SOHO_KZ 817722.0 0.008423 0.091392 0.0 0.0 0.0 0.0 1.0
TITEL_KZ 817722.0 0.003483 0.084957 0.0 0.0 0.0 0.0 5.0
VERS_TYP 891221.0 1.197852 0.952532 -1.0 1.0 1.0 2.0 2.0
ZABEOTYP 891221.0 3.362438 1.352704 1.0 3.0 3.0 4.0 6.0
ALTER_HH 817722.0 10.864126 7.639683 0.0 0.0 13.0 17.0 21.0
ANZ_PERSONEN 817722.0 1.727637 1.155849 0.0 1.0 1.0 2.0 45.0
ANZ_TITEL 817722.0 0.004162 0.068855 0.0 0.0 0.0 0.0 6.0
HH_EINKOMMEN_SCORE 872873.0 4.207243 1.624057 1.0 3.0 5.0 6.0 6.0
KK_KUNDENTYP 306609.0 3.410640 1.628844 1.0 2.0 3.0 5.0 6.0
W_KEIT_KIND_HH 783619.0 3.933406 1.964701 0.0 2.0 4.0 6.0 6.0
WOHNDAUER_2008 817722.0 7.908791 1.923137 1.0 8.0 9.0 9.0 9.0
ANZ_HAUSHALTE_AKTIV 798073.0 8.287263 15.628087 0.0 1.0 4.0 9.0 595.0
ANZ_HH_TITEL 794213.0 0.040647 0.324028 0.0 0.0 0.0 0.0 23.0
GEBAEUDETYP 798073.0 2.798641 2.656713 1.0 1.0 1.0 3.0 8.0
KONSUMNAEHE 817252.0 3.018452 1.550312 1.0 2.0 3.0 4.0 7.0
MIN_GEBAEUDEJAHR 798073.0 1993.277011 3.332739 1985.0 1992.0 1992.0 1993.0 2016.0
WOHNLAGE 798073.0 4.052836 1.949539 0.0 3.0 3.0 5.0 8.0
KBA05_ANTG1 757897.0 1.494277 1.403961 0.0 0.0 1.0 3.0 4.0
KBA05_ANTG2 757897.0 1.265584 1.245178 0.0 0.0 1.0 2.0 4.0
KBA05_ANTG3 757897.0 0.624525 1.013443 0.0 0.0 0.0 1.0 3.0
KBA05_ANTG4 757897.0 0.305927 0.638725 0.0 0.0 0.0 0.0 2.0
KBA05_BAUMAX 757897.0 1.389552 1.779483 0.0 0.0 1.0 3.0 5.0
KBA05_GBZ 757897.0 3.158580 1.329537 1.0 2.0 3.0 4.0 5.0
BALLRAUM 797481.0 4.153043 2.183710 1.0 2.0 5.0 6.0 7.0
EWDICHTE 797481.0 3.939172 1.718996 1.0 2.0 4.0 6.0 6.0
INNENSTADT 797481.0 4.549491 2.028919 1.0 3.0 5.0 6.0 8.0
GEBAEUDETYP_RASTER 798066.0 3.738306 0.923193 1.0 3.0 4.0 4.0 5.0
KKK 770025.0 2.592991 1.119052 0.0 2.0 3.0 3.0 4.0
MOBI_REGIO 757897.0 2.963540 1.428882 1.0 2.0 3.0 4.0 6.0
ONLINE_AFFINITAET 886367.0 2.698691 1.521524 0.0 1.0 3.0 4.0 5.0
REGIOTYP 770025.0 4.257967 2.030385 0.0 3.0 5.0 6.0 7.0
KBA13_ANZAHL_PKW 785421.0 619.701439 340.034318 0.0 384.0 549.0 778.0 2300.0
PLZ8_ANTG1 774706.0 2.253330 0.972008 0.0 1.0 2.0 3.0 4.0
PLZ8_ANTG2 774706.0 2.801858 0.920309 0.0 2.0 3.0 3.0 4.0
PLZ8_ANTG3 774706.0 1.595426 0.986736 0.0 1.0 2.0 2.0 3.0
PLZ8_ANTG4 774706.0 0.699166 0.727137 0.0 0.0 1.0 1.0 2.0
PLZ8_BAUMAX 774706.0 1.943913 1.459654 1.0 1.0 1.0 3.0 5.0
PLZ8_HHZ 774706.0 3.612821 0.973967 1.0 3.0 4.0 4.0 5.0
PLZ8_GBZ 774706.0 3.381087 1.111598 1.0 3.0 3.0 4.0 5.0
ARBEIT 794005.0 3.167854 1.002376 1.0 3.0 3.0 4.0 9.0
ORTSGR_KLS9 794005.0 5.293002 2.303739 0.0 4.0 5.0 7.0 9.0
RELAT_AB 794005.0 3.072220 1.362980 1.0 2.0 3.0 4.0 9.0

Now taking a look at our feature summary file AZDIAS_Feature_Summary.csv...

In [7]:
feat_info.shape
Out[7]:
(85, 4)
In [8]:
# As you can see, we have convenient information on type of data and a code telling us what values denote NA values
feat_info.head()
Out[8]:
attribute information_level type missing_or_unknown
0 AGER_TYP person categorical [-1,0]
1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9]
2 ANREDE_KZ person categorical [-1,0]
3 CJT_GESAMTTYP person categorical [0]
4 FINANZ_MINIMALIST person ordinal [-1]

Preprocessing

Assessing Missing Data

Converting Missing Value Codes to NaNs

The fourth column of the feature attributes summary (loaded in above as feat_info) tells us what codes indicate missing/NA values (e.g. [-1,0]). However, we have to convert these codes as a list because it's encoded as a string of a list.

When converting a DataFrame to csv, any columns containing Python list datatypes are converted to strings.

So, before we encode our NA values using the feature summary missing_or_unknown column, we need to first convert this column into a list.


Just for fun, let's see how many NA values are already recognized by Pandas.

In [10]:
# Total number of known missing values (Pandas recognizes as NA)
    # the first .sum() is used to sum across all columns. The second .sum() is used to aggregate all NAs
azdias.isna().sum().sum()
Out[10]:
4896838
In [11]:
# Using this function to convert the strings of a list to actual lists
def string_to_list(x):
    '''
    This function takes a string of a list and returns a list containing integers and strings.
    Note the list can only contain alphanumeric values for it to work.
    
    Input: String of a list (i.e. '[1,2,XX]')
    Returns: List data type
    '''
    new_list=[]
    # The below line removes the '[]' and splits on ',', creating a list of strings
    x = x[1:-1].split(',')
    # For each value in the list of strings, try turning it into an integer if possible and append to new_list
        # Otherwise, append it to new_list
    for i in x:
        try:
            new_list.append(int(i))
        except:
            new_list.append(i)
    return new_list
In [12]:
# Applying the function
feat_info['NA_tags'] = feat_info['missing_or_unknown'].apply(string_to_list)
In [162]:
# Example of an entry in this new column
type(feat_info['NA_tags'][0])
Out[162]:
list

As you can see, we've successfully converted these entries into Python lists

In [14]:
feat_info.head()
Out[14]:
attribute information_level type missing_or_unknown NA_tags
0 AGER_TYP person categorical [-1,0] [-1, 0]
1 ALTERSKATEGORIE_GROB person ordinal [-1,0,9] [-1, 0, 9]
2 ANREDE_KZ person categorical [-1,0] [-1, 0]
3 CJT_GESAMTTYP person categorical [0] [0]
4 FINANZ_MINIMALIST person ordinal [-1] [-1]

Now that that's out of the way, let's encode our NA values.

Inserting NA values into our DataFrame if they are tagged as NA in the "NA_tags" column

In [15]:
# Using a nearly identical copy of the feat_info DF with the attributes as the index
att_index = feat_info.set_index('attribute')
In [163]:
# This is an example of what I just created. You'll see what I did this in a moment
att_index.head(1)
Out[163]:
information_level type missing_or_unknown NA_tags
attribute
AGER_TYP person categorical [-1,0] [-1, 0]
In [16]:
# Creating a copy of the demographics data to work with to be the version with NA's inserted
na_azdias = azdias[:]

Below I'm looking through each column, using the att_index DataFrame as a reference to encode our NA values. I needed the features to be the index of the dataframe in order to use .loc to retrieve each column's respective list of NA codes.

In [17]:
# Looping through our column names
for column in na_azdias.columns:
    # For every column in our DF, if a value is in the 'NA_tags' list, we replace it inplace with np.NaN
    na_azdias[column].replace(att_index.loc[column].loc['NA_tags'],np.NaN,inplace=True)
/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py:6586: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [18]:
# Now we have about 3,500,000 more NA values as a result of this replacement
na_azdias.isna().sum().sum()
Out[18]:
8373929

Assessing Missing Data in Each Column

Below we will asses how much data is missing in each column and row to understand...

  1. How much data should we drop, if any?
  2. If we choose to drop some data, what are the implications of doing so?

Below is a simple textual representation of how many NA values are in each column. As you can see, columns starting with FINANZ... and SEMIO... don't have any missing values. This must be standard person-level data, mandatory for each observation.

In [19]:
na_azdias.isna().sum()
Out[19]:
AGER_TYP                 685843
ALTERSKATEGORIE_GROB       2881
ANREDE_KZ                     0
CJT_GESAMTTYP              4854
FINANZ_MINIMALIST             0
FINANZ_SPARER                 0
FINANZ_VORSORGER              0
FINANZ_ANLEGER                0
FINANZ_UNAUFFAELLIGER         0
FINANZ_HAUSBAUER              0
FINANZTYP                     0
GEBURTSJAHR              392318
GFK_URLAUBERTYP            4854
GREEN_AVANTGARDE              0
HEALTH_TYP               111196
LP_LEBENSPHASE_FEIN       97632
LP_LEBENSPHASE_GROB       94572
LP_FAMILIE_FEIN           77792
LP_FAMILIE_GROB           77792
LP_STATUS_FEIN             4854
LP_STATUS_GROB             4854
NATIONALITAET_KZ         108315
PRAEGENDE_JUGENDJAHRE    108164
RETOURTYP_BK_S             4854
SEMIO_SOZ                     0
SEMIO_FAM                     0
SEMIO_REL                     0
SEMIO_MAT                     0
SEMIO_VERT                    0
SEMIO_LUST                    0
SEMIO_ERL                     0
SEMIO_KULT                    0
SEMIO_RAT                     0
SEMIO_KRIT                    0
SEMIO_DOM                     0
SEMIO_KAEM                    0
SEMIO_PFLICHT                 0
SEMIO_TRADV                   0
SHOPPER_TYP              111196
SOHO_KZ                   73499
TITEL_KZ                 889061
VERS_TYP                 111196
ZABEOTYP                      0
ALTER_HH                 310267
ANZ_PERSONEN              73499
ANZ_TITEL                 73499
HH_EINKOMMEN_SCORE        18348
KK_KUNDENTYP             584612
W_KEIT_KIND_HH           147988
WOHNDAUER_2008            73499
ANZ_HAUSHALTE_AKTIV       99611
ANZ_HH_TITEL              97008
GEBAEUDETYP               93148
KONSUMNAEHE               73969
MIN_GEBAEUDEJAHR          93148
OST_WEST_KZ               93148
WOHNLAGE                  93148
CAMEO_DEUG_2015           99352
CAMEO_DEU_2015            99352
CAMEO_INTL_2015           99352
KBA05_ANTG1              133324
KBA05_ANTG2              133324
KBA05_ANTG3              133324
KBA05_ANTG4              133324
KBA05_BAUMAX             476524
KBA05_GBZ                133324
BALLRAUM                  93740
EWDICHTE                  93740
INNENSTADT                93740
GEBAEUDETYP_RASTER        93155
KKK                      158064
MOBI_REGIO               133324
ONLINE_AFFINITAET          4854
REGIOTYP                 158064
KBA13_ANZAHL_PKW         105800
PLZ8_ANTG1               116515
PLZ8_ANTG2               116515
PLZ8_ANTG3               116515
PLZ8_ANTG4               116515
PLZ8_BAUMAX              116515
PLZ8_HHZ                 116515
PLZ8_GBZ                 116515
ARBEIT                    97375
ORTSGR_KLS9               97274
RELAT_AB                  97375
dtype: int64

Below I'm making a temporary DataFrame to help me understand the question: how much data is missing in each column?

In [20]:
# Creating a copy of the DF containing the counts of NAs with the names of the columns in the original DF as a column
na_azdias_df = na_azdias.isna().sum().reset_index()
# Filtering for only columns with more than 50000 NA values to declutter the plot
temp_view = na_azdias_df[na_azdias_df[0]>50000]
In [164]:
# FYI - 'index' is column name and '0' is count of NA values
temp_view.head(1)
Out[164]:
index 0
0 AGER_TYP 685843

Below is a horizontal bar plot showing columns and their NA counts as their bar width. Here I've noted that the following columns have much more NA values than the others...

  • KBA05_BAUMAX
  • KK_KUNDENTYP
  • ALTER_HH
  • TITEL_KZ
  • GEBURTSJAHR
  • AGER_TYP
In [182]:
# Sorting by count of NA values
temp_view.sort_values(by=0,ascending=True,inplace=True)
In [181]:
plt.figure(figsize=(6,12))
barh = plt.barh(temp_view['index'],temp_view[0],color = 'darkturquoise', alpha = 0.85)
for i in range(7):    
    barh[-i].set_color('darkred')
plt.tight_layout()

So now I'll remove those 6 outlier columns...

In [22]:
outlier_columns = ['KBA05_BAUMAX','KK_KUNDENTYP','ALTER_HH','TITEL_KZ','GEBURTSJAHR','AGER_TYP']
In [23]:
na_azdias.drop(labels=outlier_columns,axis=1,inplace=True)
/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:3940: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Printing the shape shows that the no. columns decreased from 85 -> 79

In [24]:
na_azdias.shape
Out[24]:
(891221, 79)

Recap

Per my analysis of missing data above, I've found that the entire dataset contained 4,896,838 known missing values (recognized by Pandas) before I manually encoded NA values and contained 8,912,221 known missing values after I encoded NA values per the feat_info file and dropped "outlier" columns with many NA values.

I plotted the count of NA values per column and noted that the following 6 columns were outliers in terms of number of missing values, and were subsequently removed.

  • KBA05_BAUMAX
  • KK_KUNDENTYP
  • ALTER_HH
  • TITEL_KZ
  • GEBURTSJAHR
  • AGER_TYP

Assessing Missing Data in Each Row

Below I will asses how much data is missing in each row, and evaluate whether I should drop any rows.

This analysis is going to be a bit more work that assessing missing data in each column, because dropping rows has more implications than dropping columns. By dropping rows, we're dropping entire observations (one observation being one person), which can skew/misrepresent the demographic data.

Spoiler alert: I will be dropping rows for the sake of compatibility with Principal Component Analysis and KMeans Clustering. Let's understand the consequences of dropping these rows.

In [25]:
# Creating a temp column that contains the count of NA values in each row
na_azdias['no_na_row'] = na_azdias.isna().sum(axis=1)
/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

This histogram plots a distribution of non-zero NA values. Looks like most of our rows with NA values are missing between 1-10 of their values.

In [26]:
plt.hist(na_azdias[na_azdias['no_na_row']>0]['no_na_row'],bins=10)
Out[26]:
(array([96415., 60529., 12134.,  5139.,   480.,   155., 14016.,  5031.,
        28038., 46075.]),
 array([ 1. ,  5.8, 10.6, 15.4, 20.2, 25. , 29.8, 34.6, 39.4, 44.2, 49. ]),
 <a list of 10 Patch objects>)

Because we have a significant number of rows with 30-50 NA values, I'll split the data into two subsets: one with rows with less than or equal to 30 NA values na_azdias_b30 and one with rows greater than 30 NA values na_azdias_g30.

In [120]:
na_azdias_b30 = na_azdias[na_azdias['no_na_row']<=30]
na_azdias_g30 = na_azdias[na_azdias['no_na_row']>30]

Below is a quick comparison of the proportion of NA values in each row.

There's an oddly large amount of rows with 49 and 53 missing values.

In [28]:
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(13,4))
sns.countplot(x='no_na_row',data=na_azdias_b30,ax=ax1,orient='v')
sns.countplot(x='no_na_row',data=na_azdias_g30,ax=ax2,orient='v')
plt.tight_layout()

Now I'll compare the distributions of particular columns' values between the two subsets of data.

Note this is different than what we did in the cell above, which is plot a distribution of rows for each number of NA values.

In [29]:
def compare_columns(df1,df2,column,df1_desc=None,df2_desc=None):
    '''
    Takes two DataFrames and a column name as input and plots two countplots of the given column for both DataFrames.
    '''
    # Taking max value count between the 2 columns to set as ylim for both plots
    df1_max = df1[column].value_counts().iloc[0]
    df2_max = df2[column].value_counts().iloc[0]
    if df1_max > df2_max:
        top = df1_max * 1.05
    else:
        top = df2_max * 1.05
    
    # Plotting
    fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,3))
    sns.countplot(df1[column],ax=ax1)
    ax1.set_ylabel('Count')
    ax1.set_ylim(top=top)
    ax1.set_title(df1_desc)
    sns.countplot(df2[column],ax=ax2)
    ax2.set_ylabel('Count')
    ax2.set_ylim(top=top)
    ax2.set_title(df2_desc)
    plt.tight_layout()

Below are plots for six columns that have no missing values. Plots on the left are from the subset of data with less than 30 NA values in each row and plots on the right are from the subset of data with more than 30 NA values in each row.

In [31]:
for n in ['ANREDE_KZ','FINANZ_MINIMALIST','GREEN_AVANTGARDE','SEMIO_SOZ','ZABEOTYP']:
    compare_columns(na_azdias_b30,na_azdias_g30,n,"Below 30 NA Values","Above 30 NA Values")

Clearly, the subset of data with more than 30 NA values in each row (plots on the right) is much smaller than the other subset. Additionally, the distribution of values between the two subsets of data are different for all of the columns above wish the exception of column "ANREDE_KZ" (Gender).

Note that we're looking at columns with zero NA values and comparing rows with many NA values against rows with little NA values.

This suggests a possible bias introduced in the way the data was collected and/or entered. Whatever the case may be, these rows with many NA values should be investigated, since they should hypothetically be similarly distributed relative to the other subset of data with less than 30 NA values.

Additional supplementary visualization. Plots Distribution plots for first 10 columns on top of one another.

This does't add much more information. Feel free to skim over this viz.

In [154]:
plt.figure(figsize=(15,12))
for i, col in enumerate(na_azdias_b30.columns[:10]):
    plt.subplot(5, 2, i+1)
    sns.distplot(na_azdias_b30[col][na_azdias_b30[col].notnull()], label='<=30 NA',hist=False)
    sns.distplot(na_azdias_g30[col][na_azdias_g30[col].notnull()], label='>30 NA',hist=False)
    plt.title('Distribution for column: {}'.format(col))
    plt.legend();
    
plt.tight_layout()
plt.show()

Kolmogorov-Smirnov Test

Below I'll use the KS test to compare each column's distributions from our two datasets (rows with <30 NA values & rows with >30 NA values).

Purpose: To evaluate how different these two groups of data are distributed. If we see that these distributions are very different (low p-values as a result of our KS-Test), then we have statistical evidence that we can use to evaluate whether we should drop the >30 NA values-per row subset.

TL;DR? Scroll down below for my interpretation of the KS test and a quick note on how it works.

In [124]:
def KS_col_test(df1,df2):
    '''
    Used to evaluate whether distributions of data between two DataFrames with the same columns are different.
    
    Input: 2 DataFrames with identical columns.
    Output: D (KS Stat | Max Distance between the two cumulative probability functions for each Dataset) and p-values associated to each D 
    '''
    # Lists for the KS Stat (D) and p-values to return
    stats = []
    p_values = []
    
    # Grabbing an array of column names
    cols = df1.columns.values
    
    for col in cols:
        try:
            D,p = ks_2samp(df1[col],df2[col])
            stats.append(D)
            p_values.append(p)
        except:
            stats.append(np.NaN)
            p_values.append(np.NaN)
    
    return stats,p_values
In [125]:
# Calculating stats and p-values for each column
stats,p_values = KS_col_test(na_azdias_b30,na_azdias_g30)
In [130]:
# Creating a DataFrame to store the results. 80 rows (1/feature)
ks_df = pd.DataFrame(na_azdias_b30.columns.values,columns=['feature'])
ks_df['stat'] = stats
ks_df['p_value'] = p_values
In [145]:
# Plotting results. See below for my interpretation
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,4))

sns.distplot(ks_df[ks_df['stat'].notnull()]['stat'],ax=ax1,kde=False)
ax1.set_title('Distribution of D KS Stats')
ax1.set_xlabel('D | KS Stat')
ax1.set_ylim(top=70)

sns.distplot(ks_df[ks_df['p_value'].notnull()]['p_value'],ax=ax2,kde=False)
ax2.set_title('Distribution of P-Values')
ax2.set_xlabel('P Values')

plt.tight_layout()
plt.show()

So let's recap what we just did. We have two DataFrames:

  1. na_azdiasb30 DataFrame with <= 30 NA values per row
  2. na_azdiasg30 DataFrame with > 30 NA values per row

We used SciPy's implementation of the 2-way Kolmogorov-Smirnov test on each of the 80 columns to evaluate the similarity of each column's distributions.


Brief Aside on the KS-Test

The KS test measures similarity between two samples to check whether these two samples came from the same distribution. (In our case, we're checking if the values in each of our DataFrames for each column are distributed similarly to evaluate the consequence of us dropping one of these DataFrames; namely, the DataFrame with more NA values). The KS test creates empirical cumulative distribution functions (one for each sample/group of observations) and compares them. We take the difference of these two distributions, and the maximum difference (D) is our D/KS Stat. Then, we measure the p-value (Null Hypothesis: these two samples came from the same disttribution). If our p-value is low, then we are more confident that the samples come from different distributions.


KS Test Conclusion

So we see that our p-values are extremely low, telling us that our distributions between the two DataFrames (30 or less vals and greater than 30 NA vals) are highly likely to come from different distributions. I interpret this as a warning that if we drop our na_azdias_g30 DataFrame, we're going to lose valuable information that could affect our algorithm.


Consecutive NA Values

Below I investigated the remaining columns containing NA values and found that many of the rows containing NA values have consecutive NA values in related columns, such as columns that begin with "KBA05..." and "PLZ8...", suggesting that these missing values are likely not due to the chance of random erroneous data entry, and are rather caused by some external factor, such as the process/methodology of how the data was collected.

TL;DR? Key Finding: It looks like most rows I got back have many NA values in the same row. For example, the cell immediately below returns rows with NA values in column 'KBA05ANTG3', and by scrolling right to the end of the DataFrame, we see that the columns that start with 'PLZ8__' have many rows that have consecutive NA values.

In [32]:
na_azdias[pd.isna(na_azdias['KBA05_ANTG3'])]
Out[32]:
ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER FINANZTYP ... PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB no_na_row
0 2.0 1 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
11 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
12 3.0 1 6.0 5 3 4 2 4 1 3 ... 3.0 1.0 0.0 1.0 5.0 5.0 3.0 6.0 4.0 6
13 1.0 2 5.0 1 4 3 5 5 2 1 ... 1.0 1.0 1.0 1.0 3.0 3.0 3.0 6.0 4.0 8
14 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
17 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
24 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
26 3.0 1 3.0 5 2 4 2 3 1 3 ... NaN NaN NaN NaN NaN NaN 4.0 3.0 5.0 19
30 3.0 2 3.0 4 3 4 4 4 1 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 34
35 2.0 2 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
46 NaN 2 3.0 2 4 3 5 5 4 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 40
48 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
53 2.0 1 3.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
54 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
61 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
62 3.0 1 4.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
69 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
75 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
76 2.0 2 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
81 3.0 2 5.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 45
83 3.0 2 4.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
88 1.0 2 4.0 3 3 1 4 5 1 3 ... 2.0 1.0 0.0 1.0 5.0 5.0 3.0 5.0 2.0 8
90 1.0 2 4.0 2 5 3 5 5 4 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 34
97 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
99 2.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
103 2.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
106 2.0 2 1.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
108 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
109 3.0 2 4.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
112 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
133 1.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
137 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
138 3.0 1 NaN 5 3 4 2 4 1 3 ... NaN NaN NaN NaN NaN NaN 4.0 6.0 3.0 29
142 2.0 2 6.0 3 3 2 2 5 1 3 ... NaN NaN NaN NaN NaN NaN 2.0 6.0 2.0 17
143 3.0 2 4.0 3 3 4 3 3 1 1 ... NaN NaN NaN NaN NaN NaN 2.0 4.0 1.0 19
144 1.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
153 3.0 1 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
154 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
157 2.0 1 6.0 3 4 2 5 5 1 3 ... 2.0 1.0 0.0 1.0 4.0 4.0 4.0 5.0 3.0 8
163 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
891002 3.0 1 4.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
891012 3.0 1 1.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
891014 2.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
891021 2.0 2 4.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
891027 2.0 1 1.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
891044 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
891049 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
891055 3.0 2 6.0 4 2 4 4 3 1 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 33
891061 3.0 1 3.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
891062 3.0 2 3.0 1 5 2 3 5 2 1 ... NaN NaN NaN NaN NaN NaN 3.0 7.0 3.0 19
891063 1.0 2 2.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
891064 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
891069 4.0 2 6.0 3 1 5 1 3 5 2 ... NaN NaN NaN NaN NaN NaN 3.0 7.0 3.0 15
891072 2.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
891074 1.0 1 4.0 2 5 3 5 5 2 1 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 34
891076 1.0 2 6.0 1 4 3 5 5 3 1 ... 3.0 1.0 0.0 1.0 4.0 5.0 1.0 2.0 1.0 8
891083 3.0 1 1.0 5 2 3 4 5 1 2 ... 2.0 1.0 0.0 1.0 5.0 5.0 1.0 2.0 1.0 6
891092 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
891095 1.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
891100 4.0 2 3.0 5 1 5 2 3 3 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 34
891115 3.0 2 6.0 4 2 3 4 3 1 3 ... 1.0 0.0 0.0 1.0 1.0 3.0 4.0 1.0 5.0 6
891117 4.0 1 1.0 4 1 5 1 3 2 5 ... NaN NaN NaN NaN NaN NaN 3.0 5.0 3.0 19
891119 1.0 2 3.0 1 5 1 5 5 4 1 ... NaN NaN NaN NaN NaN NaN 3.0 8.0 3.0 17
891120 1.0 1 4.0 3 4 3 5 5 1 1 ... NaN NaN NaN NaN NaN NaN 2.0 3.0 1.0 17
891130 3.0 1 5.0 5 3 4 3 4 1 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 34
891135 3.0 2 2.0 3 2 4 2 3 2 2 ... 3.0 3.0 2.0 5.0 4.0 2.0 4.0 8.0 5.0 6
891137 4.0 2 4.0 4 1 5 2 3 3 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 34
891140 4.0 2 1.0 4 1 5 2 3 3 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 34
891159 2.0 1 1.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
891162 4.0 2 4.0 4 1 5 1 2 5 2 ... 4.0 2.0 1.0 2.0 4.0 3.0 3.0 4.0 4.0 6
891164 3.0 2 4.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 43
891169 3.0 1 1.0 5 2 4 2 3 1 3 ... NaN NaN NaN NaN NaN NaN 1.0 2.0 1.0 18
891170 3.0 1 4.0 4 5 2 4 5 1 3 ... NaN NaN NaN NaN NaN NaN 3.0 4.0 1.0 19
891171 3.0 2 5.0 4 2 3 5 5 1 3 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 34
891172 1.0 2 1.0 1 4 3 5 5 2 1 ... NaN NaN NaN NaN NaN NaN 3.0 4.0 4.0 16
891173 4.0 1 1.0 4 1 5 1 2 2 5 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 19
891175 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
891185 3.0 1 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
891187 3.0 2 6.0 3 4 3 5 5 3 4 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN 47
891203 4.0 2 1.0 4 1 5 1 3 4 5 ... 3.0 2.0 1.0 1.0 3.0 3.0 4.0 8.0 5.0 14

133324 rows × 80 columns


Recap

I split the data into two subsets: rows containing 0-30 NA values and rows containing more than 30 NA values. When looking at the distribution of data for columns that do not contain NA values, I noticed that the distribution of data between these two subsets look very different, which is contrary to what you'd expect. When you split data and look at distributions on columns with no NA values, you'd expect the distributions of these columns to look the same. This suggested to me that there is some sort of bias introduced in the way the data was collected or entered for rows with many NA values.

Additionally, I looked at these rows with many NA values, and it looked like they had many consecutive NA values in related rows, such as rows that begin with "PLZ8..." or "KBA05...". This further suggests that these missing values are probably not caused by chance alone.

Let's see how much of our data has more than 30 missing values.

In [34]:
prop_g_30 = len(na_azdias_g30)/len(na_azdias)
print(f'{prop_g_30*100:.2f}% of our data has more than 30 missing values in a row.')
10.45% of our data has more than 30 missing values in a row.

10.46% of data is a huge chunk that could be important for our ML algorithm. We also saw through the plots and the KS Test that the distribution of values between the two subsets are very different. However, considering the amount of missing values in a row (30 missing values out of 79 total features) that we'd have to impute, I'm dropping these rows (also for simplicity's sake...and laziness) and moving forward with a DF called azdias_1.

In [35]:
na_azdias_b30.drop('no_na_row',inplace=True,axis=1)
azdias_1 = na_azdias_b30[:]

Selecting and Re-Encode Features

In this section, we will examine the different types of data that we have and evaluate next steps to prepare the data for our PCA and KMeans clustering algorithms.

Checking for missing data isn't the only way in which you can prepare a dataset for analysis. Since the unsupervised learning techniques to be used will only work on data that is encoded numerically, you need to make a few encoding changes or additional assumptions to be able to make progress. In addition, while almost all of the values in the dataset are encoded using numbers, not all of them represent numeric values. Check the third column of the feature summary (feat_info) for a summary of types of measurement.

  • For numeric and interval data, these features can be kept without changes.
  • Most of the variables in the dataset are ordinal in nature. While ordinal values may technically be non-linear in spacing, make the simplifying assumption that the ordinal variables can be treated as being interval in nature (that is, kept without any changes).
  • Special handling may be necessary for the remaining two variable types: categorical, and 'mixed'.

In the first two parts of this sub-step, you will perform an investigation of the categorical and mixed-type features and make a decision on each of them, whether you will keep, drop, or re-encode each. Then, in the last part, you will create a new data frame with only the selected and engineered columns.

Data wrangling is often the trickiest part of the data analysis process, and there's a lot of it to be done here. But stick with it: once you're done with this step, you'll be ready to get to the machine learning parts of the project!

Let's use the feat_info file to get counts of how much of each data type we have.

In [184]:
# Removing the 6 outlier columns that were removed from feat_info
feat_info = feat_info.set_index('attribute').drop(outlier_columns).reset_index()
In [37]:
feat_info['type'].value_counts()
Out[37]:
ordinal        49
categorical    18
mixed           6
numeric         6
Name: type, dtype: int64
In [38]:
# Creating a dictionary where the 4 data types above are the keys and the corresponding DF columns are the values
    # This will help with breaking up each group of features for analysis
df_dtype = {}
dtypes = ['ordinal','categorical','numeric','mixed']
for d in dtypes:
    df_dtype[d]=feat_info[feat_info['type']==d]['attribute']

Re-Encoding

Our approach will be the following...

  • Categorical...
    • Ordinal Categoricals: Convert to numeric ordinals if necessary and keep the same for simplicity
    • Multi-Level Categoricals: One-Hot encode these values
    • Binary Categoricals: Convert to numeric binaries ([0,1])
  • Numeric: Keep the same
  • Mixed: Investigate first, then decide on an approach
In [39]:
def check_unique_vals(df):
    for col in df.columns:
        values = list(df[col].unique())
        print(f'{col}: {values}')
Ordinal
In [40]:
check_unique_vals(azdias_1[df_dtype['ordinal']])
ALTERSKATEGORIE_GROB: [1.0, 3.0, 4.0, 2.0, nan]
FINANZ_MINIMALIST: [1, 4, 3, 2, 5]
FINANZ_SPARER: [5, 4, 2, 3, 1]
FINANZ_VORSORGER: [2, 1, 5, 4, 3]
FINANZ_ANLEGER: [5, 2, 1, 4, 3]
FINANZ_UNAUFFAELLIGER: [4, 3, 1, 2, 5]
FINANZ_HAUSBAUER: [5, 2, 3, 4, 1]
HEALTH_TYP: [3.0, 2.0, 1.0, nan]
RETOURTYP_BK_S: [1.0, 3.0, 2.0, 5.0, 4.0, nan]
SEMIO_SOZ: [5, 4, 6, 2, 7, 3, 1]
SEMIO_FAM: [4, 1, 5, 7, 2, 6, 3]
SEMIO_REL: [4, 3, 2, 7, 5, 1, 6]
SEMIO_MAT: [3, 1, 2, 4, 7, 5, 6]
SEMIO_VERT: [1, 4, 7, 2, 6, 5, 3]
SEMIO_LUST: [2, 4, 6, 7, 3, 1, 5]
SEMIO_ERL: [2, 6, 7, 4, 5, 1, 3]
SEMIO_KULT: [3, 4, 6, 5, 7, 1, 2]
SEMIO_RAT: [6, 4, 3, 2, 7, 5, 1]
SEMIO_KRIT: [4, 7, 3, 1, 5, 6, 2]
SEMIO_DOM: [7, 4, 2, 1, 5, 6, 3]
SEMIO_KAEM: [4, 7, 5, 2, 3, 6, 1]
SEMIO_PFLICHT: [7, 3, 4, 5, 1, 6, 2]
SEMIO_TRADV: [6, 3, 4, 2, 7, 5, 1]
HH_EINKOMMEN_SCORE: [6.0, 4.0, 1.0, 5.0, 3.0, 2.0]
W_KEIT_KIND_HH: [3.0, nan, 2.0, 6.0, 5.0, 4.0, 1.0]
WOHNDAUER_2008: [9.0, 8.0, 3.0, 4.0, 5.0, 6.0, 2.0, 7.0, 1.0]
KONSUMNAEHE: [1.0, 5.0, 4.0, 3.0, 2.0, 6.0, 7.0, nan]
KBA05_ANTG1: [0.0, 1.0, 4.0, 2.0, 3.0, nan]
KBA05_ANTG2: [0.0, 3.0, 1.0, 4.0, 2.0, nan]
KBA05_ANTG3: [0.0, 1.0, nan, 2.0, 3.0]
KBA05_ANTG4: [2.0, 0.0, nan, 1.0]
KBA05_GBZ: [1.0, 3.0, 4.0, 5.0, 2.0, nan]
BALLRAUM: [6.0, 2.0, 4.0, 3.0, 7.0, 1.0, 5.0, nan]
EWDICHTE: [3.0, 4.0, 2.0, 5.0, 6.0, 1.0, nan]
INNENSTADT: [8.0, 4.0, 6.0, 1.0, 7.0, 3.0, 2.0, 5.0, nan]
GEBAEUDETYP_RASTER: [3.0, 4.0, 5.0, 1.0, 2.0, nan]
KKK: [2.0, nan, 3.0, 4.0, 1.0]
MOBI_REGIO: [1.0, 3.0, 4.0, 5.0, nan, 2.0, 6.0]
ONLINE_AFFINITAET: [3.0, 2.0, 1.0, 5.0, 4.0, 0.0, nan]
REGIOTYP: [3.0, 2.0, nan, 5.0, 1.0, 7.0, 6.0, 4.0]
PLZ8_ANTG1: [2.0, 3.0, nan, 1.0, 4.0, 0.0]
PLZ8_ANTG2: [3.0, 2.0, 4.0, 1.0, nan, 0.0]
PLZ8_ANTG3: [2.0, 1.0, nan, 3.0, 0.0]
PLZ8_ANTG4: [1.0, 0.0, nan, 2.0]
PLZ8_HHZ: [5.0, 4.0, 3.0, nan, 2.0, 1.0]
PLZ8_GBZ: [4.0, 3.0, 5.0, nan, 2.0, 1.0]
ARBEIT: [3.0, 2.0, 4.0, 1.0, 5.0, nan]
ORTSGR_KLS9: [5.0, 3.0, 6.0, 4.0, 8.0, 2.0, 7.0, 9.0, 1.0, nan]
RELAT_AB: [4.0, 2.0, 3.0, 5.0, 1.0, nan]

The ordinals look okay (no strings; all of them are numeric).

Categorical
In [41]:
# Columns that are tagged as "categorical" are either binary or multi_level
# Printing and tracking binary and multi-level categorical values 
binary_cats = []
multi_level_cats = []
for col in azdias_1[df_dtype['categorical']]:
    values = azdias_1[df_dtype['categorical']][col].nunique()
    if values == 2:
        binary_cats.append(col)
    else:
        multi_level_cats.append(col)
    print(f'{col}: {values}')
ANREDE_KZ: 2
CJT_GESAMTTYP: 6
FINANZTYP: 6
GFK_URLAUBERTYP: 12
GREEN_AVANTGARDE: 2
LP_FAMILIE_FEIN: 11
LP_FAMILIE_GROB: 5
LP_STATUS_FEIN: 10
LP_STATUS_GROB: 5
NATIONALITAET_KZ: 3
SHOPPER_TYP: 4
SOHO_KZ: 2
VERS_TYP: 2
ZABEOTYP: 6
GEBAEUDETYP: 7
OST_WEST_KZ: 2
CAMEO_DEUG_2015: 9
CAMEO_DEU_2015: 44
In [42]:
# Examining our binary categorical values
azdias_1[binary_cats].head()
Out[42]:
ANREDE_KZ GREEN_AVANTGARDE SOHO_KZ VERS_TYP OST_WEST_KZ
1 2 0 1.0 2.0 W
2 2 1 0.0 1.0 W
3 2 0 0.0 1.0 W
4 1 0 0.0 2.0 W
5 2 0 0.0 2.0 W

So it looks like we have to encode "OST_WEST_KZ" to be numerical.

In [43]:
# Looking at the values in this column
azdias_1['OST_WEST_KZ'].value_counts()
Out[43]:
W    629525
O    168542
Name: OST_WEST_KZ, dtype: int64
In [44]:
# I'll one-hot encode this column with the rest of our multi-level categorical variables
multi_level_cats.append('OST_WEST_KZ')

Now examining our multi-level categorical variables.

In [46]:
# These have to be one-hot encoded
azdias_1[multi_level_cats].head()
Out[46]:
CJT_GESAMTTYP FINANZTYP GFK_URLAUBERTYP LP_FAMILIE_FEIN LP_FAMILIE_GROB LP_STATUS_FEIN LP_STATUS_GROB NATIONALITAET_KZ SHOPPER_TYP ZABEOTYP GEBAEUDETYP CAMEO_DEUG_2015 CAMEO_DEU_2015 OST_WEST_KZ
1 5.0 1 10.0 5.0 3.0 2.0 1.0 1.0 3.0 5 8.0 8 8A W
2 3.0 1 10.0 1.0 1.0 3.0 2.0 1.0 2.0 5 1.0 4 4C W
3 2.0 6 1.0 NaN NaN 9.0 4.0 1.0 1.0 3 1.0 2 2A W
4 5.0 5 5.0 10.0 5.0 3.0 2.0 1.0 2.0 4 1.0 6 6B W
5 2.0 2 1.0 1.0 1.0 4.0 2.0 1.0 0.0 4 1.0 8 8C W
Numeric
In [47]:
azdias_1[df_dtype['numeric']].describe()
Out[47]:
ANZ_PERSONEN ANZ_TITEL ANZ_HAUSHALTE_AKTIV ANZ_HH_TITEL MIN_GEBAEUDEJAHR KBA13_ANZAHL_PKW
count 798067.000000 798067.000000 791605.000000 794208.000000 798067.000000 785420.000000
mean 1.728842 0.004161 8.354955 0.040647 1993.276914 619.701328
std 1.156529 0.068887 15.673773 0.324029 3.332524 340.034520
min 0.000000 0.000000 1.000000 0.000000 1985.000000 0.000000
25% 1.000000 0.000000 2.000000 0.000000 1992.000000 384.000000
50% 1.000000 0.000000 4.000000 0.000000 1992.000000 549.000000
75% 2.000000 0.000000 10.000000 0.000000 1993.000000 778.000000
max 45.000000 6.000000 595.000000 23.000000 2016.000000 2300.000000

Looks like all of the values in the numeric columns are indeed numeric. These look good.

Mixed
In [48]:
azdias_1[df_dtype['mixed']].head()
Out[48]:
LP_LEBENSPHASE_FEIN LP_LEBENSPHASE_GROB PRAEGENDE_JUGENDJAHRE WOHNLAGE CAMEO_INTL_2015 PLZ8_BAUMAX
1 21.0 6.0 14.0 4.0 51 1.0
2 3.0 1.0 15.0 2.0 24 1.0
3 NaN NaN 8.0 7.0 12 1.0
4 32.0 10.0 8.0 3.0 43 2.0
5 8.0 2.0 3.0 7.0 54 1.0
In [49]:
# Printing the number of unique values within each of the mixed columns
for column in azdias_1[df_dtype['mixed']].columns:
    print(column,azdias_1[df_dtype['mixed']][column].nunique())
LP_LEBENSPHASE_FEIN 40
LP_LEBENSPHASE_GROB 12
PRAEGENDE_JUGENDJAHRE 15
WOHNLAGE 8
CAMEO_INTL_2015 21
PLZ8_BAUMAX 5

Taking a look at the Data_Dictionary.md, I noticed these are all categorical variables. I'm treating them as multi-level categorical variables (i.e. one-hot encoding them) with the exception of "PRAEGENDE_JUGENDJAHRE" and "CAMEO_INTL_2015". I'll deal with these two below in the next section.

In [50]:
for col in df_dtype['mixed']:
    if col not in ['CAMEO_INTL_2015','PRAEGENDE_JUGENDJAHRE']:
        multi_level_cats.append(col)

Recap

So we've taken a look at the ordinal, categorical, numeric, and mixed variable columns. We're fine with the ordinal and numeric columns. The categorical column can be further split between binary and multi-level columns, and we've already taken care of the binary categorical columns.

So now we must encode the multi-level categorical and mixed variable columes.

In [51]:
azdias_1 = pd.get_dummies(azdias_1,columns=multi_level_cats)
In [52]:
azdias_1.shape
Out[52]:
(798067, 256)

In sum, after removing outlier columns and rows with over 30 NA values, I kept ordinal and numeric columns the same, and one-hot encoded multi-level categorical, one binary categorical, and some mixed variable columns.

Engineering Mixed-Type Features

Below I'm working on converting the "PRAEGENDE_JUGENDJAHRE" (combines information on three dimensions: generation by decade, movement, and nation) and "CAMEO_INTL_2015" (combines information on two axes: wealth and life stage).

I will break up the two-digit codes by their 'tens'-place and 'ones'-place digits into two new ordinal variables (which, for the purposes of this project, is equivalent to just treating them as their raw numeric values).


Below is the mapping for "PRAEGENDE_JUGENDJAHRE" from the Data_Dictionary.md

  • 1: 40s - war years (Mainstream, E+W)
  • 2: 40s - reconstruction years (Avantgarde, E+W)
  • 3: 50s - economic miracle (Mainstream, E+W)
  • 4: 50s - milk bar / Individualisation (Avantgarde, E+W)
  • 5: 60s - economic miracle (Mainstream, E+W)
  • 6: 60s - generation 68 / student protestors (Avantgarde, W)
  • 7: 60s - opponents to the building of the Wall (Avantgarde, E)
  • 8: 70s - family orientation (Mainstream, E+W)
  • 9: 70s - peace movement (Avantgarde, E+W)
  • 10: 80s - Generation Golf (Mainstream, W)
  • 11: 80s - ecological awareness (Avantgarde, W)
  • 12: 80s - FDJ / communist party youth organisation (Mainstream, E)
  • 13: 80s - Swords into ploughshares (Avantgarde, E)
  • 14: 90s - digital media kids (Mainstream, E+W)
  • 15: 90s - ecological awareness (Avantgarde, E+W)
In [53]:
movement = []
decade = []
for val in azdias_1['PRAEGENDE_JUGENDJAHRE']:
    # Mapping for movement
    if val in [1,3,5,8,10,12,14]:
        movement.append(0) 
    elif val in [2,4,6,7,9,11,13,15]:
        movement.append(1)
    else:
        movement.append(np.NaN)
    # Mapping for decade  
    if val in [1,2]:
        decade.append(40)
    elif val in [3,4]:
        decade.append(50)
    elif val in [5,6,7]:
        decade.append(60)
    elif val in [8,9]:
        decade.append(70)
    elif val in [10,11,12]:
        decade.append(80)
    elif val in [13,14,15]:
        decade.append(90)
    else:
        decade.append(np.NaN)
In [54]:
# Quick sanity check
len(movement)==len(decade)==azdias_1.shape[0]
Out[54]:
True
In [55]:
movement = pd.Series(movement)
decade = pd.Series(decade)
In [56]:
azdias_1['movement']=movement
azdias_1['decade']=decade

Below is a mapping for "CAMEO_INTL_2015" from the Data_Dictionary.md

  • 11: Wealthy Households - Pre-Family Couples & Singles
  • 12: Wealthy Households - Young Couples With Children
  • 13: Wealthy Households - Families With School Age Children
  • 14: Wealthy Households - Older Families & Mature Couples
  • 15: Wealthy Households - Elders In Retirement
  • 21: Prosperous Households - Pre-Family Couples & Singles
  • 22: Prosperous Households - Young Couples With Children
  • 23: Prosperous Households - Families With School Age Children
  • 24: Prosperous Households - Older Families & Mature Couples
  • 25: Prosperous Households - Elders In Retirement
  • 31: Comfortable Households - Pre-Family Couples & Singles
  • 32: Comfortable Households - Young Couples With Children
  • 33: Comfortable Households - Families With School Age Children
  • 34: Comfortable Households - Older Families & Mature Couples
  • 35: Comfortable Households - Elders In Retirement
  • 41: Less Affluent Households - Pre-Family Couples & Singles
  • 42: Less Affluent Households - Young Couples With Children
  • 43: Less Affluent Households - Families With School Age Children
  • 44: Less Affluent Households - Older Families & Mature Couples
  • 45: Less Affluent Households - Elders In Retirement
  • 51: Poorer Households - Pre-Family Couples & Singles
  • 52: Poorer Households - Young Couples With Children
  • 53: Poorer Households - Families With School Age Children
  • 54: Poorer Households - Older Families & Mature Couples
  • 55: Poorer Households - Elders In Retirement
In [58]:
cameo_intl = []
for val in azdias_1['CAMEO_INTL_2015']:
    try:
        val = int(val)
        if 10 < val < 19:
            cameo_intl.append(5)
        elif 20 < val < 29:
            cameo_intl.append(4)
        elif 30 < val < 39:
            cameo_intl.append(3)
        elif 40 < val < 49:
            cameo_intl.append(2)
        elif 50 < val < 59:
            cameo_intl.append(1)
        else:
            cameo_intl.append(np.nan)
    except:
        cameo_intl.append(np.nan)
In [59]:
cameo_intl = pd.Series(cameo_intl)
azdias_1['cameo_intl'] = cameo_intl
azdias_1.head()
Out[59]:
ALTERSKATEGORIE_GROB ANREDE_KZ FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER GREEN_AVANTGARDE HEALTH_TYP ... WOHNLAGE_7.0 WOHNLAGE_8.0 PLZ8_BAUMAX_1.0 PLZ8_BAUMAX_2.0 PLZ8_BAUMAX_3.0 PLZ8_BAUMAX_4.0 PLZ8_BAUMAX_5.0 movement decade cameo_intl
1 1.0 2 1 5 2 5 4 5 0 3.0 ... 0 0 1 0 0 0 0 1.0 90.0 4.0
2 3.0 2 1 4 1 2 3 5 1 3.0 ... 0 0 1 0 0 0 0 0.0 70.0 5.0
3 4.0 2 4 2 5 2 1 2 0 2.0 ... 1 0 1 0 0 0 0 0.0 70.0 2.0
4 3.0 1 4 3 4 1 3 2 0 3.0 ... 0 0 0 1 0 0 0 0.0 50.0 1.0
5 1.0 2 3 1 5 2 2 5 0 3.0 ... 1 0 1 0 0 0 0 0.0 80.0 4.0

5 rows × 259 columns

In [60]:
azdias_1.drop(['CAMEO_INTL_2015','PRAEGENDE_JUGENDJAHRE'],axis=1,inplace=True)

Recap

I performed the following steps above...

PRAEGENDE_JUGENDJAHRE I replaced "PRAEGENDE_JUGENDJAHRE" with two new columns: movement and decade according to the mapping from the Data_Dictionary.md.

CAMEO_INTL_2015 I replaced "CAMEO_INTL_2015" with a column called cameo_intl according to the mapping from the Data_Dictionary.md.

Creating a Cleaning Function

Below is a huge function that replicates all the steps I performed above on any given DataFrame with the same structure as the demographics data.

In [62]:
def clean_data(df):
    """
    Perform feature trimming, re-encoding, and engineering for demographics
    data
    
    INPUT: Demographics DataFrame
    OUTPUT: Trimmed and cleaned demographics DataFrame
    """
    
    feat_info = pd.read_csv('AZDIAS_Feature_Summary.csv',';')
    feat_info['NA_tags'] = feat_info['missing_or_unknown'].apply(string_to_list)
    att_index = feat_info.set_index('attribute')
    # Looping through our column names
    for column in df.columns:
        # For every column in our DF, if a value is in the 'NA_tags' list, we replace it inplace with np.NaN
        df[column].replace(att_index.loc[column].loc['NA_tags'],np.NaN,inplace=True)    
    att_index = feat_info.set_index('attribute')
    
    # remove outlier columns
    outlier_columns = ['KBA05_BAUMAX',
                   'KK_KUNDENTYP',
                   'ALTER_HH',
                   'TITEL_KZ',
                   'GEBURTSJAHR',
                   'AGER_TYP']
    df.drop(labels=outlier_columns,axis=1,inplace=True)
    # remove rows with > 30 NA values
    df['no_na_row'] = df.isna().sum(axis=1)
    df = df[df['no_na_row']<=30]
    df.drop('no_na_row',inplace=True,axis=1)
    
    # select, re-encode, and engineer column values.
    feat_info = feat_info.set_index('attribute').drop(outlier_columns).reset_index()
    df_dtype = {}
    dtypes = ['ordinal','categorical','numeric','mixed']
    for d in dtypes:
        df_dtype[d]=feat_info[feat_info['type']==d]['attribute']    
    
    # Separating binary from multi-level categorical columns
    binary_cats = []
    multi_level_cats = []
    for col in df[df_dtype['categorical']]:
        values = df[df_dtype['categorical']][col].nunique()
        if values == 2:
            binary_cats.append(col)
        else:
            multi_level_cats.append(col)
    
    # Including OST_WEST_KZ in the list to be one-hot encoded
    multi_level_cats.append('OST_WEST_KZ')
    for col in df_dtype['mixed']:
        if col not in ['CAMEO_INTL_2015','PRAEGENDE_JUGENDJAHRE']:
            multi_level_cats.append(col)
    
    # One-hot encoding
    df = pd.get_dummies(df,columns=multi_level_cats)
    
    # Dealing with the 2 particular mixed vars...
    movement = []
    decade = []
    for val in df['PRAEGENDE_JUGENDJAHRE']:
        # Mapping for movement
        if val in [1,3,5,8,10,12,14]:
            movement.append(0) 
        elif val in [2,4,6,7,9,11,13,15]:
            movement.append(1)
        else:
            movement.append(np.NaN)
        # Mapping for decade  
        if val in [1,2]:
            decade.append(40)
        elif val in [3,4]:
            decade.append(50)
        elif val in [5,6,7]:
            decade.append(60)
        elif val in [8,9]:
            decade.append(70)
        elif val in [10,11,12]:
            decade.append(80)
        elif val in [13,14,15]:
            decade.append(90)
        else:
            decade.append(np.NaN)
    movement = pd.Series(movement)
    decade = pd.Series(decade)
    df['movement']=movement
    df['decade']=decade
            
    cameo_intl = []
    for val in df['CAMEO_INTL_2015']:
        try:
            val = int(val)
            if 10 < val < 19:
                cameo_intl.append(5)
            elif 20 < val < 29:
                cameo_intl.append(4)
            elif 30 < val < 39:
                cameo_intl.append(3)
            elif 40 < val < 49:
                cameo_intl.append(2)
            elif 50 < val < 59:
                cameo_intl.append(1)
            else:
                cameo_intl.append(np.nan)
        except:
            cameo_intl.append(np.nan)
    cameo_intl = pd.Series(cameo_intl)
    df['cameo_intl'] = cameo_intl
    df.drop(['CAMEO_INTL_2015','PRAEGENDE_JUGENDJAHRE'],axis=1,inplace=True)            
            
    return df
    
    

Testing the cleaning function on a copy of our data...

In [63]:
test_func_az = azdias[:]
In [64]:
test_df = clean_data(test_func_az)
/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:28: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [65]:
azdias_1.shape==test_df.shape
Out[65]:
True

Feature Transformation

Feature Scaling

So we're going to use principal component analysis (PCA) to reduce the number of dimensions (features) of our data. However, in order to do so, our data needs to be standardized (mean=0 and bonus if std dev = 0). We also need to take care of all the NA values that we kept.

My approach is the following...

  1. Impute NA values using each column's median
  2. Use the StandardScaler to standardize our data.
In [66]:
# Imputing missing values with the median
med_imp = SimpleImputer(strategy='median')
azdias_2 = med_imp.fit_transform(azdias_1)
In [67]:
# Standardizing
scaler = StandardScaler()
azdias_2 = scaler.fit_transform(azdias_2)

For simplicity's sake, I chose to impute the data with the median value in each column. Then, I've standardized our data using a StandardScaler so PCA calculates the principal components using similar values and so K-Means clustering groups similar values based on their relative spread rather than absolute value.

Performing Dimensionality Reduction | PCA

Principal Component Analysis, crudely simplified, is transforming your data to make it more succint. It takes columns that are very related to each other and re-engineers new columns that better describe the relationships of our columns with one another. These new columns are Principal Components, and we can use any given number of these components for analysis.

The more principal components (PC) we use, the more information we have, but computing KMeans clustering is slower, and we increase risk of overfitting. So ideally, we want to choose a number of PC's that best represent our dataset without losing too much information.

I'd like to try several n component values for PCA Dimensionality Reduction. I'll start with 40 to see how much variance is explained. Ideally, I'd like to get to the number of components with 85% variance explained.

Quick Aside: PC's describe relationships between the columns of our data. The most relevant (most useful) PC's are said to explain the most variance. So, we want to choose a number of the PC's that explain the most variance while reducing the number of features we use.

This variance can be calculated by calling the explained_variance_ratio_ attribute from a fitted pca object.

In [68]:
azdias_2.shape
Out[68]:
(798067, 257)

PCA Test Run for 40.

In [69]:
pca = PCA(40)
X_pca = pca.fit_transform(azdias_2)
In [70]:
pca.explained_variance_ratio_.sum()
Out[70]:
0.4945068139984037

So about 50% is explained with 40 components. Below I'll loop through several values for No. Principal Components until we get 85% variance explained.

In [71]:
n_components = [80,110,150,180,210]
for n in n_components:
    pca = PCA(n)
    X_pca = pca.fit_transform(azdias_2)
    if pca.explained_variance_ratio_.sum() >= .85:
        break
In [72]:
print(pca.explained_variance_ratio_.sum())
print(pca.n_components)
0.9306053772038634
150

Looks good, but 150 components feels a bit high. So I'm going to re-run this loop in smaller increments to just get around 85%.

In [73]:
n_components = [120,125,130,135,140]
for n in n_components:
    pca = PCA(n)
    X_pca = pca.fit_transform(azdias_2)
    if pca.explained_variance_ratio_.sum() >= .85:
        break
In [74]:
print(pca.explained_variance_ratio_.sum())
print(pca.n_components)
0.8648442564392442
130

Now plotting total explained variance for our principal components.

In [75]:
ticks = np.arange(pca.n_components)
values = pca.explained_variance_ratio_

plt.figure(figsize = (13,5))
ax = plt.subplot(111)
cumvals = np.cumsum(values)
ax.bar(ticks,values)
ax.plot(ticks,cumvals)

ax.xaxis.set_tick_params(width=1)
ax.yaxis.set_tick_params(width=2,length=12)

ax.set_xlabel("No. Principal Components")
ax.set_ylabel("Variance Explained (%)")
plt.title('Explained Variance Per Principal Component\nScree Plot')
plt.show()

Recap

So above we performed PCA many times until we found a number of principal components that explains 85% variance. The scree plot above plots principal components on the x-axis, variance explained on the y-axis, bars representing amount of variance explained, and a line that cumulatively sums up the variance explained across our principal components. I've decided to go with 130 principal components because it explains a good amount of variance (again, 85%) while greatly reducing number of dimensions from 257 to 130.

Interpreting Principal Components

Below I'll have a swing at interpretting what some principal components can tell us about our data.

In [76]:
def plot_pca_dim(pca,feat_names,dim,features):
    '''
    For a given fitted PCA object, the given component, and a given number of features, this plots
    the variance that the particular component captures.
    '''
    fmt = lambda x: "{:.2f}%".format(x)
    sort_abs = lambda x: abs(x[0])

    names_to_weights = sorted(zip(pca.components_[dim],feat_names),key=sort_abs,reverse=True)
    names_to_weights = np.array(names_to_weights)

    nm_to_wt_df = pd.DataFrame(names_to_weights,columns=['var','feature'])
    nm_to_wt_df['var'] = nm_to_wt_df['var'].astype(float)

    return px.bar(nm_to_wt_df[:features],x='feature',y='var',
           color='var',color_continuous_scale=px.colors.sequential.Sunsetdark,
           labels = {'var': 'Variance (%)','feature':'Feature Names'},
           height=400,width=600)
In [77]:
feat_names = azdias_1.columns.values

Component 1 (The component that explains the most variance)

Skip ahead below all the plots for my interpretation of the PC's

In [78]:
plot_pca_dim(pca,feat_names,0,10)
Field Name Interpretation Pos/Neg
MOBI_REGIO Region: Movement Patterns Neg
LP_STATUS_GROB_1.0 Social Status Pos
KBA05_ANTG1 No. 1-2 family houses in microcell Neg
PLZ8_ANTG3 No. 6-10 family houses in region Pos
PLZ8_ANTG1 No. 1-2 family houses in region Neg
FINANZ_MINIMALIST Low Financial Interest Neg
KBA05_GBZ No. Buildings in microcell Neg
HH_EINKOMMEN_SCORE Est. Household net income Pos
PLZ8_ANTG4 No. 10+ family homes in region Pos
PLZ8_BAUMAX_1.0 Most common building type Neg

Observations: Component 1

  • Positive correlation between social status, household net income, and relatively larger family homes.
  • Contrast this to a negative correlation with movement patterns smaller family homes, and low financial interest.

Component 2

In [79]:
plot_pca_dim(pca,feat_names,1,10)
Field Name Interpretation Pos/Neg
ALTERSKATEGORIE_GROB Estimated age Pos
FINANZ_SPARER Money Saver Neg
FINANZ_VORSORGER Being financially prepared Pos
FINANZ_UNAUFFAELLIGER Financially inconspicuous Neg
SEMIO_REL Personality: Religious Neg
SEMIO_TRADV Personality: Traditional-minded Neg
SEMIO_PFLICHT Personality: Dufitul Neg
FINANZ_ANLEGER Investor Neg
ZABEOTYP_3 Fair energy consumer Pos
SEMIO_ERL Personality: Event-oriented Pos

Observations: Component 2

  • Positive correlation between age, being financially prepared, being a fair energy consumer, and being event-oriented.
  • Negative correlations with money saving, being financially inconspicuous, religions, traditional, dutiful, and an investor.

Component 3

In [80]:
plot_pca_dim(pca,feat_names,2,10)
Field Name Interpretation Pos/Neg
ANREDE_KZ Gender (1-M, 2-F) Neg
SEMIO_VERT Personality: Dreamful Pos
SEMIO_KAEM Personality: Combative Attitude Neg
SEMIO_DOM Personality: Dominant-minded Neg
SEMIO_KRIT Personality: Critical-minded Neg
SEMIO_FAM Personality: Family-minded Pos
SEMIO_SOZ Personality: Socially-minded Pos
SEMIO_KULT Personality: Cultural-minded Pos
SEMIO_ERL Personality: Event-oriented Neg
FINANZ_ANLEGER Investor Neg

Observations: Component 3

  • Positive correlation between dreamful personality, family mindedness, social-mindedness, and cultural mindedness
  • Negative correlations with gender, combative-minded, dominant-minded, critical-minded, and investing.

Discussion: Interpretting Principal Components

In [81]:
pca.explained_variance_ratio_[:3].sum()
Out[81]:
0.13896057942704898

Cumulatively, they only explain approx. 13.9% of variance from our features, so they don't explain very much variance.

PCA Summary Observations

  1. The first component seems financially-driven. It most strongly correlates social status, household income, and larger homes (positive) against smaller family homes and low financial interest.
  2. The second component seems personality-driven. It shows positive correlation between age and financial preparedness, and shows negative correlation against similar, more conservative personality types (financially inconspicuous, religions, traditional, dutiful, and investing).
  3. The third component also seems personality-driven. It shows positive correlation between dreamful, family, social, and cultural mindedness. It also shows negative correlations between gender (note value for male is lower than female = 0 vs 1), combatative, dominant, and critical mindedness.

Clustering

Applying Clustering to General Population

Below I will use sklearn's KMeans clustering algorithm to identify different segments of the demographics data.

One thing that's very tricky when it comes to using KMeans is the fact that you have to choose the number of clusters that the algorithm will find. This can be difficult since we don't know how many clusters exist!

So, here's what we're going to do...

  1. Run KMeans on a few different values of k-clusters 1.5. Calculate Average Distance (Euclidean Distance) for each value of k-clusters
  2. Analyze how much reduced distance we get from each additional cluster
  3. Pick the value for k-clusters immediately before the value where we see a clear drop in reduction of average distance

Intuitively, this makes sense. If we have 5 clear-cut clusters in our population, adding clusters 1-5 will significantly reduce average distance between clusters. Once we add a 6th cluster, that 6th cluster will situate itself close to one of the already-defined 5 clusters, which will reduce average distance from each cluster, but not by that much.

In [185]:
scores = []

First trying 3-10 clusters to see how KMeans performs.

In [381]:
# Over a number of different cluster counts...
clusters = np.arange(3,11)
for i in clusters:
    start=time.time()
    model = KMeans(i)
    model.fit(X_pca)
    # Score is sum of squared distances to a sample's assigned centroid
    scores.append(abs(model.score(X_pca)))
    end = time.time()
    print(f'Running Score for {i} Clusters: {abs(model.score(X_pca)):.3f}')
    print(f'Time taken: {end-start:.2f} seconds.')
Running Score for 3 Clusters: 161597023.276
Time taken: 73.19 seconds.
Running Score for 4 Clusters: 158004058.975
Time taken: 85.80 seconds.
Running Score for 5 Clusters: 155379812.132
Time taken: 109.79 seconds.
Running Score for 6 Clusters: 153092423.744
Time taken: 110.24 seconds.
Running Score for 7 Clusters: 150649488.845
Time taken: 144.54 seconds.
Running Score for 8 Clusters: 148389831.896
Time taken: 148.45 seconds.
Running Score for 9 Clusters: 145435904.875
Time taken: 152.77 seconds.
Running Score for 10 Clusters: 143467713.793
Time taken: 180.77 seconds.

Plotting the scores below...

In [382]:
plt.plot(clusters,scores,marker="o")
plt.show()

The scores I'm plotting are the sum of the squared distances from each point to its corresponding centroid. The trend looks linear. A possible sign of having too many clusters is diminished reduction in these distances for each additional cluster. This is known as the "elbow" method. Given the trend looks linear, I'm going to try a few more clusters below.

In [383]:
clusters = np.arange(11,16)
for i in clusters:
    start=time.time()
    model = KMeans(i)
    model.fit(X_pca)
    # Score is sum of squared distances to a sample's assigned centroid
    scores.append(abs(model.score(X_pca)))
    end = time.time()
    print(f'Running Score for {i} Clusters: {abs(model.score(X_pca)):.3f}')
    print(f'Time taken: {end-start:.2f} seconds.')
Running Score for 11 Clusters: 142549766.933
Time taken: 190.92 seconds.
Running Score for 12 Clusters: 139869652.130
Time taken: 199.48 seconds.
Running Score for 13 Clusters: 138564471.742
Time taken: 212.23 seconds.
Running Score for 14 Clusters: 136929997.809
Time taken: 205.36 seconds.
Running Score for 15 Clusters: 136202833.209
Time taken: 214.22 seconds.

Replotting the scores.

In [385]:
plt.plot(np.arange(3,16),scores,marker="o")
plt.show()

Although it's not very clear in the plot, it looks like our distance reduction began diminishing from 13 to 14 clusters, from reading out the running score printed 3 cells above. I'll stick with 13 clusters and proceed with a final model fitted on the PCA'd version of our demographics data.

In [82]:
final_model = KMeans(13)
final_model.fit(X_pca)
Out[82]:
KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=13, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)
In [83]:
gen_preds = final_model.predict(X_pca)

Recap

I fit and scored KMeans on our PCA-transformed data for n-centroids 3-15. Score is calculated as the distance between samples and their corresponding centroid. As we add centroids to the algorithm, the score will decrease (i.e. distance between points) less and less. 10 and 13 centroids seem like good candidates for our algorithm, but I chose to proceed with 13 clusters because it appears sum of squared distance was significantly reduced from 11-13 centroids.

Applying All Steps to the Customer Data

Below I'm cleaning, scaling, PCA-ing, and clustering our customer data using all the objects and steps we used on the demographics data.

In [84]:
# Load in the customer demographics data.
customers = pd.read_csv('Udacity_CUSTOMERS_Subset.csv',';')
In [85]:
customers.head()
Out[85]:
AGER_TYP ALTERSKATEGORIE_GROB ANREDE_KZ CJT_GESAMTTYP FINANZ_MINIMALIST FINANZ_SPARER FINANZ_VORSORGER FINANZ_ANLEGER FINANZ_UNAUFFAELLIGER FINANZ_HAUSBAUER ... PLZ8_ANTG1 PLZ8_ANTG2 PLZ8_ANTG3 PLZ8_ANTG4 PLZ8_BAUMAX PLZ8_HHZ PLZ8_GBZ ARBEIT ORTSGR_KLS9 RELAT_AB
0 2 4 1 5.0 5 1 5 1 2 2 ... 3.0 3.0 1.0 0.0 1.0 5.0 5.0 1.0 2.0 1.0
1 -1 4 1 NaN 5 1 5 1 3 2 ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 -1 4 2 2.0 5 1 5 1 4 4 ... 2.0 3.0 3.0 1.0 3.0 3.0 2.0 3.0 5.0 3.0
3 1 4 1 2.0 5 1 5 2 1 2 ... 3.0 2.0 1.0 0.0 1.0 3.0 4.0 1.0 3.0 1.0
4 -1 3 1 6.0 3 1 4 4 5 2 ... 2.0 4.0 2.0 1.0 2.0 3.0 3.0 3.0 5.0 1.0

5 rows × 85 columns

Cleaning

In [86]:
cleaned_cust = clean_data(customers)
/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:3940: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [87]:
cleaned_cust.shape
Out[87]:
(141725, 256)

The imputer and scaler aren't going to work because "GEBAEUDETYP_5.0" isn't in our cleaned_cust DataFrame. This wasn't created because the "GEBAEUDETYP" column didn't have any values of 5 in our customer set since we one-hot encoded it using pd.get_dummies instead of OneHotEncoder. Adding in a column of only zeros for this missing column is an equivalent operation to not having any values of 5 for this feature.

In [88]:
for i in azdias_1.columns:
    if i not in cleaned_cust.columns:
        print(i)
GEBAEUDETYP_5.0
In [89]:
GEBAEUDETYP_5 = pd.Series(np.zeros(141725))
cleaned_cust['GEBAEUDETYP_5.0'] = GEBAEUDETYP_5
In [90]:
# Reordering columns to align with original DataFrame
cols = azdias_1.columns.tolist()
cleaned_cust = cleaned_cust[cols]
In [91]:
cleaned_cust.shape
Out[91]:
(141725, 257)

Imputing & Standardization

In [92]:
imputed = med_imp.transform(cleaned_cust)
standardized = scaler.transform(imputed)
In [93]:
standardized.shape
Out[93]:
(141725, 257)

PCA

In [94]:
X_pca_cust = pca.transform(standardized)
In [95]:
X_pca_cust.shape
Out[95]:
(141725, 130)

Clustering

In [96]:
preds = final_model.predict(X_pca_cust)

Comparing Customer Data to Demographics Data

Now that we have both our demographics data and our customer data clustered let's compare them!

In [97]:
df_preds_cust = pd.DataFrame(preds,columns=['cust_preds'])
df_preds_gen =  pd.DataFrame(gen_preds,columns=['gen_preds'])
In [98]:
prop_cust = df_preds_cust['cust_preds'].value_counts().sort_index()/ \
                sum(df_preds_cust['cust_preds'].value_counts())
prop_gen = df_preds_gen['gen_preds'].value_counts().sort_index()/ \
                sum(df_preds_gen['gen_preds'].value_counts())
In [99]:
fig,ax = plt.subplots(figsize=(15,5))
width = 0.35

cust = ax.bar(prop_cust.index-.51*width, prop_cust, width, color='darkolivegreen')
gen = ax.bar(prop_gen.index+.51*width, prop_gen, width, color='steelblue')


ax.set_ylabel('Proportion (%)')
ax.set_xlabel('Labels')
ax.set_title('General Population Clusters Vs Customer Clusters')
ax.set_xticks(prop_cust.index)
ax.set_ylim(top=max(prop_cust.append(prop_gen))*1.15)
ax.legend(('Customer','General Pop'))

def autolabel(rects):
    """
    Attach a text label above each bar displaying its height
    """
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%.2f' % float(height),
                ha='center', va='bottom')

autolabel(cust)
autolabel(gen)

The two relative distributions are slightly different. The most significant differences between them are the folliwing...

  • General Pop
    • 7x label 0
    • 6x label 3
    • 15x label 5
  • Customer
    • 2x label 8
    • 3x label 7

So we can see that some clusters are over-represented in our customer data and some are under-represented in our customer data. Let's take a look at what characteristics describe these clusters to help us understand who our customers are.

Customer Over-represented: Cluster 7

Below I'm extracting cluster 7's top 5 most significant principal components.

In [102]:
centroid = final_model.cluster_centers_[7]
centroid_df = pd.DataFrame(centroid,columns=['pca_weights'])

# Sorting the centroid in descending absolute value order
    # Note indexes correspond to principal components
centroid_df.reindex(centroid_df.abs().sort_values('pca_weights',ascending=False).index).head()
Out[102]:
pca_weights
0 -4.998779
3 2.523828
5 -1.964884
15 -1.604093
6 -1.428213

Relatively speaking, it looks like PC 0 was highly negatively influencing this cluster, and PC 3 was highly positively influencing this cluster. Let's take a look at these components...

In [161]:
plot_pca_dim(pca,feat_names,0,10)
In [158]:
# Provided by Udacity Reviewer - Inserted for future reference
    # This function is a (less pretty but) more concise way of showing the features that are
        # Most negatively correlated with the cluster
        # Most positively correlated with the cluster
    # However, it is computed in a slightly different way 
        # (by inverse transforming our PCA data back into orig features)
def plot_scaled_comparison(df_sample, kmeans, cluster):
    '''
    Takes performs the PCA inverse_transform of a selected cluster from our trained model to obtain 
        the scaled feature weights. This helps us interpret who belongs in a given cluster.
    '''
    X = pd.DataFrame.from_dict(dict(zip(df_sample.columns,
        pca.inverse_transform(kmeans.cluster_centers_[cluster]))), orient='index').rename(
        columns={0: 'feature_values'}).sort_values('feature_values', ascending=False)
        X['feature_values_abs'] = abs(X['feature_values'])
        pd.concat((X['feature_values'][:10], X['feature_values'][-10:]), axis=0).plot(kind='barh');
In [160]:
plot_scaled_comparison(azdias_1, final_model, 5)
In [104]:
plot_pca_dim(pca,feat_names,3,5)

PC 0 (Negatively weighted by cluster 7)

  • Positively explains low-income earners (LP_STATUS_GROB) & no. 6-10 family owned houses (PLZ8_ANTG3)
  • Negatively explains movement patterns (MOBI_REGIO), no. 1-2 family owned houses (KBA05_ANTG1 & PLZ8_ANTG1)

PC 3 (Positively weighted by cluster 7)

  • Positively explains no. adults in household (ANZ_PERSONEN) & multi-person households (LP_FAMILIE_GROB)
  • Negatively explains single households (LP_FAMILIE_FEIN & LP_FAMILIE_GROB) and likelihood of children (1- most likely & 6 - very unlikely)

My intuition suggests to me that cluster 7 reflects older and wealthier households that likely live in bigger and wealthier areas in Germany.

Customer Under-represented: Cluster 5

Below I'm extracting cluster 5's top 5 most significant PCAs.

In [108]:
centroid = final_model.cluster_centers_[5]
centroid_df = pd.DataFrame(centroid,columns=['pca_weights'])
In [109]:
centroid_df.reindex(centroid_df.abs().sort_values(by='pca_weights',ascending=False).index).head()
Out[109]:
pca_weights
0 5.144780
1 -2.480333
7 -1.065089
8 -0.874759
3 -0.851585

Relatively speaking, it looks like PCA 0 was highly positively influencing this cluster, and PCA 1 was highly negatively influencing this cluster. Let's take a look at these components...

In [111]:
plot_pca_dim(pca,feat_names,0,5)
In [112]:
plot_pca_dim(pca,feat_names,1,5)

PC 0 (Positively weighted by cluster 5. Note this was Negatively weighted by cluster 7.)

  • Positively explains low-income earners (LP_STATUS_GROB) & no. 6-10 family owned houses (PLZ8_ANTG3)
  • Negatively explains movement patterns (MOBI_REGIO), no. 1-2 family owned houses (KBA05_ANTG1 & PLZ8_ANTG1)

PC 1 (Negatively weighted by cluster 5)

  • Positively explains estimated age (ALTERSKATEGORIE_GROB) and people that prioritize financial preparedness (FINANZ_VORSORGER)
  • Negatively explains money savers (FINANZ_SPARER), people financially inconspicuous (FINANZ_UNAUFFAELLIGER), and religious mindedness (SEMIO_REL)

My intuition tells me that cluster 5 represents a lower-income and younger part of our population that tends to save money and aren't very religious.

Conclusions

Limitation
Before I conclude, allow me to go over some limitations that you should be aware of...

  • I only looked at 2 clusters (one negative and one positive), so my scope is limited to analyzing two classes of customers
  • For each cluster, I only considered their top 2 most weighted principal components, and used their explained variance ratios to describe the customers that are represented by this data. More information about these clusters can be drawn by taking a closer look at the principal components.

Conclusion
These customers highly represented cluster 7 and under-represented cluster 5. By looking closer at the principle components that most greatly distinguish these clusters, I noted that cluster 7 generally looks older & wealthier people that live in larger houses (1-2 family homes vs 6-10) than cluster 5. On the other hand, cluster 5 generally looks younger and less wealthy.

We can deduce that most of these customers are older and wealthier. Cluster 7 makes up about 32% of our customer population per our KMeans clustering. So if my intuition is correct, I recommend that this mail-order company target older and wealthier households in their marketing efforts if they want to enforce or improve their standing with this set of customers. On another hand, this disparity may be caused by some other confounding factor, such as existing bias that already favors this subset of our population. If that is the case, it also seems reasonable to target younger money-savers since that subset of our population is underrepresented in our customer population.

The call to action is situational. What's more important is the awareness that the existing customer population is older and wealthier, and that there is an opportunity for growth in a younger and less-wealthy subset of the German population.

Opportunities for Improvement

Below I'll address some of the things I see as weaknesses/points of improvement in this analysis that would be worth working on if I continued along with this project.

  1. I dropped all rows with more than 30 missing values without considering alternative ways of imputing/inferring these values.
    • Sometimes, we can infer missing values based off of other columns. I didn't look into this.
    • Additionally, there are more advanced methods of imputement that are currently way over my head.
  2. When preparing our data, I didn't check for outliers for numeric continuous data. This might have improved my results.
  3. The Cleaning function could have been much more clean, efficient, and effective using sklearn's Pipeline feature.
  4. As stated previously, it would be great to continue to analyze other clusters and extrapolate more features that characterize these clusters.