Project: Identify Customer Segments¶

This is a Udacity Unsupervised Learning Project using Real Demographic & Customer Data

Context¶

In this project, I used unsupervised learning techniques to identify segments of the population that form the core customer base for a mail-order sales company in Germany. These segments can then be used to direct marketing campaigns towards audiences that will have the highest expected rate of returns. The data that was used was provided by Udacity's partners at Bertelsmann Arvato Analytics.

Importing Libraries¶

import numpy as np
import pandas as pd
from scipy.stats import ks_2samp
import time

import matplotlib.pyplot as plt
import seaborn as sns
import plotly_express as px
%matplotlib inline

from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans

Loading the Data¶

There are four files associated with this project (not including this one):

Udacity_AZDIAS_Subset.csv: Demographics data for the general population of Germany; 891211 persons (rows) x 85 features (columns).
Udacity_CUSTOMERS_Subset.csv: Demographics data for customers of a mail-order company; 191652 persons (rows) x 85 features (columns).
Data_Dictionary.md: Detailed information file about the features in the provided datasets.
AZDIAS_Feature_Summary.csv: Summary of feature attributes for demographics data; 85 features (rows) x 4 columns

We will examine, explore, clean, and cluster the Germany demographics data Udacity_AZDIAS_Subset.csv, then apply all the same transformations & cleaning to the customer demographics data Udacity_CUSTOMERS_Subset.csv. Each row represents a singple person, and includes information outside of individuals, including information about their household, building, and neighborhood. I will use this information to cluster the general population into groups with similar demographic properties. Then, I will see how the people in the customers dataset fit into those created clusters.

After, we will compare the clusters between the general Germany demographic data with the customer data to see whether there are any over-represented clusters in our customer data. This cluster can be interpretted as the company's core user base, and could be targetted for future marketing campaigns. Alternatively, under-represented clusters may be viewed as a potential opportunity of expansion for the mail order company.

Let's Begin...

Loading in our general demographics data & feature summary file...

# Setting display to show all rows
    # Apologies in advance for cluttering the notebook, but this makes for more convenient analysis
pd.options.display.max_rows = 260

# Load in the general demographics data.
azdias = pd.read_csv('Udacity_AZDIAS_Subset.csv',';')

# Load in the feature summary file.
    # Feature summary file contains very useful information about the data types for each feature
feat_info = pd.read_csv('AZDIAS_Feature_Summary.csv',';')

Taking a look at our general demographics...

azdias.shape

(891221, 85)

azdias.head()

Skimming the numerical data summary statistics...

azdias.describe().transpose()

Now taking a look at our feature summary file AZDIAS_Feature_Summary.csv...

feat_info.shape

(85, 4)

# As you can see, we have convenient information on type of data and a code telling us what values denote NA values
feat_info.head()

Preprocessing¶

Assessing Missing Data¶

Converting Missing Value Codes to NaNs

The fourth column of the feature attributes summary (loaded in above as feat_info) tells us what codes indicate missing/NA values (e.g. [-1,0]). However, we have to convert these codes as a list because it's encoded as a string of a list.

When converting a DataFrame to csv, any columns containing Python list datatypes are converted to strings.

So, before we encode our NA values using the feature summary missing_or_unknown column, we need to first convert this column into a list.

Just for fun, let's see how many NA values are already recognized by Pandas.

# Total number of known missing values (Pandas recognizes as NA)
    # the first .sum() is used to sum across all columns. The second .sum() is used to aggregate all NAs
azdias.isna().sum().sum()

4896838

# Using this function to convert the strings of a list to actual lists
def string_to_list(x):
    '''
    This function takes a string of a list and returns a list containing integers and strings.
    Note the list can only contain alphanumeric values for it to work.
    
    Input: String of a list (i.e. '[1,2,XX]')
    Returns: List data type
    '''
    new_list=[]
    # The below line removes the '[]' and splits on ',', creating a list of strings
    x = x[1:-1].split(',')
    # For each value in the list of strings, try turning it into an integer if possible and append to new_list
        # Otherwise, append it to new_list
    for i in x:
        try:
            new_list.append(int(i))
        except:
            new_list.append(i)
    return new_list

# Applying the function
feat_info['NA_tags'] = feat_info['missing_or_unknown'].apply(string_to_list)

# Example of an entry in this new column
type(feat_info['NA_tags'][0])

list

As you can see, we've successfully converted these entries into Python lists

feat_info.head()

Now that that's out of the way, let's encode our NA values.

Inserting NA values into our DataFrame if they are tagged as NA in the "NA_tags" column

# Using a nearly identical copy of the feat_info DF with the attributes as the index
att_index = feat_info.set_index('attribute')

# This is an example of what I just created. You'll see what I did this in a moment
att_index.head(1)

# Creating a copy of the demographics data to work with to be the version with NA's inserted
na_azdias = azdias[:]

Below I'm looking through each column, using the att_index DataFrame as a reference to encode our NA values. I needed the features to be the index of the dataframe in order to use .loc to retrieve each column's respective list of NA codes.

# Looping through our column names
for column in na_azdias.columns:
    # For every column in our DF, if a value is in the 'NA_tags' list, we replace it inplace with np.NaN
    na_azdias[column].replace(att_index.loc[column].loc['NA_tags'],np.NaN,inplace=True)

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py:6586: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

# Now we have about 3,500,000 more NA values as a result of this replacement
na_azdias.isna().sum().sum()

8373929

Assessing Missing Data in Each Column¶

Below we will asses how much data is missing in each column and row to understand...

How much data should we drop, if any?
If we choose to drop some data, what are the implications of doing so?

Below is a simple textual representation of how many NA values are in each column. As you can see, columns starting with FINANZ... and SEMIO... don't have any missing values. This must be standard person-level data, mandatory for each observation.

na_azdias.isna().sum()

AGER_TYP                 685843
ALTERSKATEGORIE_GROB       2881
ANREDE_KZ                     0
CJT_GESAMTTYP              4854
FINANZ_MINIMALIST             0
FINANZ_SPARER                 0
FINANZ_VORSORGER              0
FINANZ_ANLEGER                0
FINANZ_UNAUFFAELLIGER         0
FINANZ_HAUSBAUER              0
FINANZTYP                     0
GEBURTSJAHR              392318
GFK_URLAUBERTYP            4854
GREEN_AVANTGARDE              0
HEALTH_TYP               111196
LP_LEBENSPHASE_FEIN       97632
LP_LEBENSPHASE_GROB       94572
LP_FAMILIE_FEIN           77792
LP_FAMILIE_GROB           77792
LP_STATUS_FEIN             4854
LP_STATUS_GROB             4854
NATIONALITAET_KZ         108315
PRAEGENDE_JUGENDJAHRE    108164
RETOURTYP_BK_S             4854
SEMIO_SOZ                     0
SEMIO_FAM                     0
SEMIO_REL                     0
SEMIO_MAT                     0
SEMIO_VERT                    0
SEMIO_LUST                    0
SEMIO_ERL                     0
SEMIO_KULT                    0
SEMIO_RAT                     0
SEMIO_KRIT                    0
SEMIO_DOM                     0
SEMIO_KAEM                    0
SEMIO_PFLICHT                 0
SEMIO_TRADV                   0
SHOPPER_TYP              111196
SOHO_KZ                   73499
TITEL_KZ                 889061
VERS_TYP                 111196
ZABEOTYP                      0
ALTER_HH                 310267
ANZ_PERSONEN              73499
ANZ_TITEL                 73499
HH_EINKOMMEN_SCORE        18348
KK_KUNDENTYP             584612
W_KEIT_KIND_HH           147988
WOHNDAUER_2008            73499
ANZ_HAUSHALTE_AKTIV       99611
ANZ_HH_TITEL              97008
GEBAEUDETYP               93148
KONSUMNAEHE               73969
MIN_GEBAEUDEJAHR          93148
OST_WEST_KZ               93148
WOHNLAGE                  93148
CAMEO_DEUG_2015           99352
CAMEO_DEU_2015            99352
CAMEO_INTL_2015           99352
KBA05_ANTG1              133324
KBA05_ANTG2              133324
KBA05_ANTG3              133324
KBA05_ANTG4              133324
KBA05_BAUMAX             476524
KBA05_GBZ                133324
BALLRAUM                  93740
EWDICHTE                  93740
INNENSTADT                93740
GEBAEUDETYP_RASTER        93155
KKK                      158064
MOBI_REGIO               133324
ONLINE_AFFINITAET          4854
REGIOTYP                 158064
KBA13_ANZAHL_PKW         105800
PLZ8_ANTG1               116515
PLZ8_ANTG2               116515
PLZ8_ANTG3               116515
PLZ8_ANTG4               116515
PLZ8_BAUMAX              116515
PLZ8_HHZ                 116515
PLZ8_GBZ                 116515
ARBEIT                    97375
ORTSGR_KLS9               97274
RELAT_AB                  97375
dtype: int64

Below I'm making a temporary DataFrame to help me understand the question: how much data is missing in each column?

# Creating a copy of the DF containing the counts of NAs with the names of the columns in the original DF as a column
na_azdias_df = na_azdias.isna().sum().reset_index()
# Filtering for only columns with more than 50000 NA values to declutter the plot
temp_view = na_azdias_df[na_azdias_df[0]>50000]

# FYI - 'index' is column name and '0' is count of NA values
temp_view.head(1)

Below is a horizontal bar plot showing columns and their NA counts as their bar width. Here I've noted that the following columns have much more NA values than the others...

KBA05_BAUMAX
KK_KUNDENTYP
ALTER_HH
TITEL_KZ
GEBURTSJAHR
AGER_TYP

# Sorting by count of NA values
temp_view.sort_values(by=0,ascending=True,inplace=True)

plt.figure(figsize=(6,12))
barh = plt.barh(temp_view['index'],temp_view[0],color = 'darkturquoise', alpha = 0.85)
for i in range(7):    
    barh[-i].set_color('darkred')
plt.tight_layout()

So now I'll remove those 6 outlier columns...

outlier_columns = ['KBA05_BAUMAX','KK_KUNDENTYP','ALTER_HH','TITEL_KZ','GEBURTSJAHR','AGER_TYP']

na_azdias.drop(labels=outlier_columns,axis=1,inplace=True)

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:3940: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Printing the shape shows that the no. columns decreased from 85 -> 79

na_azdias.shape

(891221, 79)

Recap

Per my analysis of missing data above, I've found that the entire dataset contained 4,896,838 known missing values (recognized by Pandas) before I manually encoded NA values and contained 8,912,221 known missing values after I encoded NA values per the feat_info file and dropped "outlier" columns with many NA values.

I plotted the count of NA values per column and noted that the following 6 columns were outliers in terms of number of missing values, and were subsequently removed.

KBA05_BAUMAX
KK_KUNDENTYP
ALTER_HH
TITEL_KZ
GEBURTSJAHR
AGER_TYP

Assessing Missing Data in Each Row¶

Below I will asses how much data is missing in each row, and evaluate whether I should drop any rows.

This analysis is going to be a bit more work that assessing missing data in each column, because dropping rows has more implications than dropping columns. By dropping rows, we're dropping entire observations (one observation being one person), which can skew/misrepresent the demographic data.

Spoiler alert: I will be dropping rows for the sake of compatibility with Principal Component Analysis and KMeans Clustering. Let's understand the consequences of dropping these rows.

# Creating a temp column that contains the count of NA values in each row
na_azdias['no_na_row'] = na_azdias.isna().sum(axis=1)

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:2: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

This histogram plots a distribution of non-zero NA values. Looks like most of our rows with NA values are missing between 1-10 of their values.

plt.hist(na_azdias[na_azdias['no_na_row']>0]['no_na_row'],bins=10)

(array([96415., 60529., 12134.,  5139.,   480.,   155., 14016.,  5031.,
        28038., 46075.]),
 array([ 1. ,  5.8, 10.6, 15.4, 20.2, 25. , 29.8, 34.6, 39.4, 44.2, 49. ]),
 <a list of 10 Patch objects>)

Because we have a significant number of rows with 30-50 NA values, I'll split the data into two subsets: one with rows with less than or equal to 30 NA values na_azdias_b30 and one with rows greater than 30 NA values na_azdias_g30.

na_azdias_b30 = na_azdias[na_azdias['no_na_row']<=30]
na_azdias_g30 = na_azdias[na_azdias['no_na_row']>30]

Below is a quick comparison of the proportion of NA values in each row.

There's an oddly large amount of rows with 49 and 53 missing values.

fig,(ax1,ax2) = plt.subplots(1,2,figsize=(13,4))
sns.countplot(x='no_na_row',data=na_azdias_b30,ax=ax1,orient='v')
sns.countplot(x='no_na_row',data=na_azdias_g30,ax=ax2,orient='v')
plt.tight_layout()

Now I'll compare the distributions of particular columns' values between the two subsets of data.

Note this is different than what we did in the cell above, which is plot a distribution of rows for each number of NA values.

def compare_columns(df1,df2,column,df1_desc=None,df2_desc=None):
    '''
    Takes two DataFrames and a column name as input and plots two countplots of the given column for both DataFrames.
    '''
    # Taking max value count between the 2 columns to set as ylim for both plots
    df1_max = df1[column].value_counts().iloc[0]
    df2_max = df2[column].value_counts().iloc[0]
    if df1_max > df2_max:
        top = df1_max * 1.05
    else:
        top = df2_max * 1.05
    
    # Plotting
    fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,3))
    sns.countplot(df1[column],ax=ax1)
    ax1.set_ylabel('Count')
    ax1.set_ylim(top=top)
    ax1.set_title(df1_desc)
    sns.countplot(df2[column],ax=ax2)
    ax2.set_ylabel('Count')
    ax2.set_ylim(top=top)
    ax2.set_title(df2_desc)
    plt.tight_layout()

Below are plots for six columns that have no missing values. Plots on the left are from the subset of data with less than 30 NA values in each row and plots on the right are from the subset of data with more than 30 NA values in each row.

for n in ['ANREDE_KZ','FINANZ_MINIMALIST','GREEN_AVANTGARDE','SEMIO_SOZ','ZABEOTYP']:
    compare_columns(na_azdias_b30,na_azdias_g30,n,"Below 30 NA Values","Above 30 NA Values")

Clearly, the subset of data with more than 30 NA values in each row (plots on the right) is much smaller than the other subset. Additionally, the distribution of values between the two subsets of data are different for all of the columns above wish the exception of column "ANREDE_KZ" (Gender).

Note that we're looking at columns with zero NA values and comparing rows with many NA values against rows with little NA values.

This suggests a possible bias introduced in the way the data was collected and/or entered. Whatever the case may be, these rows with many NA values should be investigated, since they should hypothetically be similarly distributed relative to the other subset of data with less than 30 NA values.

Additional supplementary visualization. Plots Distribution plots for first 10 columns on top of one another.

This does't add much more information. Feel free to skim over this viz.

plt.figure(figsize=(15,12))
for i, col in enumerate(na_azdias_b30.columns[:10]):
    plt.subplot(5, 2, i+1)
    sns.distplot(na_azdias_b30[col][na_azdias_b30[col].notnull()], label='<=30 NA',hist=False)
    sns.distplot(na_azdias_g30[col][na_azdias_g30[col].notnull()], label='>30 NA',hist=False)
    plt.title('Distribution for column: {}'.format(col))
    plt.legend();
    
plt.tight_layout()
plt.show()

Kolmogorov-Smirnov Test¶

Below I'll use the KS test to compare each column's distributions from our two datasets (rows with <30 NA values & rows with >30 NA values).

Purpose: To evaluate how different these two groups of data are distributed. If we see that these distributions are very different (low p-values as a result of our KS-Test), then we have statistical evidence that we can use to evaluate whether we should drop the >30 NA values-per row subset.

TL;DR? Scroll down below for my interpretation of the KS test and a quick note on how it works.

def KS_col_test(df1,df2):
    '''
    Used to evaluate whether distributions of data between two DataFrames with the same columns are different.
    
    Input: 2 DataFrames with identical columns.
    Output: D (KS Stat | Max Distance between the two cumulative probability functions for each Dataset) and p-values associated to each D 
    '''
    # Lists for the KS Stat (D) and p-values to return
    stats = []
    p_values = []
    
    # Grabbing an array of column names
    cols = df1.columns.values
    
    for col in cols:
        try:
            D,p = ks_2samp(df1[col],df2[col])
            stats.append(D)
            p_values.append(p)
        except:
            stats.append(np.NaN)
            p_values.append(np.NaN)
    
    return stats,p_values

# Calculating stats and p-values for each column
stats,p_values = KS_col_test(na_azdias_b30,na_azdias_g30)

# Creating a DataFrame to store the results. 80 rows (1/feature)
ks_df = pd.DataFrame(na_azdias_b30.columns.values,columns=['feature'])
ks_df['stat'] = stats
ks_df['p_value'] = p_values

# Plotting results. See below for my interpretation
fig,(ax1,ax2) = plt.subplots(1,2,figsize=(10,4))

sns.distplot(ks_df[ks_df['stat'].notnull()]['stat'],ax=ax1,kde=False)
ax1.set_title('Distribution of D KS Stats')
ax1.set_xlabel('D | KS Stat')
ax1.set_ylim(top=70)

sns.distplot(ks_df[ks_df['p_value'].notnull()]['p_value'],ax=ax2,kde=False)
ax2.set_title('Distribution of P-Values')
ax2.set_xlabel('P Values')

plt.tight_layout()
plt.show()

So let's recap what we just did. We have two DataFrames:

na_azdiasb30 DataFrame with <= 30 NA values per row
na_azdiasg30 DataFrame with > 30 NA values per row

We used SciPy's implementation of the 2-way Kolmogorov-Smirnov test on each of the 80 columns to evaluate the similarity of each column's distributions.

Brief Aside on the KS-Test

The KS test measures similarity between two samples to check whether these two samples came from the same distribution. (In our case, we're checking if the values in each of our DataFrames for each column are distributed similarly to evaluate the consequence of us dropping one of these DataFrames; namely, the DataFrame with more NA values). The KS test creates empirical cumulative distribution functions (one for each sample/group of observations) and compares them. We take the difference of these two distributions, and the maximum difference (D) is our D/KS Stat. Then, we measure the p-value (Null Hypothesis: these two samples came from the same disttribution). If our p-value is low, then we are more confident that the samples come from different distributions.

KS Test Conclusion

So we see that our p-values are extremely low, telling us that our distributions between the two DataFrames (30 or less vals and greater than 30 NA vals) are highly likely to come from different distributions. I interpret this as a warning that if we drop our na_azdias_g30 DataFrame, we're going to lose valuable information that could affect our algorithm.

Consecutive NA Values¶

Below I investigated the remaining columns containing NA values and found that many of the rows containing NA values have consecutive NA values in related columns, such as columns that begin with "KBA05..." and "PLZ8...", suggesting that these missing values are likely not due to the chance of random erroneous data entry, and are rather caused by some external factor, such as the process/methodology of how the data was collected.

TL;DR? Key Finding: It looks like most rows I got back have many NA values in the same row. For example, the cell immediately below returns rows with NA values in column 'KBA05ANTG3', and by scrolling right to the end of the DataFrame, we see that the columns that start with 'PLZ8__' have many rows that have consecutive NA values.

na_azdias[pd.isna(na_azdias['KBA05_ANTG3'])]

Recap

I split the data into two subsets: rows containing 0-30 NA values and rows containing more than 30 NA values. When looking at the distribution of data for columns that do not contain NA values, I noticed that the distribution of data between these two subsets look very different, which is contrary to what you'd expect. When you split data and look at distributions on columns with no NA values, you'd expect the distributions of these columns to look the same. This suggested to me that there is some sort of bias introduced in the way the data was collected or entered for rows with many NA values.

Additionally, I looked at these rows with many NA values, and it looked like they had many consecutive NA values in related rows, such as rows that begin with "PLZ8..." or "KBA05...". This further suggests that these missing values are probably not caused by chance alone.

Let's see how much of our data has more than 30 missing values.

prop_g_30 = len(na_azdias_g30)/len(na_azdias)
print(f'{prop_g_30*100:.2f}% of our data has more than 30 missing values in a row.')

10.45% of our data has more than 30 missing values in a row.

10.46% of data is a huge chunk that could be important for our ML algorithm. We also saw through the plots and the KS Test that the distribution of values between the two subsets are very different. However, considering the amount of missing values in a row (30 missing values out of 79 total features) that we'd have to impute, I'm dropping these rows (also for simplicity's sake...and laziness) and moving forward with a DF called azdias_1.

na_azdias_b30.drop('no_na_row',inplace=True,axis=1)
azdias_1 = na_azdias_b30[:]

Selecting and Re-Encode Features¶

In this section, we will examine the different types of data that we have and evaluate next steps to prepare the data for our PCA and KMeans clustering algorithms.

Checking for missing data isn't the only way in which you can prepare a dataset for analysis. Since the unsupervised learning techniques to be used will only work on data that is encoded numerically, you need to make a few encoding changes or additional assumptions to be able to make progress. In addition, while almost all of the values in the dataset are encoded using numbers, not all of them represent numeric values. Check the third column of the feature summary (feat_info) for a summary of types of measurement.

For numeric and interval data, these features can be kept without changes.
Most of the variables in the dataset are ordinal in nature. While ordinal values may technically be non-linear in spacing, make the simplifying assumption that the ordinal variables can be treated as being interval in nature (that is, kept without any changes).
Special handling may be necessary for the remaining two variable types: categorical, and 'mixed'.

In the first two parts of this sub-step, you will perform an investigation of the categorical and mixed-type features and make a decision on each of them, whether you will keep, drop, or re-encode each. Then, in the last part, you will create a new data frame with only the selected and engineered columns.

Data wrangling is often the trickiest part of the data analysis process, and there's a lot of it to be done here. But stick with it: once you're done with this step, you'll be ready to get to the machine learning parts of the project!

Let's use the feat_info file to get counts of how much of each data type we have.

# Removing the 6 outlier columns that were removed from feat_info
feat_info = feat_info.set_index('attribute').drop(outlier_columns).reset_index()

feat_info['type'].value_counts()

ordinal        49
categorical    18
mixed           6
numeric         6
Name: type, dtype: int64

# Creating a dictionary where the 4 data types above are the keys and the corresponding DF columns are the values
    # This will help with breaking up each group of features for analysis
df_dtype = {}
dtypes = ['ordinal','categorical','numeric','mixed']
for d in dtypes:
    df_dtype[d]=feat_info[feat_info['type']==d]['attribute']

Re-Encoding¶

Our approach will be the following...

Categorical...
- Ordinal Categoricals: Convert to numeric ordinals if necessary and keep the same for simplicity
- Multi-Level Categoricals: One-Hot encode these values
- Binary Categoricals: Convert to numeric binaries ([0,1])
Numeric: Keep the same
Mixed: Investigate first, then decide on an approach

def check_unique_vals(df):
    for col in df.columns:
        values = list(df[col].unique())
        print(f'{col}: {values}')

Ordinal¶

check_unique_vals(azdias_1[df_dtype['ordinal']])

ALTERSKATEGORIE_GROB: [1.0, 3.0, 4.0, 2.0, nan]
FINANZ_MINIMALIST: [1, 4, 3, 2, 5]
FINANZ_SPARER: [5, 4, 2, 3, 1]
FINANZ_VORSORGER: [2, 1, 5, 4, 3]
FINANZ_ANLEGER: [5, 2, 1, 4, 3]
FINANZ_UNAUFFAELLIGER: [4, 3, 1, 2, 5]
FINANZ_HAUSBAUER: [5, 2, 3, 4, 1]
HEALTH_TYP: [3.0, 2.0, 1.0, nan]
RETOURTYP_BK_S: [1.0, 3.0, 2.0, 5.0, 4.0, nan]
SEMIO_SOZ: [5, 4, 6, 2, 7, 3, 1]
SEMIO_FAM: [4, 1, 5, 7, 2, 6, 3]
SEMIO_REL: [4, 3, 2, 7, 5, 1, 6]
SEMIO_MAT: [3, 1, 2, 4, 7, 5, 6]
SEMIO_VERT: [1, 4, 7, 2, 6, 5, 3]
SEMIO_LUST: [2, 4, 6, 7, 3, 1, 5]
SEMIO_ERL: [2, 6, 7, 4, 5, 1, 3]
SEMIO_KULT: [3, 4, 6, 5, 7, 1, 2]
SEMIO_RAT: [6, 4, 3, 2, 7, 5, 1]
SEMIO_KRIT: [4, 7, 3, 1, 5, 6, 2]
SEMIO_DOM: [7, 4, 2, 1, 5, 6, 3]
SEMIO_KAEM: [4, 7, 5, 2, 3, 6, 1]
SEMIO_PFLICHT: [7, 3, 4, 5, 1, 6, 2]
SEMIO_TRADV: [6, 3, 4, 2, 7, 5, 1]
HH_EINKOMMEN_SCORE: [6.0, 4.0, 1.0, 5.0, 3.0, 2.0]
W_KEIT_KIND_HH: [3.0, nan, 2.0, 6.0, 5.0, 4.0, 1.0]
WOHNDAUER_2008: [9.0, 8.0, 3.0, 4.0, 5.0, 6.0, 2.0, 7.0, 1.0]
KONSUMNAEHE: [1.0, 5.0, 4.0, 3.0, 2.0, 6.0, 7.0, nan]
KBA05_ANTG1: [0.0, 1.0, 4.0, 2.0, 3.0, nan]
KBA05_ANTG2: [0.0, 3.0, 1.0, 4.0, 2.0, nan]
KBA05_ANTG3: [0.0, 1.0, nan, 2.0, 3.0]
KBA05_ANTG4: [2.0, 0.0, nan, 1.0]
KBA05_GBZ: [1.0, 3.0, 4.0, 5.0, 2.0, nan]
BALLRAUM: [6.0, 2.0, 4.0, 3.0, 7.0, 1.0, 5.0, nan]
EWDICHTE: [3.0, 4.0, 2.0, 5.0, 6.0, 1.0, nan]
INNENSTADT: [8.0, 4.0, 6.0, 1.0, 7.0, 3.0, 2.0, 5.0, nan]
GEBAEUDETYP_RASTER: [3.0, 4.0, 5.0, 1.0, 2.0, nan]
KKK: [2.0, nan, 3.0, 4.0, 1.0]
MOBI_REGIO: [1.0, 3.0, 4.0, 5.0, nan, 2.0, 6.0]
ONLINE_AFFINITAET: [3.0, 2.0, 1.0, 5.0, 4.0, 0.0, nan]
REGIOTYP: [3.0, 2.0, nan, 5.0, 1.0, 7.0, 6.0, 4.0]
PLZ8_ANTG1: [2.0, 3.0, nan, 1.0, 4.0, 0.0]
PLZ8_ANTG2: [3.0, 2.0, 4.0, 1.0, nan, 0.0]
PLZ8_ANTG3: [2.0, 1.0, nan, 3.0, 0.0]
PLZ8_ANTG4: [1.0, 0.0, nan, 2.0]
PLZ8_HHZ: [5.0, 4.0, 3.0, nan, 2.0, 1.0]
PLZ8_GBZ: [4.0, 3.0, 5.0, nan, 2.0, 1.0]
ARBEIT: [3.0, 2.0, 4.0, 1.0, 5.0, nan]
ORTSGR_KLS9: [5.0, 3.0, 6.0, 4.0, 8.0, 2.0, 7.0, 9.0, 1.0, nan]
RELAT_AB: [4.0, 2.0, 3.0, 5.0, 1.0, nan]

The ordinals look okay (no strings; all of them are numeric).

Categorical¶

# Columns that are tagged as "categorical" are either binary or multi_level
# Printing and tracking binary and multi-level categorical values 
binary_cats = []
multi_level_cats = []
for col in azdias_1[df_dtype['categorical']]:
    values = azdias_1[df_dtype['categorical']][col].nunique()
    if values == 2:
        binary_cats.append(col)
    else:
        multi_level_cats.append(col)
    print(f'{col}: {values}')

ANREDE_KZ: 2
CJT_GESAMTTYP: 6
FINANZTYP: 6
GFK_URLAUBERTYP: 12
GREEN_AVANTGARDE: 2
LP_FAMILIE_FEIN: 11
LP_FAMILIE_GROB: 5
LP_STATUS_FEIN: 10
LP_STATUS_GROB: 5
NATIONALITAET_KZ: 3
SHOPPER_TYP: 4
SOHO_KZ: 2
VERS_TYP: 2
ZABEOTYP: 6
GEBAEUDETYP: 7
OST_WEST_KZ: 2
CAMEO_DEUG_2015: 9
CAMEO_DEU_2015: 44

# Examining our binary categorical values
azdias_1[binary_cats].head()

So it looks like we have to encode "OST_WEST_KZ" to be numerical.

# Looking at the values in this column
azdias_1['OST_WEST_KZ'].value_counts()

W    629525
O    168542
Name: OST_WEST_KZ, dtype: int64

# I'll one-hot encode this column with the rest of our multi-level categorical variables
multi_level_cats.append('OST_WEST_KZ')

Now examining our multi-level categorical variables.

# These have to be one-hot encoded
azdias_1[multi_level_cats].head()

Numeric¶

azdias_1[df_dtype['numeric']].describe()

Looks like all of the values in the numeric columns are indeed numeric. These look good.

Mixed¶

azdias_1[df_dtype['mixed']].head()

# Printing the number of unique values within each of the mixed columns
for column in azdias_1[df_dtype['mixed']].columns:
    print(column,azdias_1[df_dtype['mixed']][column].nunique())

LP_LEBENSPHASE_FEIN 40
LP_LEBENSPHASE_GROB 12
PRAEGENDE_JUGENDJAHRE 15
WOHNLAGE 8
CAMEO_INTL_2015 21
PLZ8_BAUMAX 5

Taking a look at the Data_Dictionary.md, I noticed these are all categorical variables. I'm treating them as multi-level categorical variables (i.e. one-hot encoding them) with the exception of "PRAEGENDE_JUGENDJAHRE" and "CAMEO_INTL_2015". I'll deal with these two below in the next section.

for col in df_dtype['mixed']:
    if col not in ['CAMEO_INTL_2015','PRAEGENDE_JUGENDJAHRE']:
        multi_level_cats.append(col)

Recap

So we've taken a look at the ordinal, categorical, numeric, and mixed variable columns. We're fine with the ordinal and numeric columns. The categorical column can be further split between binary and multi-level columns, and we've already taken care of the binary categorical columns.

So now we must encode the multi-level categorical and mixed variable columes.

azdias_1 = pd.get_dummies(azdias_1,columns=multi_level_cats)

azdias_1.shape

(798067, 256)

In sum, after removing outlier columns and rows with over 30 NA values, I kept ordinal and numeric columns the same, and one-hot encoded multi-level categorical, one binary categorical, and some mixed variable columns.

Engineering Mixed-Type Features¶

Below I'm working on converting the "PRAEGENDE_JUGENDJAHRE" (combines information on three dimensions: generation by decade, movement, and nation) and "CAMEO_INTL_2015" (combines information on two axes: wealth and life stage).

I will break up the two-digit codes by their 'tens'-place and 'ones'-place digits into two new ordinal variables (which, for the purposes of this project, is equivalent to just treating them as their raw numeric values).

Below is the mapping for "PRAEGENDE_JUGENDJAHRE" from the Data_Dictionary.md

1: 40s - war years (Mainstream, E+W)
2: 40s - reconstruction years (Avantgarde, E+W)
3: 50s - economic miracle (Mainstream, E+W)
4: 50s - milk bar / Individualisation (Avantgarde, E+W)
5: 60s - economic miracle (Mainstream, E+W)
6: 60s - generation 68 / student protestors (Avantgarde, W)
7: 60s - opponents to the building of the Wall (Avantgarde, E)
8: 70s - family orientation (Mainstream, E+W)
9: 70s - peace movement (Avantgarde, E+W)
10: 80s - Generation Golf (Mainstream, W)
11: 80s - ecological awareness (Avantgarde, W)
12: 80s - FDJ / communist party youth organisation (Mainstream, E)
13: 80s - Swords into ploughshares (Avantgarde, E)
14: 90s - digital media kids (Mainstream, E+W)
15: 90s - ecological awareness (Avantgarde, E+W)

movement = []
decade = []
for val in azdias_1['PRAEGENDE_JUGENDJAHRE']:
    # Mapping for movement
    if val in [1,3,5,8,10,12,14]:
        movement.append(0) 
    elif val in [2,4,6,7,9,11,13,15]:
        movement.append(1)
    else:
        movement.append(np.NaN)
    # Mapping for decade  
    if val in [1,2]:
        decade.append(40)
    elif val in [3,4]:
        decade.append(50)
    elif val in [5,6,7]:
        decade.append(60)
    elif val in [8,9]:
        decade.append(70)
    elif val in [10,11,12]:
        decade.append(80)
    elif val in [13,14,15]:
        decade.append(90)
    else:
        decade.append(np.NaN)

# Quick sanity check
len(movement)==len(decade)==azdias_1.shape[0]

True

movement = pd.Series(movement)
decade = pd.Series(decade)

azdias_1['movement']=movement
azdias_1['decade']=decade

Below is a mapping for "CAMEO_INTL_2015" from the Data_Dictionary.md

11: Wealthy Households - Pre-Family Couples & Singles
12: Wealthy Households - Young Couples With Children
13: Wealthy Households - Families With School Age Children
14: Wealthy Households - Older Families & Mature Couples
15: Wealthy Households - Elders In Retirement
21: Prosperous Households - Pre-Family Couples & Singles
22: Prosperous Households - Young Couples With Children
23: Prosperous Households - Families With School Age Children
24: Prosperous Households - Older Families & Mature Couples
25: Prosperous Households - Elders In Retirement
31: Comfortable Households - Pre-Family Couples & Singles
32: Comfortable Households - Young Couples With Children
33: Comfortable Households - Families With School Age Children
34: Comfortable Households - Older Families & Mature Couples
35: Comfortable Households - Elders In Retirement
41: Less Affluent Households - Pre-Family Couples & Singles
42: Less Affluent Households - Young Couples With Children
43: Less Affluent Households - Families With School Age Children
44: Less Affluent Households - Older Families & Mature Couples
45: Less Affluent Households - Elders In Retirement
51: Poorer Households - Pre-Family Couples & Singles
52: Poorer Households - Young Couples With Children
53: Poorer Households - Families With School Age Children
54: Poorer Households - Older Families & Mature Couples
55: Poorer Households - Elders In Retirement

cameo_intl = []
for val in azdias_1['CAMEO_INTL_2015']:
    try:
        val = int(val)
        if 10 < val < 19:
            cameo_intl.append(5)
        elif 20 < val < 29:
            cameo_intl.append(4)
        elif 30 < val < 39:
            cameo_intl.append(3)
        elif 40 < val < 49:
            cameo_intl.append(2)
        elif 50 < val < 59:
            cameo_intl.append(1)
        else:
            cameo_intl.append(np.nan)
    except:
        cameo_intl.append(np.nan)

cameo_intl = pd.Series(cameo_intl)
azdias_1['cameo_intl'] = cameo_intl
azdias_1.head()

azdias_1.drop(['CAMEO_INTL_2015','PRAEGENDE_JUGENDJAHRE'],axis=1,inplace=True)

Recap

I performed the following steps above...

PRAEGENDE_JUGENDJAHRE I replaced "PRAEGENDE_JUGENDJAHRE" with two new columns: movement and decade according to the mapping from the Data_Dictionary.md.

CAMEO_INTL_2015 I replaced "CAMEO_INTL_2015" with a column called cameo_intl according to the mapping from the Data_Dictionary.md.

Creating a Cleaning Function¶

Below is a huge function that replicates all the steps I performed above on any given DataFrame with the same structure as the demographics data.

def clean_data(df):
    """
    Perform feature trimming, re-encoding, and engineering for demographics
    data
    
    INPUT: Demographics DataFrame
    OUTPUT: Trimmed and cleaned demographics DataFrame
    """
    
    feat_info = pd.read_csv('AZDIAS_Feature_Summary.csv',';')
    feat_info['NA_tags'] = feat_info['missing_or_unknown'].apply(string_to_list)
    att_index = feat_info.set_index('attribute')
    # Looping through our column names
    for column in df.columns:
        # For every column in our DF, if a value is in the 'NA_tags' list, we replace it inplace with np.NaN
        df[column].replace(att_index.loc[column].loc['NA_tags'],np.NaN,inplace=True)    
    att_index = feat_info.set_index('attribute')
    
    # remove outlier columns
    outlier_columns = ['KBA05_BAUMAX',
                   'KK_KUNDENTYP',
                   'ALTER_HH',
                   'TITEL_KZ',
                   'GEBURTSJAHR',
                   'AGER_TYP']
    df.drop(labels=outlier_columns,axis=1,inplace=True)
    # remove rows with > 30 NA values
    df['no_na_row'] = df.isna().sum(axis=1)
    df = df[df['no_na_row']<=30]
    df.drop('no_na_row',inplace=True,axis=1)
    
    # select, re-encode, and engineer column values.
    feat_info = feat_info.set_index('attribute').drop(outlier_columns).reset_index()
    df_dtype = {}
    dtypes = ['ordinal','categorical','numeric','mixed']
    for d in dtypes:
        df_dtype[d]=feat_info[feat_info['type']==d]['attribute']    
    
    # Separating binary from multi-level categorical columns
    binary_cats = []
    multi_level_cats = []
    for col in df[df_dtype['categorical']]:
        values = df[df_dtype['categorical']][col].nunique()
        if values == 2:
            binary_cats.append(col)
        else:
            multi_level_cats.append(col)
    
    # Including OST_WEST_KZ in the list to be one-hot encoded
    multi_level_cats.append('OST_WEST_KZ')
    for col in df_dtype['mixed']:
        if col not in ['CAMEO_INTL_2015','PRAEGENDE_JUGENDJAHRE']:
            multi_level_cats.append(col)
    
    # One-hot encoding
    df = pd.get_dummies(df,columns=multi_level_cats)
    
    # Dealing with the 2 particular mixed vars...
    movement = []
    decade = []
    for val in df['PRAEGENDE_JUGENDJAHRE']:
        # Mapping for movement
        if val in [1,3,5,8,10,12,14]:
            movement.append(0) 
        elif val in [2,4,6,7,9,11,13,15]:
            movement.append(1)
        else:
            movement.append(np.NaN)
        # Mapping for decade  
        if val in [1,2]:
            decade.append(40)
        elif val in [3,4]:
            decade.append(50)
        elif val in [5,6,7]:
            decade.append(60)
        elif val in [8,9]:
            decade.append(70)
        elif val in [10,11,12]:
            decade.append(80)
        elif val in [13,14,15]:
            decade.append(90)
        else:
            decade.append(np.NaN)
    movement = pd.Series(movement)
    decade = pd.Series(decade)
    df['movement']=movement
    df['decade']=decade
            
    cameo_intl = []
    for val in df['CAMEO_INTL_2015']:
        try:
            val = int(val)
            if 10 < val < 19:
                cameo_intl.append(5)
            elif 20 < val < 29:
                cameo_intl.append(4)
            elif 30 < val < 39:
                cameo_intl.append(3)
            elif 40 < val < 49:
                cameo_intl.append(2)
            elif 50 < val < 59:
                cameo_intl.append(1)
            else:
                cameo_intl.append(np.nan)
        except:
            cameo_intl.append(np.nan)
    cameo_intl = pd.Series(cameo_intl)
    df['cameo_intl'] = cameo_intl
    df.drop(['CAMEO_INTL_2015','PRAEGENDE_JUGENDJAHRE'],axis=1,inplace=True)            
            
    return df

Testing the cleaning function on a copy of our data...

test_func_az = azdias[:]

test_df = clean_data(test_func_az)

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:28: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

azdias_1.shape==test_df.shape

True

Feature Transformation¶

Feature Scaling¶

So we're going to use principal component analysis (PCA) to reduce the number of dimensions (features) of our data. However, in order to do so, our data needs to be standardized (mean=0 and bonus if std dev = 0). We also need to take care of all the NA values that we kept.

My approach is the following...

Impute NA values using each column's median
Use the StandardScaler to standardize our data.

# Imputing missing values with the median
med_imp = SimpleImputer(strategy='median')
azdias_2 = med_imp.fit_transform(azdias_1)

# Standardizing
scaler = StandardScaler()
azdias_2 = scaler.fit_transform(azdias_2)

For simplicity's sake, I chose to impute the data with the median value in each column. Then, I've standardized our data using a StandardScaler so PCA calculates the principal components using similar values and so K-Means clustering groups similar values based on their relative spread rather than absolute value.

Performing Dimensionality Reduction | PCA¶

Principal Component Analysis, crudely simplified, is transforming your data to make it more succint. It takes columns that are very related to each other and re-engineers new columns that better describe the relationships of our columns with one another. These new columns are Principal Components, and we can use any given number of these components for analysis.

The more principal components (PC) we use, the more information we have, but computing KMeans clustering is slower, and we increase risk of overfitting. So ideally, we want to choose a number of PC's that best represent our dataset without losing too much information.

I'd like to try several n component values for PCA Dimensionality Reduction. I'll start with 40 to see how much variance is explained. Ideally, I'd like to get to the number of components with 85% variance explained.

Quick Aside: PC's describe relationships between the columns of our data. The most relevant (most useful) PC's are said to explain the most variance. So, we want to choose a number of the PC's that explain the most variance while reducing the number of features we use.

This variance can be calculated by calling the explained_variance_ratio_ attribute from a fitted pca object.

azdias_2.shape

(798067, 257)

PCA Test Run for 40.

pca = PCA(40)
X_pca = pca.fit_transform(azdias_2)

pca.explained_variance_ratio_.sum()

0.4945068139984037

So about 50% is explained with 40 components. Below I'll loop through several values for No. Principal Components until we get 85% variance explained.

n_components = [80,110,150,180,210]
for n in n_components:
    pca = PCA(n)
    X_pca = pca.fit_transform(azdias_2)
    if pca.explained_variance_ratio_.sum() >= .85:
        break

print(pca.explained_variance_ratio_.sum())
print(pca.n_components)

0.9306053772038634
150

Looks good, but 150 components feels a bit high. So I'm going to re-run this loop in smaller increments to just get around 85%.

n_components = [120,125,130,135,140]
for n in n_components:
    pca = PCA(n)
    X_pca = pca.fit_transform(azdias_2)
    if pca.explained_variance_ratio_.sum() >= .85:
        break

print(pca.explained_variance_ratio_.sum())
print(pca.n_components)

0.8648442564392442
130

Now plotting total explained variance for our principal components.

ticks = np.arange(pca.n_components)
values = pca.explained_variance_ratio_

plt.figure(figsize = (13,5))
ax = plt.subplot(111)
cumvals = np.cumsum(values)
ax.bar(ticks,values)
ax.plot(ticks,cumvals)

ax.xaxis.set_tick_params(width=1)
ax.yaxis.set_tick_params(width=2,length=12)

ax.set_xlabel("No. Principal Components")
ax.set_ylabel("Variance Explained (%)")
plt.title('Explained Variance Per Principal Component\nScree Plot')
plt.show()

Recap

So above we performed PCA many times until we found a number of principal components that explains 85% variance. The scree plot above plots principal components on the x-axis, variance explained on the y-axis, bars representing amount of variance explained, and a line that cumulatively sums up the variance explained across our principal components. I've decided to go with 130 principal components because it explains a good amount of variance (again, 85%) while greatly reducing number of dimensions from 257 to 130.

Interpreting Principal Components¶

Below I'll have a swing at interpretting what some principal components can tell us about our data.

def plot_pca_dim(pca,feat_names,dim,features):
    '''
    For a given fitted PCA object, the given component, and a given number of features, this plots
    the variance that the particular component captures.
    '''
    fmt = lambda x: "{:.2f}%".format(x)
    sort_abs = lambda x: abs(x[0])

    names_to_weights = sorted(zip(pca.components_[dim],feat_names),key=sort_abs,reverse=True)
    names_to_weights = np.array(names_to_weights)

    nm_to_wt_df = pd.DataFrame(names_to_weights,columns=['var','feature'])
    nm_to_wt_df['var'] = nm_to_wt_df['var'].astype(float)

    return px.bar(nm_to_wt_df[:features],x='feature',y='var',
           color='var',color_continuous_scale=px.colors.sequential.Sunsetdark,
           labels = {'var': 'Variance (%)','feature':'Feature Names'},
           height=400,width=600)

feat_names = azdias_1.columns.values

Component 1 (The component that explains the most variance)¶

Skip ahead below all the plots for my interpretation of the PC's

plot_pca_dim(pca,feat_names,0,10)

Field Name	Interpretation	Pos/Neg
MOBI_REGIO	Region: Movement Patterns	Neg
LP_STATUS_GROB_1.0	Social Status	Pos
KBA05_ANTG1	No. 1-2 family houses in microcell	Neg
PLZ8_ANTG3	No. 6-10 family houses in region	Pos
PLZ8_ANTG1	No. 1-2 family houses in region	Neg
FINANZ_MINIMALIST	Low Financial Interest	Neg
KBA05_GBZ	No. Buildings in microcell	Neg
HH_EINKOMMEN_SCORE	Est. Household net income	Pos
PLZ8_ANTG4	No. 10+ family homes in region	Pos
PLZ8_BAUMAX_1.0	Most common building type	Neg

Observations: Component 1

Positive correlation between social status, household net income, and relatively larger family homes.
Contrast this to a negative correlation with movement patterns smaller family homes, and low financial interest.

Component 2¶

plot_pca_dim(pca,feat_names,1,10)

Field Name	Interpretation	Pos/Neg
ALTERSKATEGORIE_GROB	Estimated age	Pos
FINANZ_SPARER	Money Saver	Neg
FINANZ_VORSORGER	Being financially prepared	Pos
FINANZ_UNAUFFAELLIGER	Financially inconspicuous	Neg
SEMIO_REL	Personality: Religious	Neg
SEMIO_TRADV	Personality: Traditional-minded	Neg
SEMIO_PFLICHT	Personality: Dufitul	Neg
FINANZ_ANLEGER	Investor	Neg
ZABEOTYP_3	Fair energy consumer	Pos
SEMIO_ERL	Personality: Event-oriented	Pos

Observations: Component 2

Positive correlation between age, being financially prepared, being a fair energy consumer, and being event-oriented.
Negative correlations with money saving, being financially inconspicuous, religions, traditional, dutiful, and an investor.

Component 3¶

plot_pca_dim(pca,feat_names,2,10)

Field Name	Interpretation	Pos/Neg
ANREDE_KZ	Gender (1-M, 2-F)	Neg
SEMIO_VERT	Personality: Dreamful	Pos
SEMIO_KAEM	Personality: Combative Attitude	Neg
SEMIO_DOM	Personality: Dominant-minded	Neg
SEMIO_KRIT	Personality: Critical-minded	Neg
SEMIO_FAM	Personality: Family-minded	Pos
SEMIO_SOZ	Personality: Socially-minded	Pos
SEMIO_KULT	Personality: Cultural-minded	Pos
SEMIO_ERL	Personality: Event-oriented	Neg
FINANZ_ANLEGER	Investor	Neg

Observations: Component 3

Positive correlation between dreamful personality, family mindedness, social-mindedness, and cultural mindedness
Negative correlations with gender, combative-minded, dominant-minded, critical-minded, and investing.

Discussion: Interpretting Principal Components¶

pca.explained_variance_ratio_[:3].sum()

0.13896057942704898

Cumulatively, they only explain approx. 13.9% of variance from our features, so they don't explain very much variance.

PCA Summary Observations

The first component seems financially-driven. It most strongly correlates social status, household income, and larger homes (positive) against smaller family homes and low financial interest.
The second component seems personality-driven. It shows positive correlation between age and financial preparedness, and shows negative correlation against similar, more conservative personality types (financially inconspicuous, religions, traditional, dutiful, and investing).
The third component also seems personality-driven. It shows positive correlation between dreamful, family, social, and cultural mindedness. It also shows negative correlations between gender (note value for male is lower than female = 0 vs 1), combatative, dominant, and critical mindedness.

Clustering¶

Applying Clustering to General Population¶

Below I will use sklearn's KMeans clustering algorithm to identify different segments of the demographics data.

One thing that's very tricky when it comes to using KMeans is the fact that you have to choose the number of clusters that the algorithm will find. This can be difficult since we don't know how many clusters exist!

So, here's what we're going to do...

Run KMeans on a few different values of k-clusters 1.5. Calculate Average Distance (Euclidean Distance) for each value of k-clusters
Analyze how much reduced distance we get from each additional cluster
Pick the value for k-clusters immediately before the value where we see a clear drop in reduction of average distance

Intuitively, this makes sense. If we have 5 clear-cut clusters in our population, adding clusters 1-5 will significantly reduce average distance between clusters. Once we add a 6th cluster, that 6th cluster will situate itself close to one of the already-defined 5 clusters, which will reduce average distance from each cluster, but not by that much.

scores = []

First trying 3-10 clusters to see how KMeans performs.

# Over a number of different cluster counts...
clusters = np.arange(3,11)
for i in clusters:
    start=time.time()
    model = KMeans(i)
    model.fit(X_pca)
    # Score is sum of squared distances to a sample's assigned centroid
    scores.append(abs(model.score(X_pca)))
    end = time.time()
    print(f'Running Score for {i} Clusters: {abs(model.score(X_pca)):.3f}')
    print(f'Time taken: {end-start:.2f} seconds.')

Running Score for 3 Clusters: 161597023.276
Time taken: 73.19 seconds.
Running Score for 4 Clusters: 158004058.975
Time taken: 85.80 seconds.
Running Score for 5 Clusters: 155379812.132
Time taken: 109.79 seconds.
Running Score for 6 Clusters: 153092423.744
Time taken: 110.24 seconds.
Running Score for 7 Clusters: 150649488.845
Time taken: 144.54 seconds.
Running Score for 8 Clusters: 148389831.896
Time taken: 148.45 seconds.
Running Score for 9 Clusters: 145435904.875
Time taken: 152.77 seconds.
Running Score for 10 Clusters: 143467713.793
Time taken: 180.77 seconds.

Plotting the scores below...

plt.plot(clusters,scores,marker="o")
plt.show()

The scores I'm plotting are the sum of the squared distances from each point to its corresponding centroid. The trend looks linear. A possible sign of having too many clusters is diminished reduction in these distances for each additional cluster. This is known as the "elbow" method. Given the trend looks linear, I'm going to try a few more clusters below.

clusters = np.arange(11,16)
for i in clusters:
    start=time.time()
    model = KMeans(i)
    model.fit(X_pca)
    # Score is sum of squared distances to a sample's assigned centroid
    scores.append(abs(model.score(X_pca)))
    end = time.time()
    print(f'Running Score for {i} Clusters: {abs(model.score(X_pca)):.3f}')
    print(f'Time taken: {end-start:.2f} seconds.')

Running Score for 11 Clusters: 142549766.933
Time taken: 190.92 seconds.
Running Score for 12 Clusters: 139869652.130
Time taken: 199.48 seconds.
Running Score for 13 Clusters: 138564471.742
Time taken: 212.23 seconds.
Running Score for 14 Clusters: 136929997.809
Time taken: 205.36 seconds.
Running Score for 15 Clusters: 136202833.209
Time taken: 214.22 seconds.

Replotting the scores.

plt.plot(np.arange(3,16),scores,marker="o")
plt.show()

Although it's not very clear in the plot, it looks like our distance reduction began diminishing from 13 to 14 clusters, from reading out the running score printed 3 cells above. I'll stick with 13 clusters and proceed with a final model fitted on the PCA'd version of our demographics data.

final_model = KMeans(13)
final_model.fit(X_pca)

KMeans(algorithm='auto', copy_x=True, init='k-means++', max_iter=300,
       n_clusters=13, n_init=10, n_jobs=None, precompute_distances='auto',
       random_state=None, tol=0.0001, verbose=0)

gen_preds = final_model.predict(X_pca)

Recap

I fit and scored KMeans on our PCA-transformed data for n-centroids 3-15. Score is calculated as the distance between samples and their corresponding centroid. As we add centroids to the algorithm, the score will decrease (i.e. distance between points) less and less. 10 and 13 centroids seem like good candidates for our algorithm, but I chose to proceed with 13 clusters because it appears sum of squared distance was significantly reduced from 11-13 centroids.

Applying All Steps to the Customer Data¶

Below I'm cleaning, scaling, PCA-ing, and clustering our customer data using all the objects and steps we used on the demographics data.

# Load in the customer demographics data.
customers = pd.read_csv('Udacity_CUSTOMERS_Subset.csv',';')

customers.head()

Cleaning

cleaned_cust = clean_data(customers)

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py:3940: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

cleaned_cust.shape

(141725, 256)

The imputer and scaler aren't going to work because "GEBAEUDETYP_5.0" isn't in our cleaned_cust DataFrame. This wasn't created because the "GEBAEUDETYP" column didn't have any values of 5 in our customer set since we one-hot encoded it using pd.get_dummies instead of OneHotEncoder. Adding in a column of only zeros for this missing column is an equivalent operation to not having any values of 5 for this feature.

for i in azdias_1.columns:
    if i not in cleaned_cust.columns:
        print(i)

GEBAEUDETYP_5.0

GEBAEUDETYP_5 = pd.Series(np.zeros(141725))
cleaned_cust['GEBAEUDETYP_5.0'] = GEBAEUDETYP_5

# Reordering columns to align with original DataFrame
cols = azdias_1.columns.tolist()
cleaned_cust = cleaned_cust[cols]

cleaned_cust.shape

(141725, 257)

Imputing & Standardization

imputed = med_imp.transform(cleaned_cust)
standardized = scaler.transform(imputed)

standardized.shape

(141725, 257)

PCA

X_pca_cust = pca.transform(standardized)

X_pca_cust.shape

(141725, 130)

Clustering

preds = final_model.predict(X_pca_cust)

Comparing Customer Data to Demographics Data¶

Now that we have both our demographics data and our customer data clustered let's compare them!

df_preds_cust = pd.DataFrame(preds,columns=['cust_preds'])
df_preds_gen =  pd.DataFrame(gen_preds,columns=['gen_preds'])

prop_cust = df_preds_cust['cust_preds'].value_counts().sort_index()/ \
                sum(df_preds_cust['cust_preds'].value_counts())
prop_gen = df_preds_gen['gen_preds'].value_counts().sort_index()/ \
                sum(df_preds_gen['gen_preds'].value_counts())

fig,ax = plt.subplots(figsize=(15,5))
width = 0.35

cust = ax.bar(prop_cust.index-.51*width, prop_cust, width, color='darkolivegreen')
gen = ax.bar(prop_gen.index+.51*width, prop_gen, width, color='steelblue')


ax.set_ylabel('Proportion (%)')
ax.set_xlabel('Labels')
ax.set_title('General Population Clusters Vs Customer Clusters')
ax.set_xticks(prop_cust.index)
ax.set_ylim(top=max(prop_cust.append(prop_gen))*1.15)
ax.legend(('Customer','General Pop'))

def autolabel(rects):
    """
    Attach a text label above each bar displaying its height
    """
    for rect in rects:
        height = rect.get_height()
        ax.text(rect.get_x() + rect.get_width()/2., 1.05*height,
                '%.2f' % float(height),
                ha='center', va='bottom')

autolabel(cust)
autolabel(gen)

The two relative distributions are slightly different. The most significant differences between them are the folliwing...

General Pop
- 7x label 0
- 6x label 3
- 15x label 5
Customer
- 2x label 8
- 3x label 7

So we can see that some clusters are over-represented in our customer data and some are under-represented in our customer data. Let's take a look at what characteristics describe these clusters to help us understand who our customers are.

Customer Over-represented: Cluster 7¶

Below I'm extracting cluster 7's top 5 most significant principal components.

centroid = final_model.cluster_centers_[7]
centroid_df = pd.DataFrame(centroid,columns=['pca_weights'])

# Sorting the centroid in descending absolute value order
    # Note indexes correspond to principal components
centroid_df.reindex(centroid_df.abs().sort_values('pca_weights',ascending=False).index).head()

Relatively speaking, it looks like PC 0 was highly negatively influencing this cluster, and PC 3 was highly positively influencing this cluster. Let's take a look at these components...

plot_pca_dim(pca,feat_names,0,10)

# Provided by Udacity Reviewer - Inserted for future reference
    # This function is a (less pretty but) more concise way of showing the features that are
        # Most negatively correlated with the cluster
        # Most positively correlated with the cluster
    # However, it is computed in a slightly different way 
        # (by inverse transforming our PCA data back into orig features)
def plot_scaled_comparison(df_sample, kmeans, cluster):
    '''
    Takes performs the PCA inverse_transform of a selected cluster from our trained model to obtain 
        the scaled feature weights. This helps us interpret who belongs in a given cluster.
    '''
    X = pd.DataFrame.from_dict(dict(zip(df_sample.columns,
        pca.inverse_transform(kmeans.cluster_centers_[cluster]))), orient='index').rename(
        columns={0: 'feature_values'}).sort_values('feature_values', ascending=False)
        X['feature_values_abs'] = abs(X['feature_values'])
        pd.concat((X['feature_values'][:10], X['feature_values'][-10:]), axis=0).plot(kind='barh');

plot_scaled_comparison(azdias_1, final_model, 5)

plot_pca_dim(pca,feat_names,3,5)

PC 0 (Negatively weighted by cluster 7)

Positively explains low-income earners (LP_STATUS_GROB) & no. 6-10 family owned houses (PLZ8_ANTG3)
Negatively explains movement patterns (MOBI_REGIO), no. 1-2 family owned houses (KBA05_ANTG1 & PLZ8_ANTG1)

PC 3 (Positively weighted by cluster 7)

Positively explains no. adults in household (ANZ_PERSONEN) & multi-person households (LP_FAMILIE_GROB)
Negatively explains single households (LP_FAMILIE_FEIN & LP_FAMILIE_GROB) and likelihood of children (1- most likely & 6 - very unlikely)

My intuition suggests to me that cluster 7 reflects older and wealthier households that likely live in bigger and wealthier areas in Germany.

Customer Under-represented: Cluster 5¶

Below I'm extracting cluster 5's top 5 most significant PCAs.

centroid = final_model.cluster_centers_[5]
centroid_df = pd.DataFrame(centroid,columns=['pca_weights'])

centroid_df.reindex(centroid_df.abs().sort_values(by='pca_weights',ascending=False).index).head()

Relatively speaking, it looks like PCA 0 was highly positively influencing this cluster, and PCA 1 was highly negatively influencing this cluster. Let's take a look at these components...

plot_pca_dim(pca,feat_names,0,5)

plot_pca_dim(pca,feat_names,1,5)

PC 0 (Positively weighted by cluster 5. Note this was Negatively weighted by cluster 7.)

Positively explains low-income earners (LP_STATUS_GROB) & no. 6-10 family owned houses (PLZ8_ANTG3)
Negatively explains movement patterns (MOBI_REGIO), no. 1-2 family owned houses (KBA05_ANTG1 & PLZ8_ANTG1)

PC 1 (Negatively weighted by cluster 5)

Positively explains estimated age (ALTERSKATEGORIE_GROB) and people that prioritize financial preparedness (FINANZ_VORSORGER)
Negatively explains money savers (FINANZ_SPARER), people financially inconspicuous (FINANZ_UNAUFFAELLIGER), and religious mindedness (SEMIO_REL)

My intuition tells me that cluster 5 represents a lower-income and younger part of our population that tends to save money and aren't very religious.

Conclusions¶

Limitation
Before I conclude, allow me to go over some limitations that you should be aware of...

I only looked at 2 clusters (one negative and one positive), so my scope is limited to analyzing two classes of customers
For each cluster, I only considered their top 2 most weighted principal components, and used their explained variance ratios to describe the customers that are represented by this data. More information about these clusters can be drawn by taking a closer look at the principal components.

Conclusion
These customers highly represented cluster 7 and under-represented cluster 5. By looking closer at the principle components that most greatly distinguish these clusters, I noted that cluster 7 generally looks older & wealthier people that live in larger houses (1-2 family homes vs 6-10) than cluster 5. On the other hand, cluster 5 generally looks younger and less wealthy.

We can deduce that most of these customers are older and wealthier. Cluster 7 makes up about 32% of our customer population per our KMeans clustering. So if my intuition is correct, I recommend that this mail-order company target older and wealthier households in their marketing efforts if they want to enforce or improve their standing with this set of customers. On another hand, this disparity may be caused by some other confounding factor, such as existing bias that already favors this subset of our population. If that is the case, it also seems reasonable to target younger money-savers since that subset of our population is underrepresented in our customer population.

The call to action is situational. What's more important is the awareness that the existing customer population is older and wealthier, and that there is an opportunity for growth in a younger and less-wealthy subset of the German population.

Opportunities for Improvement¶

Below I'll address some of the things I see as weaknesses/points of improvement in this analysis that would be worth working on if I continued along with this project.

I dropped all rows with more than 30 missing values without considering alternative ways of imputing/inferring these values.
- Sometimes, we can infer missing values based off of other columns. I didn't look into this.
- Additionally, there are more advanced methods of imputement that are currently way over my head.
When preparing our data, I didn't check for outliers for numeric continuous data. This might have improved my results.
The Cleaning function could have been much more clean, efficient, and effective using sklearn's Pipeline feature.
As stated previously, it would be great to continue to analyze other clusters and extrapolate more features that characterize these clusters.

	ANZ_PERSONEN	ANZ_TITEL	ANZ_HAUSHALTE_AKTIV	ANZ_HH_TITEL	MIN_GEBAEUDEJAHR	KBA13_ANZAHL_PKW
count	798067.000000	798067.000000	791605.000000	794208.000000	798067.000000	785420.000000
mean	1.728842	0.004161	8.354955	0.040647	1993.276914	619.701328
std	1.156529	0.068887	15.673773	0.324029	3.332524	340.034520
min	0.000000	0.000000	1.000000	0.000000	1985.000000	0.000000
25%	1.000000	0.000000	2.000000	0.000000	1992.000000	384.000000
50%	1.000000	0.000000	4.000000	0.000000	1992.000000	549.000000
75%	2.000000	0.000000	10.000000	0.000000	1993.000000	778.000000
max	45.000000	6.000000	595.000000	23.000000	2016.000000	2300.000000

	pca_weights
0	-4.998779
3	2.523828
5	-1.964884
15	-1.604093
6	-1.428213

	pca_weights
0	5.144780
1	-2.480333
7	-1.065089
8	-0.874759
3	-0.851585

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

	count	mean	std	min	25%	50%	75%	max
AGER_TYP	891221.0	-0.358435	1.198724	-1.0	-1.0	-1.0	-1.0	3.0
ALTERSKATEGORIE_GROB	891221.0	2.777398	1.068775	1.0	2.0	3.0	4.0	9.0
ANREDE_KZ	891221.0	1.522098	0.499512	1.0	1.0	2.0	2.0	2.0
CJT_GESAMTTYP	886367.0	3.632838	1.595021	1.0	2.0	4.0	5.0	6.0
FINANZ_MINIMALIST	891221.0	3.074528	1.321055	1.0	2.0	3.0	4.0	5.0
FINANZ_SPARER	891221.0	2.821039	1.464749	1.0	1.0	3.0	4.0	5.0
FINANZ_VORSORGER	891221.0	3.401106	1.322134	1.0	3.0	3.0	5.0	5.0
FINANZ_ANLEGER	891221.0	3.033328	1.529603	1.0	2.0	3.0	5.0	5.0
FINANZ_UNAUFFAELLIGER	891221.0	2.874167	1.486731	1.0	2.0	3.0	4.0	5.0
FINANZ_HAUSBAUER	891221.0	3.075121	1.353248	1.0	2.0	3.0	4.0	5.0
FINANZTYP	891221.0	3.790586	1.987876	1.0	2.0	4.0	6.0	6.0
GEBURTSJAHR	891221.0	1101.178533	976.583551	0.0	0.0	1943.0	1970.0	2017.0
GFK_URLAUBERTYP	886367.0	7.350304	3.525723	1.0	5.0	8.0	10.0	12.0
GREEN_AVANTGARDE	891221.0	0.196612	0.397437	0.0	0.0	0.0	0.0	1.0
HEALTH_TYP	891221.0	1.792102	1.269062	-1.0	1.0	2.0	3.0	3.0
LP_LEBENSPHASE_FEIN	886367.0	14.622637	12.616883	0.0	4.0	11.0	27.0	40.0
LP_LEBENSPHASE_GROB	886367.0	4.453621	3.855639	0.0	1.0	3.0	8.0	12.0
LP_FAMILIE_FEIN	886367.0	3.599574	3.926486	0.0	1.0	1.0	8.0	11.0
LP_FAMILIE_GROB	886367.0	2.185966	1.756537	0.0	1.0	1.0	4.0	5.0
LP_STATUS_FEIN	886367.0	4.791151	3.425305	1.0	2.0	4.0	9.0	10.0
LP_STATUS_GROB	886367.0	2.432575	1.474315	1.0	1.0	2.0	4.0	5.0
NATIONALITAET_KZ	891221.0	1.026827	0.586634	0.0	1.0	1.0	1.0	3.0
PRAEGENDE_JUGENDJAHRE	891221.0	8.154346	4.844532	0.0	5.0	8.0	14.0	15.0
RETOURTYP_BK_S	886367.0	3.419630	1.417741	1.0	2.0	3.0	5.0	5.0
SEMIO_SOZ	891221.0	3.945860	1.946564	1.0	2.0	4.0	6.0	7.0
SEMIO_FAM	891221.0	4.272729	1.915885	1.0	3.0	4.0	6.0	7.0
SEMIO_REL	891221.0	4.240609	2.007373	1.0	3.0	4.0	6.0	7.0
SEMIO_MAT	891221.0	4.001597	1.857540	1.0	2.0	4.0	5.0	7.0
SEMIO_VERT	891221.0	4.023709	2.077746	1.0	2.0	4.0	6.0	7.0
SEMIO_LUST	891221.0	4.359086	2.022829	1.0	2.0	5.0	6.0	7.0
SEMIO_ERL	891221.0	4.481405	1.807552	1.0	3.0	4.0	6.0	7.0
SEMIO_KULT	891221.0	4.025014	1.903816	1.0	3.0	4.0	5.0	7.0
SEMIO_RAT	891221.0	3.910139	1.580306	1.0	3.0	4.0	5.0	7.0
SEMIO_KRIT	891221.0	4.763223	1.830789	1.0	3.0	5.0	6.0	7.0
SEMIO_DOM	891221.0	4.667550	1.795712	1.0	3.0	5.0	6.0	7.0
SEMIO_KAEM	891221.0	4.445007	1.852412	1.0	3.0	5.0	6.0	7.0
SEMIO_PFLICHT	891221.0	4.256076	1.770137	1.0	3.0	4.0	6.0	7.0
SEMIO_TRADV	891221.0	3.661784	1.707637	1.0	2.0	3.0	5.0	7.0
SHOPPER_TYP	891221.0	1.266967	1.287435	-1.0	0.0	1.0	2.0	3.0
SOHO_KZ	817722.0	0.008423	0.091392	0.0	0.0	0.0	0.0	1.0
TITEL_KZ	817722.0	0.003483	0.084957	0.0	0.0	0.0	0.0	5.0
VERS_TYP	891221.0	1.197852	0.952532	-1.0	1.0	1.0	2.0	2.0
ZABEOTYP	891221.0	3.362438	1.352704	1.0	3.0	3.0	4.0	6.0
ALTER_HH	817722.0	10.864126	7.639683	0.0	0.0	13.0	17.0	21.0
ANZ_PERSONEN	817722.0	1.727637	1.155849	0.0	1.0	1.0	2.0	45.0
ANZ_TITEL	817722.0	0.004162	0.068855	0.0	0.0	0.0	0.0	6.0
HH_EINKOMMEN_SCORE	872873.0	4.207243	1.624057	1.0	3.0	5.0	6.0	6.0
KK_KUNDENTYP	306609.0	3.410640	1.628844	1.0	2.0	3.0	5.0	6.0
W_KEIT_KIND_HH	783619.0	3.933406	1.964701	0.0	2.0	4.0	6.0	6.0
WOHNDAUER_2008	817722.0	7.908791	1.923137	1.0	8.0	9.0	9.0	9.0
ANZ_HAUSHALTE_AKTIV	798073.0	8.287263	15.628087	0.0	1.0	4.0	9.0	595.0
ANZ_HH_TITEL	794213.0	0.040647	0.324028	0.0	0.0	0.0	0.0	23.0
GEBAEUDETYP	798073.0	2.798641	2.656713	1.0	1.0	1.0	3.0	8.0
KONSUMNAEHE	817252.0	3.018452	1.550312	1.0	2.0	3.0	4.0	7.0
MIN_GEBAEUDEJAHR	798073.0	1993.277011	3.332739	1985.0	1992.0	1992.0	1993.0	2016.0
WOHNLAGE	798073.0	4.052836	1.949539	0.0	3.0	3.0	5.0	8.0
KBA05_ANTG1	757897.0	1.494277	1.403961	0.0	0.0	1.0	3.0	4.0
KBA05_ANTG2	757897.0	1.265584	1.245178	0.0	0.0	1.0	2.0	4.0
KBA05_ANTG3	757897.0	0.624525	1.013443	0.0	0.0	0.0	1.0	3.0
KBA05_ANTG4	757897.0	0.305927	0.638725	0.0	0.0	0.0	0.0	2.0
KBA05_BAUMAX	757897.0	1.389552	1.779483	0.0	0.0	1.0	3.0	5.0
KBA05_GBZ	757897.0	3.158580	1.329537	1.0	2.0	3.0	4.0	5.0
BALLRAUM	797481.0	4.153043	2.183710	1.0	2.0	5.0	6.0	7.0
EWDICHTE	797481.0	3.939172	1.718996	1.0	2.0	4.0	6.0	6.0
INNENSTADT	797481.0	4.549491	2.028919	1.0	3.0	5.0	6.0	8.0
GEBAEUDETYP_RASTER	798066.0	3.738306	0.923193	1.0	3.0	4.0	4.0	5.0
KKK	770025.0	2.592991	1.119052	0.0	2.0	3.0	3.0	4.0
MOBI_REGIO	757897.0	2.963540	1.428882	1.0	2.0	3.0	4.0	6.0
ONLINE_AFFINITAET	886367.0	2.698691	1.521524	0.0	1.0	3.0	4.0	5.0
REGIOTYP	770025.0	4.257967	2.030385	0.0	3.0	5.0	6.0	7.0
KBA13_ANZAHL_PKW	785421.0	619.701439	340.034318	0.0	384.0	549.0	778.0	2300.0
PLZ8_ANTG1	774706.0	2.253330	0.972008	0.0	1.0	2.0	3.0	4.0
PLZ8_ANTG2	774706.0	2.801858	0.920309	0.0	2.0	3.0	3.0	4.0
PLZ8_ANTG3	774706.0	1.595426	0.986736	0.0	1.0	2.0	2.0	3.0
PLZ8_ANTG4	774706.0	0.699166	0.727137	0.0	0.0	1.0	1.0	2.0
PLZ8_BAUMAX	774706.0	1.943913	1.459654	1.0	1.0	1.0	3.0	5.0
PLZ8_HHZ	774706.0	3.612821	0.973967	1.0	3.0	4.0	4.0	5.0
PLZ8_GBZ	774706.0	3.381087	1.111598	1.0	3.0	3.0	4.0	5.0
ARBEIT	794005.0	3.167854	1.002376	1.0	3.0	3.0	4.0	9.0
ORTSGR_KLS9	794005.0	5.293002	2.303739	0.0	4.0	5.0	7.0	9.0
RELAT_AB	794005.0	3.072220	1.362980	1.0	2.0	3.0	4.0	9.0

	attribute	information_level	type	missing_or_unknown
0	AGER_TYP	person	categorical	[-1,0]
1	ALTERSKATEGORIE_GROB	person	ordinal	[-1,0,9]
2	ANREDE_KZ	person	categorical	[-1,0]
3	CJT_GESAMTTYP	person	categorical	[0]
4	FINANZ_MINIMALIST	person	ordinal	[-1]

	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	FINANZTYP	...	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB	no_na_row
0	2.0	1	2.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
11	2.0	1	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
12	3.0	1	6.0	5	3	4	2	4	1	3	...	3.0	1.0	0.0	1.0	5.0	5.0	3.0	6.0	4.0	6
13	1.0	2	5.0	1	4	3	5	5	2	1	...	1.0	1.0	1.0	1.0	3.0	3.0	3.0	6.0	4.0	8
14	3.0	1	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
17	2.0	1	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
24	3.0	2	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
26	3.0	1	3.0	5	2	4	2	3	1	3	...	NaN	NaN	NaN	NaN	NaN	NaN	4.0	3.0	5.0	19
30	3.0	2	3.0	4	3	4	4	4	1	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	34
35	2.0	2	2.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
46	NaN	2	3.0	2	4	3	5	5	4	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	40
48	3.0	2	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
53	2.0	1	3.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
54	3.0	1	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
61	3.0	2	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
62	3.0	1	4.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
69	3.0	2	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
75	3.0	2	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
76	2.0	2	2.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
81	3.0	2	5.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	45
83	3.0	2	4.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
88	1.0	2	4.0	3	3	1	4	5	1	3	...	2.0	1.0	0.0	1.0	5.0	5.0	3.0	5.0	2.0	8
90	1.0	2	4.0	2	5	3	5	5	4	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	34
97	2.0	1	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
99	2.0	2	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
103	2.0	2	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
106	2.0	2	1.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
108	3.0	1	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
109	3.0	2	4.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
112	3.0	1	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
133	1.0	2	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
137	3.0	1	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
138	3.0	1	NaN	5	3	4	2	4	1	3	...	NaN	NaN	NaN	NaN	NaN	NaN	4.0	6.0	3.0	29
142	2.0	2	6.0	3	3	2	2	5	1	3	...	NaN	NaN	NaN	NaN	NaN	NaN	2.0	6.0	2.0	17
143	3.0	2	4.0	3	3	4	3	3	1	1	...	NaN	NaN	NaN	NaN	NaN	NaN	2.0	4.0	1.0	19
144	1.0	2	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
153	3.0	1	2.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
154	3.0	2	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
157	2.0	1	6.0	3	4	2	5	5	1	3	...	2.0	1.0	0.0	1.0	4.0	4.0	4.0	5.0	3.0	8
163	3.0	2	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
891002	3.0	1	4.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
891012	3.0	1	1.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
891014	2.0	2	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
891021	2.0	2	4.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
891027	2.0	1	1.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
891044	3.0	1	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
891049	3.0	1	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
891055	3.0	2	6.0	4	2	4	4	3	1	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	33
891061	3.0	1	3.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
891062	3.0	2	3.0	1	5	2	3	5	2	1	...	NaN	NaN	NaN	NaN	NaN	NaN	3.0	7.0	3.0	19
891063	1.0	2	2.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
891064	3.0	1	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
891069	4.0	2	6.0	3	1	5	1	3	5	2	...	NaN	NaN	NaN	NaN	NaN	NaN	3.0	7.0	3.0	15
891072	2.0	1	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
891074	1.0	1	4.0	2	5	3	5	5	2	1	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	34
891076	1.0	2	6.0	1	4	3	5	5	3	1	...	3.0	1.0	0.0	1.0	4.0	5.0	1.0	2.0	1.0	8
891083	3.0	1	1.0	5	2	3	4	5	1	2	...	2.0	1.0	0.0	1.0	5.0	5.0	1.0	2.0	1.0	6
891092	3.0	2	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
891095	1.0	1	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
891100	4.0	2	3.0	5	1	5	2	3	3	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	34
891115	3.0	2	6.0	4	2	3	4	3	1	3	...	1.0	0.0	0.0	1.0	1.0	3.0	4.0	1.0	5.0	6
891117	4.0	1	1.0	4	1	5	1	3	2	5	...	NaN	NaN	NaN	NaN	NaN	NaN	3.0	5.0	3.0	19
891119	1.0	2	3.0	1	5	1	5	5	4	1	...	NaN	NaN	NaN	NaN	NaN	NaN	3.0	8.0	3.0	17
891120	1.0	1	4.0	3	4	3	5	5	1	1	...	NaN	NaN	NaN	NaN	NaN	NaN	2.0	3.0	1.0	17
891130	3.0	1	5.0	5	3	4	3	4	1	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	34
891135	3.0	2	2.0	3	2	4	2	3	2	2	...	3.0	3.0	2.0	5.0	4.0	2.0	4.0	8.0	5.0	6
891137	4.0	2	4.0	4	1	5	2	3	3	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	34
891140	4.0	2	1.0	4	1	5	2	3	3	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	34
891159	2.0	1	1.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
891162	4.0	2	4.0	4	1	5	1	2	5	2	...	4.0	2.0	1.0	2.0	4.0	3.0	3.0	4.0	4.0	6
891164	3.0	2	4.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	43
891169	3.0	1	1.0	5	2	4	2	3	1	3	...	NaN	NaN	NaN	NaN	NaN	NaN	1.0	2.0	1.0	18
891170	3.0	1	4.0	4	5	2	4	5	1	3	...	NaN	NaN	NaN	NaN	NaN	NaN	3.0	4.0	1.0	19
891171	3.0	2	5.0	4	2	3	5	5	1	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	34
891172	1.0	2	1.0	1	4	3	5	5	2	1	...	NaN	NaN	NaN	NaN	NaN	NaN	3.0	4.0	4.0	16
891173	4.0	1	1.0	4	1	5	1	2	2	5	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	19
891175	3.0	2	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
891185	3.0	1	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
891187	3.0	2	6.0	3	4	3	5	5	3	4	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	47
891203	4.0	2	1.0	4	1	5	1	3	4	5	...	3.0	2.0	1.0	1.0	3.0	3.0	4.0	8.0	5.0	14

	ALTERSKATEGORIE_GROB	ANREDE_KZ	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	GREEN_AVANTGARDE	HEALTH_TYP	...	WOHNLAGE_7.0	PLZ8_BAUMAX_1.0	PLZ8_BAUMAX_2.0	movement	decade	cameo_intl
1	1.0	2	1	5	2	5	4	5	0	3.0	...	0	1	0	1.0	90.0	4.0
2	3.0	2	1	4	1	2	3	5	1	3.0	...	0	1	0	0.0	70.0	5.0
3	4.0	2	4	2	5	2	1	2	0	2.0	...	1	1	0	0.0	70.0	2.0
4	3.0	1	4	3	4	1	3	2	0	3.0	...	0	0	1	0.0	50.0	1.0
5	1.0	2	3	1	5	2	2	5	0	3.0	...	1	1	0	0.0	80.0	4.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	2	4	1	5.0	5	1	5	1	2	2	...	3.0	3.0	1.0	0.0	1.0	5.0	5.0	1.0	2.0	1.0
1	-1	4	1	NaN	5	1	5	1	3	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	-1	4	2	2.0	5	1	5	1	4	4	...	2.0	3.0	3.0	1.0	3.0	3.0	2.0	3.0	5.0	3.0
3	1	4	1	2.0	5	1	5	2	1	2	...	3.0	2.0	1.0	0.0	1.0	3.0	4.0	1.0	3.0	1.0
4	-1	3	1	6.0	3	1	4	4	5	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	3.0	5.0	1.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	2	4	1	5.0	5	1	5	1	2	2	...	3.0	3.0	1.0	0.0	1.0	5.0	5.0	1.0	2.0	1.0
1	-1	4	1	NaN	5	1	5	1	3	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	-1	4	2	2.0	5	1	5	1	4	4	...	2.0	3.0	3.0	1.0	3.0	3.0	2.0	3.0	5.0	3.0
3	1	4	1	2.0	5	1	5	2	1	2	...	3.0	2.0	1.0	0.0	1.0	3.0	4.0	1.0	3.0	1.0
4	-1	3	1	6.0	3	1	4	4	5	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	3.0	5.0	1.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	-1	2	1	2.0	3	4	3	5	5	3	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	-1	1	2	5.0	1	5	2	5	4	5	...	2.0	3.0	2.0	1.0	1.0	5.0	4.0	3.0	5.0	4.0
2	-1	3	2	3.0	1	4	1	2	3	5	...	3.0	3.0	1.0	0.0	1.0	4.0	4.0	3.0	5.0	2.0
3	2	4	2	2.0	4	2	5	2	1	2	...	2.0	2.0	2.0	0.0	1.0	3.0	4.0	2.0	3.0	3.0
4	-1	3	1	5.0	4	3	4	1	3	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	4.0	6.0	5.0

	AGER_TYP	ALTERSKATEGORIE_GROB	ANREDE_KZ	CJT_GESAMTTYP	FINANZ_MINIMALIST	FINANZ_SPARER	FINANZ_VORSORGER	FINANZ_ANLEGER	FINANZ_UNAUFFAELLIGER	FINANZ_HAUSBAUER	...	PLZ8_ANTG1	PLZ8_ANTG2	PLZ8_ANTG3	PLZ8_ANTG4	PLZ8_BAUMAX	PLZ8_HHZ	PLZ8_GBZ	ARBEIT	ORTSGR_KLS9	RELAT_AB
0	2	4	1	5.0	5	1	5	1	2	2	...	3.0	3.0	1.0	0.0	1.0	5.0	5.0	1.0	2.0	1.0
1	-1	4	1	NaN	5	1	5	1	3	2	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
2	-1	4	2	2.0	5	1	5	1	4	4	...	2.0	3.0	3.0	1.0	3.0	3.0	2.0	3.0	5.0	3.0
3	1	4	1	2.0	5	1	5	2	1	2	...	3.0	2.0	1.0	0.0	1.0	3.0	4.0	1.0	3.0	1.0
4	-1	3	1	6.0	3	1	4	4	5	2	...	2.0	4.0	2.0	1.0	2.0	3.0	3.0	3.0	5.0	1.0