Hello! Thanks for opening my notebook. In this notebook you will find data analysis and use of shallow machine learning algorithms to predict possible donors for a fictional company. Our "target" is people who make above $50,000 per year, because this group of people are most likely to be willing to donate to my non-profit. So, the goal of the algorithm is to predict level of income based on features that we will explore below.

Effective exploratory data analysis, evaluation of potentially useful algorithms, and hyperparameter tuning are all crucial tools that can improve analysis results and prediction accuracy.

In this notebook, you will find...

  1. Exploratory Data Analysis
  2. Data Preparation via Pipelines
  3. Initial model evaluation for multiple shallow ML algorithms
  4. Hyperparameter tuning using Randomized Search Cross Validation

Importing Libraries & Data

In [1]:
import numpy as np
import pandas as pd

from time import time

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, roc_auc_score,make_scorer
from sklearn.model_selection import RandomizedSearchCV,cross_val_score

from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier
In [2]:
import plotly_express as px
import seaborn as sns

import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

%matplotlib inline

from IPython.display import display
In [3]:
training_data = pd.read_csv('./udacity-mlcharity-competition/census.csv')
testing_data = pd.read_csv('./udacity-mlcharity-competition/test_census.csv')
example_sub = pd.read_csv('./udacity-mlcharity-competition/example_submission.csv')

Exploratory Data Analysis

In [4]:
explore_training = training_data.copy()
In [5]:
explore_training.head(1)
Out[5]:
age workclass education_level education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income
0 39 State-gov Bachelors 13.0 Never-married Adm-clerical Not-in-family White Male 2174.0 0.0 40.0 United-States <=50K

Looks like the training data is complete. We have a near split between our numerical features and our categorical features.

In [6]:
explore_training.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45222 entries, 0 to 45221
Data columns (total 14 columns):
age                45222 non-null int64
workclass          45222 non-null object
education_level    45222 non-null object
education-num      45222 non-null float64
marital-status     45222 non-null object
occupation         45222 non-null object
relationship       45222 non-null object
race               45222 non-null object
sex                45222 non-null object
capital-gain       45222 non-null float64
capital-loss       45222 non-null float64
hours-per-week     45222 non-null float64
native-country     45222 non-null object
income             45222 non-null object
dtypes: float64(4), int64(1), object(9)
memory usage: 4.8+ MB

Numerical Data

Our capital-gain and capital-loss features have some irregular distributions.

The modes of education-num and hours-per-week make sense.

  • For education-num, most people graduated from high-school, then we have some-college, then a finished bachelors mode (according to 'education-level' feature explored below)
  • For hours-per-week, most people work a standard 40-hour work week
In [7]:
explore_training.hist(figsize=(13,7),bins=20,layout=(2,3))
plt.tight_layout()
plt.show()

Capital-gain and loss show a huge standard deviation that we should fix later.

In [8]:
explore_training.describe()
Out[8]:
age education-num capital-gain capital-loss hours-per-week
count 45222.000000 45222.000000 45222.000000 45222.000000 45222.000000
mean 38.547941 10.118460 1101.430344 88.595418 40.938017
std 13.217870 2.552881 7506.430084 404.956092 12.007508
min 17.000000 1.000000 0.000000 0.000000 1.000000
25% 28.000000 9.000000 0.000000 0.000000 40.000000
50% 37.000000 10.000000 0.000000 0.000000 40.000000
75% 47.000000 13.000000 0.000000 0.000000 45.000000
max 90.000000 16.000000 99999.000000 4356.000000 99.000000

Looking at linear correlations...

  • looks like none of the quantitative features have strongly correlated relationships with one another except maybe hours-per-week and education-num (0.1462)
In [9]:
explore_training.corr()
Out[9]:
age education-num capital-gain capital-loss hours-per-week
age 1.000000 0.037623 0.079683 0.059351 0.101992
education-num 0.037623 1.000000 0.126907 0.081711 0.146206
capital-gain 0.079683 0.126907 1.000000 -0.032102 0.083880
capital-loss 0.059351 0.081711 -0.032102 1.000000 0.054195
hours-per-week 0.101992 0.146206 0.083880 0.054195 1.000000

Wondered if there could be a relationship between age and hours-per-week. I realized in the graph below that it appears many people are working 100-hour work weeks?? Let's look into it.

In [10]:
sns.jointplot(x='age',y='hours-per-week',data= explore_training)
Out[10]:
<seaborn.axisgrid.JointGrid at 0x1a24f31e10>

From the histogram below, 150 people are working between 95-99 hours per week, which doesn't seem like a very significant amount. I'm not sure if this is valid, since some people may very well be working 99 hour work weeks. It doesn't feel accurate, but I'll leave these untouched since there's so few.

In [11]:
explore_training['hours-per-week'].sort_values(ascending=False)
Out[11]:
31683    99.0
4979     99.0
12603    99.0
24643    99.0
11831    99.0
20676    99.0
14062    99.0
7444     99.0
21058    99.0
27793    99.0
22082    99.0
38656    99.0
32079    99.0
7991     99.0
32642    99.0
23494    99.0
3285     99.0
36796    99.0
24682    99.0
40828    99.0
39640    99.0
28497    99.0
12548    99.0
17661    99.0
34993    99.0
21526    99.0
18288    99.0
27839    99.0
41721    99.0
41379    99.0
         ... 
36169     2.0
27992     2.0
35564     2.0
18006     2.0
145       2.0
10018     2.0
17933     2.0
12951     2.0
7779      2.0
43051     2.0
36406     2.0
42691     2.0
13559     2.0
31314     2.0
32806     2.0
21544     2.0
5680      2.0
16342     2.0
175       1.0
10583     1.0
38556     1.0
19372     1.0
39810     1.0
23240     1.0
40494     1.0
21277     1.0
18307     1.0
30967     1.0
36702     1.0
22508     1.0
Name: hours-per-week, Length: 45222, dtype: float64
In [12]:
px.histogram(explore_training,x='hours-per-week',nbins=25,width=1000,height=400)

Categorical Data

Taking a look at the values within each categorical feature.

In [13]:
explore_training['workclass'].value_counts()
Out[13]:
 Private             33307
 Self-emp-not-inc     3796
 Local-gov            3100
 State-gov            1946
 Self-emp-inc         1646
 Federal-gov          1406
 Without-pay            21
Name: workclass, dtype: int64

Exploring hours-per-week between working class and education level.

In [14]:
px.histogram(explore_training,x='workclass',y='hours-per-week',histfunc='avg',width=700,height=400)

Looks like what I'm interpreting as the highest levels of education ('Prof-school', 'Masters', 'Doctorate', and 'Bachelors') are also assocaited with the highest amount of hours worked in a week. These may be good indicators of level of income.

In [15]:
px.histogram(explore_training,x='education_level',y='hours-per-week',histfunc='avg',width=700,height=400)
In [16]:
explore_training['education_level'].value_counts()
Out[16]:
 HS-grad         14783
 Some-college     9899
 Bachelors        7570
 Masters          2514
 Assoc-voc        1959
 11th             1619
 Assoc-acdm       1507
 10th             1223
 7th-8th           823
 Prof-school       785
 9th               676
 12th              577
 Doctorate         544
 5th-6th           449
 1st-4th           222
 Preschool          72
Name: education_level, dtype: int64
In [17]:
explore_training['marital-status'].value_counts()
Out[17]:
 Married-civ-spouse       21055
 Never-married            14598
 Divorced                  6297
 Separated                 1411
 Widowed                   1277
 Married-spouse-absent      552
 Married-AF-spouse           32
Name: marital-status, dtype: int64
In [18]:
explore_training['occupation'].value_counts()
Out[18]:
 Craft-repair         6020
 Prof-specialty       6008
 Exec-managerial      5984
 Adm-clerical         5540
 Sales                5408
 Other-service        4808
 Machine-op-inspct    2970
 Transport-moving     2316
 Handlers-cleaners    2046
 Farming-fishing      1480
 Tech-support         1420
 Protective-serv       976
 Priv-house-serv       232
 Armed-Forces           14
Name: occupation, dtype: int64

Strange - about 40% of our observations are labeled as "Husband"

In [19]:
explore_training['relationship'].value_counts()
Out[19]:
 Husband           18666
 Not-in-family     11702
 Own-child          6626
 Unmarried          4788
 Wife               2091
 Other-relative     1349
Name: relationship, dtype: int64
In [20]:
p = sum(explore_training['relationship']==' Husband')/len(explore_training['relationship'])
print('Percent of our dataset that is labeled "Husband": %.3f'% p)
Percent of our dataset that is labeled "Husband": 0.413
In [21]:
explore_training['race'].value_counts()
Out[21]:
 White                 38903
 Black                  4228
 Asian-Pac-Islander     1303
 Amer-Indian-Eskimo      435
 Other                   353
Name: race, dtype: int64

Looks like most of our dataset observations are Male as well - would be good to check if our testing data is distributed like this as well.

In [22]:
explore_training['sex'].value_counts()
Out[22]:
 Male      30527
 Female    14695
Name: sex, dtype: int64

Yup.

In [23]:
testing_data['sex'].value_counts()
Out[23]:
 Male      30512
 Female    14691
Name: sex, dtype: int64

Native-country doesn't seem like a very strong feature given more than 90% of our observations are from the US.

In [25]:
p = sum(explore_training['native-country']==' United-States')/len(explore_training['native-country'])
print('Percent of our dataset that is labeled "United-States": %.3f'% p)
Percent of our dataset that is labeled "United-States": 0.913
In [26]:
explore_training['native-country'].value_counts()
Out[26]:
 United-States                 41292
 Mexico                          903
 Philippines                     283
 Germany                         193
 Puerto-Rico                     175
 Canada                          163
 India                           147
 El-Salvador                     147
 Cuba                            133
 England                         119
 China                           113
 Jamaica                         103
 South                           101
 Italy                           100
 Dominican-Republic               97
 Japan                            89
 Guatemala                        86
 Vietnam                          83
 Columbia                         82
 Poland                           81
 Haiti                            69
 Portugal                         62
 Iran                             56
 Taiwan                           55
 Greece                           49
 Nicaragua                        48
 Peru                             45
 Ecuador                          43
 Ireland                          36
 France                           36
 Thailand                         29
 Hong                             28
 Cambodia                         26
 Trinadad&Tobago                  26
 Yugoslavia                       23
 Outlying-US(Guam-USVI-etc)       22
 Laos                             21
 Scotland                         20
 Honduras                         19
 Hungary                          18
 Holand-Netherlands                1
Name: native-country, dtype: int64

Data Preparation

First I'll separate our labels and update them to binary numbers

In [27]:
label_dict = {'<=50K':0,'>50K':1}
training_labels = training_data['income'].map(label_dict)
training_data.drop('income',axis=1,inplace=True)
In [28]:
num_features = ['age','education-num','capital-gain','capital-loss','hours-per-week']
cat_features = ['workclass','education_level','marital-status','occupation',
                'relationship','race','sex','native-country']

num_training_data=training_data[num_features].copy()
cat_training_data=training_data[cat_features].copy()
full_training_data = training_data.copy()

*PLEASE READ BEFORE CONTINUING*
Rather than cleaning my data step by step, note that I've built several transformers to practice 1) building transformers and 2) assembling them into a pipeline. The two following sections show me building one pipeline for categorical data, one pipeline for numerical data, then combining them into one full pipeline that uses both of the previous pipelines.

Categorical Data

Using a custom imputer to fill NaNs with median or mode.

In [29]:
# Creating a custom imputer because SimpleImputer() returns an array. 
    # I want it to return a DataFrame because I'm not comfortable with Pipelines yet and I want to inspect it easily after.
class CustomImputer(BaseEstimator,TransformerMixin):
    def __init__(self,method):
        self.method = method
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        if self.method == 'median':
            for column in X.columns:
                X[column] = X[column].fillna(X[column].median())
        elif self.method == 'mode':
            for column in X.columns:
                X[column] = X[column].fillna(X[column].mode()[0])            
        return X

Most of the categorical features are nominal (no order) with the exception of education_level (ordinal data), so here I will manually map the ordered categories to the numerical values in the dictionary below within the EduLevelMapper custom transformer, which will also standardize or normalize these values depending on the scale attribute.

In [30]:
# The ordered category mapping that is passed into the transformer below. 
    # Note this is subjective ordering and is not based on any metadata provided.
ordered_categories = {' Preschool':0,' 1st-4th':1,' 5th-6th':2,' 7th-8th':3,
                      ' 9th':4,' 10th':5,' 11th':6,' 12th':7,' HS-grad':8,
                      ' Assoc-acdm':9,' Assoc-voc':10,' Some-college':11,
                      ' Bachelors':12,' Prof-school':13,' Masters':14,' Doctorate':15}
In [31]:
class EduLevelMapper(BaseEstimator,TransformerMixin):
    def __init__(self,ordered_categories,scale=None):
        self.ordered_categories = ordered_categories
        self.scale=scale
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        # Mapping ordinal values
        X['education_level'] = X['education_level'].map(self.ordered_categories)
        # Normalizing or standardizing the values
        if self.scale == 'normalization':
            x_min = X['education_level'].min()
            x_max = X['education_level'].max()
            x_range = x_max-x_min
            X['education_level']=(X['education_level']-x_min)/x_range
        elif self.scale == 'standardization':
            x_mean = X['education_level'].mean()
            x_std = X['education_level'].std()
            X['education_level'] = (X['education_level']-x_mean)/x_std     
        return X  

For the rest of the categorical data, this transformer mimics pd.get_dummies to One Hot Encode the rest of our features. pd.get_dummies was selected over OneHotEncoder() because OneHotEncoder encodes ALL features (including the "education_level" feature that was encoded in EduLevelMapper, and I'm not sure how to force OneHotEncoder to exclude this feature), but pd.get_dummies only encodes categorical features, ignoring "education_level".

In [32]:
class GetDummies(BaseEstimator,TransformerMixin):
    def __init__(self):
        None
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        return pd.get_dummies(X)

Assembling the two-step categorical data pipeline, and an example of how it transforms the categorical data below.

In [33]:
# Note I selected "standardization" rather than "normalization"
cat_pipeline = Pipeline([
    ('imputer',CustomImputer('mode')),
    ('edu_mapper',EduLevelMapper(ordered_categories,'standardization')),
    ('get_dummies',GetDummies())])
In [34]:
# Fitting our categorical data and then transforming it
cat_pipeline_output = cat_pipeline.fit_transform(cat_training_data)
In [35]:
# Category Pipeline Output - note education_level is standardized 
    # and the rest of our categorical features are one-hot encoded
cat_pipeline_output.head()
Out[35]:
education_level workclass_ Federal-gov workclass_ Local-gov workclass_ Private workclass_ Self-emp-inc workclass_ Self-emp-not-inc workclass_ State-gov workclass_ Without-pay marital-status_ Divorced marital-status_ Married-AF-spouse ... native-country_ Portugal native-country_ Puerto-Rico native-country_ Scotland native-country_ South native-country_ Taiwan native-country_ Thailand native-country_ Trinadad&Tobago native-country_ United-States native-country_ Vietnam native-country_ Yugoslavia
0 0.912551 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
1 0.912551 0 0 0 0 1 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
2 -0.563973 0 0 1 0 0 0 0 1 0 ... 0 0 0 0 0 0 0 1 0 0
3 -1.302235 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0
4 0.912551 0 0 1 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

5 rows × 83 columns

Numerical Data

Now moving onto our numerical data...

  • In case our dataset has missing values, I'm using an imputer to fill our missing values with the median value
    • Note that I'm using a custom imputer that does the same thing as SimpleImputer(strategy='median')
      • I'm using a custom one so that our numerical pipeline outputs a DataFrame that my other custom transformers can build on top of. It's also easier to inspect.
      • Note that would have been much better for me to have built my transformers so that they take arrays as input and give arrays as output, but I've chosen to build them out using DataFrames because they're easier for me to read and make sure everything makes sense.
  • Because our capital-gains and capital-loss features are highly skewed, I will apply a logarithmic transformation to reduce the severity of the skew
  • After, I will use the a scaler on all of our numerical features so they're comparable in value

Before numerical pipeline transformations...

In [36]:
num_training_data.hist(figsize=(8,5),layout=(2,3))
plt.tight_layout()
plt.show()
In [37]:
# This is the custom transformer that will apply a logarithmic transformation to the skewed features
skewed = ['capital-gain','capital-loss']
class SkewedDistTransformer(BaseEstimator,TransformerMixin):
    def __init__(self,skewed_features):
        self.skewed_features = skewed_features
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        X[skewed] = X[skewed].apply(lambda x: np.log(x+1))
        return X
In [38]:
# Again, creating a custom scaler (rather than using StandardScaler or MinMaxScaler) 
    # because I want to return a DataFrame
class CustomScaler(BaseEstimator,TransformerMixin):
    def __init__(self,method):
        self.method = method
        None
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        if self.method == 'normalization':
            for column in X.columns:
                x_min = X[column].min()
                x_max = X[column].max()
                x_range = x_max-x_min
                X[column]=(X[column]-x_min)/x_range
        else:
            for column in X.columns:
                x_mean = X[column].mean()
                x_std = X[column].std()
                X[column] = (X[column]-x_mean)/x_std
        return X

Assembling the three-step numerical pipeline and a sample output below.

In [39]:
num_pipeline = Pipeline([
    ('imputer',CustomImputer('median')),
    ('transform_skew',SkewedDistTransformer(skewed)),
    ('scaler',CustomScaler('normalization'))
])
In [40]:
# Fitting our numerical data and then transforming it
num_pipeline_output = num_pipeline.fit_transform(num_training_data)
In [41]:
# Numerical Pipeline Output
    # Note capital-gain and loss features were transformed and all features were scaled down
num_pipeline_output.head()
Out[41]:
age education-num capital-gain capital-loss hours-per-week
0 0.301370 0.800000 0.667492 0.0 0.397959
1 0.452055 0.800000 0.000000 0.0 0.122449
2 0.287671 0.533333 0.000000 0.0 0.397959
3 0.493151 0.400000 0.000000 0.0 0.397959
4 0.150685 0.800000 0.000000 0.0 0.397959

After the transformation.

  • Notice the data is around 0 on the x-axis
  • Also note that the height of the histograms for capital-gain and loss are slightly different
    • This is a result of our logarithmic transformation
In [42]:
num_training_data.hist(figsize=(8,5),layout=(2,3))
plt.tight_layout()
plt.show()

Full Categorical and Numerical Data Pipeline

The ColumnTransformer allows you to combine both pipelines.

- It accepts a list of three-tuples with:
    1) Name of the pipeline (can be any arbitrary name)
    2) The pipeline object
    3) A list of the columns that you want it to transform
In [43]:
full_pipeline = ColumnTransformer([
    ("num",num_pipeline,num_features),
    ("cat",cat_pipeline,cat_features),
])
In [44]:
# Fitting and Transforming our full training set
full_pipeline_output = full_pipeline.fit_transform(full_training_data)
In [45]:
# Taking a look at the output
full_pipeline_output
Out[45]:
array([[0.30136986, 0.8       , 0.66749185, ..., 1.        , 0.        ,
        0.        ],
       [0.45205479, 0.8       , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.28767123, 0.53333333, 0.        , ..., 1.        , 0.        ,
        0.        ],
       ...,
       [0.28767123, 0.8       , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.36986301, 0.8       , 0.74737487, ..., 1.        , 0.        ,
        0.        ],
       [0.24657534, 0.8       , 0.        , ..., 1.        , 0.        ,
        0.        ]])

The combined pipelines gave us a dense matrix that has the same rows and coumns as our original dataset.

Just to confirm all of this looks good, let's compare the rows and columns of our individual pipelines to our full pipeline output.

In [46]:
# Sanity Check - Here's our categorical data pipeline output
cat_pipeline_output.head(1)
Out[46]:
education_level workclass_ Federal-gov workclass_ Local-gov workclass_ Private workclass_ Self-emp-inc workclass_ Self-emp-not-inc workclass_ State-gov workclass_ Without-pay marital-status_ Divorced marital-status_ Married-AF-spouse ... native-country_ Portugal native-country_ Puerto-Rico native-country_ Scotland native-country_ South native-country_ Taiwan native-country_ Thailand native-country_ Trinadad&Tobago native-country_ United-States native-country_ Vietnam native-country_ Yugoslavia
0 0.912551 0 0 0 0 0 1 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

1 rows × 83 columns

In [47]:
# Sanity Check - Here's our numerical data pipeline output
num_pipeline_output.head(1)
Out[47]:
age education-num capital-gain capital-loss hours-per-week
0 0.30137 0.8 0.667492 0.0 0.397959
In [48]:
# Concatenating our pipelines along the second axis (the columns)
num_cat_pipeline_combined = pd.concat([num_pipeline_output,cat_pipeline_output],axis=1)
In [49]:
# Sanity Check - Here's our combined data pipeline output
num_cat_pipeline_combined.head(1)
Out[49]:
age education-num capital-gain capital-loss hours-per-week education_level workclass_ Federal-gov workclass_ Local-gov workclass_ Private workclass_ Self-emp-inc ... native-country_ Portugal native-country_ Puerto-Rico native-country_ Scotland native-country_ South native-country_ Taiwan native-country_ Thailand native-country_ Trinadad&Tobago native-country_ United-States native-country_ Vietnam native-country_ Yugoslavia
0 0.30137 0.8 0.667492 0.0 0.397959 0.912551 0 0 0 0 ... 0 0 0 0 0 0 0 1 0 0

1 rows × 88 columns

In [50]:
# Comparing our combined pipelines with our full pipeline output
num_cat_pipeline_combined==full_pipeline_output[:,:]
Out[50]:
age education-num capital-gain capital-loss hours-per-week education_level workclass_ Federal-gov workclass_ Local-gov workclass_ Private workclass_ Self-emp-inc ... native-country_ Portugal native-country_ Puerto-Rico native-country_ Scotland native-country_ South native-country_ Taiwan native-country_ Thailand native-country_ Trinadad&Tobago native-country_ United-States native-country_ Vietnam native-country_ Yugoslavia
0 True True True True True True True True True True ... True True True True True True True True True True
1 True True True True True True True True True True ... True True True True True True True True True True
2 True True True True True True True True True True ... True True True True True True True True True True
3 True True True True True True True True True True ... True True True True True True True True True True
4 True True True True True True True True True True ... True True True True True True True True True True
5 True True True True True True True True True True ... True True True True True True True True True True
6 True True True True True True True True True True ... True True True True True True True True True True
7 True True True True True True True True True True ... True True True True True True True True True True
8 True True True True True True True True True True ... True True True True True True True True True True
9 True True True True True True True True True True ... True True True True True True True True True True
10 True True True True True True True True True True ... True True True True True True True True True True
11 True True True True True True True True True True ... True True True True True True True True True True
12 True True True True True True True True True True ... True True True True True True True True True True
13 True True True True True True True True True True ... True True True True True True True True True True
14 True True True True True True True True True True ... True True True True True True True True True True
15 True True True True True True True True True True ... True True True True True True True True True True
16 True True True True True True True True True True ... True True True True True True True True True True
17 True True True True True True True True True True ... True True True True True True True True True True
18 True True True True True True True True True True ... True True True True True True True True True True
19 True True True True True True True True True True ... True True True True True True True True True True
20 True True True True True True True True True True ... True True True True True True True True True True
21 True True True True True True True True True True ... True True True True True True True True True True
22 True True True True True True True True True True ... True True True True True True True True True True
23 True True True True True True True True True True ... True True True True True True True True True True
24 True True True True True True True True True True ... True True True True True True True True True True
25 True True True True True True True True True True ... True True True True True True True True True True
26 True True True True True True True True True True ... True True True True True True True True True True
27 True True True True True True True True True True ... True True True True True True True True True True
28 True True True True True True True True True True ... True True True True True True True True True True
29 True True True True True True True True True True ... True True True True True True True True True True
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
45192 True True True True True True True True True True ... True True True True True True True True True True
45193 True True True True True True True True True True ... True True True True True True True True True True
45194 True True True True True True True True True True ... True True True True True True True True True True
45195 True True True True True True True True True True ... True True True True True True True True True True
45196 True True True True True True True True True True ... True True True True True True True True True True
45197 True True True True True True True True True True ... True True True True True True True True True True
45198 True True True True True True True True True True ... True True True True True True True True True True
45199 True True True True True True True True True True ... True True True True True True True True True True
45200 True True True True True True True True True True ... True True True True True True True True True True
45201 True True True True True True True True True True ... True True True True True True True True True True
45202 True True True True True True True True True True ... True True True True True True True True True True
45203 True True True True True True True True True True ... True True True True True True True True True True
45204 True True True True True True True True True True ... True True True True True True True True True True
45205 True True True True True True True True True True ... True True True True True True True True True True
45206 True True True True True True True True True True ... True True True True True True True True True True
45207 True True True True True True True True True True ... True True True True True True True True True True
45208 True True True True True True True True True True ... True True True True True True True True True True
45209 True True True True True True True True True True ... True True True True True True True True True True
45210 True True True True True True True True True True ... True True True True True True True True True True
45211 True True True True True True True True True True ... True True True True True True True True True True
45212 True True True True True True True True True True ... True True True True True True True True True True
45213 True True True True True True True True True True ... True True True True True True True True True True
45214 True True True True True True True True True True ... True True True True True True True True True True
45215 True True True True True True True True True True ... True True True True True True True True True True
45216 True True True True True True True True True True ... True True True True True True True True True True
45217 True True True True True True True True True True ... True True True True True True True True True True
45218 True True True True True True True True True True ... True True True True True True True True True True
45219 True True True True True True True True True True ... True True True True True True True True True True
45220 True True True True True True True True True True ... True True True True True True True True True True
45221 True True True True True True True True True True ... True True True True True True True True True True

45222 rows × 88 columns

Looks good! Let's move on.

In [51]:
final_training_data = full_pipeline_output

Model Selection

I'd like to narrow down my model search to 2 models, so I'll start by evaluating 5 models on the training data.

Model Selection

In [52]:
# Instantiating the 5 classifiers that I want to test.
    # Rather than using full default values, I've selected a few for the cross validation
rf_clf = RandomForestClassifier(n_estimators=100)
svc_clf = SVC(gamma='auto')
tree_clf = DecisionTreeClassifier(max_depth=15)
boost_clf = AdaBoostClassifier(n_estimators=100)
nb_clf = BernoulliNB()
In [55]:
def train_cv(learner, X_train, y_train,scoring_function,cv=5): 
    '''
    This function trains a model on training data, scores it, returns cross validation testing results.
    
    Inputs:
       - learner: the learning algorithm to be trained and predicted on
       - X_train: features training set
       - y_train: income training set
       - cv: number of folds used for cross validation testing
    
    Returns:
       - Dictionary containing cross validation testing results
    '''
    scorer = make_scorer(scoring_function)
    
    results = {'name':learner.__class__.__name__}
    # Timing time to train
    start = time()
    cv_scores = cross_val_score(learner,X_train,y=y_train,scoring=scorer,cv=cv)    
    end = time()
    
    # Storing results in the result dictionary
    results['scores'] = cv_scores
    results['average_score'] = cv_scores.mean()
    results['std_dev_score'] = cv_scores.std()
    results['total_time'] = end-start
    results['average_time'] = (end-start)/cv
    
    return results
In [56]:
classifiers = [rf_clf,svc_clf,tree_clf,boost_clf,nb_clf]

model_results = []
for clf in classifiers:
        model_results.append(train_cv(clf,final_training_data,training_labels,scoring_function=roc_auc_score))

Result Table

In [57]:
result_df = pd.DataFrame(model_results)
result_df.sort_values('average_score',ascending=False).drop(['scores'],axis=1)
Out[57]:
average_score average_time name std_dev_score total_time
3 0.783012 5.137789 AdaBoostClassifier 0.004695 25.688946
0 0.770614 4.424796 RandomForestClassifier 0.007787 22.123980
2 0.767556 0.353361 DecisionTreeClassifier 0.010492 1.766806
4 0.763671 0.264685 BernoulliNB 0.003098 1.323425
1 0.751991 92.014329 SVC 0.000994 460.071643

Looks like the AdaBoost and RandomForest classifiers performed the best on average (per "average_score" column).

Since they have the best average preliminary testing scores (based on AUC), I'll perform a hyperparameter search for both of these algorithms.

Model Tuning

I've decided to use RandomizedSearchCV to find good hyperparameter settings because it's been shown to result in similar hyperparameters as GridSearchCV, but with much faster run time.

Setting the parameters that I want to search...

In [58]:
boost_parameters = dict(n_estimators=list(np.arange(50,300,10)),
                   learning_rate=list(np.linspace(.5,2.5,20)),)
rf_parameters = dict(n_estimators=list(np.arange(50,300,10)),
                    criterion=['gini','entropy'],
                    min_samples_split=list(np.arange(2,500,5)),)

Instantiating the scorer object

In [59]:
scorer = make_scorer(roc_auc_score)

Instantiating the RandomizedSearchCV objects

In [60]:
boost_rscv_obj = RandomizedSearchCV(boost_clf,boost_parameters,n_iter=30,scoring=scorer,n_jobs=-1,cv=5)
In [61]:
rf_rscv_obj = RandomizedSearchCV(rf_clf,rf_parameters,n_iter=30,scoring=scorer,n_jobs=-1,cv=5)

Fitting the objects to our training data.

In [62]:
boost_rscv_fit = boost_rscv_obj.fit(final_training_data,training_labels)
rf_rscv_fit = rf_rscv_obj.fit(final_training_data,training_labels)
In [63]:
best_boost_clf = boost_rscv_fit.best_estimator_
best_rf_clf = rf_rscv_fit.best_estimator_
best_classifiers = [best_boost_clf,best_rf_clf]

best_model_results = []
for clf in best_classifiers:
        best_model_results.append(train_cv(clf,final_training_data,
                                           training_labels,scoring_function=roc_auc_score))
        
best_result_df = pd.DataFrame(best_model_results)
best_result_df.sort_values('average_score',ascending=False).drop(['scores'],axis=1)
Out[63]:
average_score average_time name std_dev_score total_time
0 0.792994 15.357944 AdaBoostClassifier 0.003176 76.789719
1 0.782356 1.598872 RandomForestClassifier 0.003883 7.994361
In [64]:
final_model = best_boost_clf
final_model
Out[64]:
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
                   learning_rate=1.3421052631578947, n_estimators=260,
                   random_state=None)

Final Model Evaluation

In [65]:
testing_data = pd.read_csv('./udacity-mlcharity-competition/test_census.csv')
example_sub = pd.read_csv('./udacity-mlcharity-competition/example_submission.csv')
In [66]:
testing_ids = testing_data['Unnamed: 0']
testing_data.drop('Unnamed: 0',axis=1,inplace=True)
In [67]:
# final_testing_data = full_pipeline.transform(testing_data)
final_testing_data = full_pipeline.transform(testing_data)
In [68]:
final_predictions = final_model.predict(final_testing_data)
# final_predictions = final_model.predict_proba(final_testing_data)
In [69]:
submission = pd.concat([testing_ids,pd.DataFrame(final_predictions)],axis=1)
In [70]:
submission.columns = ['id','income']
In [71]:
submission.head()
Out[71]:
id income
0 0 0
1 1 1
2 2 0
3 3 1
4 4 0
In [72]:
submission.to_csv('finding_donors_submission.csv',index=False)
In [73]:
pd.read_csv('finding_donors_submission.csv').head()
Out[73]:
id income
0 0 0
1 1 1
2 2 0
3 3 1
4 4 0

Reflection - Room for Improvement

  • Pipeline is messy and can be cleaned up
    • Transformers can be used to select categorical columns and numerical columns
    • Transformers can leverage existing library transformers like SimpleImputer and OneHotEncode
    • Once final model is selected, the final model can be worked into a final pipeline
      • Main benefit - GridSearchCV and RandomizedSearchCV can be used on model hyperparameters AND preprocessing hyperparameters if I run the search on the pipeline.
  • Function can be created to automate model hyperparameter tuning