Hello! Thanks for opening my notebook. In this notebook you will find data analysis and use of shallow machine learning algorithms to predict possible donors for a fictional company. Our "target" is people who make above $50,000 per year, because this group of people are most likely to be willing to donate to my non-profit. So, the goal of the algorithm is to predict level of income based on features that we will explore below.

Effective exploratory data analysis, evaluation of potentially useful algorithms, and hyperparameter tuning are all crucial tools that can improve analysis results and prediction accuracy.

In this notebook, you will find...

  1. Exploratory Data Analysis
  2. Data Preparation via Pipelines
  3. Initial model evaluation for multiple shallow ML algorithms
  4. Hyperparameter tuning using Randomized Search Cross Validation

Importing Libraries & Data

In [1]:
import numpy as np
import pandas as pd

from time import time

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, roc_auc_score,make_scorer
from sklearn.model_selection import RandomizedSearchCV,cross_val_score

from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier
In [2]:
import plotly_express as px
import seaborn as sns

import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

%matplotlib inline

from IPython.display import display
In [3]:
training_data = pd.read_csv('./udacity-mlcharity-competition/census.csv')
testing_data = pd.read_csv('./udacity-mlcharity-competition/test_census.csv')
example_sub = pd.read_csv('./udacity-mlcharity-competition/example_submission.csv')

Exploratory Data Analysis

In [4]:
explore_training = training_data.copy()
In [5]:
explore_training.head(1)
Out[5]:
age workclass education_level education-num marital-status occupation relationship race sex capital-gain capital-loss hours-per-week native-country income
0 39 State-gov Bachelors 13.0 Never-married Adm-clerical Not-in-family White Male 2174.0 0.0 40.0 United-States <=50K

Looks like the training data is complete. We have a near split between our numerical features and our categorical features.

In [6]:
explore_training.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45222 entries, 0 to 45221
Data columns (total 14 columns):
age                45222 non-null int64
workclass          45222 non-null object
education_level    45222 non-null object
education-num      45222 non-null float64
marital-status     45222 non-null object
occupation         45222 non-null object
relationship       45222 non-null object
race               45222 non-null object
sex                45222 non-null object
capital-gain       45222 non-null float64
capital-loss       45222 non-null float64
hours-per-week     45222 non-null float64
native-country     45222 non-null object
income             45222 non-null object
dtypes: float64(4), int64(1), object(9)
memory usage: 4.8+ MB

Numerical Data

Our capital-gain and capital-loss features have some irregular distributions.

The modes of education-num and hours-per-week make sense.

  • For education-num, most people graduated from high-school, then we have some-college, then a finished bachelors mode (according to 'education-level' feature explored below)
  • For hours-per-week, most people work a standard 40-hour work week
In [7]:
explore_training.hist(figsize=(13,7),bins=20,layout=(2,3))
plt.tight_layout()
plt.show()

Capital-gain and loss show a huge standard deviation that we should fix later.

In [8]:
explore_training.describe()
Out[8]:
age education-num capital-gain capital-loss hours-per-week
count 45222.000000 45222.000000 45222.000000 45222.000000 45222.000000
mean 38.547941 10.118460 1101.430344 88.595418 40.938017
std 13.217870 2.552881 7506.430084 404.956092 12.007508
min 17.000000 1.000000 0.000000 0.000000 1.000000
25% 28.000000 9.000000 0.000000 0.000000 40.000000
50% 37.000000 10.000000 0.000000 0.000000 40.000000
75% 47.000000 13.000000 0.000000 0.000000 45.000000
max 90.000000 16.000000 99999.000000 4356.000000 99.000000

Looking at linear correlations...

  • looks like none of the quantitative features have strongly correlated relationships with one another except maybe hours-per-week and education-num (0.1462)
In [9]:
explore_training.corr()
Out[9]:
age education-num capital-gain capital-loss hours-per-week
age 1.000000 0.037623 0.079683 0.059351 0.101992
education-num 0.037623 1.000000 0.126907 0.081711 0.146206
capital-gain 0.079683 0.126907 1.000000 -0.032102 0.083880
capital-loss 0.059351 0.081711 -0.032102 1.000000 0.054195
hours-per-week 0.101992 0.146206 0.083880 0.054195 1.000000

Wondered if there could be a relationship between age and hours-per-week. I realized in the graph below that it appears many people are working 100-hour work weeks?? Let's look into it.

In [10]:
sns.jointplot(x='age',y='hours-per-week',data= explore_training)
Out[10]:
<seaborn.axisgrid.JointGrid at 0x1a24f31e10>

From the histogram below, 150 people are working between 95-99 hours per week, which doesn't seem like a very significant amount. I'm not sure if this is valid, since some people may very well be working 99 hour work weeks. It doesn't feel accurate, but I'll leave these untouched since there's so few.

In [11]:
explore_training['hours-per-week'].sort_values(ascending=False)
Out[11]:
31683    99.0
4979     99.0
12603    99.0
24643    99.0
11831    99.0
20676    99.0
14062    99.0
7444     99.0
21058    99.0
27793    99.0
22082    99.0
38656    99.0
32079    99.0
7991     99.0
32642    99.0
23494    99.0
3285     99.0
36796    99.0
24682    99.0
40828    99.0
39640    99.0
28497    99.0
12548    99.0
17661    99.0
34993    99.0
21526    99.0
18288    99.0
27839    99.0
41721    99.0
41379    99.0
         ... 
36169     2.0
27992     2.0
35564     2.0
18006     2.0
145       2.0
10018     2.0
17933     2.0
12951     2.0
7779      2.0
43051     2.0
36406     2.0
42691     2.0
13559     2.0
31314     2.0
32806     2.0
21544     2.0
5680      2.0
16342     2.0
175       1.0
10583     1.0
38556     1.0
19372     1.0
39810     1.0
23240     1.0
40494     1.0
21277     1.0
18307     1.0
30967     1.0
36702     1.0
22508     1.0
Name: hours-per-week, Length: 45222, dtype: float64
In [12]:
px.histogram(explore_training,x='hours-per-week',nbins=25,width=1000,height=400)