Hello! Thanks for opening my notebook. In this notebook you will find data analysis and use of shallow machine learning algorithms to predict possible donors for a fictional company. Our "target" is people who make above $50,000 per year, because this group of people are most likely to be willing to donate to my non-profit. So, the goal of the algorithm is to predict level of income based on features that we will explore below.
Effective exploratory data analysis, evaluation of potentially useful algorithms, and hyperparameter tuning are all crucial tools that can improve analysis results and prediction accuracy.
In this notebook, you will find...
import numpy as np
import pandas as pd
from time import time
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, roc_auc_score,make_scorer
from sklearn.model_selection import RandomizedSearchCV,cross_val_score
from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier
import plotly_express as px
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)
%matplotlib inline
from IPython.display import display
training_data = pd.read_csv('./udacity-mlcharity-competition/census.csv')
testing_data = pd.read_csv('./udacity-mlcharity-competition/test_census.csv')
example_sub = pd.read_csv('./udacity-mlcharity-competition/example_submission.csv')
explore_training = training_data.copy()
explore_training.head(1)
Looks like the training data is complete. We have a near split between our numerical features and our categorical features.
explore_training.info()
Our capital-gain and capital-loss features have some irregular distributions.
The modes of education-num and hours-per-week make sense.
explore_training.hist(figsize=(13,7),bins=20,layout=(2,3))
plt.tight_layout()
plt.show()
Capital-gain and loss show a huge standard deviation that we should fix later.
explore_training.describe()
Looking at linear correlations...
explore_training.corr()
Wondered if there could be a relationship between age and hours-per-week. I realized in the graph below that it appears many people are working 100-hour work weeks?? Let's look into it.
sns.jointplot(x='age',y='hours-per-week',data= explore_training)
From the histogram below, 150 people are working between 95-99 hours per week, which doesn't seem like a very significant amount. I'm not sure if this is valid, since some people may very well be working 99 hour work weeks. It doesn't feel accurate, but I'll leave these untouched since there's so few.
explore_training['hours-per-week'].sort_values(ascending=False)
px.histogram(explore_training,x='hours-per-week',nbins=25,width=1000,height=400)
Taking a look at the values within each categorical feature.
explore_training['workclass'].value_counts()
Exploring hours-per-week between working class and education level.
px.histogram(explore_training,x='workclass',y='hours-per-week',histfunc='avg',width=700,height=400)
Looks like what I'm interpreting as the highest levels of education ('Prof-school', 'Masters', 'Doctorate', and 'Bachelors') are also assocaited with the highest amount of hours worked in a week. These may be good indicators of level of income.
px.histogram(explore_training,x='education_level',y='hours-per-week',histfunc='avg',width=700,height=400)
explore_training['education_level'].value_counts()
explore_training['marital-status'].value_counts()
explore_training['occupation'].value_counts()
Strange - about 40% of our observations are labeled as "Husband"
explore_training['relationship'].value_counts()
p = sum(explore_training['relationship']==' Husband')/len(explore_training['relationship'])
print('Percent of our dataset that is labeled "Husband": %.3f'% p)
explore_training['race'].value_counts()
Looks like most of our dataset observations are Male as well - would be good to check if our testing data is distributed like this as well.
explore_training['sex'].value_counts()
Yup.
testing_data['sex'].value_counts()
Native-country doesn't seem like a very strong feature given more than 90% of our observations are from the US.
p = sum(explore_training['native-country']==' United-States')/len(explore_training['native-country'])
print('Percent of our dataset that is labeled "United-States": %.3f'% p)
explore_training['native-country'].value_counts()
First I'll separate our labels and update them to binary numbers
label_dict = {'<=50K':0,'>50K':1}
training_labels = training_data['income'].map(label_dict)
training_data.drop('income',axis=1,inplace=True)
num_features = ['age','education-num','capital-gain','capital-loss','hours-per-week']
cat_features = ['workclass','education_level','marital-status','occupation',
'relationship','race','sex','native-country']
num_training_data=training_data[num_features].copy()
cat_training_data=training_data[cat_features].copy()
full_training_data = training_data.copy()
*PLEASE READ BEFORE CONTINUING*
Rather than cleaning my data step by step, note that I've built several transformers to practice 1) building transformers and 2) assembling them into a pipeline. The two following sections show me building one pipeline for categorical data, one pipeline for numerical data, then combining them into one full pipeline that uses both of the previous pipelines.
Using a custom imputer to fill NaNs with median or mode.
# Creating a custom imputer because SimpleImputer() returns an array.
# I want it to return a DataFrame because I'm not comfortable with Pipelines yet and I want to inspect it easily after.
class CustomImputer(BaseEstimator,TransformerMixin):
def __init__(self,method):
self.method = method
def fit(self,X,y=None):
return self
def transform(self,X,y=None):
if self.method == 'median':
for column in X.columns:
X[column] = X[column].fillna(X[column].median())
elif self.method == 'mode':
for column in X.columns:
X[column] = X[column].fillna(X[column].mode()[0])
return X
Most of the categorical features are nominal (no order) with the exception of education_level (ordinal data), so here I will manually map the ordered categories to the numerical values in the dictionary below within the EduLevelMapper custom transformer, which will also standardize or normalize these values depending on the scale attribute.
# The ordered category mapping that is passed into the transformer below.
# Note this is subjective ordering and is not based on any metadata provided.
ordered_categories = {' Preschool':0,' 1st-4th':1,' 5th-6th':2,' 7th-8th':3,
' 9th':4,' 10th':5,' 11th':6,' 12th':7,' HS-grad':8,
' Assoc-acdm':9,' Assoc-voc':10,' Some-college':11,
' Bachelors':12,' Prof-school':13,' Masters':14,' Doctorate':15}
class EduLevelMapper(BaseEstimator,TransformerMixin):
def __init__(self,ordered_categories,scale=None):
self.ordered_categories = ordered_categories
self.scale=scale
def fit(self,X,y=None):
return self
def transform(self,X,y=None):
# Mapping ordinal values
X['education_level'] = X['education_level'].map(self.ordered_categories)
# Normalizing or standardizing the values
if self.scale == 'normalization':
x_min = X['education_level'].min()
x_max = X['education_level'].max()
x_range = x_max-x_min
X['education_level']=(X['education_level']-x_min)/x_range
elif self.scale == 'standardization':
x_mean = X['education_level'].mean()
x_std = X['education_level'].std()
X['education_level'] = (X['education_level']-x_mean)/x_std
return X
For the rest of the categorical data, this transformer mimics pd.get_dummies to One Hot Encode the rest of our features. pd.get_dummies was selected over OneHotEncoder() because OneHotEncoder encodes ALL features (including the "education_level" feature that was encoded in EduLevelMapper, and I'm not sure how to force OneHotEncoder to exclude this feature), but pd.get_dummies only encodes categorical features, ignoring "education_level".
class GetDummies(BaseEstimator,TransformerMixin):
def __init__(self):
None
def fit(self,X,y=None):
return self
def transform(self,X,y=None):
return pd.get_dummies(X)
Assembling the two-step categorical data pipeline, and an example of how it transforms the categorical data below.
# Note I selected "standardization" rather than "normalization"
cat_pipeline = Pipeline([
('imputer',CustomImputer('mode')),
('edu_mapper',EduLevelMapper(ordered_categories,'standardization')),
('get_dummies',GetDummies())])
# Fitting our categorical data and then transforming it
cat_pipeline_output = cat_pipeline.fit_transform(cat_training_data)
# Category Pipeline Output - note education_level is standardized
# and the rest of our categorical features are one-hot encoded
cat_pipeline_output.head()
Now moving onto our numerical data...
Before numerical pipeline transformations...
num_training_data.hist(figsize=(8,5),layout=(2,3))
plt.tight_layout()
plt.show()
# This is the custom transformer that will apply a logarithmic transformation to the skewed features
skewed = ['capital-gain','capital-loss']
class SkewedDistTransformer(BaseEstimator,TransformerMixin):
def __init__(self,skewed_features):
self.skewed_features = skewed_features
def fit(self,X,y=None):
return self
def transform(self,X,y=None):
X[skewed] = X[skewed].apply(lambda x: np.log(x+1))
return X
# Again, creating a custom scaler (rather than using StandardScaler or MinMaxScaler)
# because I want to return a DataFrame
class CustomScaler(BaseEstimator,TransformerMixin):
def __init__(self,method):
self.method = method
None
def fit(self,X,y=None):
return self
def transform(self,X,y=None):
if self.method == 'normalization':
for column in X.columns:
x_min = X[column].min()
x_max = X[column].max()
x_range = x_max-x_min
X[column]=(X[column]-x_min)/x_range
else:
for column in X.columns:
x_mean = X[column].mean()
x_std = X[column].std()
X[column] = (X[column]-x_mean)/x_std
return X
Assembling the three-step numerical pipeline and a sample output below.
num_pipeline = Pipeline([
('imputer',CustomImputer('median')),
('transform_skew',SkewedDistTransformer(skewed)),
('scaler',CustomScaler('normalization'))
])
# Fitting our numerical data and then transforming it
num_pipeline_output = num_pipeline.fit_transform(num_training_data)
# Numerical Pipeline Output
# Note capital-gain and loss features were transformed and all features were scaled down
num_pipeline_output.head()
After the transformation.
num_training_data.hist(figsize=(8,5),layout=(2,3))
plt.tight_layout()
plt.show()
The ColumnTransformer allows you to combine both pipelines.
- It accepts a list of three-tuples with:
1) Name of the pipeline (can be any arbitrary name)
2) The pipeline object
3) A list of the columns that you want it to transform
full_pipeline = ColumnTransformer([
("num",num_pipeline,num_features),
("cat",cat_pipeline,cat_features),
])
# Fitting and Transforming our full training set
full_pipeline_output = full_pipeline.fit_transform(full_training_data)
# Taking a look at the output
full_pipeline_output
The combined pipelines gave us a dense matrix that has the same rows and coumns as our original dataset.
Just to confirm all of this looks good, let's compare the rows and columns of our individual pipelines to our full pipeline output.
# Sanity Check - Here's our categorical data pipeline output
cat_pipeline_output.head(1)
# Sanity Check - Here's our numerical data pipeline output
num_pipeline_output.head(1)
# Concatenating our pipelines along the second axis (the columns)
num_cat_pipeline_combined = pd.concat([num_pipeline_output,cat_pipeline_output],axis=1)
# Sanity Check - Here's our combined data pipeline output
num_cat_pipeline_combined.head(1)
# Comparing our combined pipelines with our full pipeline output
num_cat_pipeline_combined==full_pipeline_output[:,:]
Looks good! Let's move on.
final_training_data = full_pipeline_output
I'd like to narrow down my model search to 2 models, so I'll start by evaluating 5 models on the training data.
# Instantiating the 5 classifiers that I want to test.
# Rather than using full default values, I've selected a few for the cross validation
rf_clf = RandomForestClassifier(n_estimators=100)
svc_clf = SVC(gamma='auto')
tree_clf = DecisionTreeClassifier(max_depth=15)
boost_clf = AdaBoostClassifier(n_estimators=100)
nb_clf = BernoulliNB()
def train_cv(learner, X_train, y_train,scoring_function,cv=5):
'''
This function trains a model on training data, scores it, returns cross validation testing results.
Inputs:
- learner: the learning algorithm to be trained and predicted on
- X_train: features training set
- y_train: income training set
- cv: number of folds used for cross validation testing
Returns:
- Dictionary containing cross validation testing results
'''
scorer = make_scorer(scoring_function)
results = {'name':learner.__class__.__name__}
# Timing time to train
start = time()
cv_scores = cross_val_score(learner,X_train,y=y_train,scoring=scorer,cv=cv)
end = time()
# Storing results in the result dictionary
results['scores'] = cv_scores
results['average_score'] = cv_scores.mean()
results['std_dev_score'] = cv_scores.std()
results['total_time'] = end-start
results['average_time'] = (end-start)/cv
return results
classifiers = [rf_clf,svc_clf,tree_clf,boost_clf,nb_clf]
model_results = []
for clf in classifiers:
model_results.append(train_cv(clf,final_training_data,training_labels,scoring_function=roc_auc_score))
result_df = pd.DataFrame(model_results)
result_df.sort_values('average_score',ascending=False).drop(['scores'],axis=1)
Looks like the AdaBoost and RandomForest classifiers performed the best on average (per "average_score" column).
Since they have the best average preliminary testing scores (based on AUC), I'll perform a hyperparameter search for both of these algorithms.
I've decided to use RandomizedSearchCV to find good hyperparameter settings because it's been shown to result in similar hyperparameters as GridSearchCV, but with much faster run time.
Setting the parameters that I want to search...
boost_parameters = dict(n_estimators=list(np.arange(50,300,10)),
learning_rate=list(np.linspace(.5,2.5,20)),)
rf_parameters = dict(n_estimators=list(np.arange(50,300,10)),
criterion=['gini','entropy'],
min_samples_split=list(np.arange(2,500,5)),)
Instantiating the scorer object
scorer = make_scorer(roc_auc_score)
Instantiating the RandomizedSearchCV objects
boost_rscv_obj = RandomizedSearchCV(boost_clf,boost_parameters,n_iter=30,scoring=scorer,n_jobs=-1,cv=5)
rf_rscv_obj = RandomizedSearchCV(rf_clf,rf_parameters,n_iter=30,scoring=scorer,n_jobs=-1,cv=5)
Fitting the objects to our training data.
boost_rscv_fit = boost_rscv_obj.fit(final_training_data,training_labels)
rf_rscv_fit = rf_rscv_obj.fit(final_training_data,training_labels)
best_boost_clf = boost_rscv_fit.best_estimator_
best_rf_clf = rf_rscv_fit.best_estimator_
best_classifiers = [best_boost_clf,best_rf_clf]
best_model_results = []
for clf in best_classifiers:
best_model_results.append(train_cv(clf,final_training_data,
training_labels,scoring_function=roc_auc_score))
best_result_df = pd.DataFrame(best_model_results)
best_result_df.sort_values('average_score',ascending=False).drop(['scores'],axis=1)
final_model = best_boost_clf
final_model
testing_data = pd.read_csv('./udacity-mlcharity-competition/test_census.csv')
example_sub = pd.read_csv('./udacity-mlcharity-competition/example_submission.csv')
testing_ids = testing_data['Unnamed: 0']
testing_data.drop('Unnamed: 0',axis=1,inplace=True)
# final_testing_data = full_pipeline.transform(testing_data)
final_testing_data = full_pipeline.transform(testing_data)
final_predictions = final_model.predict(final_testing_data)
# final_predictions = final_model.predict_proba(final_testing_data)
submission = pd.concat([testing_ids,pd.DataFrame(final_predictions)],axis=1)
submission.columns = ['id','income']
submission.head()
submission.to_csv('finding_donors_submission.csv',index=False)
pd.read_csv('finding_donors_submission.csv').head()