Hello! Thanks for opening my notebook. In this notebook you will find data analysis and use of shallow machine learning algorithms to predict possible donors for a fictional company. Our "target" is people who make above $50,000 per year, because this group of people are most likely to be willing to donate to my non-profit. So, the goal of the algorithm is to predict level of income based on features that we will explore below.

Effective exploratory data analysis, evaluation of potentially useful algorithms, and hyperparameter tuning are all crucial tools that can improve analysis results and prediction accuracy.

In this notebook, you will find...

Exploratory Data Analysis
Data Preparation via Pipelines
Initial model evaluation for multiple shallow ML algorithms
Hyperparameter tuning using Randomized Search Cross Validation

Importing Libraries & Data¶

import numpy as np
import pandas as pd

from time import time

from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder,StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator,TransformerMixin
from sklearn.compose import ColumnTransformer
from sklearn.metrics import accuracy_score, roc_auc_score,make_scorer
from sklearn.model_selection import RandomizedSearchCV,cross_val_score

from sklearn.ensemble import RandomForestClassifier,AdaBoostClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier

import plotly_express as px
import seaborn as sns

import matplotlib as mpl
import matplotlib.pyplot as plt
mpl.rc('axes', labelsize=14)
mpl.rc('xtick', labelsize=12)
mpl.rc('ytick', labelsize=12)

%matplotlib inline

from IPython.display import display

training_data = pd.read_csv('./udacity-mlcharity-competition/census.csv')
testing_data = pd.read_csv('./udacity-mlcharity-competition/test_census.csv')
example_sub = pd.read_csv('./udacity-mlcharity-competition/example_submission.csv')

Exploratory Data Analysis¶

explore_training = training_data.copy()

explore_training.head(1)

Looks like the training data is complete. We have a near split between our numerical features and our categorical features.

explore_training.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45222 entries, 0 to 45221
Data columns (total 14 columns):
age                45222 non-null int64
workclass          45222 non-null object
education_level    45222 non-null object
education-num      45222 non-null float64
marital-status     45222 non-null object
occupation         45222 non-null object
relationship       45222 non-null object
race               45222 non-null object
sex                45222 non-null object
capital-gain       45222 non-null float64
capital-loss       45222 non-null float64
hours-per-week     45222 non-null float64
native-country     45222 non-null object
income             45222 non-null object
dtypes: float64(4), int64(1), object(9)
memory usage: 4.8+ MB

Numerical Data¶

Our capital-gain and capital-loss features have some irregular distributions.

The modes of education-num and hours-per-week make sense.

For education-num, most people graduated from high-school, then we have some-college, then a finished bachelors mode (according to 'education-level' feature explored below)
For hours-per-week, most people work a standard 40-hour work week

explore_training.hist(figsize=(13,7),bins=20,layout=(2,3))
plt.tight_layout()
plt.show()

Capital-gain and loss show a huge standard deviation that we should fix later.

explore_training.describe()

Looking at linear correlations...

looks like none of the quantitative features have strongly correlated relationships with one another except maybe hours-per-week and education-num (0.1462)

explore_training.corr()

Wondered if there could be a relationship between age and hours-per-week. I realized in the graph below that it appears many people are working 100-hour work weeks?? Let's look into it.

sns.jointplot(x='age',y='hours-per-week',data= explore_training)

<seaborn.axisgrid.JointGrid at 0x1a24f31e10>

From the histogram below, 150 people are working between 95-99 hours per week, which doesn't seem like a very significant amount. I'm not sure if this is valid, since some people may very well be working 99 hour work weeks. It doesn't feel accurate, but I'll leave these untouched since there's so few.

explore_training['hours-per-week'].sort_values(ascending=False)

31683    99.0
4979     99.0
12603    99.0
24643    99.0
11831    99.0
20676    99.0
14062    99.0
7444     99.0
21058    99.0
27793    99.0
22082    99.0
38656    99.0
32079    99.0
7991     99.0
32642    99.0
23494    99.0
3285     99.0
36796    99.0
24682    99.0
40828    99.0
39640    99.0
28497    99.0
12548    99.0
17661    99.0
34993    99.0
21526    99.0
18288    99.0
27839    99.0
41721    99.0
41379    99.0
         ... 
36169     2.0
27992     2.0
35564     2.0
18006     2.0
145       2.0
10018     2.0
17933     2.0
12951     2.0
7779      2.0
43051     2.0
36406     2.0
42691     2.0
13559     2.0
31314     2.0
32806     2.0
21544     2.0
5680      2.0
16342     2.0
175       1.0
10583     1.0
38556     1.0
19372     1.0
39810     1.0
23240     1.0
40494     1.0
21277     1.0
18307     1.0
30967     1.0
36702     1.0
22508     1.0
Name: hours-per-week, Length: 45222, dtype: float64

px.histogram(explore_training,x='hours-per-week',nbins=25,width=1000,height=400)

Categorical Data¶

Taking a look at the values within each categorical feature.

explore_training['workclass'].value_counts()

 Private             33307
 Self-emp-not-inc     3796
 Local-gov            3100
 State-gov            1946
 Self-emp-inc         1646
 Federal-gov          1406
 Without-pay            21
Name: workclass, dtype: int64

Exploring hours-per-week between working class and education level.

px.histogram(explore_training,x='workclass',y='hours-per-week',histfunc='avg',width=700,height=400)

Looks like what I'm interpreting as the highest levels of education ('Prof-school', 'Masters', 'Doctorate', and 'Bachelors') are also assocaited with the highest amount of hours worked in a week. These may be good indicators of level of income.

px.histogram(explore_training,x='education_level',y='hours-per-week',histfunc='avg',width=700,height=400)

explore_training['education_level'].value_counts()

 HS-grad         14783
 Some-college     9899
 Bachelors        7570
 Masters          2514
 Assoc-voc        1959
 11th             1619
 Assoc-acdm       1507
 10th             1223
 7th-8th           823
 Prof-school       785
 9th               676
 12th              577
 Doctorate         544
 5th-6th           449
 1st-4th           222
 Preschool          72
Name: education_level, dtype: int64

explore_training['marital-status'].value_counts()

 Married-civ-spouse       21055
 Never-married            14598
 Divorced                  6297
 Separated                 1411
 Widowed                   1277
 Married-spouse-absent      552
 Married-AF-spouse           32
Name: marital-status, dtype: int64

explore_training['occupation'].value_counts()

 Craft-repair         6020
 Prof-specialty       6008
 Exec-managerial      5984
 Adm-clerical         5540
 Sales                5408
 Other-service        4808
 Machine-op-inspct    2970
 Transport-moving     2316
 Handlers-cleaners    2046
 Farming-fishing      1480
 Tech-support         1420
 Protective-serv       976
 Priv-house-serv       232
 Armed-Forces           14
Name: occupation, dtype: int64

Strange - about 40% of our observations are labeled as "Husband"

explore_training['relationship'].value_counts()

 Husband           18666
 Not-in-family     11702
 Own-child          6626
 Unmarried          4788
 Wife               2091
 Other-relative     1349
Name: relationship, dtype: int64

p = sum(explore_training['relationship']==' Husband')/len(explore_training['relationship'])
print('Percent of our dataset that is labeled "Husband": %.3f'% p)

Percent of our dataset that is labeled "Husband": 0.413

explore_training['race'].value_counts()

 White                 38903
 Black                  4228
 Asian-Pac-Islander     1303
 Amer-Indian-Eskimo      435
 Other                   353
Name: race, dtype: int64

Looks like most of our dataset observations are Male as well - would be good to check if our testing data is distributed like this as well.

explore_training['sex'].value_counts()

 Male      30527
 Female    14695
Name: sex, dtype: int64

Yup.

testing_data['sex'].value_counts()

 Male      30512
 Female    14691
Name: sex, dtype: int64

Native-country doesn't seem like a very strong feature given more than 90% of our observations are from the US.

p = sum(explore_training['native-country']==' United-States')/len(explore_training['native-country'])
print('Percent of our dataset that is labeled "United-States": %.3f'% p)

Percent of our dataset that is labeled "United-States": 0.913

explore_training['native-country'].value_counts()

 United-States                 41292
 Mexico                          903
 Philippines                     283
 Germany                         193
 Puerto-Rico                     175
 Canada                          163
 India                           147
 El-Salvador                     147
 Cuba                            133
 England                         119
 China                           113
 Jamaica                         103
 South                           101
 Italy                           100
 Dominican-Republic               97
 Japan                            89
 Guatemala                        86
 Vietnam                          83
 Columbia                         82
 Poland                           81
 Haiti                            69
 Portugal                         62
 Iran                             56
 Taiwan                           55
 Greece                           49
 Nicaragua                        48
 Peru                             45
 Ecuador                          43
 Ireland                          36
 France                           36
 Thailand                         29
 Hong                             28
 Cambodia                         26
 Trinadad&Tobago                  26
 Yugoslavia                       23
 Outlying-US(Guam-USVI-etc)       22
 Laos                             21
 Scotland                         20
 Honduras                         19
 Hungary                          18
 Holand-Netherlands                1
Name: native-country, dtype: int64

Data Preparation¶

First I'll separate our labels and update them to binary numbers

label_dict = {'<=50K':0,'>50K':1}
training_labels = training_data['income'].map(label_dict)
training_data.drop('income',axis=1,inplace=True)

num_features = ['age','education-num','capital-gain','capital-loss','hours-per-week']
cat_features = ['workclass','education_level','marital-status','occupation',
                'relationship','race','sex','native-country']

num_training_data=training_data[num_features].copy()
cat_training_data=training_data[cat_features].copy()
full_training_data = training_data.copy()

*PLEASE READ BEFORE CONTINUING*
Rather than cleaning my data step by step, note that I've built several transformers to practice 1) building transformers and 2) assembling them into a pipeline. The two following sections show me building one pipeline for categorical data, one pipeline for numerical data, then combining them into one full pipeline that uses both of the previous pipelines.

Categorical Data¶

Using a custom imputer to fill NaNs with median or mode.

# Creating a custom imputer because SimpleImputer() returns an array. 
    # I want it to return a DataFrame because I'm not comfortable with Pipelines yet and I want to inspect it easily after.
class CustomImputer(BaseEstimator,TransformerMixin):
    def __init__(self,method):
        self.method = method
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        if self.method == 'median':
            for column in X.columns:
                X[column] = X[column].fillna(X[column].median())
        elif self.method == 'mode':
            for column in X.columns:
                X[column] = X[column].fillna(X[column].mode()[0])            
        return X

Most of the categorical features are nominal (no order) with the exception of education_level (ordinal data), so here I will manually map the ordered categories to the numerical values in the dictionary below within the EduLevelMapper custom transformer, which will also standardize or normalize these values depending on the scale attribute.

# The ordered category mapping that is passed into the transformer below. 
    # Note this is subjective ordering and is not based on any metadata provided.
ordered_categories = {' Preschool':0,' 1st-4th':1,' 5th-6th':2,' 7th-8th':3,
                      ' 9th':4,' 10th':5,' 11th':6,' 12th':7,' HS-grad':8,
                      ' Assoc-acdm':9,' Assoc-voc':10,' Some-college':11,
                      ' Bachelors':12,' Prof-school':13,' Masters':14,' Doctorate':15}

class EduLevelMapper(BaseEstimator,TransformerMixin):
    def __init__(self,ordered_categories,scale=None):
        self.ordered_categories = ordered_categories
        self.scale=scale
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        # Mapping ordinal values
        X['education_level'] = X['education_level'].map(self.ordered_categories)
        # Normalizing or standardizing the values
        if self.scale == 'normalization':
            x_min = X['education_level'].min()
            x_max = X['education_level'].max()
            x_range = x_max-x_min
            X['education_level']=(X['education_level']-x_min)/x_range
        elif self.scale == 'standardization':
            x_mean = X['education_level'].mean()
            x_std = X['education_level'].std()
            X['education_level'] = (X['education_level']-x_mean)/x_std     
        return X

For the rest of the categorical data, this transformer mimics pd.get_dummies to One Hot Encode the rest of our features. pd.get_dummies was selected over OneHotEncoder() because OneHotEncoder encodes ALL features (including the "education_level" feature that was encoded in EduLevelMapper, and I'm not sure how to force OneHotEncoder to exclude this feature), but pd.get_dummies only encodes categorical features, ignoring "education_level".

class GetDummies(BaseEstimator,TransformerMixin):
    def __init__(self):
        None
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        return pd.get_dummies(X)

Assembling the two-step categorical data pipeline, and an example of how it transforms the categorical data below.

# Note I selected "standardization" rather than "normalization"
cat_pipeline = Pipeline([
    ('imputer',CustomImputer('mode')),
    ('edu_mapper',EduLevelMapper(ordered_categories,'standardization')),
    ('get_dummies',GetDummies())])

# Fitting our categorical data and then transforming it
cat_pipeline_output = cat_pipeline.fit_transform(cat_training_data)

# Category Pipeline Output - note education_level is standardized 
    # and the rest of our categorical features are one-hot encoded
cat_pipeline_output.head()

Numerical Data¶

Now moving onto our numerical data...

In case our dataset has missing values, I'm using an imputer to fill our missing values with the median value
- Note that I'm using a custom imputer that does the same thing as SimpleImputer(strategy='median')
  - I'm using a custom one so that our numerical pipeline outputs a DataFrame that my other custom transformers can build on top of. It's also easier to inspect.
  - Note that would have been much better for me to have built my transformers so that they take arrays as input and give arrays as output, but I've chosen to build them out using DataFrames because they're easier for me to read and make sure everything makes sense.
Because our capital-gains and capital-loss features are highly skewed, I will apply a logarithmic transformation to reduce the severity of the skew
After, I will use the a scaler on all of our numerical features so they're comparable in value

Before numerical pipeline transformations...

num_training_data.hist(figsize=(8,5),layout=(2,3))
plt.tight_layout()
plt.show()

# This is the custom transformer that will apply a logarithmic transformation to the skewed features
skewed = ['capital-gain','capital-loss']
class SkewedDistTransformer(BaseEstimator,TransformerMixin):
    def __init__(self,skewed_features):
        self.skewed_features = skewed_features
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        X[skewed] = X[skewed].apply(lambda x: np.log(x+1))
        return X

# Again, creating a custom scaler (rather than using StandardScaler or MinMaxScaler) 
    # because I want to return a DataFrame
class CustomScaler(BaseEstimator,TransformerMixin):
    def __init__(self,method):
        self.method = method
        None
    def fit(self,X,y=None):
        return self
    def transform(self,X,y=None):
        if self.method == 'normalization':
            for column in X.columns:
                x_min = X[column].min()
                x_max = X[column].max()
                x_range = x_max-x_min
                X[column]=(X[column]-x_min)/x_range
        else:
            for column in X.columns:
                x_mean = X[column].mean()
                x_std = X[column].std()
                X[column] = (X[column]-x_mean)/x_std
        return X

Assembling the three-step numerical pipeline and a sample output below.

num_pipeline = Pipeline([
    ('imputer',CustomImputer('median')),
    ('transform_skew',SkewedDistTransformer(skewed)),
    ('scaler',CustomScaler('normalization'))
])

# Fitting our numerical data and then transforming it
num_pipeline_output = num_pipeline.fit_transform(num_training_data)

# Numerical Pipeline Output
    # Note capital-gain and loss features were transformed and all features were scaled down
num_pipeline_output.head()

After the transformation.

Notice the data is around 0 on the x-axis
Also note that the height of the histograms for capital-gain and loss are slightly different
- This is a result of our logarithmic transformation

num_training_data.hist(figsize=(8,5),layout=(2,3))
plt.tight_layout()
plt.show()

Full Categorical and Numerical Data Pipeline¶

The ColumnTransformer allows you to combine both pipelines.

- It accepts a list of three-tuples with:
    1) Name of the pipeline (can be any arbitrary name)
    2) The pipeline object
    3) A list of the columns that you want it to transform

full_pipeline = ColumnTransformer([
    ("num",num_pipeline,num_features),
    ("cat",cat_pipeline,cat_features),
])

# Fitting and Transforming our full training set
full_pipeline_output = full_pipeline.fit_transform(full_training_data)

# Taking a look at the output
full_pipeline_output

array([[0.30136986, 0.8       , 0.66749185, ..., 1.        , 0.        ,
        0.        ],
       [0.45205479, 0.8       , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.28767123, 0.53333333, 0.        , ..., 1.        , 0.        ,
        0.        ],
       ...,
       [0.28767123, 0.8       , 0.        , ..., 1.        , 0.        ,
        0.        ],
       [0.36986301, 0.8       , 0.74737487, ..., 1.        , 0.        ,
        0.        ],
       [0.24657534, 0.8       , 0.        , ..., 1.        , 0.        ,
        0.        ]])

The combined pipelines gave us a dense matrix that has the same rows and coumns as our original dataset.

Just to confirm all of this looks good, let's compare the rows and columns of our individual pipelines to our full pipeline output.

# Sanity Check - Here's our categorical data pipeline output
cat_pipeline_output.head(1)

# Sanity Check - Here's our numerical data pipeline output
num_pipeline_output.head(1)

# Concatenating our pipelines along the second axis (the columns)
num_cat_pipeline_combined = pd.concat([num_pipeline_output,cat_pipeline_output],axis=1)

# Sanity Check - Here's our combined data pipeline output
num_cat_pipeline_combined.head(1)

# Comparing our combined pipelines with our full pipeline output
num_cat_pipeline_combined==full_pipeline_output[:,:]

Looks good! Let's move on.

final_training_data = full_pipeline_output

Model Selection¶

I'd like to narrow down my model search to 2 models, so I'll start by evaluating 5 models on the training data.

Model Selection¶

# Instantiating the 5 classifiers that I want to test.
    # Rather than using full default values, I've selected a few for the cross validation
rf_clf = RandomForestClassifier(n_estimators=100)
svc_clf = SVC(gamma='auto')
tree_clf = DecisionTreeClassifier(max_depth=15)
boost_clf = AdaBoostClassifier(n_estimators=100)
nb_clf = BernoulliNB()

def train_cv(learner, X_train, y_train,scoring_function,cv=5): 
    '''
    This function trains a model on training data, scores it, returns cross validation testing results.
    
    Inputs:
       - learner: the learning algorithm to be trained and predicted on
       - X_train: features training set
       - y_train: income training set
       - cv: number of folds used for cross validation testing
    
    Returns:
       - Dictionary containing cross validation testing results
    '''
    scorer = make_scorer(scoring_function)
    
    results = {'name':learner.__class__.__name__}
    # Timing time to train
    start = time()
    cv_scores = cross_val_score(learner,X_train,y=y_train,scoring=scorer,cv=cv)    
    end = time()
    
    # Storing results in the result dictionary
    results['scores'] = cv_scores
    results['average_score'] = cv_scores.mean()
    results['std_dev_score'] = cv_scores.std()
    results['total_time'] = end-start
    results['average_time'] = (end-start)/cv
    
    return results

classifiers = [rf_clf,svc_clf,tree_clf,boost_clf,nb_clf]

model_results = []
for clf in classifiers:
        model_results.append(train_cv(clf,final_training_data,training_labels,scoring_function=roc_auc_score))

Result Table¶

result_df = pd.DataFrame(model_results)
result_df.sort_values('average_score',ascending=False).drop(['scores'],axis=1)

Looks like the AdaBoost and RandomForest classifiers performed the best on average (per "average_score" column).

Since they have the best average preliminary testing scores (based on AUC), I'll perform a hyperparameter search for both of these algorithms.

Model Tuning¶

I've decided to use RandomizedSearchCV to find good hyperparameter settings because it's been shown to result in similar hyperparameters as GridSearchCV, but with much faster run time.

Setting the parameters that I want to search...

boost_parameters = dict(n_estimators=list(np.arange(50,300,10)),
                   learning_rate=list(np.linspace(.5,2.5,20)),)
rf_parameters = dict(n_estimators=list(np.arange(50,300,10)),
                    criterion=['gini','entropy'],
                    min_samples_split=list(np.arange(2,500,5)),)

Instantiating the scorer object

scorer = make_scorer(roc_auc_score)

Instantiating the RandomizedSearchCV objects

boost_rscv_obj = RandomizedSearchCV(boost_clf,boost_parameters,n_iter=30,scoring=scorer,n_jobs=-1,cv=5)

rf_rscv_obj = RandomizedSearchCV(rf_clf,rf_parameters,n_iter=30,scoring=scorer,n_jobs=-1,cv=5)

Fitting the objects to our training data.

boost_rscv_fit = boost_rscv_obj.fit(final_training_data,training_labels)
rf_rscv_fit = rf_rscv_obj.fit(final_training_data,training_labels)

best_boost_clf = boost_rscv_fit.best_estimator_
best_rf_clf = rf_rscv_fit.best_estimator_
best_classifiers = [best_boost_clf,best_rf_clf]

best_model_results = []
for clf in best_classifiers:
        best_model_results.append(train_cv(clf,final_training_data,
                                           training_labels,scoring_function=roc_auc_score))
        
best_result_df = pd.DataFrame(best_model_results)
best_result_df.sort_values('average_score',ascending=False).drop(['scores'],axis=1)

final_model = best_boost_clf
final_model

AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None,
                   learning_rate=1.3421052631578947, n_estimators=260,
                   random_state=None)

Final Model Evaluation¶

testing_data = pd.read_csv('./udacity-mlcharity-competition/test_census.csv')
example_sub = pd.read_csv('./udacity-mlcharity-competition/example_submission.csv')

testing_ids = testing_data['Unnamed: 0']
testing_data.drop('Unnamed: 0',axis=1,inplace=True)

# final_testing_data = full_pipeline.transform(testing_data)
final_testing_data = full_pipeline.transform(testing_data)

final_predictions = final_model.predict(final_testing_data)
# final_predictions = final_model.predict_proba(final_testing_data)

submission = pd.concat([testing_ids,pd.DataFrame(final_predictions)],axis=1)

submission.columns = ['id','income']

submission.head()

submission.to_csv('finding_donors_submission.csv',index=False)

pd.read_csv('finding_donors_submission.csv').head()

Reflection - Room for Improvement¶

Pipeline is messy and can be cleaned up
- Transformers can be used to select categorical columns and numerical columns
- Transformers can leverage existing library transformers like SimpleImputer and OneHotEncode
- Once final model is selected, the final model can be worked into a final pipeline
  - Main benefit - GridSearchCV and RandomizedSearchCV can be used on model hyperparameters AND preprocessing hyperparameters if I run the search on the pipeline.
    - Resource 1 and Resource 2 for this
Function can be created to automate model hyperparameter tuning

	age	education-num	capital-gain	capital-loss	hours-per-week
count	45222.000000	45222.000000	45222.000000	45222.000000	45222.000000
mean	38.547941	10.118460	1101.430344	88.595418	40.938017
std	13.217870	2.552881	7506.430084	404.956092	12.007508
min	17.000000	1.000000	0.000000	0.000000	1.000000
25%	28.000000	9.000000	0.000000	0.000000	40.000000
50%	37.000000	10.000000	0.000000	0.000000	40.000000
75%	47.000000	13.000000	0.000000	0.000000	45.000000
max	90.000000	16.000000	99999.000000	4356.000000	99.000000

	age	education-num	capital-gain	capital-loss	hours-per-week
age	1.000000	0.037623	0.079683	0.059351	0.101992
education-num	0.037623	1.000000	0.126907	0.081711	0.146206
capital-gain	0.079683	0.126907	1.000000	-0.032102	0.083880
capital-loss	0.059351	0.081711	-0.032102	1.000000	0.054195
hours-per-week	0.101992	0.146206	0.083880	0.054195	1.000000

	education_level	workclass_ Private	workclass_ Self-emp-not-inc	workclass_ State-gov	marital-status_ Divorced	...	native-country_ United-States
0	0.912551	0	0	1	0	...	1
1	0.912551	0	1	0	0	...	1
2	-0.563973	1	0	0	1	...	1
3	-1.302235	1	0	0	0	...	1
4	0.912551	1	0	0	0	...	0

	age	education-num	capital-gain	hours-per-week
0	0.301370	0.800000	0.667492	0.397959
1	0.452055	0.800000	0.000000	0.122449
2	0.287671	0.533333	0.000000	0.397959
3	0.493151	0.400000	0.000000	0.397959
4	0.150685	0.800000	0.000000	0.397959

	average_score	average_time	name	std_dev_score	total_time
3	0.783012	5.137789	AdaBoostClassifier	0.004695	25.688946
0	0.770614	4.424796	RandomForestClassifier	0.007787	22.123980
2	0.767556	0.353361	DecisionTreeClassifier	0.010492	1.766806
4	0.763671	0.264685	BernoulliNB	0.003098	1.323425
1	0.751991	92.014329	SVC	0.000994	460.071643

	age	education-num	capital-gain	capital-loss	hours-per-week	education_level	workclass_ Federal-gov	workclass_ Local-gov	workclass_ Private	workclass_ Self-emp-inc	...	native-country_ Portugal	native-country_ Puerto-Rico	native-country_ Scotland	native-country_ South	native-country_ Taiwan	native-country_ Thailand	native-country_ Trinadad&Tobago	native-country_ United-States	native-country_ Vietnam	native-country_ Yugoslavia
0	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
1	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
2	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
3	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
4	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
5	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
6	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
7	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
8	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
9	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
10	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
11	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
12	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
13	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
14	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
15	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
16	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
17	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
18	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
19	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
20	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
21	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
22	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
23	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
24	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
25	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
26	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
27	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
28	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
29	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
45192	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45193	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45194	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45195	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45196	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45197	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45198	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45199	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45200	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45201	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45202	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45203	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45204	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45205	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45206	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45207	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45208	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45209	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45210	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45211	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45212	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45213	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45214	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45215	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45216	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45217	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45218	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45219	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45220	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True
45221	True	True	True	True	True	True	True	True	True	True	...	True	True	True	True	True	True	True	True	True	True

	average_score	average_time	name	std_dev_score	total_time
0	0.792994	15.357944	AdaBoostClassifier	0.003176	76.789719
1	0.782356	1.598872	RandomForestClassifier	0.003883	7.994361