Airbnb is the largest hospitality company in the world that doesn't really own any of the lodging that it provides to customers! It has taken the world by storm, and provides many different kinds of lodging options to people everywhere.
How is this made possible? By harnessing data.
Airbnb uses data to determine which listings to show to customers, how to price its listings, which Airbnb experiences users are most likely to book, and much more.
I'd like to understand what kinds of things we can draw from listings data.
Let's see what we can find.
Datasets found here.
These datasets were scraped by InsideAirbnb from the Airbnb website. They are a snapshot of listings from San Francisco, CA; New York, NY; and Austin, TX from earlier this year.
Below, I will import all of the libraries I will use, understand how this data is organized, clean it, and explore it.
Let's see if we can answer the following questions...
from sklearn.feature_selection import RFECV, f_classif, SelectPercentile
from sklearn.preprocessing import MultiLabelBinarizer, StandardScaler
from category_encoders import OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.impute._iterative import IterativeImputer
from sklearn.experimental import enable_iterative_imputer
import plotly.graph_objects as go
import plotly_express as px
import seaborn as sns
import datetime
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, t
import scipy.stats as stats
import math
import collections
import matplotlib.pyplot as plt
%matplotlib inline
sf = pd.read_csv('sf_listings.csv')
ny = pd.read_csv('ny_listings.csv')
aus = pd.read_csv('aus_listings.csv')
pd.options.display.max_rows = 106
sf.head(2).transpose()
Skimmed through features to filter out columns I don't need.
drop = ['listing_url', 'scrape_id', 'last_scraped', 'summary', 'space', 'description', 'neighborhood_overview',
'notes', 'transit', 'access', 'interaction', 'house_rules', 'thumbnail_url', 'medium_url', 'picture_url',
'xl_picture_url', 'host_id', 'host_url', 'host_about', 'host_thumbnail_url', 'host_picture_url',
'calendar_updated', 'calendar_last_scraped', 'license', 'name', 'host_name', 'zipcode', 'id']
# Dropping all the columns I don't want
sf = sf.drop(drop, axis=1)
ny = ny.drop(drop, axis=1)
aus = aus.drop(drop, axis=1)
df = pd.concat([sf, ny, aus], ignore_index=True)
# Sanity checking rows all came in
df.shape[0] == sf.shape[0]+ny.shape[0]+aus.shape[0]
Here we need to clean the data enough for Exploratory Data Analysis.
df.shape
df = df.drop_duplicates()
df.shape
1 duplicate dropped.
cat_df = df.select_dtypes(include=['object', 'bool'])
cat_cols = df.select_dtypes(include=['object', 'bool']).columns.values
num_df = df.select_dtypes(include=['int', 'float'])
num_cols = df.select_dtypes(include=['int', 'float']).columns.values
# Sanity check that we have all our features accounted for
len(num_cols)+len(cat_cols) == df.shape[1]
So now we need to account for all of the following data types...
num_df.head().transpose()
From what I can tell, all the numerical columns look good except host_acceptance_rate
and square_feet
since there are missing values in the first few rows. We'll take a look at NA values later.
cat_df.head().transpose()
We have a lot more work to do for these categorical columns.
cat_to_num = ['host_response_rate', 'price', 'weekly_price',
'monthly_price', 'security_deposit', 'cleaning_fee', 'extra_people']
prices = ['price', 'weekly_price', 'monthly_price',
'security_deposit', 'cleaning_fee', 'extra_people']
cat_to_num_df = cat_df[cat_to_num]
cat_to_num_df.head(2)
# Converting the response rate to integers
def repl(m): return m.group(1)
cat_to_num_df['host_response_rate'] = cat_to_num_df['host_response_rate'].str.replace(
r'(\d+)%', repl).astype(float)
# Removing commas and dollar signs, then converting to float
for col in prices:
cat_to_num_df[col] = cat_to_num_df[col].str.replace(
'$', '').str.replace(',', '')
cat_to_num_df[col] = cat_to_num_df[col].astype(float)
cat_to_num_df.head(1)
cat_to_num_df.describe().head(3)
These columns look good now.
Appending the cat_to_num_df
to the num_df
.
num_df = pd.concat([num_df, cat_to_num_df], axis=1)
num_cols = num_df.columns.values
cat_df = cat_df.drop(cat_to_num, axis=1)
cat_cols = cat_df.columns.values
bi_cols = []
for col in cat_df.columns:
if cat_df[col].nunique() == 2:
bi_cols.append(col)
cat_df[bi_cols].head()
for col in bi_cols:
cat_df[col] = cat_df[col].map({'f': 0, 't': 1})
Took a look at these columns since they looked redundant.
redundant_cols = ['city', 'market',
'jurisdiction_names', 'host_location', 'street']
Removing the following features
market
: smart_location
is a more detailed featurecity
: redundant to smart_location
jurisdiction_names
: redundant to smart_location
host_location
: 1,862 unique values. Too many unique values.street
: redundant to smart_location
cat_df = cat_df.drop(redundant_cols, axis=1)
cat_df.head()
def striplist(l):
return([x.strip() for x in l])
cat_df['host_verifications'] = cat_df['host_verifications'].str.replace('[', '').str.replace(']', '').str.replace("'",
'').str.lower().str.split(',').apply(striplist)
cat_df['amenities'] = cat_df['amenities'].str.replace('{', '').str.replace('}',
'').str.replace('"', '').str.lower().str.split(',').apply(striplist)
Binarizing the lists.
mlb = MultiLabelBinarizer()
amenities_df = pd.DataFrame(mlb.fit_transform(
cat_df['amenities']), columns=mlb.classes_, index=cat_df.index)
mlb = MultiLabelBinarizer()
host_verif_df = pd.DataFrame(mlb.fit_transform(
cat_df['host_verifications']), columns=mlb.classes_, index=cat_df.index)
host_verif_df = host_verif_df.drop(axis=1, labels=[''])
amenities_df = amenities_df.drop(axis=1, labels=[''])
cat_df = cat_df.drop(['host_verifications', 'amenities'], axis=1)
Concatenating the DataFrames together.
cat_df = pd.concat([cat_df, amenities_df, host_verif_df], axis=1)
cat_cols = cat_df.columns.values
Converting the date features to "days since" features so we have numerical values to work with.
today = pd.to_datetime('today')
dt_cols = ['host_since', 'first_review', 'last_review']
cat_df[dt_cols].head(1)
# Converting to datetime data type
for col in dt_cols:
cat_df[col] = pd.to_datetime(cat_df[col], infer_datetime_format=True)
# Creating day features for each date column
for col in dt_cols:
num_df[col+'_days'] = (today - cat_df[col]).apply(lambda x: x.days)
# Creating a discrete feature based on how recent the last review was
bins = [0, 90, 180, 365, 730, np.inf]
labels = ['last 90 days', 'last 180 days',
'last year', 'last 2 years', 'more than 2 years']
cat_df['last_review_discrete'] = pd.cut(
num_df['last_review_days'], bins=bins, labels=labels)
# Dropping the datetime columns since because I don't need them
cat_df = cat_df.drop(dt_cols, axis=1)
cat_cols = cat_df.columns.values
num_cols = num_df.columns.values
Time to explore the data. Let's see if we can answer any of our previously mentioned exploratory questions below.
# Sanity checking before I rejoin the DFs
num_df.shape[0] == cat_df.shape[0]
cleaned_df = pd.concat([num_df, cat_df], axis=1)
cleaned_df.head()
Is there a difference in quality of service/product between hosts with different numbers of units? (i.e. do hosts with multiple units give better service than those who do not?)
# Making a copy of the DataFrame just for exploring
eda = cleaned_df[:]
Let's take a quick look at the distribution of host_total_listings_count
.
plt.figure()
plt.title('Distribution of No. host listings.')
plt.hist(eda['host_total_listings_count'],bins=30)
plt.show()
We have some huge outliers here, with the majority of values less than 100. Let's clean this up.
eda['host_total_listings_count'].value_counts(
).sort_index(ascending=False).head(20)
So we have a significant number of hosts with hundreds of listings. Some of these could be errors because I it's difficult to imagine people owning over a thousand airbnb listings. In a work setting, this'd be worth following up. However, I'm going to discretize these into bins to make these groups easier to compare.
I'll make a categorical column to distinguish between hosts. Here are the bins I'm going to use...
bins = [0,1,10,50,100,np.inf]
labels = ['1','2-10','11-50','51-100','101+']
eda['cat_num_units'] = pd.cut(eda['host_total_listings_count'],bins=bins,labels=labels)
Now plotting the bins...
plt.figure()
sns.countplot(x='cat_num_units',data=eda)
plt.title('No. Host Units Distribution')
plt.show()
Let's compare these categories on their ratings.
rating_cols = ['review_scores_rating', 'review_scores_cleanliness', 'review_scores_checkin',
'review_scores_communication', 'review_scores_location', 'review_scores_value']
def highlight_max(s):
'''
highlight the maximum in a Series lightgreen.
'''
is_max = s == s.max()
return ['background-color: lightgreen' if v else '' for v in is_max]
def highlight_min(s):
'''
highlight the minimum in a Series red.
'''
is_min = s == s.min()
return ['background-color: red' if v else '' for v in is_min]
eda.groupby('cat_num_units')[rating_cols].mean().transpose().style. \
apply(highlight_max, axis=1).apply(highlight_min, axis=1)
Looks like hosts with only 1 unit appear to have the highest average rating and scores. Let's look at their boxplot summaries side-by-side to compare.
plt.figure(1,figsize=(15,13))
i = 1
for col in rating_cols:
plt.subplot(3,2,i)
sns.boxplot(x='cat_num_units',y=col,data=eda)
title = col.replace('_',' ').title()
plt.title(title)
plt.ylabel(None)
plt.xlabel(None)
i += 1
plt.tight_layout()
plt.show()
So these plots tell me that hosts with only 1 unit have tighter and higher distributions, suggesting that these hosts are rated slightly better than hosts with more units. However, it's worth noting that...
Let's perform a simple hypothesis t-test with a 5% significance level to check whether the average Review Scores Rating
for hosts with 1 unit is greater than ratings of all other hosts.
def plot_t_dist_hyp(df, t_stat, p_val):
"""
After t-test, plot t-statistic and p-value via Plotly Express.
Depenencies: scipy.stat.t | plotly_express as px | plotly.graph_objects as go
Inputs
------
df: degrees of freedom
t_stat: t-statistic
p_val: p-value
Returns
------
Plotly Express Figure
"""
rv = t(df=df)
x_t = np.linspace(rv.ppf(0.00001), rv.ppf(0.99999), 100)
y_t = rv.pdf(x_t)
t_df = pd.DataFrame.from_dict(data={'x_t': x_t, 'y_t': y_t})
fig = px.line(t_df, 'x_t', 'y_t', width=600, height=300,
labels={'x_t': 'X', 'y_t': ''},)
fig.update_traces(textposition='top center')
fig.update(layout=dict(title='Hypothesis Test Results', annotations=[{
"x": t_stat,
"y": rv.pdf(t_stat),
'font': {'family': 'Arial', 'size': 12, 'color': 'black'},
"showarrow": True,
"text": f'P-value<br>{p_val:.2e}',
"arrowhead": 7,
"ay": -40
}]))
return fig
First let's check our conditions for hypothesis testing using the Central Limit Theorem...
Experiment Details
$ H_0: u_{1 unit} = u_{all others}$
$ H_A: u_{1 unit} > u_{all others}$
$\alpha = 0.05$
# Creating sampling arrays
one_unit_samp = eda[eda['cat_num_units'] == '1']['review_scores_rating']
all_others_samp = eda[eda['cat_num_units'] != '1']['review_scores_rating']
# Calculating Variance
one_unit_var = one_unit_samp.var()
all_others_var = all_others_samp.var()
# Variances are nearly the same
print(one_unit_var)
print(all_others_var)
t_stat, p_val = ttest_ind(
a=one_unit_samp, b=all_others_samp, equal_var=True, nan_policy='omit')
# Plotting a T-distribution with the calculated t-statistic annotated
df = min([len(one_unit_samp)-1, len(all_others_samp)-1])
plot_t_dist_hyp(df, t_stat, p_val)
Results
$t \approx 17.87$
${p_{value}} \approx 0$
Thus, after performing a t-test comparing the ratings between hosts with only 1 unit and all other hosts, we reject our null hypothesis (that the average ratings between the two groups are the same) in favor of the alternative ($H_A$), providing convincing evidence that hosts with only one unit are rated higher than hosts with more than 1 unit.
Do hosts with no recent reviews change pricing behavior from hosts with recent reviews?
eda[['last_review_discrete', 'last_review_days']]
eda[['property_type', 'room_type']]
eda['price']
# Plotting facet grid to compare box plots for each kind of room type. Also filtered out outlier prices (>500)
fgrid = sns.FacetGrid(eda[eda['price'] <= 500], col='room_type', height=5,)
fgrid.map(sns.boxplot, 'last_review_discrete', 'price', 'host_is_superhost')
for ax in fgrid.axes.flat:
plt.setp(ax.get_xticklabels(), rotation=45)
ax.set(xlabel=None, ylabel=None)
l = plt.legend(loc='upper right')
l.get_texts()[0].set_text('Is not Superhost')
l.get_texts()[1].set_text('Is Superhost')
fgrid.fig.tight_layout(w_pad=1)
Looking at the faceted plots above, it seems that units that haven't been reviewed for a long time are priced slightly higher than units with more recent reviews. Also, it appears that superhosts' pricing (dark blue) is higher than non-superhosts (light blue), suggesting that hosts who are verified as superhosts (hosts who are top-rated and most experienced) are priced higher than those who are not.
What are the strongest predictors for price?
This question requires deeper analysis using machine learning techniques. Let's revisit this later.
We need to prepare the data so we can pass it through our ML algorithms. This entails...
cleaned_df = cleaned_df.drop(['last_review_discrete'], axis=1)
# Calculating proportion of NA values in each column
prop_na = cleaned_df.isna().sum()/len(cleaned_df)
prop_na_05 = prop_na[prop_na > .05]
prop_na_05 = prop_na_05.sort_values(0, ascending=True).reset_index()
Plotting proportion of NA values for all columns where more than 5% is missing. The six columns in red will be removed since they are missing > 30% of their values.
plt.figure(figsize=(9, 7))
barh = plt.barh(prop_na_05['index'], prop_na_05[0], alpha=0.85, color='green')
for i in range(7):
i += 1
barh[-i].set_color('darkred')
plt.title('Proportion of NA Values')
plt.vlines(x=.3, ymin=0, ymax=20, color='red', linestyles='dashed')
plt.xticks(np.arange(.1, 1.01, .1))
plt.tight_layout()
drop_na_cols = ['host_acceptance_rate', 'square_feet', 'monthly_price', 'weekly_price',
'security_deposit', 'host_response_rate', 'host_response_time']
cleaned_df = cleaned_df.drop(drop_na_cols, axis=1)
Taking a look at neighbourhood_group_cleansed
as well.
cleaned_df['neighbourhood_group_cleansed'].value_counts()
Looks like it is only relevant to the New York
dataset, so I think it's safe to drop.
cleaned_df = cleaned_df.drop('neighbourhood_group_cleansed',axis=1)
cleaned_df['sum_na_row'] = cleaned_df.isna().sum(axis=1)
plt.figure()
sns.distplot(cleaned_df['sum_na_row'], bins=15, kde=False)
plt.xticks(np.arange(0, 21, 5))
plt.show()
Let's take a look at the rows missing 10 or more values.
temp = cleaned_df[cleaned_df['sum_na_row'] >= 10].isna().any().reset_index()
na_cols = temp[temp[0] == True]['index'].values
cleaned_df[cleaned_df['sum_na_row'] >= 10][na_cols].transpose()
NA values related to reviews are most likely missing because this particular unit does not have any reviews. This could be useful information, so I'll encode these values as 0 so they're different from the values that we do have.
Although 0 may be misleading, I believe filling with 0 is better than simply removing the rows or imputing based on other values to maintain its variance from units have reviews.
review_cols = ['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
'review_scores_value', 'reviews_per_month', 'first_review_days', 'last_review_days']
cleaned_df[review_cols].head()
cleaned_df[review_cols] = cleaned_df[review_cols].fillna(value=0)
It doesn't really make any sense to impute the host_neighbourhood
or the neighbourhood
features, so I'll drop the rows that're missing these values
na_neigh = cleaned_df[(cleaned_df['host_neighbourhood'].isna()) | (
cleaned_df['neighbourhood'].isna())].index
print(
f'Dropping {na_neigh.shape[0]} rows due to missing values in neighborhood columns.')
cleaned_df = cleaned_df.drop(na_neigh)
State
cleaned_df[cleaned_df['state'].isna()][['state', 'neighbourhood',
'smart_location']]
The first three are NY, and the last one is TX
update_state = cleaned_df[cleaned_df['state'].isna(
)][['state', 'neighbourhood', 'smart_location']].index
cleaned_df.loc[update_state[:3]
]['state'] = cleaned_df.loc[update_state[:3]]['state'].fillna('NY')
for i in update_state[:3]:
cleaned_df.at[i, 'state', ] = 'NY'
cleaned_df.at[update_state[-1], 'state', ] = 'TX'
These features have some very common modes that I will use to impute their values.
bbb = ['bathrooms', 'bedrooms', 'beds']
plt.figure(figsize=(13, 4))
i = 1
for col in bbb:
plt.subplot(1, 3, i)
sns.countplot(x=col, data=cleaned_df)
plt.title(col+' Distribution')
plt.xticks(np.arange(0, 10, 1, dtype=np.int64), rotation=30)
plt.xlim(right=10)
i += 1
plt.tight_layout()
# Filling with mode (denoted by highest bar in plots above)
for col in bbb:
cleaned_df[col] = cleaned_df[col].fillna(cleaned_df[col].mode()[0])
cleaned_df[bbb].isna().sum()
cleaned_df.head()
IterativeImputer
)¶I'd like to practice using the IterativeImputer
to impute missing values for the cleaning_fee
feature. However, before we use IterativeImputer
to impute the missing values, 1) the features need to be numerical and 2) they should be standardized.
However, we will have a lot of features if I use all that is available, making imputation very computationally expensive. So, I'm going to simplify the problem by doing the following...
features_for_imputer = ['accommodates', 'bathrooms', 'bedrooms', 'beds',
'guests_included', 'review_scores_rating', 'price', 'cleaning_fee']
df_impute = cleaned_df[features_for_imputer]
std = StandardScaler()
std_df_impute = std.fit_transform(df_impute)
std_df_impute = pd.DataFrame(data=std_df_impute, columns=df_impute.columns)
# These rows do not have missing values
rows_all_vals = std_df_impute[std_df_impute['cleaning_fee'].notnull()]
Instantiating the Imputer
imp = IterativeImputer(max_iter=5)
imp.fit(rows_all_vals)
rescaled_data = std.inverse_transform(imp.transform(std_df_impute))
# Calling the imputer on the subset of our DataFrame
imputed_std_df = pd.DataFrame(rescaled_data, columns=df_impute.columns)
imputed_std_df.head()
# Sanity checking before re-inserting back into the DataFrame
imputed_std_df.shape == df_impute.shape == cleaned_df[features_for_imputer].shape
Resetting indices so they can be concatennated
cleaned_df = cleaned_df.reset_index().drop('index', axis=1)
cleaned_df[features_for_imputer] = imputed_std_df
# Sanity Check - no NA values
cleaned_df.isna().any().sum()
Now we're going to split the data into an X-matrix of features and y-array for price
to figure out which features are the strongest predictors of price.
I've decided not to split my data into a training and testing set because we're simply using some ML algorithms to figure out which features are most important when predicting price. I'm not really concerned with using a testing set to figure out how well the algorithms are doing.
X = cleaned_df.drop(['price', 'sum_na_row'], axis=1)
y = cleaned_df['price']
cat_cols = X.select_dtypes('object').columns
print('Unique Values per categorical column...')
for col in cat_cols:
print(f'{col}: {X[col].nunique()}')
We don't need any of the categorical columns with only 1 value.
# Tracking all the columns I drop so I drop these in the test set as well
drop_cols = ['experiences_offered','country_code','country','has_availability','is_business_travel_ready']
Let's take a look at some of these columns to see if we can easily just get rid of some of them.
X[['host_neighbourhood', 'neighbourhood',
'neighbourhood_cleansed', 'smart_location']].head(10)
I don't really know the difference between each of the neighbourhood columns. In a work setting, I'd follow up with someone that knows this dataset well to understand how to use these columns. For simplicity, I'll just move forward with neighborhood
. I'm also going to drop smart_location
because it's redundant to neighbourhood
and a bit less granular.
drop_cols.extend(['host_neighbourhood','neighbourhood_cleansed','smart_location'])
Let's take a look at property_type
.
X['property_type'].value_counts()
This could be a valuable predictor, since I'd guess that a Villa would be priced higher than a Hostel.
# Dropping the columns
X = X.drop(drop_cols, axis=1)
X.head()
Now let's OneHotEncode our columns
ohe = OneHotEncoder(handle_unknown='value', use_cat_names=True,
return_df=True, handle_missing='value')
X = ohe.fit_transform(X)
X.head()
X.shape
feature_names = X.columns.values
feature_names.shape
Now standardizing our train data using the StandardScaler
.
std = StandardScaler()
X = std.fit_transform(X)
X.shape
I'd like to practice using ANOVA since I haven't used it before. So, I'm using ANOVA to initially select the best 30% features
fvalue_selector = SelectPercentile(f_classif, percentile=30)
X_best = fvalue_selector.fit_transform(X,y)
mask = fvalue_selector.get_support(indices=True)
f_val_features = feature_names[mask]
X_best.shape
I'd also like to further reduce dimensionality by running PCA on our best 30% features found with ANOVA above.
I want to have enough components to have 85% variance explained, so I'll test run with 75 components to see how much variance that explains, and go from there.
pca = PCA(75)
X_pca = pca.fit_transform(X_best)
pca.explained_variance_ratio_.sum()
So we have about 72% variance explained with the first 75 principal components. Let's keep going until we have 85%.
n_components = [100, 105, 110, 115, 120, 125]
for n in n_components:
pca = PCA(n)
X_pca = pca.fit_transform(X_best)
if pca.explained_variance_ratio_.sum() >= .85:
break
print(pca.explained_variance_ratio_.sum())
print(pca.n_components)
ticks = np.arange(pca.n_components)
values = pca.explained_variance_ratio_
plt.figure(figsize=(13, 5))
ax = plt.subplot(111)
cumvals = np.cumsum(values)
ax.bar(ticks, values)
ax.plot(ticks, cumvals)
ax.xaxis.set_tick_params(width=1)
ax.yaxis.set_tick_params(width=2, length=12)
ax.set_xlabel("No. Principal Components")
ax.set_ylabel("Variance Explained (%)")
plt.title('Explained Variance Per Principal Component\nScree Plot')
plt.show()
So we now have 110 principal components that explain 87% of our variance. Let's keep going.
Now that we have our features ready for our algorithms, let's train some models to predict price
.
RFECV can be a very useful tool to figure out which features are most important. It will train a model on all features, score it via cross validation, then recursively eliminate features and score until the score doesn't increase anymore.
In theory, we are left with the most important features (in our case, principal components). Let's see what we get using a Linear model on our principal components...
ols = LinearRegression()
rfecv_ols = RFECV(estimator=ols, step=1,
scoring='neg_mean_squared_error', cv=3)
%time rfecv_ols.fit(X_pca, y)
rfecv_ols.n_features_
pc_kept = []
i = 1
print('Principal Components Kept...')
for boolean in rfecv_ols.support_:
if boolean == True:
print(f'Principal Component {i}')
pc_kept.append(i)
i += 1
print('Respective Linear Regression Princpal Component Coefficients...\n')
for comp, coef in zip(pc_kept, rfecv_ols.estimator_.coef_):
print(f'PC{comp}: {coef:.5f}')
Now we have enough information for quick and dirty analysis. We can pick a few of these Principal Components and try to interpret what they describe, but first, I'd like to run a RandomForest ensemble and compare the RandomForest's feature_importances_
to this list of principal components leftover from RFECV
.
I want to see if a trained Random Forest regressor's feature_importancees_
agrees with our findings from RFECV. Below, I'll train a Random Forest on the 30% best features we found through ANOVA.
rf_reg = RandomForestRegressor(n_estimators=75)
rf_reg.fit(X_best, y)
i = 1
component_names = []
while i <= len(rf_reg.feature_importances_):
comp_name = 'Comp ' + str(i)
component_names.append(comp_name)
i += 1
importances = rf_reg.feature_importances_
# Grabbing the top 15 indices that would sort the feature importances according to their importance rating
indices = np.argsort(importances)[-5:]
# Grabbing an array of all the feature names
features = np.array(component_names)
# Plotting
plt.figure()
plt.title("Feature importances")
plt.barh(range(len(indices)), importances[indices],
color="r", align="center")
plt.yticks(range(len(indices)), features[indices])
plt.show()
According to our Random Forest Regressor, the plot above shows the top most important Principal Components.
Let's compare this to our RFECV most important Principal Components.
Now that we have a clearer picture of the the most important Principal Components per RFECV with Linear Regression and Random Forest Regression, we're ready to tackle the question...
What are the strongest predictors for price?
But first, a quick recap of what we've done since our initial Exploratory Data Analysis:
Below is a table summary showing the 5 components left after performing RFECV and the top 5 most important components per Random Forest Regression
Principal Component | From RFECV | From RF Reg |
---|---|---|
Principal Component 1 | - | x |
Principal Component 2 | - | x |
Principal Component 3 | x | x |
Principal Component 4 | x | - |
Principal Component 5 | - | x |
Principal Component 6 | - | x |
Principal Component 7 | x | - |
Principal Component 9 | x | - |
Principal Component 20 | x | - |
As you can see, both analyses highlighted principle component 3. Let's examine the features in this component. Additionally, the largest positive coefficient from the Linear Regression used in RFECV is associated with Principal Component 7, and the most important feature per the Random Forest Regressor is Principal Component 2, so we'll take a look at these as well.
def plot_pca_dim(pca, feat_names, dim, features, fontsize=8):
'''
For a given fitted PCA object, the given component, and a given number of features, this plots
the scaled coefficients for this particular component.
Inputs:
pca: PCA object
feat_names: feature names from original pre-PCA dataset
dim: component that you want to inspect
features: number of most positive and most negative features to display
'''
names_to_weights = sorted(
zip(pca.components_[dim], feat_names), reverse=False)
names_to_weights = np.array(names_to_weights)
nm_to_wt_df = pd.DataFrame(names_to_weights, columns=['coef', 'feature'])
nm_to_wt_df = pd.concat(
[nm_to_wt_df.head(features), nm_to_wt_df.tail(features)])
nm_to_wt_df['coef'] = nm_to_wt_df['coef'].astype(float)
fig = px.bar(nm_to_wt_df, y='feature', x='coef',
color='coef', color_continuous_scale=px.colors.sequential.Sunsetdark,
labels={'coef': 'Coefficients', 'feature': ''},
height=450, width=800, orientation='h', title='Principal Component '+str(dim+1)+' Coefficients',
hover_name='feature')
fig.update_yaxes(tickfont=dict(size=fontsize))
fig.update_xaxes(tickfont=dict(size=fontsize))
return fig
def get_index_where_value(arr, value):
"""
Returns index of array where that array's value equals input value.
Inputs...
arr: input array to search
value: value to search
Returns: Index Position or None
"""
try:
return np.argwhere(np.where(arr == value, arr, None))[0][0]
except:
return None
fig_3 = plot_pca_dim(pca, f_val_features, 2, 10)
fig_3
This component has a relatively large association with New York and room type, and has negative associations with some amenities, so I interpret this as measuring New York apartment listings as predictors of unit price. Consequently, this punishes non-NY listings beause our feature value for state_NY
is negative (due to standardization with StandardScaler
) for non-NY listings. This doesn't provide very much value since it's focused on one particular state, and this probably stems from the fact that this dataset is heavily imbalanced in favor of New York listings. Intuitively, it's difficult to interpret the reason this component has such strong negative associations with amenities. Therefore, I'll interpret these as un-important price adjustments that the algorithm learned to help it predict price
with more accuracy for NY listings.
fig_7 = plot_pca_dim(pca, f_val_features, 6, 10)
fig_7
This component has a very strong positive association with one room type Entire home/apt
and a very negative association with another Private room
. It also is positively associated with accomodates
(discrete variable for number of guests), bedrooms
, beds
, family/kid friendly
, and guests_included
. Therefore, this component measures the size of listings/units, valuing larger listings higher than smaller listings.
fig_2 = plot_pca_dim(pca, f_val_features, 1, 10)
fig_2
This component has very uniform coefficients related to amenities. Given the relatively high amenities coefficients, and given these amenities are associated with luxury amenities (wine cellar
, sun deck
, steam room
, rooftop
, etc - all things I would expect to find in a very luxurious listing), this component measures luxurious amenities, valuing listings higher if they have these.
In sum, by exploring the Airbnb listings datasets for New York, San Francisco, and Austin, we've been able to answer the following questions...
Is there a difference in quality of service/product between hosts with different numbers of units? (i.e. do hosts with multiple units give better or worse service than those who do not?)
Do hosts with no recent reviews change pricing behavior from hosts with recent reviews?
is_superhost
, we found that, visually, pricing seemed higher for superhosts than non-superhosts.What are the strongest predictors for price?
price
. By analyzing three of these "important" components, we found that the following attributes were mostly captured by these components...