Exploring Airbnb Listings Datasets¶

Table of Contents¶

Business Understanding
Data Understanding
1. Preparing Data for Exploration
2. Exploratory Data Analysis
Preparing Data for ML Algorithms
1. Dealing with Missing Data
2. Feature Scaling
3. Encoding Categorial Data
4. Feature Selection & Transformation
Data Modeling
Evaluate Results

Business Understanding¶

Airbnb is the largest hospitality company in the world that doesn't really own any of the lodging that it provides to customers! It has taken the world by storm, and provides many different kinds of lodging options to people everywhere.

How is this made possible? By harnessing data.

Airbnb uses data to determine which listings to show to customers, how to price its listings, which Airbnb experiences users are most likely to book, and much more.

I'd like to understand what kinds of things we can draw from listings data.

Which hosts give better service than others?
Do guest reviews affect pricing?
What kinds of features are the strongest predictors of price?

Let's see what we can find.

Data Understanding¶

Datasets found here.

These datasets were scraped by InsideAirbnb from the Airbnb website. They are a snapshot of listings from San Francisco, CA; New York, NY; and Austin, TX from earlier this year.

Below, I will import all of the libraries I will use, understand how this data is organized, clean it, and explore it.

Let's see if we can answer the following questions...

Is there a difference in quality of service/product between hosts with different numbers of units? (i.e. do hosts with multiple units give better service than those who do not?)
Do hosts with no recent reviews change pricing behavior from hosts with recent reviews?
What are the strongest predictors of price?

from sklearn.feature_selection import RFECV, f_classif, SelectPercentile
from sklearn.preprocessing import MultiLabelBinarizer, StandardScaler
from category_encoders import OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.impute._iterative import IterativeImputer
from sklearn.experimental import enable_iterative_imputer

import plotly.graph_objects as go
import plotly_express as px
import seaborn as sns
import datetime
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, t
import scipy.stats as stats
import math
import collections

import matplotlib.pyplot as plt
%matplotlib inline

sf = pd.read_csv('sf_listings.csv')
ny = pd.read_csv('ny_listings.csv')
aus = pd.read_csv('aus_listings.csv')

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3049: DtypeWarning:

Columns (43,61,62,94,95) have mixed types. Specify dtype option on import or set low_memory=False.

pd.options.display.max_rows = 106

sf.head(2).transpose()

Skimmed through features to filter out columns I don't need.

drop = ['listing_url', 'scrape_id', 'last_scraped', 'summary', 'space', 'description', 'neighborhood_overview',
        'notes', 'transit', 'access', 'interaction', 'house_rules', 'thumbnail_url', 'medium_url', 'picture_url',
        'xl_picture_url', 'host_id', 'host_url', 'host_about', 'host_thumbnail_url', 'host_picture_url',
        'calendar_updated', 'calendar_last_scraped', 'license', 'name', 'host_name', 'zipcode', 'id']

# Dropping all the columns I don't want
sf = sf.drop(drop, axis=1)
ny = ny.drop(drop, axis=1)
aus = aus.drop(drop, axis=1)

df = pd.concat([sf, ny, aus], ignore_index=True)

# Sanity checking rows all came in
df.shape[0] == sf.shape[0]+ny.shape[0]+aus.shape[0]

True

Preparing Data for Exploration¶

Here we need to clean the data enough for Exploratory Data Analysis.

Duplicate Value Check¶

df.shape

(68168, 78)

df = df.drop_duplicates()

df.shape

(68167, 78)

1 duplicate dropped.

Separate Data Types¶

cat_df = df.select_dtypes(include=['object', 'bool'])
cat_cols = df.select_dtypes(include=['object', 'bool']).columns.values

num_df = df.select_dtypes(include=['int', 'float'])
num_cols = df.select_dtypes(include=['int', 'float']).columns.values

# Sanity check that we have all our features accounted for
len(num_cols)+len(cat_cols) == df.shape[1]

True

So now we need to account for all of the following data types...

Numerical
- Continuous
- Discrete
Categorical
- Binary
- Ordinal
- Nominal
Mixed Type
Time

Numerical¶

num_df.head().transpose()

From what I can tell, all the numerical columns look good except host_acceptance_rate and square_feet since there are missing values in the first few rows. We'll take a look at NA values later.

Categorical¶

Binary
Ordinal
Nominal

cat_df.head().transpose()

We have a lot more work to do for these categorical columns.

Numerical
- Convert some of these to numerical columns (i.e. prices and response rate)
Binary Categorical
- We need to Binarize the Binary/Boolean columns
Ordinal
- The only ordinal-looking column I see is host_response_time, but I'm going to treat it as nominal
Nominal
- We have a lot of features that seem redundant that we should explore (i.e. city, market, smart_location, jurisdiction_names)
- host_verifications and amenities have multiple items that need to be extrapolated into their own columns
Date/Time
- Engineer features from the dates columns
  - Extrapolate days

Categorical to Numerical¶

cat_to_num = ['host_response_rate', 'price', 'weekly_price',
              'monthly_price', 'security_deposit', 'cleaning_fee', 'extra_people']
prices = ['price', 'weekly_price', 'monthly_price',
          'security_deposit', 'cleaning_fee', 'extra_people']

cat_to_num_df = cat_df[cat_to_num]
cat_to_num_df.head(2)

# Converting the response rate to integers
def repl(m): return m.group(1)


cat_to_num_df['host_response_rate'] = cat_to_num_df['host_response_rate'].str.replace(
    r'(\d+)%', repl).astype(float)

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

# Removing commas and dollar signs, then converting to float
for col in prices:
    cat_to_num_df[col] = cat_to_num_df[col].str.replace(
        '$', '').str.replace(',', '')
    cat_to_num_df[col] = cat_to_num_df[col].astype(float)

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

cat_to_num_df.head(1)

cat_to_num_df.describe().head(3)

These columns look good now.

Appending the cat_to_num_df to the num_df.

num_df = pd.concat([num_df, cat_to_num_df], axis=1)

num_cols = num_df.columns.values

cat_df = cat_df.drop(cat_to_num, axis=1)
cat_cols = cat_df.columns.values

Binary Categorical¶

bi_cols = []
for col in cat_df.columns:
    if cat_df[col].nunique() == 2:
        bi_cols.append(col)

cat_df[bi_cols].head()

for col in bi_cols:
    cat_df[col] = cat_df[col].map({'f': 0, 't': 1})

Nominal¶

Redundant Columns¶

Took a look at these columns since they looked redundant.

redundant_cols = ['city', 'market',
                  'jurisdiction_names', 'host_location', 'street']

Removing the following features

market: smart_location is a more detailed feature
city : redundant to smart_location
jurisdiction_names: redundant to smart_location
host_location: 1,862 unique values. Too many unique values.
street: redundant to smart_location

cat_df = cat_df.drop(redundant_cols, axis=1)

Multiple Values in One Column¶

host_verifications
amenities

cat_df.head()

def striplist(l):
    return([x.strip() for x in l])


cat_df['host_verifications'] = cat_df['host_verifications'].str.replace('[', '').str.replace(']', '').str.replace("'",
                                                                                                                  '').str.lower().str.split(',').apply(striplist)

cat_df['amenities'] = cat_df['amenities'].str.replace('{', '').str.replace('}',
                                                                           '').str.replace('"', '').str.lower().str.split(',').apply(striplist)

Binarizing the lists.

mlb = MultiLabelBinarizer()
amenities_df = pd.DataFrame(mlb.fit_transform(
    cat_df['amenities']), columns=mlb.classes_, index=cat_df.index)

mlb = MultiLabelBinarizer()
host_verif_df = pd.DataFrame(mlb.fit_transform(
    cat_df['host_verifications']), columns=mlb.classes_, index=cat_df.index)

host_verif_df = host_verif_df.drop(axis=1, labels=[''])
amenities_df = amenities_df.drop(axis=1, labels=[''])

cat_df = cat_df.drop(['host_verifications', 'amenities'], axis=1)

Concatenating the DataFrames together.

cat_df = pd.concat([cat_df, amenities_df, host_verif_df], axis=1)

cat_cols = cat_df.columns.values

Date/Time Feature Engineering¶

Converting the date features to "days since" features so we have numerical values to work with.

today = pd.to_datetime('today')

dt_cols = ['host_since', 'first_review', 'last_review']

cat_df[dt_cols].head(1)

# Converting to datetime data type
for col in dt_cols:
    cat_df[col] = pd.to_datetime(cat_df[col], infer_datetime_format=True)

# Creating day features for each date column
for col in dt_cols:
    num_df[col+'_days'] = (today - cat_df[col]).apply(lambda x: x.days)

# Creating a discrete feature based on how recent the last review was
bins = [0, 90, 180, 365, 730, np.inf]
labels = ['last 90 days', 'last 180 days',
          'last year', 'last 2 years', 'more than 2 years']
cat_df['last_review_discrete'] = pd.cut(
    num_df['last_review_days'], bins=bins, labels=labels)

# Dropping the datetime columns since because I don't need them
cat_df = cat_df.drop(dt_cols, axis=1)
cat_cols = cat_df.columns.values
num_cols = num_df.columns.values

Exploratory Data Analysis¶

Time to explore the data. Let's see if we can answer any of our previously mentioned exploratory questions below.

Is there a difference in quality of service/product between hosts with different numbers of units? (i.e. do hosts with multiple units give better service than those who do not?)
Do hosts with no recent reviews change pricing behavior from hosts with recent reviews?
What are the strongest predictors for price?

# Sanity checking before I rejoin the DFs
num_df.shape[0] == cat_df.shape[0]

True

cleaned_df = pd.concat([num_df, cat_df], axis=1)

cleaned_df.head()

Exploratory Question 1¶

Is there a difference in quality of service/product between hosts with different numbers of units? (i.e. do hosts with multiple units give better service than those who do not?)

# Making a copy of the DataFrame just for exploring
eda = cleaned_df[:]

Let's take a quick look at the distribution of host_total_listings_count.

plt.figure()
plt.title('Distribution of No. host listings.')
plt.hist(eda['host_total_listings_count'],bins=30)
plt.show()

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/numpy/lib/histograms.py:824: RuntimeWarning:

invalid value encountered in greater_equal

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/numpy/lib/histograms.py:825: RuntimeWarning:

invalid value encountered in less_equal

We have some huge outliers here, with the majority of values less than 100. Let's clean this up.

eda['host_total_listings_count'].value_counts(
).sort_index(ascending=False).head(20)

1677.0     23
1614.0     72
1522.0     71
1235.0    241
1108.0    106
814.0     160
695.0       8
641.0      14
640.0       1
593.0     341
528.0      18
516.0      14
488.0      24
472.0       8
441.0      18
408.0      33
362.0       1
349.0       4
332.0       4
314.0      26
Name: host_total_listings_count, dtype: int64

So we have a significant number of hosts with hundreds of listings. Some of these could be errors because I it's difficult to imagine people owning over a thousand airbnb listings. In a work setting, this'd be worth following up. However, I'm going to discretize these into bins to make these groups easier to compare.

I'll make a categorical column to distinguish between hosts. Here are the bins I'm going to use...

1 unit
2-10 units
11-50 units
51-100 units
101+

bins = [0,1,10,50,100,np.inf]
labels = ['1','2-10','11-50','51-100','101+']
eda['cat_num_units'] = pd.cut(eda['host_total_listings_count'],bins=bins,labels=labels)

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Now plotting the bins...

plt.figure()
sns.countplot(x='cat_num_units',data=eda)
plt.title('No. Host Units Distribution')
plt.show()

Let's compare these categories on their ratings.

rating_cols = ['review_scores_rating', 'review_scores_cleanliness', 'review_scores_checkin',
               'review_scores_communication', 'review_scores_location', 'review_scores_value']

def highlight_max(s):
    '''
    highlight the maximum in a Series lightgreen.
    '''
    is_max = s == s.max()
    return ['background-color: lightgreen' if v else '' for v in is_max]

def highlight_min(s):
    '''
    highlight the minimum in a Series red.
    '''
    is_min = s == s.min()
    return ['background-color: red' if v else '' for v in is_min]

eda.groupby('cat_num_units')[rating_cols].mean().transpose().style. \
    apply(highlight_max, axis=1).apply(highlight_min, axis=1)

Looks like hosts with only 1 unit appear to have the highest average rating and scores. Let's look at their boxplot summaries side-by-side to compare.

plt.figure(1,figsize=(15,13))
i = 1

for col in rating_cols:
    plt.subplot(3,2,i)
    sns.boxplot(x='cat_num_units',y=col,data=eda)
    title = col.replace('_',' ').title()
    plt.title(title)
    plt.ylabel(None)
    plt.xlabel(None)
    i += 1

plt.tight_layout()    
plt.show()

Conclusion¶

So these plots tell me that hosts with only 1 unit have tighter and higher distributions, suggesting that these hosts are rated slightly better than hosts with more units. However, it's worth noting that...

this difference looks very marginal, and
observations on hosts with more than 10 units may be slightly misleading since our sample sizes for these buckets are much smaller than hosts with 1 unit and hosts with 2-10 units.

Hypothesis Testing¶

Let's perform a simple hypothesis t-test with a 5% significance level to check whether the average Review Scores Rating for hosts with 1 unit is greater than ratings of all other hosts.

def plot_t_dist_hyp(df, t_stat, p_val):
    """
    After t-test, plot t-statistic and p-value via Plotly Express.
    Depenencies: scipy.stat.t | plotly_express as px | plotly.graph_objects as go

    Inputs
    ------
    df: degrees of freedom
    t_stat: t-statistic
    p_val: p-value

    Returns
    ------
    Plotly Express Figure

    """
    rv = t(df=df)

    x_t = np.linspace(rv.ppf(0.00001), rv.ppf(0.99999), 100)
    y_t = rv.pdf(x_t)
    t_df = pd.DataFrame.from_dict(data={'x_t': x_t, 'y_t': y_t})

    fig = px.line(t_df, 'x_t', 'y_t', width=600, height=300,
                  labels={'x_t': 'X', 'y_t': ''},)
    fig.update_traces(textposition='top center')
    fig.update(layout=dict(title='Hypothesis Test Results', annotations=[{
        "x": t_stat,
        "y": rv.pdf(t_stat),
        'font': {'family': 'Arial', 'size': 12, 'color': 'black'},
        "showarrow": True,
        "text": f'P-value<br>{p_val:.2e}',
        "arrowhead": 7,
        "ay": -40
    }]))
    return fig

First let's check our conditions for hypothesis testing using the Central Limit Theorem...

Independence: I picked listing data from 3 of the largest cities in the US, and these listings were scraped off of Airbnb's website and represent a snapshot of these listings at a given point in time. Therefore, It's important to note that these listings were not randomly sampled. Additionally, it feels wrong to assume that ratings within each of these cities are independent of one another, since ratings of some listings can affect how guests rate other listings. Therefore, this test cannot reliably be generalized to the population of all Airbnb listings around the world. However, I will still proceed so I can practice hypothesis testing.
Sample Size & Skew: Because our sample size (>60,000) is greater than 30, we can reliable assume that this sampling distribution would be normally distributed if we continued to take samples of 50,000 listings.

Experiment Details

$ H_0: u_{1 unit} = u_{all others}$
$ H_A: u_{1 unit} > u_{all others}$
$\alpha = 0.05$

# Creating sampling arrays
one_unit_samp = eda[eda['cat_num_units'] == '1']['review_scores_rating']
all_others_samp = eda[eda['cat_num_units'] != '1']['review_scores_rating']

# Calculating Variance
one_unit_var = one_unit_samp.var()
all_others_var = all_others_samp.var()

# Variances are nearly the same
print(one_unit_var)
print(all_others_var)

60.04818004883296
69.84990857907735

t_stat, p_val = ttest_ind(
    a=one_unit_samp, b=all_others_samp, equal_var=True, nan_policy='omit')

# Plotting a T-distribution with the calculated t-statistic annotated
df = min([len(one_unit_samp)-1, len(all_others_samp)-1])
plot_t_dist_hyp(df, t_stat, p_val)

Results

$t \approx 17.87$
${p_{value}} \approx 0$

Thus, after performing a t-test comparing the ratings between hosts with only 1 unit and all other hosts, we reject our null hypothesis (that the average ratings between the two groups are the same) in favor of the alternative ($H_A$), providing convincing evidence that hosts with only one unit are rated higher than hosts with more than 1 unit.

Exploratory Question 2¶

Do hosts with no recent reviews change pricing behavior from hosts with recent reviews?

eda[['last_review_discrete', 'last_review_days']]

eda[['property_type', 'room_type']]

eda['price']

0        170.0
1        235.0
2         65.0
3         65.0
4        685.0
5        255.0
6        139.0
7        135.0
8        265.0
9        177.0
10       194.0
11       139.0
12        85.0
13        85.0
14        79.0
15       166.0
16       215.0
17       450.0
18       107.0
19       110.0
20       125.0
21        65.0
22       100.0
23       155.0
24       180.0
25       155.0
26        79.0
27       195.0
28       150.0
29       110.0
30       125.0
31       110.0
32       450.0
33       150.0
34       199.0
35       220.0
36       155.0
37       150.0
38       107.0
39        94.0
40       108.0
41        83.0
42       159.0
43       125.0
44       136.0
45       170.0
46        85.0
47        85.0
48       150.0
49       120.0
50       175.0
51       116.0
52       125.0
         ...  
68115     69.0
68116     29.0
68117     69.0
68118     29.0
68119     59.0
68120     39.0
68121     39.0
68122    200.0
68123     32.0
68124    799.0
68125    199.0
68126     40.0
68127     41.0
68128    600.0
68129    200.0
68130    142.0
68131    200.0
68132     40.0
68133     70.0
68134    150.0
68135     70.0
68136     95.0
68137     89.0
68138    200.0
68139     60.0
68140     35.0
68141     69.0
68142     69.0
68143    125.0
68144    127.0
68145    100.0
68146    150.0
68147    180.0
68148     50.0
68149     50.0
68150     45.0
68151     85.0
68152    111.0
68153    119.0
68154     40.0
68155     90.0
68156    120.0
68157     49.0
68158     95.0
68159    110.0
68160    100.0
68161     40.0
68162     30.0
68163    150.0
68164     29.0
68165     30.0
68166    115.0
68167     90.0
Name: price, Length: 68167, dtype: float64

# Plotting facet grid to compare box plots for each kind of room type. Also filtered out outlier prices (>500)
fgrid = sns.FacetGrid(eda[eda['price'] <= 500], col='room_type', height=5,)
fgrid.map(sns.boxplot, 'last_review_discrete', 'price', 'host_is_superhost')
for ax in fgrid.axes.flat:
    plt.setp(ax.get_xticklabels(), rotation=45)
    ax.set(xlabel=None, ylabel=None)

l = plt.legend(loc='upper right')
l.get_texts()[0].set_text('Is not Superhost')
l.get_texts()[1].set_text('Is Superhost')

fgrid.fig.tight_layout(w_pad=1)

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py:715: UserWarning: Using the boxplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)
/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/seaborn/axisgrid.py:720: UserWarning: Using the boxplot function without specifying `hue_order` is likely to produce an incorrect plot.
  warnings.warn(warning)

Conclusion¶

Looking at the faceted plots above, it seems that units that haven't been reviewed for a long time are priced slightly higher than units with more recent reviews. Also, it appears that superhosts' pricing (dark blue) is higher than non-superhosts (light blue), suggesting that hosts who are verified as superhosts (hosts who are top-rated and most experienced) are priced higher than those who are not.

Exploratory Question 3¶

What are the strongest predictors for price?

This question requires deeper analysis using machine learning techniques. Let's revisit this later.

Preparing Data for ML¶

We need to prepare the data so we can pass it through our ML algorithms. This entails...

Removing or imputing missing data
Encoding our categorical data
Scaling our data

Assessing Missing Data¶

cleaned_df = cleaned_df.drop(['last_review_discrete'], axis=1)

Per Column¶

# Calculating proportion of NA values in each column
prop_na = cleaned_df.isna().sum()/len(cleaned_df)
prop_na_05 = prop_na[prop_na > .05]
prop_na_05 = prop_na_05.sort_values(0, ascending=True).reset_index()

Plotting proportion of NA values for all columns where more than 5% is missing. The six columns in red will be removed since they are missing > 30% of their values.

plt.figure(figsize=(9, 7))

barh = plt.barh(prop_na_05['index'], prop_na_05[0], alpha=0.85, color='green')
for i in range(7):
    i += 1
    barh[-i].set_color('darkred')

plt.title('Proportion of NA Values')
plt.vlines(x=.3, ymin=0, ymax=20, color='red', linestyles='dashed')
plt.xticks(np.arange(.1, 1.01, .1))

plt.tight_layout()

drop_na_cols = ['host_acceptance_rate', 'square_feet', 'monthly_price', 'weekly_price',
                'security_deposit', 'host_response_rate', 'host_response_time']
cleaned_df = cleaned_df.drop(drop_na_cols, axis=1)

Taking a look at neighbourhood_group_cleansed as well.

cleaned_df['neighbourhood_group_cleansed'].value_counts()

Manhattan        21790
Brooklyn         20019
Queens            5588
Bronx             1048
Staten Island      356
Name: neighbourhood_group_cleansed, dtype: int64

Looks like it is only relevant to the New York dataset, so I think it's safe to drop.

cleaned_df = cleaned_df.drop('neighbourhood_group_cleansed',axis=1)

Per Row¶

cleaned_df['sum_na_row'] = cleaned_df.isna().sum(axis=1)

plt.figure()
sns.distplot(cleaned_df['sum_na_row'], bins=15, kde=False)
plt.xticks(np.arange(0, 21, 5))
plt.show()

Let's take a look at the rows missing 10 or more values.

temp = cleaned_df[cleaned_df['sum_na_row'] >= 10].isna().any().reset_index()
na_cols = temp[temp[0] == True]['index'].values
cleaned_df[cleaned_df['sum_na_row'] >= 10][na_cols].transpose()

NA values related to reviews are most likely missing because this particular unit does not have any reviews. This could be useful information, so I'll encode these values as 0 so they're different from the values that we do have.

Although 0 may be misleading, I believe filling with 0 is better than simply removing the rows or imputing based on other values to maintain its variance from units have reviews.

review_cols = ['review_scores_rating', 'review_scores_accuracy', 'review_scores_cleanliness',
               'review_scores_checkin', 'review_scores_communication', 'review_scores_location',
               'review_scores_value', 'reviews_per_month', 'first_review_days', 'last_review_days']

cleaned_df[review_cols].head()

cleaned_df[review_cols] = cleaned_df[review_cols].fillna(value=0)

It doesn't really make any sense to impute the host_neighbourhood or the neighbourhood features, so I'll drop the rows that're missing these values

na_neigh = cleaned_df[(cleaned_df['host_neighbourhood'].isna()) | (
    cleaned_df['neighbourhood'].isna())].index

print(
    f'Dropping {na_neigh.shape[0]} rows due to missing values in neighborhood columns.')

Dropping 9930 rows due to missing values in neighborhood columns.

cleaned_df = cleaned_df.drop(na_neigh)

State

cleaned_df[cleaned_df['state'].isna()][['state', 'neighbourhood',
                                        'smart_location']]

The first three are NY, and the last one is TX

update_state = cleaned_df[cleaned_df['state'].isna(
)][['state', 'neighbourhood', 'smart_location']].index
cleaned_df.loc[update_state[:3]
               ]['state'] = cleaned_df.loc[update_state[:3]]['state'].fillna('NY')

for i in update_state[:3]:
    cleaned_df.at[i, 'state', ] = 'NY'

cleaned_df.at[update_state[-1], 'state', ] = 'TX'

Bathrooms, Bedrooms, and Beds¶

These features have some very common modes that I will use to impute their values.

bbb = ['bathrooms', 'bedrooms', 'beds']

plt.figure(figsize=(13, 4))
i = 1
for col in bbb:
    plt.subplot(1, 3, i)
    sns.countplot(x=col, data=cleaned_df)
    plt.title(col+' Distribution')
    plt.xticks(np.arange(0, 10, 1, dtype=np.int64), rotation=30)
    plt.xlim(right=10)
    i += 1

plt.tight_layout()

# Filling with mode (denoted by highest bar in plots above)
for col in bbb:
    cleaned_df[col] = cleaned_df[col].fillna(cleaned_df[col].mode()[0])

cleaned_df[bbb].isna().sum()

bathrooms    0
bedrooms     0
beds         0
dtype: int64

cleaned_df.head()

Feature Scaling (For `IterativeImputer`)¶

I'd like to practice using the IterativeImputer to impute missing values for the cleaning_fee feature. However, before we use IterativeImputer to impute the missing values, 1) the features need to be numerical and 2) they should be standardized.

However, we will have a lot of features if I use all that is available, making imputation very computationally expensive. So, I'm going to simplify the problem by doing the following...

Pick out a smaller number of numerical features that I can use to predict the missing values
Standardize these features
Predict missing values
Re-insert these values (with imputed NA values) back into the DataFrame.

features_for_imputer = ['accommodates', 'bathrooms', 'bedrooms', 'beds',
                        'guests_included', 'review_scores_rating', 'price', 'cleaning_fee']

df_impute = cleaned_df[features_for_imputer]

std = StandardScaler()
std_df_impute = std.fit_transform(df_impute)

std_df_impute = pd.DataFrame(data=std_df_impute, columns=df_impute.columns)

# These rows do not have missing values
rows_all_vals = std_df_impute[std_df_impute['cleaning_fee'].notnull()]

Instantiating the Imputer

imp = IterativeImputer(max_iter=5)

imp.fit(rows_all_vals)

IterativeImputer(add_indicator=False, estimator=None,
                 imputation_order='ascending', initial_strategy='mean',
                 max_iter=5, max_value=None, min_value=None, missing_values=nan,
                 n_nearest_features=None, random_state=None,
                 sample_posterior=False, tol=0.001, verbose=0)

rescaled_data = std.inverse_transform(imp.transform(std_df_impute))

# Calling the imputer on the subset of our DataFrame
imputed_std_df = pd.DataFrame(rescaled_data, columns=df_impute.columns)
imputed_std_df.head()

# Sanity checking before re-inserting back into the DataFrame
imputed_std_df.shape == df_impute.shape == cleaned_df[features_for_imputer].shape

True

Resetting indices so they can be concatennated

cleaned_df = cleaned_df.reset_index().drop('index', axis=1)

cleaned_df[features_for_imputer] = imputed_std_df

# Sanity Check - no NA values
cleaned_df.isna().any().sum()

0

Splitting Data¶

Now we're going to split the data into an X-matrix of features and y-array for price to figure out which features are the strongest predictors of price.

I've decided not to split my data into a training and testing set because we're simply using some ML algorithms to figure out which features are most important when predicting price. I'm not really concerned with using a testing set to figure out how well the algorithms are doing.

X = cleaned_df.drop(['price', 'sum_na_row'], axis=1)
y = cleaned_df['price']

Encoding Categorical Values¶

cat_cols = X.select_dtypes('object').columns
print('Unique Values per categorical column...')
for col in cat_cols:
    print(f'{col}: {X[col].nunique()}')

Unique Values per categorical column...
experiences_offered: 1
host_neighbourhood: 605
neighbourhood: 316
neighbourhood_cleansed: 291
state: 7
smart_location: 220
country_code: 1
country: 1
property_type: 41
room_type: 3
bed_type: 5
has_availability: 1
is_business_travel_ready: 1
cancellation_policy: 6

We don't need any of the categorical columns with only 1 value.

# Tracking all the columns I drop so I drop these in the test set as well
drop_cols = ['experiences_offered','country_code','country','has_availability','is_business_travel_ready']

Let's take a look at some of these columns to see if we can easily just get rid of some of them.

X[['host_neighbourhood', 'neighbourhood',
    'neighbourhood_cleansed', 'smart_location']].head(10)

I don't really know the difference between each of the neighbourhood columns. In a work setting, I'd follow up with someone that knows this dataset well to understand how to use these columns. For simplicity, I'll just move forward with neighborhood. I'm also going to drop smart_location because it's redundant to neighbourhood and a bit less granular.

drop_cols.extend(['host_neighbourhood','neighbourhood_cleansed','smart_location'])

Let's take a look at property_type.

X['property_type'].value_counts()

Apartment                 38470
House                      9441
Condominium                2655
Townhouse                  1946
Loft                       1461
Guest suite                 971
Serviced apartment          805
Guesthouse                  513
Boutique hotel              379
Bungalow                    348
Hotel                       222
Hostel                      189
Other                       168
Bed and breakfast           145
Resort                      118
Camper/RV                    84
Villa                        67
Aparthotel                   63
Cottage                      60
Tiny house                   47
Cabin                        16
Tent                         16
Boat                          7
Campsite                      7
Earth house                   6
Casa particular (Cuba)        5
Dome house                    4
Treehouse                     4
Farm stay                     3
Castle                        2
Barn                          2
Timeshare                     2
Yurt                          2
Houseboat                     2
In-law                        1
Cave                          1
Lighthouse                    1
Hut                           1
Chalet                        1
Tipi                          1
Bus                           1
Name: property_type, dtype: int64

This could be a valuable predictor, since I'd guess that a Villa would be priced higher than a Hostel.

# Dropping the columns
X = X.drop(drop_cols, axis=1)

X.head()

Now let's OneHotEncode our columns

ohe = OneHotEncoder(handle_unknown='value', use_cat_names=True,
                    return_df=True, handle_missing='value')

X = ohe.fit_transform(X)

X.head()

X.shape

(58237, 661)

feature_names = X.columns.values

feature_names.shape

(661,)

Feature Scaling (All)¶

Now standardizing our train data using the StandardScaler.

std = StandardScaler()

X = std.fit_transform(X)

X.shape

(58237, 661)

Feature Selection / Dimensionality Reduction¶

ANOVA¶

I'd like to practice using ANOVA since I haven't used it before. So, I'm using ANOVA to initially select the best 30% features

fvalue_selector = SelectPercentile(f_classif, percentile=30)

X_best = fvalue_selector.fit_transform(X,y)

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/sklearn/feature_selection/univariate_selection.py:114: UserWarning: Features [649] are constant.
  UserWarning)
/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/sklearn/feature_selection/univariate_selection.py:115: RuntimeWarning: invalid value encountered in true_divide
  f = msb / msw

mask = fvalue_selector.get_support(indices=True)

f_val_features = feature_names[mask]

X_best.shape

(58237, 199)

PCA¶

I'd also like to further reduce dimensionality by running PCA on our best 30% features found with ANOVA above.

I want to have enough components to have 85% variance explained, so I'll test run with 75 components to see how much variance that explains, and go from there.

pca = PCA(75)

X_pca = pca.fit_transform(X_best)

pca.explained_variance_ratio_.sum()

0.7179129999234619

So we have about 72% variance explained with the first 75 principal components. Let's keep going until we have 85%.

n_components = [100, 105, 110, 115, 120, 125]
for n in n_components:
    pca = PCA(n)
    X_pca = pca.fit_transform(X_best)
    if pca.explained_variance_ratio_.sum() >= .85:
        break

print(pca.explained_variance_ratio_.sum())
print(pca.n_components)

0.8666531501852338
110

ticks = np.arange(pca.n_components)
values = pca.explained_variance_ratio_

plt.figure(figsize=(13, 5))
ax = plt.subplot(111)
cumvals = np.cumsum(values)
ax.bar(ticks, values)
ax.plot(ticks, cumvals)

ax.xaxis.set_tick_params(width=1)
ax.yaxis.set_tick_params(width=2, length=12)

ax.set_xlabel("No. Principal Components")
ax.set_ylabel("Variance Explained (%)")
plt.title('Explained Variance Per Principal Component\nScree Plot')
plt.show()

So we now have 110 principal components that explain 87% of our variance. Let's keep going.

Data Modeling¶

Now that we have our features ready for our algorithms, let's train some models to predict price.

Recursive Feature Elimination with Cross Validation¶

RFECV can be a very useful tool to figure out which features are most important. It will train a model on all features, score it via cross validation, then recursively eliminate features and score until the score doesn't increase anymore.

In theory, we are left with the most important features (in our case, principal components). Let's see what we get using a Linear model on our principal components...

ols = LinearRegression()

rfecv_ols = RFECV(estimator=ols, step=1,
                  scoring='neg_mean_squared_error', cv=3)

%time rfecv_ols.fit(X_pca, y)

CPU times: user 54.6 s, sys: 3.71 s, total: 58.3 s
Wall time: 34.7 s

RFECV(cv=3,
      estimator=LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None,
                                 normalize=False),
      min_features_to_select=1, n_jobs=None, scoring='neg_mean_squared_error',
      step=1, verbose=0)

rfecv_ols.n_features_

5

pc_kept = []
i = 1
print('Principal Components Kept...')
for boolean in rfecv_ols.support_:
    if boolean == True:
        print(f'Principal Component {i}')
        pc_kept.append(i)
    i += 1

Principal Components Kept...
Principal Component 3
Principal Component 4
Principal Component 7
Principal Component 9
Principal Component 20

print('Respective Linear Regression Princpal Component Coefficients...\n')
for comp, coef in zip(pc_kept, rfecv_ols.estimator_.coef_):
    print(f'PC{comp}: {coef:.5f}')

Respective Linear Regression Princpal Component Coefficients...

PC3: -25.91075
PC4: -26.06227
PC7: 33.92885
PC9: 23.89627
PC20: 27.11547

Now we have enough information for quick and dirty analysis. We can pick a few of these Principal Components and try to interpret what they describe, but first, I'd like to run a RandomForest ensemble and compare the RandomForest's feature_importances_ to this list of principal components leftover from RFECV.

I want to see if a trained Random Forest regressor's feature_importancees_ agrees with our findings from RFECV. Below, I'll train a Random Forest on the 30% best features we found through ANOVA.

rf_reg = RandomForestRegressor(n_estimators=75)

rf_reg.fit(X_best, y)

RandomForestRegressor(bootstrap=True, criterion='mse', max_depth=None,
                      max_features='auto', max_leaf_nodes=None,
                      min_impurity_decrease=0.0, min_impurity_split=None,
                      min_samples_leaf=1, min_samples_split=2,
                      min_weight_fraction_leaf=0.0, n_estimators=75,
                      n_jobs=None, oob_score=False, random_state=None,
                      verbose=0, warm_start=False)

i = 1
component_names = []
while i <= len(rf_reg.feature_importances_):
    comp_name = 'Comp ' + str(i)
    component_names.append(comp_name)
    i += 1

importances = rf_reg.feature_importances_

# Grabbing the top 15 indices that would sort the feature importances according to their importance rating
indices = np.argsort(importances)[-5:]

# Grabbing an array of all the feature names
features = np.array(component_names)

# Plotting
plt.figure()
plt.title("Feature importances")
plt.barh(range(len(indices)), importances[indices],
         color="r", align="center")
plt.yticks(range(len(indices)), features[indices])
plt.show()

According to our Random Forest Regressor, the plot above shows the top most important Principal Components.

Let's compare this to our RFECV most important Principal Components.

Revisiting Exploratory Question 3¶

Now that we have a clearer picture of the the most important Principal Components per RFECV with Linear Regression and Random Forest Regression, we're ready to tackle the question...

What are the strongest predictors for price?

But first, a quick recap of what we've done since our initial Exploratory Data Analysis:

Prepared Data for ML Algorithms (Removed some rows & columns, imputed missing values, feature scaling, and encoding)
Reduced Dimensionality via ANOVA, then PCA
Algorithm Analysis
- Performed RFECV with a LinearRegression algorithm to obtain a list of 5 components that were the optimum 5 features to use with a Linear Model
- Trained a Random Forest Regressor on our dataset to obtain the top 5 most important components calculated by the Random Forest algorithm.
Now we're comparing these principal components and interpretting them to understand what are important predictors for price.

Below is a table summary showing the 5 components left after performing RFECV and the top 5 most important components per Random Forest Regression

Principal Component	From RFECV	From RF Reg
Principal Component 1	-	x
Principal Component 2	-	x
Principal Component 3	x	x
Principal Component 4	x	-
Principal Component 5	-	x
Principal Component 6	-	x
Principal Component 7	x	-
Principal Component 9	x	-
Principal Component 20	x	-

As you can see, both analyses highlighted principle component 3. Let's examine the features in this component. Additionally, the largest positive coefficient from the Linear Regression used in RFECV is associated with Principal Component 7, and the most important feature per the Random Forest Regressor is Principal Component 2, so we'll take a look at these as well.

def plot_pca_dim(pca, feat_names, dim, features, fontsize=8):
    '''
    For a given fitted PCA object, the given component, and a given number of features, this plots
    the scaled coefficients for this particular component.

    Inputs:
    pca: PCA object
    feat_names: feature names from original pre-PCA dataset
    dim: component that you want to inspect
    features: number of most positive and most negative features to display

    '''

    names_to_weights = sorted(
        zip(pca.components_[dim], feat_names), reverse=False)
    names_to_weights = np.array(names_to_weights)

    nm_to_wt_df = pd.DataFrame(names_to_weights, columns=['coef', 'feature'])
    nm_to_wt_df = pd.concat(
        [nm_to_wt_df.head(features), nm_to_wt_df.tail(features)])
    nm_to_wt_df['coef'] = nm_to_wt_df['coef'].astype(float)

    fig = px.bar(nm_to_wt_df, y='feature', x='coef',
                 color='coef', color_continuous_scale=px.colors.sequential.Sunsetdark,
                 labels={'coef': 'Coefficients', 'feature': ''},
                 height=450, width=800, orientation='h', title='Principal Component '+str(dim+1)+' Coefficients',
                 hover_name='feature')

    fig.update_yaxes(tickfont=dict(size=fontsize))
    fig.update_xaxes(tickfont=dict(size=fontsize))

    return fig

def get_index_where_value(arr, value):
    """
    Returns index of array where that array's value equals input value.

    Inputs...
    arr: input array to search
    value: value to search

    Returns: Index Position or None
    """
    try:
        return np.argwhere(np.where(arr == value, arr, None))[0][0]
    except:
        return None

Component 3¶

fig_3 = plot_pca_dim(pca, f_val_features, 2, 10)
fig_3

This component has a relatively large association with New York and room type, and has negative associations with some amenities, so I interpret this as measuring New York apartment listings as predictors of unit price. Consequently, this punishes non-NY listings beause our feature value for state_NY is negative (due to standardization with StandardScaler) for non-NY listings. This doesn't provide very much value since it's focused on one particular state, and this probably stems from the fact that this dataset is heavily imbalanced in favor of New York listings. Intuitively, it's difficult to interpret the reason this component has such strong negative associations with amenities. Therefore, I'll interpret these as un-important price adjustments that the algorithm learned to help it predict price with more accuracy for NY listings.

Component 7¶

fig_7 = plot_pca_dim(pca, f_val_features, 6, 10)
fig_7

This component has a very strong positive association with one room type Entire home/apt and a very negative association with another Private room. It also is positively associated with accomodates (discrete variable for number of guests), bedrooms, beds, family/kid friendly, and guests_included. Therefore, this component measures the size of listings/units, valuing larger listings higher than smaller listings.

Component 2¶

fig_2 = plot_pca_dim(pca, f_val_features, 1, 10)
fig_2

This component has very uniform coefficients related to amenities. Given the relatively high amenities coefficients, and given these amenities are associated with luxury amenities (wine cellar, sun deck, steam room, rooftop, etc - all things I would expect to find in a very luxurious listing), this component measures luxurious amenities, valuing listings higher if they have these.

Evaluate the Results¶

In sum, by exploring the Airbnb listings datasets for New York, San Francisco, and Austin, we've been able to answer the following questions...

Is there a difference in quality of service/product between hosts with different numbers of units? (i.e. do hosts with multiple units give better or worse service than those who do not?)
- We measured quality of service by looking at the review ratings. We found that hosts with only one unit were reviewed slightly higher than hosts with more than 1 unit. I used an inferential hypothesis test to see whether the difference was statistically significant, and we found that it was indeed a statistically significant difference. However, we may not be able to generalize these findings since we did not fulfill the independence requirement for the Central Limit Theorem generalizability to hold.
Do hosts with no recent reviews change pricing behavior from hosts with recent reviews?
- We discretized days since last review by placing each listing into one of several bins based on their days since last review, and we found (from a glance at side-by-side box plots) that pricing didn't appear to be different from a practical significance perspective between each of these bins. Note that we didn't dive into further statistical inference analysis like we did with our first question, which could have proven statistical significance. Futher, by separating bins on is_superhost, we found that, visually, pricing seemed higher for superhosts than non-superhosts.
What are the strongest predictors for price?
- After proprocessing our data, we were left with over 600 features. We selected the best 30% features based on ANOVA (selected features with highest variance), futher decomposed the features into 110 principal components, then used 1) Recursive Feature Elimination with Cross Validation (RFECV) using a Linear Regression model and 2) Random Forest Regression to figure out which components were most important/significant when predicting price. By analyzing three of these "important" components, we found that the following attributes were mostly captured by these components...
  1. Principal Component 3: New York Apartments
  2. Principal Component 7: Size and capacity of listings
  3. Principal Component 2: Luxurious Amenities
- We can interpret these latent features as being the strongest predictors of price for the algorithms that I used.

	0	1
id	958	5858
listing_url	https://www.airbnb.com/rooms/958	https://www.airbnb.com/rooms/5858
scrape_id	20190602163113	20190602163113
last_scraped	2019-06-02	2019-06-02
name	Bright, Modern Garden Unit - 1BR/1B	Creative Sanctuary
summary	New update: the house next door is under const...	NaN
space	Newly remodeled, modern, and bright garden uni...	We live in a large Victorian house on a quiet ...
description	New update: the house next door is under const...	We live in a large Victorian house on a quiet ...
experiences_offered	none	none
neighborhood_overview	Quiet cul de sac in friendly neighborhood St...	I love how our neighborhood feels quiet but is...
notes	Due to the fact that we have children and a do...	All the furniture in the house was handmade so...
transit	Public Transportation is 1/2 block away. Ce...	The train is two blocks away and you can stop ...
access	*Full access to patio and backyard (shared wit...	Our deck, garden, gourmet kitchen and extensiv...
interaction	A family of 4 lives upstairs with their dog. N...	NaN
house_rules	* No Pets - even visiting guests for a short t...	Please respect the house, the art work, the fu...
thumbnail_url	NaN	NaN
medium_url	NaN	NaN
picture_url	https://a0.muscache.com/im/pictures/b7c2a199-4...	https://a0.muscache.com/im/pictures/17714/3a7a...
xl_picture_url	NaN	NaN
host_id	1169	8904
host_url	https://www.airbnb.com/users/show/1169	https://www.airbnb.com/users/show/8904
host_name	Holly	Philip And Tania
host_since	2008-07-31	2009-03-02
host_location	San Francisco, California, United States	San Francisco, California, United States
host_about	We are a family with 2 boys born in 2009 and 2...	Philip: English transplant to the Bay Area and...
host_response_time	within a few hours	within a day
host_response_rate	100%	100%
host_acceptance_rate	NaN	NaN
host_is_superhost	t	f
host_thumbnail_url	https://a0.muscache.com/im/pictures/efdad96a-3...	https://a0.muscache.com/im/users/8904/profile_...
host_picture_url	https://a0.muscache.com/im/pictures/efdad96a-3...	https://a0.muscache.com/im/users/8904/profile_...
host_neighbourhood	Duboce Triangle	Bernal Heights
host_listings_count	1	2
host_total_listings_count	1	2
host_verifications	['email', 'phone', 'facebook', 'reviews', 'kba']	['email', 'phone', 'reviews', 'kba', 'work_ema...
host_has_profile_pic	t	t
host_identity_verified	t	t
street	San Francisco, CA, United States	San Francisco, CA, United States
neighbourhood	Lower Haight	Bernal Heights
neighbourhood_cleansed	Western Addition	Bernal Heights
neighbourhood_group_cleansed	NaN	NaN
city	San Francisco	San Francisco
state	CA	CA
zipcode	94117	94110
market	San Francisco	San Francisco
smart_location	San Francisco, CA	San Francisco, CA
country_code	US	US
country	United States	United States
latitude	37.7693	37.7451
longitude	-122.434	-122.421
is_location_exact	t	t
property_type	Apartment	Apartment
room_type	Entire home/apt	Entire home/apt
accommodates	3	5
bathrooms	1	1
bedrooms	1	2
beds	2	3
bed_type	Real Bed	Real Bed
amenities	{TV,"Cable TV",Internet,Wifi,Kitchen,"Pets liv...	{Internet,Wifi,Kitchen,Heating,"Family/kid fri...
square_feet	NaN	NaN
price	$170.00	$235.00
weekly_price	$1,120.00	$1,600.00
monthly_price	$4,200.00	$5,500.00
security_deposit	$100.00	NaN
cleaning_fee	$100.00	$100.00
guests_included	2	2
extra_people	$25.00	$0.00
minimum_nights	1	30
maximum_nights	30	60
minimum_minimum_nights	1	30
maximum_minimum_nights	1	30
minimum_maximum_nights	30	60
maximum_maximum_nights	30	60
minimum_nights_avg_ntm	1	30
maximum_nights_avg_ntm	30	60
calendar_updated	2 months ago	3 days ago
has_availability	t	t
availability_30	1	0
availability_60	3	0
availability_90	7	0
availability_365	92	0
calendar_last_scraped	2019-06-02	2019-06-02
number_of_reviews	192	111
number_of_reviews_ltm	56	0
first_review	2009-07-23	2009-05-03
last_review	2019-05-16	2017-08-06
review_scores_rating	97	98
review_scores_accuracy	10	10
review_scores_cleanliness	10	10
review_scores_checkin	10	10
review_scores_communication	10	10
review_scores_location	10	10
review_scores_value	9	9
requires_license	t	t
license	STR-0001256	NaN
jurisdiction_names	{"SAN FRANCISCO"}	{"SAN FRANCISCO"}
instant_bookable	t	f
is_business_travel_ready	f	f
cancellation_policy	moderate	strict_14_with_grace_period
require_guest_profile_picture	f	f
require_guest_phone_verification	f	f
calculated_host_listings_count	1	1
calculated_host_listings_count_entire_homes	1	1
calculated_host_listings_count_private_rooms	0	0
calculated_host_listings_count_shared_rooms	0	0
reviews_per_month	1.6	0.9

	0	1	2	3	4
host_acceptance_rate	NaN	NaN	NaN	NaN	NaN
host_listings_count	1.00000	2.00000	10.00000	10.00000	2.00000
host_total_listings_count	1.00000	2.00000	10.00000	10.00000	2.00000
latitude	37.76931	37.74511	37.76669	37.76487	37.77525
longitude	-122.43386	-122.42102	-122.45250	-122.45183	-122.43637
accommodates	3.00000	5.00000	2.00000	2.00000	5.00000
bathrooms	1.00000	1.00000	4.00000	4.00000	1.50000
bedrooms	1.00000	2.00000	1.00000	1.00000	2.00000
beds	2.00000	3.00000	1.00000	1.00000	2.00000
square_feet	NaN	NaN	NaN	NaN	NaN
guests_included	2.00000	2.00000	1.00000	1.00000	2.00000
minimum_nights	1.00000	30.00000	32.00000	32.00000	4.00000
maximum_nights	30.00000	60.00000	60.00000	90.00000	1125.00000
minimum_minimum_nights	1.00000	30.00000	32.00000	32.00000	4.00000
maximum_minimum_nights	1.00000	30.00000	32.00000	32.00000	4.00000
minimum_maximum_nights	30.00000	60.00000	60.00000	90.00000	1125.00000
maximum_maximum_nights	30.00000	60.00000	60.00000	90.00000	1125.00000
minimum_nights_avg_ntm	1.00000	30.00000	32.00000	32.00000	4.00000
maximum_nights_avg_ntm	30.00000	60.00000	60.00000	90.00000	1125.00000
availability_30	1.00000	0.00000	30.00000	11.00000	30.00000
availability_60	3.00000	0.00000	60.00000	41.00000	60.00000
availability_90	7.00000	0.00000	90.00000	71.00000	90.00000
availability_365	92.00000	0.00000	365.00000	346.00000	90.00000
number_of_reviews	192.00000	111.00000	17.00000	8.00000	27.00000
number_of_reviews_ltm	56.00000	0.00000	0.00000	1.00000	1.00000
review_scores_rating	97.00000	98.00000	85.00000	93.00000	97.00000
review_scores_accuracy	10.00000	10.00000	8.00000	9.00000	10.00000
review_scores_cleanliness	10.00000	10.00000	8.00000	9.00000	10.00000
review_scores_checkin	10.00000	10.00000	9.00000	10.00000	10.00000
review_scores_communication	10.00000	10.00000	9.00000	10.00000	10.00000
review_scores_location	10.00000	10.00000	9.00000	9.00000	10.00000
review_scores_value	9.00000	9.00000	8.00000	9.00000	9.00000
calculated_host_listings_count	1.00000	1.00000	9.00000	9.00000	2.00000
calculated_host_listings_count_entire_homes	1.00000	1.00000	0.00000	0.00000	2.00000
calculated_host_listings_count_private_rooms	0.00000	0.00000	9.00000	9.00000	0.00000
calculated_host_listings_count_shared_rooms	0.00000	0.00000	0.00000	0.00000	0.00000
reviews_per_month	1.60000	0.90000	0.14000	0.14000	0.23000

	0	1	2	3	4
experiences_offered	none	none	none	none	none
host_since	2008-07-31	2009-03-02	2009-06-17	2009-06-17	2009-07-02
host_location	San Francisco, California, United States	San Francisco, California, United States	San Francisco, California, United States	San Francisco, California, United States	San Francisco, California, United States
host_response_time	within a few hours	within a day	within a few hours	within a few hours	within an hour
host_response_rate	100%	100%	100%	100%	100%
host_is_superhost	t	f	f	f	f
host_neighbourhood	Duboce Triangle	Bernal Heights	Cole Valley	Cole Valley	Alamo Square
host_verifications	['email', 'phone', 'facebook', 'reviews', 'kba']	['email', 'phone', 'reviews', 'kba', 'work_ema...	['email', 'phone', 'reviews', 'jumio', 'govern...	['email', 'phone', 'reviews', 'jumio', 'govern...	['email', 'phone', 'reviews', 'kba']
host_has_profile_pic	t	t	t	t	t
host_identity_verified	t	t	t	t	t
street	San Francisco, CA, United States	San Francisco, CA, United States	San Francisco, CA, United States	San Francisco, CA, United States	San Francisco, CA, United States
neighbourhood	Lower Haight	Bernal Heights	Cole Valley	Cole Valley	Alamo Square
neighbourhood_cleansed	Western Addition	Bernal Heights	Haight Ashbury	Haight Ashbury	Western Addition
neighbourhood_group_cleansed	NaN	NaN	NaN	NaN	NaN
city	San Francisco	San Francisco	San Francisco	San Francisco	San Francisco
state	CA	CA	CA	CA	CA
market	San Francisco	San Francisco	San Francisco	San Francisco	San Francisco
smart_location	San Francisco, CA	San Francisco, CA	San Francisco, CA	San Francisco, CA	San Francisco, CA
country_code	US	US	US	US	US
country	United States	United States	United States	United States	United States
is_location_exact	t	t	t	t	t
property_type	Apartment	Apartment	Apartment	Apartment	House
room_type	Entire home/apt	Entire home/apt	Private room	Private room	Entire home/apt
bed_type	Real Bed	Real Bed	Real Bed	Real Bed	Real Bed
amenities	{TV,"Cable TV",Internet,Wifi,Kitchen,"Pets liv...	{Internet,Wifi,Kitchen,Heating,"Family/kid fri...	{TV,Internet,Wifi,Kitchen,"Free street parking...	{TV,Internet,Wifi,Kitchen,"Free street parking...	{TV,Internet,Wifi,Kitchen,Heating,"Family/kid ...
price	$170.00	$235.00	$65.00	$65.00	$685.00
weekly_price	$1,120.00	$1,600.00	$485.00	$490.00	NaN
monthly_price	$4,200.00	$5,500.00	$1,685.00	$1,685.00	NaN
security_deposit	$100.00	NaN	$200.00	$200.00	$0.00
cleaning_fee	$100.00	$100.00	$50.00	$50.00	$225.00
extra_people	$25.00	$0.00	$12.00	$12.00	$150.00
has_availability	t	t	t	t	t
first_review	2009-07-23	2009-05-03	2009-08-31	2014-09-08	2009-09-25
last_review	2019-05-16	2017-08-06	2016-11-21	2018-09-12	2018-08-11
requires_license	t	t	t	t	t
jurisdiction_names	{"SAN FRANCISCO"}	{"SAN FRANCISCO"}	{"SAN FRANCISCO"}	{"SAN FRANCISCO"}	{"SAN FRANCISCO"}
instant_bookable	t	f	f	f	f
is_business_travel_ready	f	f	f	f	f
cancellation_policy	moderate	strict_14_with_grace_period	strict_14_with_grace_period	strict_14_with_grace_period	moderate
require_guest_profile_picture	f	f	f	f	t
require_guest_phone_verification	f	f	f	f	t

	host_response_rate	price	weekly_price	monthly_price	security_deposit	cleaning_fee	extra_people
0	100%	$170.00	$1,120.00	$4,200.00	$100.00	$100.00	$25.00
1	100%	$235.00	$1,600.00	$5,500.00	NaN	$100.00	$0.00

	host_response_rate	price	weekly_price	monthly_price	security_deposit	cleaning_fee	extra_people
count	47511.000000	68167.000000	8364.000000	7392.000000	45780.000000	54230.000000	68167.000000
mean	93.684620	191.974768	1030.642994	3200.006899	314.606422	74.611304	14.484384
std	15.894384	423.200385	1025.214473	2793.350228	569.540004	70.839249	25.595083

	host_acceptance_rate	host_listings_count	host_total_listings_count	latitude	longitude	accommodates	bathrooms	bedrooms	beds	square_feet	...	reviews	work_email	last_review_discrete
0	NaN	1.0	1.0	37.76931	-122.43386	3	1.0	1.0	2.0	NaN	...	1	0	last 90 days
1	NaN	2.0	2.0	37.74511	-122.42102	5	1.0	2.0	3.0	NaN	...	1	1	last 2 years
2	NaN	10.0	10.0	37.76669	-122.45250	2	4.0	1.0	1.0	NaN	...	1	0	more than 2 years
3	NaN	10.0	10.0	37.76487	-122.45183	2	4.0	1.0	1.0	NaN	...	1	0	last year
4	NaN	2.0	2.0	37.77525	-122.43637	5	1.5	2.0	2.0	NaN	...	1	0	last year

cat_num_units	1	2-10	11-50	51-100	101+
review_scores_rating	95.1033	94.3074	91.2667	91.5408	93.6839
review_scores_cleanliness	9.39855	9.38816	9.14122	9.31654	9.54651
review_scores_checkin	9.81121	9.78747	9.63401	9.5491	9.47831
review_scores_communication	9.83415	9.78231	9.53695	9.53428	9.45503
review_scores_location	9.64073	9.5636	9.57101	9.47158	9.66455
review_scores_value	9.51381	9.42233	9.05917	8.92626	9.06561

	last_review_discrete	last_review_days
0	last 90 days	61.0
1	last 2 years	709.0
2	more than 2 years	967.0
3	last year	307.0
4	last year	339.0
5	last 90 days	58.0
6	last 90 days	56.0
7	last 90 days	45.0
8	last 90 days	60.0
9	last year	227.0
10	last year	325.0
11	last year	213.0
12	last year	299.0
13	more than 2 years	1367.0
14	last 90 days	62.0
15	last year	240.0
16	last 90 days	74.0
17	more than 2 years	1020.0
18	last year	197.0
19	last 90 days	58.0
20	last 180 days	161.0
21	last 2 years	699.0
22	last 90 days	53.0
23	last 90 days	55.0
24	last 90 days	65.0
25	last 90 days	72.0
26	last 90 days	51.0
27	last 90 days	48.0
28	last 90 days	77.0
29	last 90 days	47.0
30	last 90 days	62.0
31	last 180 days	115.0
32	last 180 days	113.0
33	last 2 years	472.0
34	last year	289.0
35	last year	191.0
36	last 90 days	46.0
37	last 2 years	411.0
38	last 2 years	690.0
39	last 2 years	710.0
40	last 180 days	119.0
41	last year	185.0
42	last 90 days	64.0
43	last 90 days	51.0
44	last 90 days	75.0
45	last 2 years	472.0
46	more than 2 years	775.0
47	more than 2 years	1196.0
48	last 2 years	695.0
49	last 2 years	367.0
50	more than 2 years	3207.0
51	last 90 days	57.0
52	last 90 days	49.0
...	...	...
68115	NaN	NaN
68116	NaN	NaN
68117	NaN	NaN
68118	NaN	NaN
68119	NaN	NaN
68120	NaN	NaN
68121	NaN	NaN
68122	NaN	NaN
68123	NaN	NaN
68124	NaN	NaN
68125	NaN	NaN
68126	last 90 days	65.0
68127	NaN	NaN
68128	NaN	NaN
68129	NaN	NaN
68130	NaN	NaN
68131	NaN	NaN
68132	NaN	NaN
68133	NaN	NaN
68134	NaN	NaN
68135	NaN	NaN
68136	NaN	NaN
68137	NaN	NaN
68138	NaN	NaN
68139	NaN	NaN
68140	NaN	NaN
68141	NaN	NaN
68142	NaN	NaN
68143	NaN	NaN
68144	NaN	NaN
68145	NaN	NaN
68146	NaN	NaN
68147	NaN	NaN
68148	NaN	NaN
68149	NaN	NaN
68150	NaN	NaN
68151	NaN	NaN
68152	NaN	NaN
68153	NaN	NaN
68154	NaN	NaN
68155	NaN	NaN
68156	NaN	NaN
68157	NaN	NaN
68158	NaN	NaN
68159	NaN	NaN
68160	NaN	NaN
68161	NaN	NaN
68162	NaN	NaN
68163	NaN	NaN
68164	NaN	NaN
68165	NaN	NaN
68166	NaN	NaN
68167	NaN	NaN

	review_scores_rating	review_scores_accuracy	review_scores_cleanliness	review_scores_checkin	review_scores_communication	review_scores_location	review_scores_value	reviews_per_month	first_review_days	last_review_days
0	97.0	10.0	10.0	10.0	10.0	10.0	9.0	1.60	3645.0	61.0
1	98.0	10.0	10.0	10.0	10.0	10.0	9.0	0.90	3726.0	709.0
2	85.0	8.0	8.0	9.0	9.0	9.0	8.0	0.14	3606.0	967.0
3	93.0	9.0	9.0	10.0	10.0	9.0	9.0	0.14	1772.0	307.0
4	97.0	10.0	10.0	10.0	10.0	10.0	9.0	0.23	3581.0	339.0

	state	neighbourhood	smart_location
8548	NaN	Ozone Park	NY,
44558	NaN	Harlem	New York,
47152	NaN	Manhattan	US,
56454	NaN	Travis Heights	US,

	accommodates	bathrooms	bedrooms	beds	guests_included	review_scores_rating	price	cleaning_fee
0	3.0	1.0	1.0	2.0	2.0	97.0	170.0	100.0
1	5.0	1.0	2.0	3.0	2.0	98.0	235.0	100.0
2	2.0	4.0	1.0	1.0	1.0	85.0	65.0	50.0
3	2.0	4.0	1.0	1.0	1.0	93.0	65.0	50.0
4	5.0	1.5	2.0	2.0	2.0	97.0	685.0	225.0

Exploring Airbnb Listings Datasets¶

Table of Contents¶

Business Understanding¶

Data Understanding¶

Preparing Data for Exploration¶

Duplicate Value Check¶

Separate Data Types¶

Numerical¶

Categorical¶

Categorical to Numerical¶

Binary Categorical¶

Nominal¶

Redundant Columns¶

Multiple Values in One Column¶

Date/Time Feature Engineering¶

Exploratory Data Analysis¶

Exploratory Question 1¶

Conclusion¶

Hypothesis Testing¶

Exploratory Question 2¶

Conclusion¶

Exploratory Question 3¶

Preparing Data for ML¶

Assessing Missing Data¶

Per Column¶

Per Row¶

Bathrooms, Bedrooms, and Beds¶

Feature Scaling (For IterativeImputer)¶

Splitting Data¶

Encoding Categorical Values¶

Feature Scaling (All)¶

Feature Selection / Dimensionality Reduction¶

ANOVA¶

PCA¶

Data Modeling¶

Recursive Feature Elimination with Cross Validation¶

Revisiting Exploratory Question 3¶

Component 3¶

Component 7¶

Component 2¶

Evaluate the Results¶

Feature Scaling (For `IterativeImputer`)¶