Exploring Airbnb Listings Datasets

Accompanying Medium Blogpost

Table of Contents

  1. Business Understanding
  2. Data Understanding
    1. Preparing Data for Exploration
    2. Exploratory Data Analysis
  3. Preparing Data for ML Algorithms
    1. Dealing with Missing Data
    2. Feature Scaling
    3. Encoding Categorial Data
    4. Feature Selection & Transformation
  4. Data Modeling
  5. Evaluate Results

Business Understanding

Airbnb is the largest hospitality company in the world that doesn't really own any of the lodging that it provides to customers! It has taken the world by storm, and provides many different kinds of lodging options to people everywhere.

How is this made possible? By harnessing data.

Airbnb uses data to determine which listings to show to customers, how to price its listings, which Airbnb experiences users are most likely to book, and much more.

I'd like to understand what kinds of things we can draw from listings data.

  1. Which hosts give better service than others?
  2. Do guest reviews affect pricing?
  3. What kinds of features are the strongest predictors of price?

Let's see what we can find.


Data Understanding

Datasets found here.

These datasets were scraped by InsideAirbnb from the Airbnb website. They are a snapshot of listings from San Francisco, CA; New York, NY; and Austin, TX from earlier this year.

Below, I will import all of the libraries I will use, understand how this data is organized, clean it, and explore it.

Let's see if we can answer the following questions...

  1. Is there a difference in quality of service/product between hosts with different numbers of units? (i.e. do hosts with multiple units give better service than those who do not?)
  2. Do hosts with no recent reviews change pricing behavior from hosts with recent reviews?
  3. What are the strongest predictors of price?
In [1]:
from sklearn.feature_selection import RFECV, f_classif, SelectPercentile
from sklearn.preprocessing import MultiLabelBinarizer, StandardScaler
from category_encoders import OneHotEncoder
from sklearn.decomposition import PCA
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.impute._iterative import IterativeImputer
from sklearn.experimental import enable_iterative_imputer

import plotly.graph_objects as go
import plotly_express as px
import seaborn as sns
import datetime
import pandas as pd
import numpy as np
from scipy.stats import ttest_ind, t
import scipy.stats as stats
import math
import collections

import matplotlib.pyplot as plt
%matplotlib inline
In [2]:
sf = pd.read_csv('sf_listings.csv')
ny = pd.read_csv('ny_listings.csv')
aus = pd.read_csv('aus_listings.csv')
/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/IPython/core/interactiveshell.py:3049: DtypeWarning:

Columns (43,61,62,94,95) have mixed types. Specify dtype option on import or set low_memory=False.

In [3]:
pd.options.display.max_rows = 106
In [4]:
sf.head(2).transpose()
Out[4]:
0 1
id 958 5858
listing_url https://www.airbnb.com/rooms/958 https://www.airbnb.com/rooms/5858
scrape_id 20190602163113 20190602163113
last_scraped 2019-06-02 2019-06-02
name Bright, Modern Garden Unit - 1BR/1B Creative Sanctuary
summary New update: the house next door is under const... NaN
space Newly remodeled, modern, and bright garden uni... We live in a large Victorian house on a quiet ...
description New update: the house next door is under const... We live in a large Victorian house on a quiet ...
experiences_offered none none
neighborhood_overview *Quiet cul de sac in friendly neighborhood *St... I love how our neighborhood feels quiet but is...
notes Due to the fact that we have children and a do... All the furniture in the house was handmade so...
transit *Public Transportation is 1/2 block away. *Ce... The train is two blocks away and you can stop ...
access *Full access to patio and backyard (shared wit... Our deck, garden, gourmet kitchen and extensiv...
interaction A family of 4 lives upstairs with their dog. N... NaN
house_rules * No Pets - even visiting guests for a short t... Please respect the house, the art work, the fu...
thumbnail_url NaN NaN
medium_url NaN NaN
picture_url https://a0.muscache.com/im/pictures/b7c2a199-4... https://a0.muscache.com/im/pictures/17714/3a7a...
xl_picture_url NaN NaN
host_id 1169 8904
host_url https://www.airbnb.com/users/show/1169 https://www.airbnb.com/users/show/8904
host_name Holly Philip And Tania
host_since 2008-07-31 2009-03-02
host_location San Francisco, California, United States San Francisco, California, United States
host_about We are a family with 2 boys born in 2009 and 2... Philip: English transplant to the Bay Area and...
host_response_time within a few hours within a day
host_response_rate 100% 100%
host_acceptance_rate NaN NaN
host_is_superhost t f
host_thumbnail_url https://a0.muscache.com/im/pictures/efdad96a-3... https://a0.muscache.com/im/users/8904/profile_...
host_picture_url https://a0.muscache.com/im/pictures/efdad96a-3... https://a0.muscache.com/im/users/8904/profile_...
host_neighbourhood Duboce Triangle Bernal Heights
host_listings_count 1 2
host_total_listings_count 1 2
host_verifications ['email', 'phone', 'facebook', 'reviews', 'kba'] ['email', 'phone', 'reviews', 'kba', 'work_ema...
host_has_profile_pic t t
host_identity_verified t t
street San Francisco, CA, United States San Francisco, CA, United States
neighbourhood Lower Haight Bernal Heights
neighbourhood_cleansed Western Addition Bernal Heights
neighbourhood_group_cleansed NaN NaN
city San Francisco San Francisco
state CA CA
zipcode 94117 94110
market San Francisco San Francisco
smart_location San Francisco, CA San Francisco, CA
country_code US US
country United States United States
latitude 37.7693 37.7451
longitude -122.434 -122.421
is_location_exact t t
property_type Apartment Apartment
room_type Entire home/apt Entire home/apt
accommodates 3 5
bathrooms 1 1
bedrooms 1 2
beds 2 3
bed_type Real Bed Real Bed
amenities {TV,"Cable TV",Internet,Wifi,Kitchen,"Pets liv... {Internet,Wifi,Kitchen,Heating,"Family/kid fri...
square_feet NaN NaN
price $170.00 $235.00
weekly_price $1,120.00 $1,600.00
monthly_price $4,200.00 $5,500.00
security_deposit $100.00 NaN
cleaning_fee $100.00 $100.00
guests_included 2 2
extra_people $25.00 $0.00
minimum_nights 1 30
maximum_nights 30 60
minimum_minimum_nights 1 30
maximum_minimum_nights 1 30
minimum_maximum_nights 30 60
maximum_maximum_nights 30 60
minimum_nights_avg_ntm 1 30
maximum_nights_avg_ntm 30 60
calendar_updated 2 months ago 3 days ago
has_availability t t
availability_30 1 0
availability_60 3 0
availability_90 7 0
availability_365 92 0
calendar_last_scraped 2019-06-02 2019-06-02
number_of_reviews 192 111
number_of_reviews_ltm 56 0
first_review 2009-07-23 2009-05-03
last_review 2019-05-16 2017-08-06
review_scores_rating 97 98
review_scores_accuracy 10 10
review_scores_cleanliness 10 10
review_scores_checkin 10 10
review_scores_communication 10 10
review_scores_location 10 10
review_scores_value 9 9
requires_license t t
license STR-0001256 NaN
jurisdiction_names {"SAN FRANCISCO"} {"SAN FRANCISCO"}
instant_bookable t f
is_business_travel_ready f f
cancellation_policy moderate strict_14_with_grace_period
require_guest_profile_picture f f
require_guest_phone_verification f f
calculated_host_listings_count 1 1
calculated_host_listings_count_entire_homes 1 1
calculated_host_listings_count_private_rooms 0 0
calculated_host_listings_count_shared_rooms 0 0
reviews_per_month 1.6 0.9

Skimmed through features to filter out columns I don't need.

In [5]:
drop = ['listing_url', 'scrape_id', 'last_scraped', 'summary', 'space', 'description', 'neighborhood_overview',
        'notes', 'transit', 'access', 'interaction', 'house_rules', 'thumbnail_url', 'medium_url', 'picture_url',
        'xl_picture_url', 'host_id', 'host_url', 'host_about', 'host_thumbnail_url', 'host_picture_url',
        'calendar_updated', 'calendar_last_scraped', 'license', 'name', 'host_name', 'zipcode', 'id']  
In [6]:
# Dropping all the columns I don't want
sf = sf.drop(drop, axis=1)
ny = ny.drop(drop, axis=1)
aus = aus.drop(drop, axis=1)
In [7]:
df = pd.concat([sf, ny, aus], ignore_index=True)
In [8]:
# Sanity checking rows all came in
df.shape[0] == sf.shape[0]+ny.shape[0]+aus.shape[0]
Out[8]:
True

Preparing Data for Exploration

Here we need to clean the data enough for Exploratory Data Analysis.

Duplicate Value Check

In [9]:
df.shape
Out[9]:
(68168, 78)
In [10]:
df = df.drop_duplicates()
In [11]:
df.shape
Out[11]:
(68167, 78)

1 duplicate dropped.

Separate Data Types

In [12]:
cat_df = df.select_dtypes(include=['object', 'bool'])
cat_cols = df.select_dtypes(include=['object', 'bool']).columns.values
In [13]:
num_df = df.select_dtypes(include=['int', 'float'])
num_cols = df.select_dtypes(include=['int', 'float']).columns.values
In [14]:
# Sanity check that we have all our features accounted for
len(num_cols)+len(cat_cols) == df.shape[1]
Out[14]:
True

So now we need to account for all of the following data types...

  1. Numerical
    • Continuous
    • Discrete
  2. Categorical
    • Binary
    • Ordinal
    • Nominal
  3. Mixed Type
  4. Time

Numerical

In [15]:
num_df.head().transpose()
Out[15]:
0 1 2 3 4
host_acceptance_rate NaN NaN NaN NaN NaN
host_listings_count 1.00000 2.00000 10.00000 10.00000 2.00000
host_total_listings_count 1.00000 2.00000 10.00000 10.00000 2.00000
latitude 37.76931 37.74511 37.76669 37.76487 37.77525
longitude -122.43386 -122.42102 -122.45250 -122.45183 -122.43637
accommodates 3.00000 5.00000 2.00000 2.00000 5.00000
bathrooms 1.00000 1.00000 4.00000 4.00000 1.50000
bedrooms 1.00000 2.00000 1.00000 1.00000 2.00000
beds 2.00000 3.00000 1.00000 1.00000 2.00000
square_feet NaN NaN NaN NaN NaN
guests_included 2.00000 2.00000 1.00000 1.00000 2.00000
minimum_nights 1.00000 30.00000 32.00000 32.00000 4.00000
maximum_nights 30.00000 60.00000 60.00000 90.00000 1125.00000
minimum_minimum_nights 1.00000 30.00000 32.00000 32.00000 4.00000
maximum_minimum_nights 1.00000 30.00000 32.00000 32.00000 4.00000
minimum_maximum_nights 30.00000 60.00000 60.00000 90.00000 1125.00000
maximum_maximum_nights 30.00000 60.00000 60.00000 90.00000 1125.00000
minimum_nights_avg_ntm 1.00000 30.00000 32.00000 32.00000 4.00000
maximum_nights_avg_ntm 30.00000 60.00000 60.00000 90.00000 1125.00000
availability_30 1.00000 0.00000 30.00000 11.00000 30.00000
availability_60 3.00000 0.00000 60.00000 41.00000 60.00000
availability_90 7.00000 0.00000 90.00000 71.00000 90.00000
availability_365 92.00000 0.00000 365.00000 346.00000 90.00000
number_of_reviews 192.00000 111.00000 17.00000 8.00000 27.00000
number_of_reviews_ltm 56.00000 0.00000 0.00000 1.00000 1.00000
review_scores_rating 97.00000 98.00000 85.00000 93.00000 97.00000
review_scores_accuracy 10.00000 10.00000 8.00000 9.00000 10.00000
review_scores_cleanliness 10.00000 10.00000 8.00000 9.00000 10.00000
review_scores_checkin 10.00000 10.00000 9.00000 10.00000 10.00000
review_scores_communication 10.00000 10.00000 9.00000 10.00000 10.00000
review_scores_location 10.00000 10.00000 9.00000 9.00000 10.00000
review_scores_value 9.00000 9.00000 8.00000 9.00000 9.00000
calculated_host_listings_count 1.00000 1.00000 9.00000 9.00000 2.00000
calculated_host_listings_count_entire_homes 1.00000 1.00000 0.00000 0.00000 2.00000
calculated_host_listings_count_private_rooms 0.00000 0.00000 9.00000 9.00000 0.00000
calculated_host_listings_count_shared_rooms 0.00000 0.00000 0.00000 0.00000 0.00000
reviews_per_month 1.60000 0.90000 0.14000 0.14000 0.23000

From what I can tell, all the numerical columns look good except host_acceptance_rate and square_feet since there are missing values in the first few rows. We'll take a look at NA values later.

Categorical

  • Binary
  • Ordinal
  • Nominal
In [16]:
cat_df.head().transpose()
Out[16]:
0 1 2 3 4
experiences_offered none none none none none
host_since 2008-07-31 2009-03-02 2009-06-17 2009-06-17 2009-07-02
host_location San Francisco, California, United States San Francisco, California, United States San Francisco, California, United States San Francisco, California, United States San Francisco, California, United States
host_response_time within a few hours within a day within a few hours within a few hours within an hour
host_response_rate 100% 100% 100% 100% 100%
host_is_superhost t f f f f
host_neighbourhood Duboce Triangle Bernal Heights Cole Valley Cole Valley Alamo Square
host_verifications ['email', 'phone', 'facebook', 'reviews', 'kba'] ['email', 'phone', 'reviews', 'kba', 'work_ema... ['email', 'phone', 'reviews', 'jumio', 'govern... ['email', 'phone', 'reviews', 'jumio', 'govern... ['email', 'phone', 'reviews', 'kba']
host_has_profile_pic t t t t t
host_identity_verified t t t t t
street San Francisco, CA, United States San Francisco, CA, United States San Francisco, CA, United States San Francisco, CA, United States San Francisco, CA, United States
neighbourhood Lower Haight Bernal Heights Cole Valley Cole Valley Alamo Square
neighbourhood_cleansed Western Addition Bernal Heights Haight Ashbury Haight Ashbury Western Addition
neighbourhood_group_cleansed NaN NaN NaN NaN NaN
city San Francisco San Francisco San Francisco San Francisco San Francisco
state CA CA CA CA CA
market San Francisco San Francisco San Francisco San Francisco San Francisco
smart_location San Francisco, CA San Francisco, CA San Francisco, CA San Francisco, CA San Francisco, CA
country_code US US US US US
country United States United States United States United States United States
is_location_exact t t t t t
property_type Apartment Apartment Apartment Apartment House
room_type Entire home/apt Entire home/apt Private room Private room Entire home/apt
bed_type Real Bed Real Bed Real Bed Real Bed Real Bed
amenities {TV,"Cable TV",Internet,Wifi,Kitchen,"Pets liv... {Internet,Wifi,Kitchen,Heating,"Family/kid fri... {TV,Internet,Wifi,Kitchen,"Free street parking... {TV,Internet,Wifi,Kitchen,"Free street parking... {TV,Internet,Wifi,Kitchen,Heating,"Family/kid ...
price $170.00 $235.00 $65.00 $65.00 $685.00
weekly_price $1,120.00 $1,600.00 $485.00 $490.00 NaN
monthly_price $4,200.00 $5,500.00 $1,685.00 $1,685.00 NaN
security_deposit $100.00 NaN $200.00 $200.00 $0.00
cleaning_fee $100.00 $100.00 $50.00 $50.00 $225.00
extra_people $25.00 $0.00 $12.00 $12.00 $150.00
has_availability t t t t t
first_review 2009-07-23 2009-05-03 2009-08-31 2014-09-08 2009-09-25
last_review 2019-05-16 2017-08-06 2016-11-21 2018-09-12 2018-08-11
requires_license t t t t t
jurisdiction_names {"SAN FRANCISCO"} {"SAN FRANCISCO"} {"SAN FRANCISCO"} {"SAN FRANCISCO"} {"SAN FRANCISCO"}
instant_bookable t f f f f
is_business_travel_ready f f f f f
cancellation_policy moderate strict_14_with_grace_period strict_14_with_grace_period strict_14_with_grace_period moderate
require_guest_profile_picture f f f f t
require_guest_phone_verification f f f f t

We have a lot more work to do for these categorical columns.

  • Numerical
    • Convert some of these to numerical columns (i.e. prices and response rate)
  • Binary Categorical
    • We need to Binarize the Binary/Boolean columns
  • Ordinal
    • The only ordinal-looking column I see is host_response_time, but I'm going to treat it as nominal
  • Nominal
    • We have a lot of features that seem redundant that we should explore (i.e. city, market, smart_location, jurisdiction_names)
    • host_verifications and amenities have multiple items that need to be extrapolated into their own columns
  • Date/Time
    • Engineer features from the dates columns
      • Extrapolate days

Categorical to Numerical

In [17]:
cat_to_num = ['host_response_rate', 'price', 'weekly_price',
              'monthly_price', 'security_deposit', 'cleaning_fee', 'extra_people']
prices = ['price', 'weekly_price', 'monthly_price',
          'security_deposit', 'cleaning_fee', 'extra_people']
In [18]:
cat_to_num_df = cat_df[cat_to_num]
cat_to_num_df.head(2)
Out[18]:
host_response_rate price weekly_price monthly_price security_deposit cleaning_fee extra_people
0 100% $170.00 $1,120.00 $4,200.00 $100.00 $100.00 $25.00
1 100% $235.00 $1,600.00 $5,500.00 NaN $100.00 $0.00
In [19]:
# Converting the response rate to integers
def repl(m): return m.group(1)


cat_to_num_df['host_response_rate'] = cat_to_num_df['host_response_rate'].str.replace(
    r'(\d+)%', repl).astype(float)
/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:6: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [20]:
# Removing commas and dollar signs, then converting to float
for col in prices:
    cat_to_num_df[col] = cat_to_num_df[col].str.replace(
        '$', '').str.replace(',', '')
    cat_to_num_df[col] = cat_to_num_df[col].astype(float)
/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:5: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

In [21]:
cat_to_num_df.head(1)
Out[21]:
host_response_rate price weekly_price monthly_price security_deposit cleaning_fee extra_people
0 100.0 170.0 1120.0 4200.0 100.0 100.0 25.0
In [22]:
cat_to_num_df.describe().head(3)
Out[22]:
host_response_rate price weekly_price monthly_price security_deposit cleaning_fee extra_people
count 47511.000000 68167.000000 8364.000000 7392.000000 45780.000000 54230.000000 68167.000000
mean 93.684620 191.974768 1030.642994 3200.006899 314.606422 74.611304 14.484384
std 15.894384 423.200385 1025.214473 2793.350228 569.540004 70.839249 25.595083

These columns look good now.

Appending the cat_to_num_df to the num_df.

In [23]:
num_df = pd.concat([num_df, cat_to_num_df], axis=1)
In [24]:
num_cols = num_df.columns.values
In [25]:
cat_df = cat_df.drop(cat_to_num, axis=1)
cat_cols = cat_df.columns.values

Binary Categorical

In [26]:
bi_cols = []
for col in cat_df.columns:
    if cat_df[col].nunique() == 2:
        bi_cols.append(col)
In [27]:
cat_df[bi_cols].head()
Out[27]:
host_is_superhost host_has_profile_pic host_identity_verified is_location_exact requires_license instant_bookable require_guest_profile_picture require_guest_phone_verification
0 t t t t t t f f
1 f t t t t f f f
2 f t t t t f f f
3 f t t t t f f f
4 f t t t t f t t
In [28]:
for col in bi_cols:
    cat_df[col] = cat_df[col].map({'f': 0, 't': 1})

Nominal

Redundant Columns

Took a look at these columns since they looked redundant.

In [29]:
redundant_cols = ['city', 'market',
                  'jurisdiction_names', 'host_location', 'street']

Removing the following features

  • market: smart_location is a more detailed feature
  • city : redundant to smart_location
  • jurisdiction_names: redundant to smart_location
  • host_location: 1,862 unique values. Too many unique values.
  • street: redundant to smart_location
In [30]:
cat_df = cat_df.drop(redundant_cols, axis=1)
Multiple Values in One Column
  • host_verifications
  • amenities
In [31]:
cat_df.head()
Out[31]:
experiences_offered host_since host_response_time host_is_superhost host_neighbourhood host_verifications host_has_profile_pic host_identity_verified neighbourhood neighbourhood_cleansed ... amenities has_availability first_review last_review requires_license instant_bookable is_business_travel_ready cancellation_policy require_guest_profile_picture require_guest_phone_verification
0 none 2008-07-31 within a few hours 1.0 Duboce Triangle ['email', 'phone', 'facebook', 'reviews', 'kba'] 1.0 1.0 Lower Haight Western Addition ... {TV,"Cable TV",Internet,Wifi,Kitchen,"Pets liv... t 2009-07-23 2019-05-16 1 1 f moderate 0 0
1 none 2009-03-02 within a day 0.0 Bernal Heights ['email', 'phone', 'reviews', 'kba', 'work_ema... 1.0 1.0 Bernal Heights Bernal Heights ... {Internet,Wifi,Kitchen,Heating,"Family/kid fri... t 2009-05-03 2017-08-06 1 0 f strict_14_with_grace_period 0 0
2 none 2009-06-17 within a few hours 0.0 Cole Valley ['email', 'phone', 'reviews', 'jumio', 'govern... 1.0 1.0 Cole Valley Haight Ashbury ... {TV,Internet,Wifi,Kitchen,"Free street parking... t 2009-08-31 2016-11-21 1 0 f strict_14_with_grace_period 0 0
3 none 2009-06-17 within a few hours 0.0 Cole Valley ['email', 'phone', 'reviews', 'jumio', 'govern... 1.0 1.0 Cole Valley Haight Ashbury ... {TV,Internet,Wifi,Kitchen,"Free street parking... t 2014-09-08 2018-09-12 1 0 f strict_14_with_grace_period 0 0
4 none 2009-07-02 within an hour 0.0 Alamo Square ['email', 'phone', 'reviews', 'kba'] 1.0 1.0 Alamo Square Western Addition ... {TV,Internet,Wifi,Kitchen,Heating,"Family/kid ... t 2009-09-25 2018-08-11 1 0 f moderate 1 1

5 rows × 29 columns

In [32]:
def striplist(l):
    return([x.strip() for x in l])


cat_df['host_verifications'] = cat_df['host_verifications'].str.replace('[', '').str.replace(']', '').str.replace("'",
                                                                                                                  '').str.lower().str.split(',').apply(striplist)
In [33]:
cat_df['amenities'] = cat_df['amenities'].str.replace('{', '').str.replace('}',
                                                                           '').str.replace('"', '').str.lower().str.split(',').apply(striplist)

Binarizing the lists.

In [34]:
mlb = MultiLabelBinarizer()
amenities_df = pd.DataFrame(mlb.fit_transform(
    cat_df['amenities']), columns=mlb.classes_, index=cat_df.index)
In [35]:
mlb = MultiLabelBinarizer()
host_verif_df = pd.DataFrame(mlb.fit_transform(
    cat_df['host_verifications']), columns=mlb.classes_, index=cat_df.index)
In [36]:
host_verif_df = host_verif_df.drop(axis=1, labels=[''])
amenities_df = amenities_df.drop(axis=1, labels=[''])
In [37]:
cat_df = cat_df.drop(['host_verifications', 'amenities'], axis=1)

Concatenating the DataFrames together.

In [38]:
cat_df = pd.concat([cat_df, amenities_df, host_verif_df], axis=1)
In [39]:
cat_cols = cat_df.columns.values
Date/Time Feature Engineering

Converting the date features to "days since" features so we have numerical values to work with.

In [40]:
today = pd.to_datetime('today')
In [41]:
dt_cols = ['host_since', 'first_review', 'last_review']
In [42]:
cat_df[dt_cols].head(1)
Out[42]:
host_since first_review last_review
0 2008-07-31 2009-07-23 2019-05-16
In [43]:
# Converting to datetime data type
for col in dt_cols:
    cat_df[col] = pd.to_datetime(cat_df[col], infer_datetime_format=True)
In [44]:
# Creating day features for each date column
for col in dt_cols:
    num_df[col+'_days'] = (today - cat_df[col]).apply(lambda x: x.days)
In [45]:
# Creating a discrete feature based on how recent the last review was
bins = [0, 90, 180, 365, 730, np.inf]
labels = ['last 90 days', 'last 180 days',
          'last year', 'last 2 years', 'more than 2 years']
cat_df['last_review_discrete'] = pd.cut(
    num_df['last_review_days'], bins=bins, labels=labels)
In [46]:
# Dropping the datetime columns since because I don't need them
cat_df = cat_df.drop(dt_cols, axis=1)
cat_cols = cat_df.columns.values
num_cols = num_df.columns.values

Exploratory Data Analysis

Time to explore the data. Let's see if we can answer any of our previously mentioned exploratory questions below.

  1. Is there a difference in quality of service/product between hosts with different numbers of units? (i.e. do hosts with multiple units give better service than those who do not?)
  2. Do hosts with no recent reviews change pricing behavior from hosts with recent reviews?
  3. What are the strongest predictors for price?
In [47]:
# Sanity checking before I rejoin the DFs
num_df.shape[0] == cat_df.shape[0]
Out[47]:
True
In [48]:
cleaned_df = pd.concat([num_df, cat_df], axis=1)
In [49]:
cleaned_df.head()
Out[49]:
host_acceptance_rate host_listings_count host_total_listings_count latitude longitude accommodates bathrooms bedrooms beds square_feet ... photographer reviews selfie sent_id sesame sesame_offline weibo work_email zhima_selfie last_review_discrete
0 NaN 1.0 1.0 37.76931 -122.43386 3 1.0 1.0 2.0 NaN ... 0 1 0 0 0 0 0 0 0 last 90 days
1 NaN 2.0 2.0 37.74511 -122.42102 5 1.0 2.0 3.0 NaN ... 0 1 0 0 0 0 0 1 0 last 2 years
2 NaN 10.0 10.0 37.76669 -122.45250 2 4.0 1.0 1.0 NaN ... 0 1 0 0 0 0 0 0 0 more than 2 years
3 NaN 10.0 10.0 37.76487 -122.45183 2 4.0 1.0 1.0 NaN ... 0 1 0 0 0 0 0 0 0 last year
4 NaN 2.0 2.0 37.77525 -122.43637 5 1.5 2.0 2.0 NaN ... 0 1 0 0 0 0 0 0 0 last year

5 rows × 307 columns

Exploratory Question 1

Is there a difference in quality of service/product between hosts with different numbers of units? (i.e. do hosts with multiple units give better service than those who do not?)

In [50]:
# Making a copy of the DataFrame just for exploring
eda = cleaned_df[:]

Let's take a quick look at the distribution of host_total_listings_count.

In [51]:
plt.figure()
plt.title('Distribution of No. host listings.')
plt.hist(eda['host_total_listings_count'],bins=30)
plt.show()
/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/numpy/lib/histograms.py:824: RuntimeWarning:

invalid value encountered in greater_equal

/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/numpy/lib/histograms.py:825: RuntimeWarning:

invalid value encountered in less_equal

We have some huge outliers here, with the majority of values less than 100. Let's clean this up.

In [52]:
eda['host_total_listings_count'].value_counts(
).sort_index(ascending=False).head(20)
Out[52]:
1677.0     23
1614.0     72
1522.0     71
1235.0    241
1108.0    106
814.0     160
695.0       8
641.0      14
640.0       1
593.0     341
528.0      18
516.0      14
488.0      24
472.0       8
441.0      18
408.0      33
362.0       1
349.0       4
332.0       4
314.0      26
Name: host_total_listings_count, dtype: int64

So we have a significant number of hosts with hundreds of listings. Some of these could be errors because I it's difficult to imagine people owning over a thousand airbnb listings. In a work setting, this'd be worth following up. However, I'm going to discretize these into bins to make these groups easier to compare.

I'll make a categorical column to distinguish between hosts. Here are the bins I'm going to use...

  • 1 unit
  • 2-10 units
  • 11-50 units
  • 51-100 units
  • 101+
In [53]:
bins = [0,1,10,50,100,np.inf]
labels = ['1','2-10','11-50','51-100','101+']
eda['cat_num_units'] = pd.cut(eda['host_total_listings_count'],bins=bins,labels=labels)
/Users/patrickdeguzman/anaconda3/lib/python3.7/site-packages/ipykernel_launcher.py:3: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy

Now plotting the bins...

In [54]:
plt.figure()
sns.countplot(x='cat_num_units',data=eda)
plt.title('No. Host Units Distribution')
plt.show()

Let's compare these categories on their ratings.

In [55]:
rating_cols = ['review_scores_rating', 'review_scores_cleanliness', 'review_scores_checkin',
               'review_scores_communication', 'review_scores_location', 'review_scores_value']
In [56]:
def highlight_max(s):
    '''
    highlight the maximum in a Series lightgreen.
    '''
    is_max = s == s.max()
    return ['background-color: lightgreen' if v else '' for v in is_max]

def highlight_min(s):
    '''
    highlight the minimum in a Series red.
    '''
    is_min = s == s.min()
    return ['background-color: red' if v else '' for v in is_min]
In [57]:
eda.groupby('cat_num_units')[rating_cols].mean().transpose().style. \
    apply(highlight_max, axis=1).apply(highlight_min, axis=1)
Out[57]:
cat_num_units 1 2-10 11-50 51-100 101+
review_scores_rating 95.1033 94.3074 91.2667 91.5408 93.6839
review_scores_cleanliness 9.39855 9.38816 9.14122 9.31654 9.54651
review_scores_checkin 9.81121 9.78747 9.63401 9.5491 9.47831
review_scores_communication 9.83415 9.78231 9.53695 9.53428 9.45503
review_scores_location 9.64073 9.5636 9.57101 9.47158 9.66455
review_scores_value 9.51381 9.42233 9.05917 8.92626 9.06561

Looks like hosts with only 1 unit appear to have the highest average rating and scores. Let's look at their boxplot summaries side-by-side to compare.

In [58]:
plt.figure(1,figsize=(15,13))
i = 1

for col in rating_cols:
    plt.subplot(3,2,i)
    sns.boxplot(x='cat_num_units',y=col,data=eda)
    title = col.replace('_',' ').title()
    plt.title(title)
    plt.ylabel(None)
    plt.xlabel(None)
    i += 1

plt.tight_layout()    
plt.show()

Conclusion

So these plots tell me that hosts with only 1 unit have tighter and higher distributions, suggesting that these hosts are rated slightly better than hosts with more units. However, it's worth noting that...

  1. this difference looks very marginal, and
  2. observations on hosts with more than 10 units may be slightly misleading since our sample sizes for these buckets are much smaller than hosts with 1 unit and hosts with 2-10 units.
Hypothesis Testing

Let's perform a simple hypothesis t-test with a 5% significance level to check whether the average Review Scores Rating for hosts with 1 unit is greater than ratings of all other hosts.

In [59]:
def plot_t_dist_hyp(df, t_stat, p_val):
    """
    After t-test, plot t-statistic and p-value via Plotly Express.
    Depenencies: scipy.stat.t | plotly_express as px | plotly.graph_objects as go

    Inputs
    ------
    df: degrees of freedom
    t_stat: t-statistic
    p_val: p-value

    Returns
    ------
    Plotly Express Figure

    """
    rv = t(df=df)

    x_t = np.linspace(rv.ppf(0.00001), rv.ppf(0.99999), 100)
    y_t = rv.pdf(x_t)
    t_df = pd.DataFrame.from_dict(data={'x_t': x_t, 'y_t': y_t})

    fig = px.line(t_df, 'x_t', 'y_t', width=600, height=300,
                  labels={'x_t': 'X', 'y_t': ''},)
    fig.update_traces(textposition='top center')
    fig.update(layout=dict(title='Hypothesis Test Results', annotations=[{
        "x": t_stat,
        "y": rv.pdf(t_stat),
        'font': {'family': 'Arial', 'size': 12, 'color': 'black'},
        "showarrow": True,
        "text": f'P-value<br>{p_val:.2e}',
        "arrowhead": 7,
        "ay": -40
    }]))
    return fig

First let's check our conditions for hypothesis testing using the Central Limit Theorem...

  1. Independence: I picked listing data from 3 of the largest cities in the US, and these listings were scraped off of Airbnb's website and represent a snapshot of these listings at a given point in time. Therefore, It's important to note that these listings were not randomly sampled. Additionally, it feels wrong to assume that ratings within each of these cities are independent of one another, since ratings of some listings can affect how guests rate other listings. Therefore, this test cannot reliably be generalized to the population of all Airbnb listings around the world. However, I will still proceed so I can practice hypothesis testing.
  2. Sample Size & Skew: Because our sample size (>60,000) is greater than 30, we can reliable assume that this sampling distribution would be normally distributed if we continued to take samples of 50,000 listings.

Experiment Details

$ H_0: u_{1 unit} = u_{all others}$
$ H_A: u_{1 unit} > u_{all others}$
$\alpha = 0.05$

In [60]:
# Creating sampling arrays
one_unit_samp = eda[eda['cat_num_units'] == '1']['review_scores_rating']
all_others_samp = eda[eda['cat_num_units'] != '1']['review_scores_rating']

# Calculating Variance
one_unit_var = one_unit_samp.var()
all_others_var = all_others_samp.var()
In [61]:
# Variances are nearly the same
print(one_unit_var)
print(all_others_var)
60.04818004883296
69.84990857907735
In [62]:
t_stat, p_val = ttest_ind(
    a=one_unit_samp, b=all_others_samp, equal_var=True, nan_policy='omit')
In [63]:
# Plotting a T-distribution with the calculated t-statistic annotated
df = min([len(one_unit_samp)-1, len(all_others_samp)-1])
plot_t_dist_hyp(df, t_stat, p_val)