Yelp - Recommend other Cuisines available in a restaurant based on past Cuisine Data

Association rule mining is a technique to identify underlying relations between different items.

In this use case for example if a restaurant offers Korean cuisines, there is a pattern in what other cusisines such restaurants offer. For instance, a Korean barbeque restaurant may also have Seafood in their cuisines list. In short as a food lover, I felt cusines between restaurants may involve a pattern and wanted to explore more on it. More patterns on what kind of food is offered can be generated if the relationship between cuisines in different restaurants can be identified.

Dataset used: Yelp Business dataset JSON file

https://www.yelp.com/dataset/download

Import the JSON file

In [1]:
import os
import pandas as pd
os.getcwd()
os.chdir('c:\\Users\\Nijanth Anand\\Downloads\\Apriori Problem') 

#Read the JSON file into a python dataframe
df = pd.read_json('yelp_academic_dataset_business.json', lines=True)

df.head(10)

print('Size of the JSON file',df.shape)
print(df.columns)
#print(df.dtypes)
Size of the JSON file (192609, 14)
Index(['address', 'attributes', 'business_id', 'categories', 'city', 'hours',
       'is_open', 'latitude', 'longitude', 'name', 'postal_code',
       'review_count', 'stars', 'state'],
      dtype='object')

We filter out Businesses that are Restaurants using the keyword 'Restaurant Reservations'.

In [2]:
rest_data=df[df['attributes'].astype(str).str.contains('RestaurantsReservations')]

print('Total Number of businesses:',df.shape[0])
print('Number of Restaurants in it:',rest_data.shape[0])

#Lets see how a sample row in the restaurant data looks like
print('Sample record in the dataset')

rest_data.iloc[2]
Total Number of businesses: 192609
Number of Restaurants in it: 52287
Sample record in the dataset
Out[2]:
address                                   2450 E Indian School Rd
attributes      {'RestaurantsTakeOut': 'True', 'BusinessParkin...
business_id                                1Dfx3zM-rW4n-31KeC8sJg
categories      Restaurants, Breakfast & Brunch, Mexican, Taco...
city                                                      Phoenix
hours           {'Monday': '7:0-0:0', 'Tuesday': '7:0-0:0', 'W...
is_open                                                         1
latitude                                                  33.4952
longitude                                                -112.029
name                                                    Taco Bell
postal_code                                                 85016
review_count                                                   18
stars                                                           3
state                                                          AZ
Name: 11, dtype: object

Plot the distribution of Restaurants

In [3]:
import folium
from folium import plugins
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

#We generate a Location list from out dataset.
rest_data=rest_data.reset_index(drop=True)
locations = rest_data[['latitude', 'longitude']].reset_index(drop=True)
locationlist = locations.values.tolist()
len(locationlist)
locationlist[7]

#For complexity reasons we only plot a sample of the data.
import random
random.seed(4)
locationlist = random.sample(locationlist, 500)
print('Number of Location samples taken for below Interactive Folium map',len(locationlist))

#https://georgetsilva.github.io/posts/mapping-points-with-folium/
map = folium.Map(location=[37.0902, -95.7129], zoom_start=3.5)  #Set the starting point as centre of the United States.
for point in range(0, len(locationlist)):
    folium.Marker(locationlist[point], popup=rest_data['name'][point],icon=folium.Icon(color='red',icon='glass')).add_to(map)
map
Number of Location samples taken for below Interactive Folium map 500
Out[3]:

Understanding the distribution of the data from different states.

In [4]:
rest_state_data = rest_data.groupby('state')['business_id'].nunique()
rest_state_data=rest_state_data.sort_values(ascending=False)
print('No of states',rest_state_data.size)
rest_state_data
No of states 20
Out[4]:
state
ON     12576
AZ     10466
NV      7398
OH      4868
QC      4551
NC      3911
PA      3495
AB      2568
WI      1519
IL       576
SC       340
NY        10
XWY        2
NM         1
XGM        1
TX         1
FL         1
BC         1
WA         1
AR         1
Name: business_id, dtype: int64

Data Preparation

In [5]:
#Generate the attributes from the categories column.
data=rest_data['categories']
#print(data.dtypes)

#We now convert the categories into items in a list so they can be used for Associate Rule Mining.
j=0
a=[]
b=[]
for i in data:
    temp=str(i)
    temp=temp.split(",")
    #We filter out certain unwanted attributes such as Art, Casinos and generic keywords such as Restaurant, Bars etc
    temp= [k for k in temp if 'Bars' not in k]
    temp= [k for k in temp if 'Restaurants' not in k]
    temp= [k for k in temp if 'Food' not in k]
    temp= [k for k in temp if 'Nightlife' not in k]
    temp= [k for k in temp if 'Beer' not in k]
    temp= [k for k in temp if 'Arts' not in k]
    temp= [k for k in temp if 'Events' not in k]
    temp= [k for k in temp if 'Caterers' not in k]
    temp= [k for k in temp if 'Hotels' not in k]
    temp= [k for k in temp if 'Casinos' not in k]
    temp= [k for k in temp if 'Event Planning & Services' not in k]
    temp= [k for k in temp if 'Active Life' not in k]
    temp= [k for k in temp if 'Art Galleries' not in k]
    temp= [k for k in temp if 'Gas Stations' not in k]
    temp= [k for k in temp if 'Day Spas' not in k]
    temp= [k for k in temp if  'Books' not in k]
    temp= [k for k in temp if ' Event Planning & Services' not in k]
    temp= [k for k in temp if 'Shopping' not in k]
    temp= [k for k in temp if 'Breakfast & Brunch' not in k]
    temp= [k for k in temp if 'American (New)' not in k]
    temp= [k for k in temp if 'Sandwiches' not in k]
    temp= [k for k in temp if 'Salads' not in k]
    a.append(temp)
    b=b+temp
    j=j+1

Descriptive Statistics

In [6]:
print('Total count of repetitive categories/cusines in the Restaurant data: ',len(b))
print('\nCusine Distribution of Top 50 by frequency: ')
fig=plt.gcf()
fig.set_size_inches(15,7)
pd.DataFrame(b)[0].value_counts().nlargest(50).plot(kind='bar')
plt.show()
Total count of repetitive categories/cusines in the Restaurant data:  96088

Cusine Distribution of Top 50 by frequency: 

Associate Rule Mining

In [7]:
from apyori import apriori  

association_rules = apriori(a, min_support=0.0017, min_confidence=0.40, min_lift=1, min_length=2)  
association_results = list(association_rules)  

for item in association_results:

    # first index of the inner list
    # Contains base item and add item
    pair = item[0] 
    items = [x for x in pair]
    print("Rule: " + items[0] + " -> " + items[1])

    #second index of the inner list
    print("Support: " + str(item[1]))

    #third index of the list located at 0th
    #of the third index of the inner list

    print("Confidence: " + str(item[2][0][2]))
    print("Lift: " + str(item[2][0][3]))
    print("=====================================")
Rule:  Breweries ->  Pubs
Support: 0.002256775106623061
Confidence: 0.4682539682539682
Lift: 17.133376653670563
=====================================
Rule:  Chicken Shop ->  Chicken Wings
Support: 0.0036146652131504964
Confidence: 0.47969543147208127
Lift: 13.235796847166602
=====================================
Rule:  Chinese ->  Dim Sum
Support: 0.0027540306385908544
Confidence: 0.712871287128713
Lift: 15.023740826319635
=====================================
Rule:  Indian ->  Pakistani
Support: 0.0036146652131504964
Confidence: 0.6019108280254777
Lift: 36.68078259320297
=====================================
Rule: Pakistani ->  Indian
Support: 0.0017212691491192839
Confidence: 0.8653846153846154
Lift: 52.737022592791824
=====================================
Rule:  Irish ->  Pubs
Support: 0.0018551456384952284
Confidence: 0.642384105960265
Lift: 23.504784988344557
=====================================
Rule:  Japanese ->  Ramen
Support: 0.0023332759576950293
Confidence: 0.7349397590361446
Lift: 22.498709122203103
=====================================
Rule:  Middle Eastern ->  Lebanese
Support: 0.002161149042783101
Confidence: 0.7337662337662338
Lift: 46.84546405974978
=====================================
Rule: Greek ->  Mediterranean
Support: 0.002237649893855069
Confidence: 0.420863309352518
Lift: 17.63275629496403
=====================================
Rule: Middle Eastern ->  Mediterranean
Support: 0.0023715263832310134
Confidence: 0.4261168384879725
Lift: 17.852861485593444
=====================================
Rule:  Mexican ->  Tacos
Support: 0.0026966550002868782
Confidence: 0.7921348314606741
Lift: 15.132756277889758
=====================================
Rule:  Tex-Mex ->  Mexican
Support: 0.005775814255933597
Confidence: 0.5571955719557196
Lift: 10.644532287485827
=====================================
Rule: Tex-Mex ->  Mexican
Support: 0.00252452808537495
Confidence: 0.8301886792452831
Lift: 15.859727976506436
=====================================
Rule: Italian ->  Pizza
Support: 0.009734733298907951
Confidence: 0.43319148936170215
Lift: 5.678185862184839
=====================================
Rule:  Salad ->  Soup
Support: 0.004838678830301987
Confidence: 0.4074074074074074
Lift: 11.020233373570155
=====================================
Rule:  Vegetarian ->  Vegan
Support: 0.004781303191998011
Confidence: 0.4132231404958678
Lift: 25.389187246894757
=====================================
Rule:  Cafes ->  Coffee & Tea
Support: 0.0020463977661751486
Confidence: 0.4196078431372549
Lift: 10.507679738562091
=====================================
Rule:  Pizza ->  Chicken Wings
Support: 0.0028496567024308144
Confidence: 0.5498154981549815
Lift: 7.206869629488474
=====================================
Rule: Pizza ->  Chicken Wings
Support: 0.002180274255551093
Confidence: 0.42066420664206644
Lift: 12.287859984745099
=====================================
Rule:  Chicken Wings -> Italian
Support: 0.0017977700001912521
Confidence: 0.9791666666666666
Lift: 12.834717347706192
=====================================
Rule:  Pizza ->  Salad
Support: 0.001702143936351292
Confidence: 0.44723618090452266
Lift: 5.862280819993677
=====================================