GeoFencing and Click Through rates in Mobile Analytics¶

The purpose of the project is to understand what are the key parameters and factors driving Click through rate in Mobile devices. Is it the location or size of advertisement or the category of app that makes a difference? Read throught to understand.

library(car, warn.conflicts = FALSE)             #For VIF Function
library(caret, warn.conflicts = FALSE)           #For Data imputation using KNN
library(corrplot, warn.conflicts = FALSE)        #For Correlation plot     
library(ggplot2, warn.conflicts = FALSE)         #For Plotting graphs
library(dplyr, warn.conflicts = FALSE)           #For Data PreProcessing
library(aspace, warn.conflicts = FALSE)          #For Latitude to Radiance 
library(DMwR,warn.conflicts = FALSE)             #For KNN Imputation
library(tidyr)                                   #For the gather function in ggplot
library(psych)                                   #For Descriptive statisticslibrary(ROCR)
library(regclass)                                #For Predicting the Confusion Matrix
library(ROCR)                                    #For Prediction and ROC Curve.
library(mapproj)                                 #For Getting a map output

Data Importing:¶

We shift the working directory to our local path and import the 'Geo-Fence Analytics.csv' file as a data frame. We also read the record count of the data from the csv and check the number of missing values. We find that the app publisher name, os the device is running on, zip code and the app category have missing values.

getwd()
setwd("C:/Users/Nijanth Anand/Downloads/BANA277-Customer Analytics/Assignment - Mobile Analytics")
data=read.csv("Geo-Fence Analytics.csv",header = TRUE,na.strings=c(""," ","NA"))
print(paste("The data from the Geo-Fence Analytics.csv file has been loaded"))
print(paste("Number of rows    :",nrow(data)))
print(paste("Number of columns :",ncol(data)))

[1] "The data from the Geo-Fence Analytics.csv file has been loaded"
[1] "Number of rows    : 121567"
[1] "Number of columns : 15"

We store a backup of the data from the csv file before we proceed further.

data_backup <- data

Data Preperation¶

Since the data is now available as a data frame we go ahead with the data preperation.

Null Value Treatment¶

We run the sapply function to check the NA value count in the records of the dataframe.

print("Number of missing records in each column is")
sapply(data, function(x) sum(is.na(x)))

[1] "Number of missing records in each column is"

We remove the app publisher and device zip attributes as they have high NA's and we have latitude and longitude, app category details to supplement the 2 attributes. For the rest of the missing values we run a knn imputation with a n value of 10 so it can do the imputation with 10 nearest neighbours based on the Eigen distance.

data$app_pub<- NULL
data$device_zip<- NULL
data<-knnImputation(data,k=10)

Feature Engineering¶

We now work on the feature engineering as per the specifications in Question 2 -> Analysis -> Data PreProcessing. We create all the 6 variables stated in the question as below.

          1.imp_large         2. cat_entertainment       3.cat_social          4.cat_tech
          5.os_ios            6. distance(in kms)        7.distance_squared    8.ln_app_review_vol

data <- mutate(data, imp_large = ifelse((data$imp_size=="728x90"), 1, 0))
data <- mutate(data, cat_entertainment=ifelse((data$app_topcat=="IAB1") | (data$app_topcat=="IAB1-6"),1,0))
data <- mutate(data, cat_social=ifelse((data$app_topcat=="IAB14"),1,0))
data <- mutate(data, cat_tech=ifelse((data$app_topcat=="IAB19-6"),1,0))
data <- mutate(data, os_ios=ifelse((data$device_os=='iOS'),1,0))
data <- mutate(data, ln_app_review_vol=log(data$app_review_vol))                                             
data$device_lat=as_radians(data$device_lat)
data$device_lon=as_radians(data$device_lon)
data$geofence_lat=as_radians(data$geofence_lat)
data$geofence_lon=as_radians(data$geofence_lon)
data <- mutate(data, distance=acos( sin(data$device_lat)*sin(data$geofence_lat) + cos(data$device_lat)*cos(data$geofence_lat)*cos(data$device_lon-data$geofence_lon) ) * 6371)
#head(data$distance)
data <- mutate(data, distance_group = cut(distance, breaks = c(0,0.5,1,2,4,7,10,Inf), labels = c(1,2,3,4,5,6,7)))
data$distance_group <- as.integer(data$distance_group)
data <- mutate(data, distance_squared=(data$distance)^2)

Data Interpretation and Data Visualisation - Descriptive Statistics¶

Feature Selection¶

We only select the below features as stated in the question.

1. didclick       2. distance       3. imp_large             4. cat_entertainment    5. cat_social
6. cat_tech       7. os_ios         8. ln_app_review_vol     9.app_review_val

We create a new dataframe data1 and assign the values to it. The new dataa frame has 121,567 observations and 10 variables.

data1 <- select(data,didclick,imp_large,cat_tech,cat_entertainment,cat_social,ln_app_review_vol,app_review_val,os_ios,distance_group,distance_squared) 
str(data1)

'data.frame':	121567 obs. of  10 variables:
 $ didclick         : int  0 0 0 0 0 0 0 0 0 0 ...
 $ imp_large        : num  0 0 0 0 0 0 0 0 0 0 ...
 $ cat_tech         : num  0 0 0 0 0 0 0 0 0 0 ...
 $ cat_entertainment: num  0 0 0 0 0 0 0 0 0 0 ...
 $ cat_social       : num  0 0 0 0 0 0 0 0 0 0 ...
 $ ln_app_review_vol: num  11.1 11.1 11.1 11.1 11.1 ...
 $ app_review_val   : num  4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.6 4.6 ...
 $ os_ios           : num  0 0 0 0 0 0 0 0 0 0 ...
 $ distance_group   : int  4 4 5 4 4 5 4 2 4 4 ...
 $ distance_squared : num  7.28 5.88 25.52 4.42 4.39 ...

Descriptive Statistics¶

We now calculate the summary statistics for the data using mean, median, standard deviation, minimum, maximum, skewness and Kurtosis.

We also plot the histogram distribution of the numeric variables.

psych::describe(data1,check=FALSE)             #Descriptive statistics #psych::describe(data1) %>% [,c(2,3,5,8,9,10,11,12)]
dplyr::select_if(data1,is.numeric) %>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") + scale_fill_gradient(low = "red",high = "yellow") +
  geom_histogram(bins=30)

Correlation Amongst variables¶

We generate the correlation values amongst the numeric variables below using the pearson, kendall and spearman methods.

#xyz<- rcorr(as.matrix(data1), type="spearman")
M=cor(data1,method = c("pearson", "kendall", "spearman"))
M

We generate a diagramatic representation of the correlation values with values plotted in the corresponding boxes. The red markings indicate negative correlation values while extreme blue markings indicate positive correlation.

corrplot(M,method="number",addrect=2)  ##circle

Click Through Rate¶

We generate the click through rate and the click through rate considering all the impressions.

click_through_rate=sum(data$didclick==1)/(sum(data$didclick==1)+sum(data$didclick==0))
print(paste("Click through rate for the advertising impression is ", round(click_through_rate,4)))
click_through_rate_percentage=click_through_rate*100
print(paste("Click through rate percentage for the advertising impression is ", round(click_through_rate,4)*100,"%"))

[1] "Click through rate for the advertising impression is  0.0068"
[1] "Click through rate percentage for the advertising impression is  0.68 %"

Click Through Rate based on Distance Group¶

We calculate the change in the the variation of clickthrough rate over distance groups (1-7) through the plot below.

data_temp=group_by(data1,distance_group)
#qplot(distance_group,summarise(data_temp, count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1))),data=data)
abc=summarise(data_temp, count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n())
df = as.data.frame(abc)
plot(x=df$distance_group,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on Distance from Geofencing POI", ylab = "Click Through Rate", main = "Distance Vs Click Through Rate")

Click Through Rate based on Distance Group Vs total Impressions for group Vs Count of didclicks¶

We calculate the variation in Click Through Rate based on Distance Group Vs total Impressions for group Vs Count of didclicks

plot(summarise(data_temp, count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))

Click through Rate Vs (Device OS, Large Screen, Tech Category,Entertainment Category, Social Category, Log of application review volume, Application review value).¶

We generate comparision graphs to study the variation of click through rate based on predictors in the data frame such as Device OS, Large Screen, Tech Category,Entertainment Category, Social Category, Log of application review volume, Application review value

par(mfrow=c(4,2))
df=as.data.frame(summarise(group_by(data1,os_ios), count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
plot(x=df$os_ios,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on Operating system (1= iOs)", ylab = "Click Through Rate", main = "Operating System Vs Click Through Rate")
df=as.data.frame(summarise(group_by(data1,imp_large), count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
plot(x=df$imp_large,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on Impression size (1=Large)", ylab = "Click Through Rate", main = "Impression Size Vs Click Through Rate")
df=as.data.frame(summarise(group_by(data1,cat_tech), count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
plot(x=df$cat_tech,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on Category Technology (1=Yes)", ylab = "Click Through Rate", main = "Category Technology Vs Click Through Rate")
df=as.data.frame(summarise(group_by(data1,cat_entertainment), count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
plot(x=df$cat_entertainment,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on Category Entertainment (1=Yes)", ylab = "Click Through Rate", main = "Category Entertainment Vs Click Through Rate")
df=as.data.frame(summarise(group_by(data1,cat_social), count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
plot(x=df$cat_social,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on Category Social (1=Yes)", ylab = "Click Through Rate", main = "Category Social Vs Click Through Rate")
df=as.data.frame(summarise(group_by(data1,ln_app_review_vol), count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
plot(x=df$ln_app_review_vol,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on Log of App review volume", ylab = "Click Through Rate", main = "Log of App Review Volume Vs Click Through Rate")
df=as.data.frame(summarise(group_by(data1,app_review_val), count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
plot(x=df$app_review_val,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on App Review Rating", ylab = "Click Through Rate", main = "App Review Rating Vs Click Through Rate")

Logistic Regression¶

Build a Basic Logistic Regression Model using a glm function

table(data1$didclick)
glm_model_1=glm(didclick~ . ,data=data1,family = binomial)

     0      1 
120739    828

Summary Statistics of the Model¶

We generate the summary statistics of the model using the summary function. It shows that only predictors such as impression size, category of technology, the operating system of the mobile device and distance group based on distance from the geofence centre.

summary(glm_model_1)

Call:
glm(formula = didclick ~ ., family = binomial, data = data1)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.1878  -0.1271  -0.1130  -0.1058   3.4239  

Coefficients:
                   Estimate Std. Error z value Pr(>|z|)    
(Intercept)       -6.797032   0.991863  -6.853 7.24e-12 ***
imp_large         -0.321867   0.091083  -3.534 0.000410 ***
cat_tech           0.778819   0.225167   3.459 0.000542 ***
cat_entertainment  0.074061   0.217002   0.341 0.732884    
cat_social        -0.037635   0.248477  -0.151 0.879612    
ln_app_review_vol  0.057210   0.063711   0.898 0.369206    
app_review_val     0.291719   0.186310   1.566 0.117402    
os_ios             0.333519   0.121010   2.756 0.005849 ** 
distance_group    -0.096012   0.040088  -2.395 0.016619 *  
distance_squared   0.002549   0.002200   1.159 0.246523    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9912.5  on 121566  degrees of freedom
Residual deviance: 9861.5  on 121557  degrees of freedom
AIC: 9881.5

Number of Fisher Scoring iterations: 8

Model Performance Evaluation¶

We now evaluate the performance of the Model using a confusion Matrix.

confusion_matrix(glm_model_1,DATA=data1)

Predicted levels same as naive model (majority level)

         Predicted 0
Actual 0      120739
Actual 1         828

The model doesn't predict any value of did click as you can see from the confusion matrix above. But the accuracy is still high since the Click through rate is only 0.0068. So we now build another model taking weights for didclick.

GLM Model 2¶

We add weights to the model so that the prediction is better.

#data2<-data1
#data2$weight<- ifelse(data2$didclick==1,1,0.1)
#data2$weight<- 1:nrow(data2)
#glm_model_2=glm(didclick~ . ,data=data2,weights=weight,family = binomial)
#summary(glm_model_2)

Warning message:
"glm.fit: fitted probabilities numerically 0 or 1 occurred"

#Supersampling Rare Events in R
data_smote <- data1
data_smote$didclick <- as.factor(data_smote$didclick) 
data_smote=(DMwR::SMOTE(didclick ~ ., data_smote,perc.over = 100, perc.under=200))
glm_model_2=glm(didclick~os_ios+distance_group+cat_tech+imp_large ,data=data_smote,family = binomial)
summary(glm_model_2)

Call:
glm(formula = didclick ~ os_ios + distance_group + cat_tech + 
    imp_large, family = binomial, data = data_smote)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-1.42254  -1.14823   0.05907   1.18311   1.43432  

Coefficients:
               Estimate Std. Error z value Pr(>|z|)    
(Intercept)    -0.12964    0.12319  -1.052   0.2926    
os_ios          0.25312    0.10886   2.325   0.0201 *  
distance_group -0.05565    0.02426  -2.293   0.0218 *  
cat_tech        0.60339    0.10067   5.994 2.05e-09 ***
imp_large      -0.37588    0.09063  -4.147 3.36e-05 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4591.4  on 3311  degrees of freedom
Residual deviance: 4539.3  on 3307  degrees of freedom
AIC: 4549.3

Number of Fisher Scoring iterations: 4

confusion_matrix(glm_model_2,DATA=data1)

We calculate the Variance Inflation Factor to ensure we are not having predictors with high multicollinearity amongst them.

vif(glm_model_2)

We now check if the reduced model glm_model_2 fits as well as the glm_model_1 with all predictors using the chi-square value from the anova test.

glm_model_1=glm(didclick~ . ,data=data_smote,family = binomial) #To ensure models are fitted to the same size of dataset
anova(glm_model_1,glm_model_2, test="Chisq")

The non-significant Chisquare probability (p = 0.25) suggests that the reduced model fits as well as the full model

So We also plot the ROC and the AUC values for the glm model 2.

prob_train <- predict(glm_model_2, type = "response")  ##List of predictions from model
pred <- prediction(prob_train, data_smote$didclick)
perf <- performance(pred, measure="tpr", x.measure="fpr")  ##Get True+ve and True_ve rate
##AUC Score
perf_auc <- performance(pred, measure="auc")
auc <- perf_auc@y.values[[1]]
print(paste("AUC Value is ",auc))   ##0.564
#ROC Curve
par(mfrow=c(1,1))
plot(perf, col=rainbow(10), colorize=T, print.cutoffs.at=seq(0,1,0.05))  ##Plot ROC Curve
abline(a=0, b= 1)

[1] "AUC Value is  0.574044354885762"

Analysis of Geo Fence Radius¶

We are also interested in analysing the Geo Fence radius parameter if it is of any significance. So we start off by studying the variation of the geo fence radius based on Latitude and Longitude.

So we use the summarise function to study the variation of radius across different latitudes and longitudes that are grouped uniquely.

data2 <- data1
data2$gepfence_radius=data_backup$gepfence_radius
data2$geofence_lon=data_backup$geofence_lon
data2$geofence_lat=data_backup$geofence_lat
data2_temp <- data2 %>% 
  group_by(geofence_lon,geofence_lat) %>% 
  summarise(min=min(gepfence_radius),max=max(gepfence_radius),average_radius=mean(gepfence_radius))
data2_temp

There seems to be some pattern in the distribution of radius across different latitudes and longitudes, Longitudes in the range of -117degrees have a radius of 11.263kms and latitudes in the range of -87degrees have a radius of 5kms. To understand more about it we plot a Geographical map.

data2_temp$average_radius=as.factor(data2_temp$average_radius)
data2_temp %>% 
  ggplot(aes(x=geofence_lon,y=geofence_lat,fill=average_radius)) + borders("state") +geom_point(aes(colour=average_radius),size=1.5) +coord_map() +theme_void()  #Remove the theme

From the above graph we can infer that the red dots refer to Chicago and the blue dots refer to Los Angeles.Also, there seems to be difference in radius for the two cities of Los Angeles and Chicago. Los Angeles has a radius of 11.263 and Chicago has a radius of 5kms. So we dig more deeper into it to study if there is any change in Click Through rate based on city.

We plot the variation of Click through rate, distribution of impressions and count of clicks between the two cities.

data2$city<- as.factor(ifelse(data$gepfence_radius == 5,"Chicago","Los Angeles"))
df=as.data.frame(summarise(group_by(data2,city), count_of_clicks=sum(didclick==1),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
require(gridExtra)
plot1<- qplot(x=df$city,y=df$click_through_rate,size=I(2),xlab = "City", ylab = "Click Through Rate", main = "Click Through Rate")
plot2<- qplot(x=df$city,y=df$impressions,size=I(2),xlab = "City", ylab = "Total Number of Impressions", main = "Impression Count")
plot3<- qplot(x=df$city,y=df$count_of_clicks,size=I(2),xlab = "City", ylab = "Number of Clicks", main = "Count of Clicks")
gridExtra::grid.arrange(plot1, plot2,plot3, ncol=3)

There seems to be a significant difference in click through rate between both the cities (CTR in Chicago= 0.02359,CTR in LA=0.00644). So we are interested in learning more about it. We create a ratio attribute which denotes the normalised distance from the centre of the geo fence to understand how much the user is away from the geo fence center in terms of the radius ratio.

We define the formula as Normalised Distance = (Distance in kms)/(Radius in kms)

We also define some features based on the normalised distance as below.

  1. inside_quarterradius_circle as 1 if the user is inside (radius/4) size of the circle around the geo fence center.
  2. inside_halfradius_circle as 1 if the user is inside (radius/2) size of the circle around the geo fence center.
  3. outside_geofence_radius as 1 if the user is outside the circle centred around the geofence with the gepfence_radius.

We also study the variation of the above generated parameters.

data_city<- data2
data_city$normalised_distance<- sqrt(data2$distance_squared)/data2$gepfence_radius
data_city$inside_quarterradius_circle <- ifelse(data_city$normalised_distance<= 0.25,1,0)
data_city$inside_halfradius_circle <- ifelse((data_city$normalised_distance>= 0.25)&(data_city$normalised_distance<= 0.5) ,1,0)
data_city$outside_geofence_radius<- ifelse(data_city$normalised_distance>=1,1,0)

select(data_city,normalised_distance,inside_quarterradius_circle,inside_halfradius_circle,outside_geofence_radius)%>% 
  gather() %>% 
  ggplot(aes(value)) +
  facet_wrap(~ key, scales = "free") + scale_fill_gradient(low = "red",high = "yellow") +
  geom_histogram(bins=30) #+ facet_grid(cols = vars(key))

We work on taking a log function over the normalised distance since it is skewed to the left and study the correlation between the similar parameters since we don't want to send correlated parameters to the glm model.

data_city$ln_normalised_distance <- log(data_city$normalised_distance)
data_corr<- select(data_city,normalised_distance,ln_normalised_distance,distance_group,distance_squared,gepfence_radius,inside_quarterradius_circle,inside_halfradius_circle,outside_geofence_radius)
M<- cor(data_corr)
corrplot(M,method="number",addrect=2)  ##circle

From the above correlation plot we can see there is high correlation between factors such as normalised distance, lof of normalised distance, distance group,square of distance. So we only keep the inside_quarterradius_circle and outside_geofence_radius parameters to ensure the model is smooth.

data_city$distance_group<- NULL
data_city$distance_squared<- NULL
data_city$gepfence_radius<- NULL
data_city$geofence_lon<- NULL
data_city$geofence_lat<- NULL
data_city$inside_halfradius_circle <- NULL
data_city$normalised_distance <- NULL

Now that the predictors with high correlation between them have been removed we build the model.

glm_model_3=glm(didclick~ . ,data=data_city,family = binomial)
summary(glm_model_3)

Call:
glm(formula = didclick ~ ., family = binomial, data = data_city)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.4065  -0.1248  -0.1118  -0.1007   3.4508  

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)                 -2.85658    0.90069  -3.172 0.001516 ** 
imp_large                   -0.33363    0.09162  -3.641 0.000271 ***
cat_tech                     0.88196    0.21108   4.178 2.94e-05 ***
cat_entertainment            0.11020    0.20903   0.527 0.598045    
cat_social                   0.21414    0.23311   0.919 0.358296    
ln_app_review_vol           -0.15414    0.06544  -2.356 0.018497 *  
app_review_val               0.14257    0.16286   0.875 0.381334    
os_ios                       0.14178    0.12048   1.177 0.239259    
cityLos Angeles             -1.79981    0.17161 -10.488  < 2e-16 ***
inside_quarterradius_circle  0.26072    0.11828   2.204 0.027507 *  
outside_geofence_radius     -0.01898    0.58461  -0.032 0.974097    
ln_normalised_distance       0.03974    0.06139   0.647 0.517403    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 9912.5  on 121566  degrees of freedom
Residual deviance: 9772.4  on 121555  degrees of freedom
AIC: 9796.4

Number of Fisher Scoring iterations: 8

The model has multiple good predictors such as presence of user inside the quarter radius circle or not, city the user is from, number of reviews for the app, the category of the app if it is technology, impression size etc.

We calculate the confusion matrix for the model.

confusion_matrix(glm_model_3,DATA=data_city)

Predicted levels same as naive model (majority level)

         Predicted 0
Actual 0      120739
Actual 1         828

The model is not ale to predict didclicks, so we enhance the model with a SMOTE function to oversample the didclick=1 records and to only consider the significant predictors.

data_smote <- data_city
data_smote$didclick <- as.factor(data_smote$didclick) 
data_smote=(DMwR::SMOTE(didclick ~ ., data_smote,perc.over = 100, perc.under=200))
glm_model_4=glm(didclick~ imp_large+cat_tech+ln_app_review_vol+city+inside_quarterradius_circle ,data=data_smote,family = binomial)
summary(glm_model_4)

Call:
glm(formula = didclick ~ imp_large + cat_tech + ln_app_review_vol + 
    city + inside_quarterradius_circle, family = binomial, data = data_smote)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.8880  -1.0917  -0.1065   1.1131   1.5949  

Coefficients:
                            Estimate Std. Error z value Pr(>|z|)    
(Intercept)                  2.31505    0.75952   3.048   0.0023 ** 
imp_large                   -0.35811    0.09158  -3.910 9.21e-05 ***
cat_tech                     0.64978    0.08089   8.033 9.55e-16 ***
ln_app_review_vol           -0.08862    0.06710  -1.321   0.1866    
cityLos Angeles             -1.91764    0.20776  -9.230  < 2e-16 ***
inside_quarterradius_circle  0.15583    0.07478   2.084   0.0372 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 4591.4  on 3311  degrees of freedom
Residual deviance: 4437.6  on 3306  degrees of freedom
AIC: 4449.6

Number of Fisher Scoring iterations: 4

confusion_matrix(glm_model_4,DATA=data_city)

The model shows us that the parameters such as presence inside the quarterradius_circle, city details are good estimators of the didclick parameter. We also run a VIF to ensure we are not having multicollinearity in the glm model.

vif(glm_model_4)

The Variance Inflation factor shows that predictors such as inside_quarterradius_circle, city, outside_geofence_radius_circle,ln_app_review_vol,imp_large have very less multicollinearity since they are close to 1 are great predictors of the model.

We now check if the reduced model glm_model_4 fits as well as the glm_model_3 with all predictors using the chi-square value from the anova test.

glm_model_3=glm(didclick~ . ,data=data_smote,family = binomial) #To ensure models are fitted to the same size of dataset
anova(glm_model_3,glm_model_4, test="Chisq")

The non-significant Chisquare probability (p = 0.56) suggests that the reduced model fits as well as the full model

So We also plot the ROC and the AUC values for the glm model 4.

prob_train <- predict(glm_model_4, type = "response")  ##List of predictions from model
pred <- prediction(prob_train, data_smote$didclick)
perf <- performance(pred, measure="tpr", x.measure="fpr")  ##Get True+ve and True_ve rate
##AUC Score
perf_auc <- performance(pred, measure="auc")
auc <- perf_auc@y.values[[1]]
print(paste("AUC Value is ",auc))   ##0.564
#ROC Curve
par(mfrow=c(1,1))
plot(perf, col=rainbow(10), colorize=T, print.cutoffs.at=seq(0,1,0.05))  ##Plot ROC Curve
abline(a=0, b= 1)

[1] "AUC Value is  0.608094340007935"

The AUC value for model 4 is 0.608 which is higher but similat to model 2 which has 0.57 and it also has an AIC score of 4449.6 which is lesser but similar to the AIC score of model 2 that is 4549.3. So we can conlcude that model's 2 and 4 are good since they have almost similar AIC values compared to the other models that AIC values of 9000 plus.

We now go ahead and calculate the coeffecients of the significant variables from the two glm models to derive our findings and implications.

Odds Ratio Estimation for GLM Model 2 which has only the significant Variables of GLM Model 1 and has weights for didclick=1¶

exp(coef(glm_model_2))

Odds Ratio Estimation for GLM Model 2 which has only the significant Variables of GLM Model 3 and has weights for didclick=1¶

exp(coef(glm_model_4))

Business Findings and Implications¶

Geofencing is a location based digital marketing technology used by advertising companies to target users with specific ads based on their proximity to local businesses. Every local business can build a geofence around it which is nothing but a circle around it for a particular radius (eg.5km) to target users that enter that particular geofence.

In this case study we are interested in finding what are the parameters that have significant impact on the click through rate and did clicks for the impressions shown to users using geofencing advertising.The overall Click through rate is 0.0068 which means about 7 people in 1,000 targeted users did click the advertisement/impression.The Descriptive Statistics and Correlation implicate that the Count of did clicks is highest for group 3 which has users who are at a radius of 1-2kms from the geo fence center or the local buiness. The number of impressions is highest for this category which shows this is one of the key areas of advertising for marketers. The click through rate is however lesser compared to group 1. Click through rate for group 1 that has users less than 0.5kms in radius has the highest click through rate.From the distributions we might want to conclude that Click through rate decreases as the Distance group increases proving distance from business is inversely propotional to the click through rate. However there is an exception for distance group 6 that we rectify using normalised distance concept.

From the click through rate comparision plots we also observed the following trends, the aggregated CTR(click through rate) for Technology apps was higher in comparision to apps from category Entertainment and Social. Also Apps with review rating of 4.2 had the highest click through rates and Smaller impressions or screen size had better click through rates.

The Category Tech has the highest correlation values with other predictors. High negative correlation with category entertainment and social, app review value and os_ios. High positive correlation with imp_large_size so we use it as a major predictor discarding all the other variables so it has a lesser VIF score.

The logistic regression 1 also helps emphasize on the findings and implies that if the user uses ios and the app is a tech app there is a higher possiblity of converting since the coeffeciencts are >1. The user might not click the advertisements or had a lesser probability of converting if their impressions/screen size are larger and the distance_group is higher.

To invesetigate more on the data since the AUC value is 0.57 for the glm model 1 we dig more into the geographical details such as city and radius that find few interesting predictors that gives a model with 0.1 AUC. The sample has data from two cities Chicago and Los Angeles in the survey which have different click through rates (CTR in Chicago= 0.02359,CTR in LA=0.00644) which showed users in Chicago where 4 times more likely to click on ads than people from LA. Also based on the normalised distance parameter generated.

From the Logistic Regression model (glm_model_4) we find geo predcitors such as the users who are inside quarter the radius of the geofence radius are more likely to click on the ads than those outside it. Predictors such as user present inside half the geo fence radius, outside the geo fence radius are not significant. So it means the presence of user inside quarter the radius of the geofence center is the only significant distance parameter and the distance from the center is not of much significance.

	vars	n	mean	sd	median	trimmed	mad	min	max	range	skew	kurtosis	se
didclick	1	121567	0.006811059	0.08224794	0.00000	0.00000000	0.000000	0.0000000000	1.0000	1.000000	11.99263761	141.8245234	0.0002358942
imp_large	2	121567	0.230876800	0.42139550	0.00000	0.16360084	0.000000	0.0000000000	1.0000	1.000000	1.27728459	-0.3685471	0.0012085985
cat_tech	3	121567	0.543914056	0.49806987	1.00000	0.55489178	0.000000	0.0000000000	1.0000	1.000000	-0.17633548	-1.9689220	0.0014285072
cat_entertainment	4	121567	0.283925736	0.45090308	0.00000	0.22991106	0.000000	0.0000000000	1.0000	1.000000	0.95839881	-1.0814806	0.0012932287
cat_social	5	121567	0.125124417	0.33086130	0.00000	0.03141227	0.000000	0.0000000000	1.0000	1.000000	2.26604020	3.1349640	0.0009489386
ln_app_review_vol	6	121567	10.056798904	0.63696194	10.08722	9.98384595	0.000000	7.0808678967	12.9377	5.856834	1.78926780	6.1773255	0.0018268615
app_review_val	7	121567	3.654872622	0.36081251	3.40000	3.62465889	0.000000	1.4000000000	4.7000	3.300000	0.01724566	1.9459931	0.0010348413
os_ios	8	121567	0.250363997	0.43322443	0.00000	0.18795949	0.000000	0.0000000000	1.0000	1.000000	1.15244631	-0.6718730	0.0012425249
distance_group	9	121567	3.622257685	1.44165407	4.00000	3.55299985	1.482600	1.0000000000	7.0000	6.000000	0.32416543	-0.2702810	0.0041347877
distance_squared	10	121567	15.917320648	27.39226296	4.08389	8.66358915	5.365952	0.0004309338	138.9254	138.924990	2.48694671	5.5564938	0.0785633624

	didclick	imp_large	cat_tech	cat_entertainment	cat_social	ln_app_review_vol	app_review_val	os_ios	distance_group	distance_squared
didclick	1.000000000	-0.004786218	0.012176592	-0.007117972	-0.005623417	0.003982875	-0.006523592	-0.002147325	-0.007994014	-0.004462031
imp_large	-0.004786218	1.000000000	0.386715220	-0.254731873	-0.185311155	0.049929790	-0.321439020	-0.190194050	0.003670761	0.029075052
cat_tech	0.012176592	0.386715220	1.000000000	-0.687646354	-0.412990139	0.046101406	-0.756895005	-0.609490680	-0.006483263	0.018865105
cat_entertainment	-0.007117972	-0.254731873	-0.687646354	1.000000000	-0.238133905	-0.105545185	0.642212363	0.312647684	-0.014816427	-0.038385249
cat_social	-0.005623417	-0.185311155	-0.412990139	-0.238133905	1.000000000	-0.115376574	0.194394425	0.513672844	0.060702079	0.055799705
ln_app_review_vol	0.003982875	0.049929790	0.046101406	-0.105545185	-0.115376574	1.000000000	0.014457854	-0.013523794	-0.152626838	-0.146892384
app_review_val	-0.006523592	-0.321439020	-0.756895005	0.642212363	0.194394425	0.014457854	1.000000000	0.366139311	0.025525672	0.010750893
os_ios	-0.002147325	-0.190194050	-0.609490680	0.312647684	0.513672844	-0.013523794	0.366139311	1.000000000	-0.042687191	-0.065060413
distance_group	-0.007994014	0.003670761	-0.006483263	-0.014816427	0.060702079	-0.152626838	0.025525672	-0.042687191	1.000000000	0.808332872
distance_squared	-0.004462031	0.029075052	0.018865105	-0.038385249	0.055799705	-0.146892384	0.010750893	-0.065060413	0.808332872	1.000000000

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
121557	9861.548	NA	NA	NA
121562	9868.128	-5	-6.580465	0.2537571

geofence_lon	geofence_lat	min	max	average_radius
-118.0128	33.84287	11.263	11.263	11.263
-118.0128	33.90455	11.263	11.263	11.263
-118.0053	33.84667	11.263	11.263	11.263
-118.0000	33.85000	11.263	11.263	11.263
-117.9893	33.87948	11.263	11.263	11.263
-117.9659	33.82353	11.263	11.263	11.263
-117.9659	33.87496	11.263	11.263	11.263
-117.9635	33.87004	11.263	11.263	11.263
-117.9616	33.81779	11.263	11.263	11.263
-117.9600	33.92266	11.263	11.263	11.263
-117.9592	33.91737	11.263	11.263	11.263
-117.9542	33.84698	11.263	11.263	11.263
-117.9400	33.84000	11.263	11.263	11.263
-117.9373	33.84164	11.263	11.263	11.263
-117.9308	33.87330	11.263	11.263	11.263
-117.9308	33.90416	11.263	11.263	11.263
-117.9304	33.89179	11.263	11.263	11.263
-117.9275	33.81798	11.263	11.263	11.263
-117.9220	33.87031	11.263	11.263	11.263
-117.9202	33.86190	11.263	11.263	11.263
-117.9193	33.81241	11.263	11.263	11.263
-117.9168	33.83251	11.263	11.263	11.263
-117.9073	33.82760	11.263	11.263	11.263
-117.9026	33.87407	11.263	11.263	11.263
-117.9010	33.81814	11.263	11.263	11.263
-117.8956	33.88192	11.263	11.263	11.263
-117.8904	33.91767	11.263	11.263	11.263
-117.8839	33.83904	11.263	11.263	11.263
-117.8722	33.83621	11.263	11.263	11.263
-117.8635	33.88821	11.263	11.263	11.263
...	...	...	...	...
-88.02163	41.77352	5	5	5
-88.02163	41.81329	5	5	5
-88.02163	41.97208	5	5	5
-88.00206	41.87255	5	5	5
-88.00206	41.93209	5	5	5
-87.97758	41.74789	5	5	5
-87.97758	41.79762	5	5	5
-87.97758	41.87709	5	5	5
-87.97758	41.96636	5	5	5
-87.95798	41.79727	5	5	5
-87.94328	41.83179	5	5	5
-87.94328	41.89137	5	5	5
-87.94328	41.95088	5	5	5
-87.92858	41.79675	5	5	5
-87.92368	41.75190	5	5	5
-87.90896	41.86594	5	5	5
-87.89915	41.80615	5	5	5
-87.89915	41.91538	5	5	5
-87.88934	41.84572	5	5	5
-87.88075	41.90388	5	5	5
-87.87952	41.93485	5	5	5
-87.86971	41.83542	5	5	5
-87.86971	41.88507	5	5	5
-87.85989	41.85511	5	5	5
-87.85989	41.90474	5	5	5
-87.85006	41.82512	5	5	5
-87.84024	41.87460	5	5	5
-87.83778	41.92664	5	5	5
-87.83656	41.85839	5	5	5
-87.82059	41.83451	5	5	5

Resid. Df	Resid. Dev	Df	Deviance	Pr(>Chi)
3300	4432.475	NA	NA	NA
3306	4437.612	-6	-5.13713	0.5263495

	Predicted 0	Predicted 1	Total
Actual 0	72511	48228	120739
Actual 1	418	410	828
Total	72929	48638	121567

	Predicted 0	Predicted 1	Total
Actual 0	77532	43207	120739
Actual 1	419	409	828
Total	77951	43616	121567