The purpose of the project is to understand what are the key parameters and factors driving Click through rate in Mobile devices. Is it the location or size of advertisement or the category of app that makes a difference? Read throught to understand.
library(car, warn.conflicts = FALSE) #For VIF Function
library(caret, warn.conflicts = FALSE) #For Data imputation using KNN
library(corrplot, warn.conflicts = FALSE) #For Correlation plot
library(ggplot2, warn.conflicts = FALSE) #For Plotting graphs
library(dplyr, warn.conflicts = FALSE) #For Data PreProcessing
library(aspace, warn.conflicts = FALSE) #For Latitude to Radiance
library(DMwR,warn.conflicts = FALSE) #For KNN Imputation
library(tidyr) #For the gather function in ggplot
library(psych) #For Descriptive statisticslibrary(ROCR)
library(regclass) #For Predicting the Confusion Matrix
library(ROCR) #For Prediction and ROC Curve.
library(mapproj) #For Getting a map output
We shift the working directory to our local path and import the 'Geo-Fence Analytics.csv' file as a data frame. We also read the record count of the data from the csv and check the number of missing values. We find that the app publisher name, os the device is running on, zip code and the app category have missing values.
getwd()
setwd("C:/Users/Nijanth Anand/Downloads/BANA277-Customer Analytics/Assignment - Mobile Analytics")
data=read.csv("Geo-Fence Analytics.csv",header = TRUE,na.strings=c(""," ","NA"))
print(paste("The data from the Geo-Fence Analytics.csv file has been loaded"))
print(paste("Number of rows :",nrow(data)))
print(paste("Number of columns :",ncol(data)))
We store a backup of the data from the csv file before we proceed further.
data_backup <- data
Since the data is now available as a data frame we go ahead with the data preperation.
We run the sapply function to check the NA value count in the records of the dataframe.
print("Number of missing records in each column is")
sapply(data, function(x) sum(is.na(x)))
We remove the app publisher and device zip attributes as they have high NA's and we have latitude and longitude, app category details to supplement the 2 attributes. For the rest of the missing values we run a knn imputation with a n value of 10 so it can do the imputation with 10 nearest neighbours based on the Eigen distance.
data$app_pub<- NULL
data$device_zip<- NULL
data<-knnImputation(data,k=10)
We now work on the feature engineering as per the specifications in Question 2 -> Analysis -> Data PreProcessing. We create all the 6 variables stated in the question as below.
1.imp_large 2. cat_entertainment 3.cat_social 4.cat_tech
5.os_ios 6. distance(in kms) 7.distance_squared 8.ln_app_review_vol
data <- mutate(data, imp_large = ifelse((data$imp_size=="728x90"), 1, 0))
data <- mutate(data, cat_entertainment=ifelse((data$app_topcat=="IAB1") | (data$app_topcat=="IAB1-6"),1,0))
data <- mutate(data, cat_social=ifelse((data$app_topcat=="IAB14"),1,0))
data <- mutate(data, cat_tech=ifelse((data$app_topcat=="IAB19-6"),1,0))
data <- mutate(data, os_ios=ifelse((data$device_os=='iOS'),1,0))
data <- mutate(data, ln_app_review_vol=log(data$app_review_vol))
data$device_lat=as_radians(data$device_lat)
data$device_lon=as_radians(data$device_lon)
data$geofence_lat=as_radians(data$geofence_lat)
data$geofence_lon=as_radians(data$geofence_lon)
data <- mutate(data, distance=acos( sin(data$device_lat)*sin(data$geofence_lat) + cos(data$device_lat)*cos(data$geofence_lat)*cos(data$device_lon-data$geofence_lon) ) * 6371)
#head(data$distance)
data <- mutate(data, distance_group = cut(distance, breaks = c(0,0.5,1,2,4,7,10,Inf), labels = c(1,2,3,4,5,6,7)))
data$distance_group <- as.integer(data$distance_group)
data <- mutate(data, distance_squared=(data$distance)^2)
We only select the below features as stated in the question.
1. didclick 2. distance 3. imp_large 4. cat_entertainment 5. cat_social
6. cat_tech 7. os_ios 8. ln_app_review_vol 9.app_review_val
We create a new dataframe data1 and assign the values to it. The new dataa frame has 121,567 observations and 10 variables.
data1 <- select(data,didclick,imp_large,cat_tech,cat_entertainment,cat_social,ln_app_review_vol,app_review_val,os_ios,distance_group,distance_squared)
str(data1)
We now calculate the summary statistics for the data using mean, median, standard deviation, minimum, maximum, skewness and Kurtosis.
We also plot the histogram distribution of the numeric variables.
psych::describe(data1,check=FALSE) #Descriptive statistics #psych::describe(data1) %>% [,c(2,3,5,8,9,10,11,12)]
dplyr::select_if(data1,is.numeric) %>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") + scale_fill_gradient(low = "red",high = "yellow") +
geom_histogram(bins=30)
We generate the correlation values amongst the numeric variables below using the pearson, kendall and spearman methods.
#xyz<- rcorr(as.matrix(data1), type="spearman")
M=cor(data1,method = c("pearson", "kendall", "spearman"))
M
We generate a diagramatic representation of the correlation values with values plotted in the corresponding boxes. The red markings indicate negative correlation values while extreme blue markings indicate positive correlation.
corrplot(M,method="number",addrect=2) ##circle
We generate the click through rate and the click through rate considering all the impressions.
click_through_rate=sum(data$didclick==1)/(sum(data$didclick==1)+sum(data$didclick==0))
print(paste("Click through rate for the advertising impression is ", round(click_through_rate,4)))
click_through_rate_percentage=click_through_rate*100
print(paste("Click through rate percentage for the advertising impression is ", round(click_through_rate,4)*100,"%"))
We calculate the change in the the variation of clickthrough rate over distance groups (1-7) through the plot below.
data_temp=group_by(data1,distance_group)
#qplot(distance_group,summarise(data_temp, count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1))),data=data)
abc=summarise(data_temp, count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n())
df = as.data.frame(abc)
plot(x=df$distance_group,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on Distance from Geofencing POI", ylab = "Click Through Rate", main = "Distance Vs Click Through Rate")
We calculate the variation in Click Through Rate based on Distance Group Vs total Impressions for group Vs Count of didclicks
plot(summarise(data_temp, count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
We generate comparision graphs to study the variation of click through rate based on predictors in the data frame such as Device OS, Large Screen, Tech Category,Entertainment Category, Social Category, Log of application review volume, Application review value
par(mfrow=c(4,2))
df=as.data.frame(summarise(group_by(data1,os_ios), count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
plot(x=df$os_ios,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on Operating system (1= iOs)", ylab = "Click Through Rate", main = "Operating System Vs Click Through Rate")
df=as.data.frame(summarise(group_by(data1,imp_large), count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
plot(x=df$imp_large,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on Impression size (1=Large)", ylab = "Click Through Rate", main = "Impression Size Vs Click Through Rate")
df=as.data.frame(summarise(group_by(data1,cat_tech), count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
plot(x=df$cat_tech,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on Category Technology (1=Yes)", ylab = "Click Through Rate", main = "Category Technology Vs Click Through Rate")
df=as.data.frame(summarise(group_by(data1,cat_entertainment), count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
plot(x=df$cat_entertainment,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on Category Entertainment (1=Yes)", ylab = "Click Through Rate", main = "Category Entertainment Vs Click Through Rate")
df=as.data.frame(summarise(group_by(data1,cat_social), count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
plot(x=df$cat_social,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on Category Social (1=Yes)", ylab = "Click Through Rate", main = "Category Social Vs Click Through Rate")
df=as.data.frame(summarise(group_by(data1,ln_app_review_vol), count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
plot(x=df$ln_app_review_vol,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on Log of App review volume", ylab = "Click Through Rate", main = "Log of App Review Volume Vs Click Through Rate")
df=as.data.frame(summarise(group_by(data1,app_review_val), count_of_clicks=sum(didclick),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
plot(x=df$app_review_val,y=df$click_through_rate, type = "o", col = "blue",xlab = "Group based on App Review Rating", ylab = "Click Through Rate", main = "App Review Rating Vs Click Through Rate")
Build a Basic Logistic Regression Model using a glm function
table(data1$didclick)
glm_model_1=glm(didclick~ . ,data=data1,family = binomial)
We generate the summary statistics of the model using the summary function. It shows that only predictors such as impression size, category of technology, the operating system of the mobile device and distance group based on distance from the geofence centre.
summary(glm_model_1)
We now evaluate the performance of the Model using a confusion Matrix.
confusion_matrix(glm_model_1,DATA=data1)
The model doesn't predict any value of did click as you can see from the confusion matrix above. But the accuracy is still high since the Click through rate is only 0.0068. So we now build another model taking weights for didclick.
We add weights to the model so that the prediction is better.
#data2<-data1
#data2$weight<- ifelse(data2$didclick==1,1,0.1)
#data2$weight<- 1:nrow(data2)
#glm_model_2=glm(didclick~ . ,data=data2,weights=weight,family = binomial)
#summary(glm_model_2)
#Supersampling Rare Events in R
data_smote <- data1
data_smote$didclick <- as.factor(data_smote$didclick)
data_smote=(DMwR::SMOTE(didclick ~ ., data_smote,perc.over = 100, perc.under=200))
glm_model_2=glm(didclick~os_ios+distance_group+cat_tech+imp_large ,data=data_smote,family = binomial)
summary(glm_model_2)
confusion_matrix(glm_model_2,DATA=data1)
We calculate the Variance Inflation Factor to ensure we are not having predictors with high multicollinearity amongst them.
vif(glm_model_2)
We now check if the reduced model glm_model_2 fits as well as the glm_model_1 with all predictors using the chi-square value from the anova test.
glm_model_1=glm(didclick~ . ,data=data_smote,family = binomial) #To ensure models are fitted to the same size of dataset
anova(glm_model_1,glm_model_2, test="Chisq")
The non-significant Chisquare probability (p = 0.25) suggests that the reduced model fits as well as the full model
So We also plot the ROC and the AUC values for the glm model 2.
prob_train <- predict(glm_model_2, type = "response") ##List of predictions from model
pred <- prediction(prob_train, data_smote$didclick)
perf <- performance(pred, measure="tpr", x.measure="fpr") ##Get True+ve and True_ve rate
##AUC Score
perf_auc <- performance(pred, measure="auc")
auc <- perf_auc@y.values[[1]]
print(paste("AUC Value is ",auc)) ##0.564
#ROC Curve
par(mfrow=c(1,1))
plot(perf, col=rainbow(10), colorize=T, print.cutoffs.at=seq(0,1,0.05)) ##Plot ROC Curve
abline(a=0, b= 1)
We are also interested in analysing the Geo Fence radius parameter if it is of any significance. So we start off by studying the variation of the geo fence radius based on Latitude and Longitude.
So we use the summarise function to study the variation of radius across different latitudes and longitudes that are grouped uniquely.
data2 <- data1
data2$gepfence_radius=data_backup$gepfence_radius
data2$geofence_lon=data_backup$geofence_lon
data2$geofence_lat=data_backup$geofence_lat
data2_temp <- data2 %>%
group_by(geofence_lon,geofence_lat) %>%
summarise(min=min(gepfence_radius),max=max(gepfence_radius),average_radius=mean(gepfence_radius))
data2_temp
There seems to be some pattern in the distribution of radius across different latitudes and longitudes, Longitudes in the range of -117degrees have a radius of 11.263kms and latitudes in the range of -87degrees have a radius of 5kms. To understand more about it we plot a Geographical map.
data2_temp$average_radius=as.factor(data2_temp$average_radius)
data2_temp %>%
ggplot(aes(x=geofence_lon,y=geofence_lat,fill=average_radius)) + borders("state") +geom_point(aes(colour=average_radius),size=1.5) +coord_map() +theme_void() #Remove the theme
From the above graph we can infer that the red dots refer to Chicago and the blue dots refer to Los Angeles.Also, there seems to be difference in radius for the two cities of Los Angeles and Chicago. Los Angeles has a radius of 11.263 and Chicago has a radius of 5kms. So we dig more deeper into it to study if there is any change in Click Through rate based on city.
We plot the variation of Click through rate, distribution of impressions and count of clicks between the two cities.
data2$city<- as.factor(ifelse(data$gepfence_radius == 5,"Chicago","Los Angeles"))
df=as.data.frame(summarise(group_by(data2,city), count_of_clicks=sum(didclick==1),click_through_rate=sum(didclick==1)/(sum(didclick==0)+sum(didclick==1)),impressions=n()))
require(gridExtra)
plot1<- qplot(x=df$city,y=df$click_through_rate,size=I(2),xlab = "City", ylab = "Click Through Rate", main = "Click Through Rate")
plot2<- qplot(x=df$city,y=df$impressions,size=I(2),xlab = "City", ylab = "Total Number of Impressions", main = "Impression Count")
plot3<- qplot(x=df$city,y=df$count_of_clicks,size=I(2),xlab = "City", ylab = "Number of Clicks", main = "Count of Clicks")
gridExtra::grid.arrange(plot1, plot2,plot3, ncol=3)
There seems to be a significant difference in click through rate between both the cities (CTR in Chicago= 0.02359,CTR in LA=0.00644). So we are interested in learning more about it. We create a ratio attribute which denotes the normalised distance from the centre of the geo fence to understand how much the user is away from the geo fence center in terms of the radius ratio.
We define the formula as Normalised Distance = (Distance in kms)/(Radius in kms)
We also define some features based on the normalised distance as below.
1. inside_quarterradius_circle as 1 if the user is inside (radius/4) size of the circle around the geo fence center.
2. inside_halfradius_circle as 1 if the user is inside (radius/2) size of the circle around the geo fence center.
3. outside_geofence_radius as 1 if the user is outside the circle centred around the geofence with the gepfence_radius.
We also study the variation of the above generated parameters.
data_city<- data2
data_city$normalised_distance<- sqrt(data2$distance_squared)/data2$gepfence_radius
data_city$inside_quarterradius_circle <- ifelse(data_city$normalised_distance<= 0.25,1,0)
data_city$inside_halfradius_circle <- ifelse((data_city$normalised_distance>= 0.25)&(data_city$normalised_distance<= 0.5) ,1,0)
data_city$outside_geofence_radius<- ifelse(data_city$normalised_distance>=1,1,0)
select(data_city,normalised_distance,inside_quarterradius_circle,inside_halfradius_circle,outside_geofence_radius)%>%
gather() %>%
ggplot(aes(value)) +
facet_wrap(~ key, scales = "free") + scale_fill_gradient(low = "red",high = "yellow") +
geom_histogram(bins=30) #+ facet_grid(cols = vars(key))
We work on taking a log function over the normalised distance since it is skewed to the left and study the correlation between the similar parameters since we don't want to send correlated parameters to the glm model.
data_city$ln_normalised_distance <- log(data_city$normalised_distance)
data_corr<- select(data_city,normalised_distance,ln_normalised_distance,distance_group,distance_squared,gepfence_radius,inside_quarterradius_circle,inside_halfradius_circle,outside_geofence_radius)
M<- cor(data_corr)
corrplot(M,method="number",addrect=2) ##circle
From the above correlation plot we can see there is high correlation between factors such as normalised distance, lof of normalised distance, distance group,square of distance. So we only keep the inside_quarterradius_circle and outside_geofence_radius parameters to ensure the model is smooth.
data_city$distance_group<- NULL
data_city$distance_squared<- NULL
data_city$gepfence_radius<- NULL
data_city$geofence_lon<- NULL
data_city$geofence_lat<- NULL
data_city$inside_halfradius_circle <- NULL
data_city$normalised_distance <- NULL
Now that the predictors with high correlation between them have been removed we build the model.
glm_model_3=glm(didclick~ . ,data=data_city,family = binomial)
summary(glm_model_3)
The model has multiple good predictors such as presence of user inside the quarter radius circle or not, city the user is from, number of reviews for the app, the category of the app if it is technology, impression size etc.
We calculate the confusion matrix for the model.
confusion_matrix(glm_model_3,DATA=data_city)
The model is not ale to predict didclicks, so we enhance the model with a SMOTE function to oversample the didclick=1 records and to only consider the significant predictors.
data_smote <- data_city
data_smote$didclick <- as.factor(data_smote$didclick)
data_smote=(DMwR::SMOTE(didclick ~ ., data_smote,perc.over = 100, perc.under=200))
glm_model_4=glm(didclick~ imp_large+cat_tech+ln_app_review_vol+city+inside_quarterradius_circle ,data=data_smote,family = binomial)
summary(glm_model_4)
confusion_matrix(glm_model_4,DATA=data_city)
The model shows us that the parameters such as presence inside the quarterradius_circle, city details are good estimators of the didclick parameter. We also run a VIF to ensure we are not having multicollinearity in the glm model.
vif(glm_model_4)
The Variance Inflation factor shows that predictors such as inside_quarterradius_circle, city, outside_geofence_radius_circle,ln_app_review_vol,imp_large have very less multicollinearity since they are close to 1 are great predictors of the model.
We now check if the reduced model glm_model_4 fits as well as the glm_model_3 with all predictors using the chi-square value from the anova test.
glm_model_3=glm(didclick~ . ,data=data_smote,family = binomial) #To ensure models are fitted to the same size of dataset
anova(glm_model_3,glm_model_4, test="Chisq")
The non-significant Chisquare probability (p = 0.56) suggests that the reduced model fits as well as the full model
So We also plot the ROC and the AUC values for the glm model 4.
prob_train <- predict(glm_model_4, type = "response") ##List of predictions from model
pred <- prediction(prob_train, data_smote$didclick)
perf <- performance(pred, measure="tpr", x.measure="fpr") ##Get True+ve and True_ve rate
##AUC Score
perf_auc <- performance(pred, measure="auc")
auc <- perf_auc@y.values[[1]]
print(paste("AUC Value is ",auc)) ##0.564
#ROC Curve
par(mfrow=c(1,1))
plot(perf, col=rainbow(10), colorize=T, print.cutoffs.at=seq(0,1,0.05)) ##Plot ROC Curve
abline(a=0, b= 1)
The AUC value for model 4 is 0.608 which is higher but similat to model 2 which has 0.57 and it also has an AIC score of 4449.6 which is lesser but similar to the AIC score of model 2 that is 4549.3. So we can conlcude that model's 2 and 4 are good since they have almost similar AIC values compared to the other models that AIC values of 9000 plus.
We now go ahead and calculate the coeffecients of the significant variables from the two glm models to derive our findings and implications.
exp(coef(glm_model_2))
exp(coef(glm_model_4))
Geofencing is a location based digital marketing technology used by advertising companies to target users with specific ads based on their proximity to local businesses. Every local business can build a geofence around it which is nothing but a circle around it for a particular radius (eg.5km) to target users that enter that particular geofence.
In this case study we are interested in finding what are the parameters that have significant impact on the click through rate and did clicks for the impressions shown to users using geofencing advertising.The overall Click through rate is 0.0068 which means about 7 people in 1,000 targeted users did click the advertisement/impression.The Descriptive Statistics and Correlation implicate that the Count of did clicks is highest for group 3 which has users who are at a radius of 1-2kms from the geo fence center or the local buiness. The number of impressions is highest for this category which shows this is one of the key areas of advertising for marketers. The click through rate is however lesser compared to group 1. Click through rate for group 1 that has users less than 0.5kms in radius has the highest click through rate.From the distributions we might want to conclude that Click through rate decreases as the Distance group increases proving distance from business is inversely propotional to the click through rate. However there is an exception for distance group 6 that we rectify using normalised distance concept.
From the click through rate comparision plots we also observed the following trends, the aggregated CTR(click through rate) for Technology apps was higher in comparision to apps from category Entertainment and Social. Also Apps with review rating of 4.2 had the highest click through rates and Smaller impressions or screen size had better click through rates.
The Category Tech has the highest correlation values with other predictors. High negative correlation with category entertainment and social, app review value and os_ios. High positive correlation with imp_large_size so we use it as a major predictor discarding all the other variables so it has a lesser VIF score.
The logistic regression 1 also helps emphasize on the findings and implies that if the user uses ios and the app is a tech app there is a higher possiblity of converting since the coeffeciencts are >1. The user might not click the advertisements or had a lesser probability of converting if their impressions/screen size are larger and the distance_group is higher.
To invesetigate more on the data since the AUC value is 0.57 for the glm model 1 we dig more into the geographical details such as city and radius that find few interesting predictors that gives a model with 0.1 AUC. The sample has data from two cities Chicago and Los Angeles in the survey which have different click through rates (CTR in Chicago= 0.02359,CTR in LA=0.00644) which showed users in Chicago where 4 times more likely to click on ads than people from LA. Also based on the normalised distance parameter generated.
From the Logistic Regression model (glm_model_4) we find geo predcitors such as the users who are inside quarter the radius of the geofence radius are more likely to click on the ads than those outside it. Predictors such as user present inside half the geo fence radius, outside the geo fence radius are not significant. So it means the presence of user inside quarter the radius of the geofence center is the only significant distance parameter and the distance from the center is not of much significance.