High Note is an anonymized real music streaming company similar to Last.fm, Spotify or Pandora which uses a freemium business model. The expectation is to use the High Note customer data, and build a "free to fee" strategy to convert free accounts to premium subscribers.
We start of by importing the below packages for the functionalities as commented below.
library(dplyr) # Package for Data preprocessing
library(psych) # Package for summarising data in tables
library(ggplot2) # Package for Visualisations
library(MatchIt) # Package for Matching based on propensity score
library(corrplot) # Package for Correlation plots
library(gridExtra) # Package for plotting multiple plots
library(ggExtra) # Package for Visual Enhancement
library(car) # Package for VIF
options(scipen = 999) # To ensure results are not in exponential
We change the working directory, import the csv file and check the dimensions of the dataframe
getwd()
setwd('C:/Users/Nijanth Anand/Downloads/BANA277-Customer Analytics/Assignment - MidTerm')
data=read.csv("HighNote Data Midterm.csv",na.strings=c(""," ","NA"))
print('Dimension of data is')
print(paste('Row Count:',dim(data)[1]))
print(paste('Column Count:',dim(data)[2]))
We check if there are NA Values
sum(is.na(data))
Since there are no NA values, we work on the descriptive statistics based on the adopter value as stated in the reading
xyz<- data[data$adopter==1,c(2,3,4,8,9,12,11,13,10,7)]
a<-psych::describe(xyz,skew=FALSE)
xyz<- data[data$adopter==0,c(2,3,4,8,9,12,11,13,10,7)]
b<-psych::describe(xyz,skew=FALSE)
print("For Adopter =1")
a
print("For Adopter =0")
b
#data[data$adopter==0,-1] %>% psych::describe(,skew=FALSE,mat=TRUE)
We can see there is a difference in means between adopter=0 and adopter=1 grouped records, for instance the average songs listened by adopters is 33K while non adopters have listened to about 17K songs which is half of it.
We assign ID attribute to NULL since we will not be using it further for the modelling and descriptive analytics
data$ID<- NULL
Assign a new data frame for descriptive statistics computation since we will be factorising variables
data_desc<- data
We make the adopter variable a String Factor to help visualize how adopters and non-adopters differ from each other.
data_desc$paid <- ifelse(data_desc$adopter==1,'Subscriber','Free User')
table(data_desc$paid)
str(data)
cor_data<-data[,c('age','male','good_country','friend_cnt','adopter')]
M=cor(cor_data,method = c("pearson", "kendall", "spearman"))
corrplot(M,method="number",addrect=2) ##circle
cor_data<-data[,c('friend_cnt','avg_friend_age','avg_friend_male','friend_country_cnt','subscriber_friend_cnt','adopter')]
M=cor(cor_data,method = c("pearson", "kendall", "spearman"))
corrplot(M,method="number",addrect=2) ##circle
cor_data<-data[,c('songsListened','lovedTracks','posts','playlists','shouts','tenure','adopter')]
M=cor(cor_data,method = c("pearson", "kendall", "spearman"))
corrplot(M,method="number",addrect=2) ##circle
Plot between friends count and age which is studied for users based on country and adopter status.
labs <- paste("Country Type:", c("Good", "Bad"))
labs2 <- paste(c("Paid User", "Free User"))
data %>% mutate(good_country = ifelse(good_country == 1, labs[1], labs[2]))%>%
mutate(adopter = ifelse(adopter == 1, labs2[1], labs2[2]))%>%
ggplot(aes(x=age, y=friend_cnt,shape=adopter, color=adopter)) + geom_point(size=2)+
facet_wrap(~good_country)+ylim(0,2500)
We can conclude that the number of friends is relatively higher in bad countries and paid users are normally distributed along age specifically between th20 and 40.
Box Plots for Demographics to study the distribution
data_desc$paid=as.factor(data_desc$paid)
data_desc$male=as.factor(data_desc$male)
data_desc$good_country=as.factor(data_desc$good_country)
print("Box Plot for Demographic understanding to help visualize how adopters and non-adopters perform")
require(gridExtra)
abc1 <- ggplot(data_desc, aes(x=paid, y=age)) + geom_boxplot(aes(fill=paid),outlier.colour="red", outlier.shape=8,outlier.size=1)
abc1 <- abc1 + scale_fill_manual(values=c("#FF4500", "#32CD32")) +theme(legend.position="none")+theme(axis.title.x=element_blank())
abc2<- ggplot(data_desc, aes(x=age)) + geom_bar()
abc3<- ggplot(data_desc, aes(x=age)) + geom_density(kernel='gaussian')
data1<-data_desc[data_desc$male==0,]
abc4 <- ggplot(data1, aes(x=paid)) +labs(title = "Female Users")+ geom_bar(aes(fill=paid) )+theme(axis.title.x=element_blank())+ scale_fill_manual(values=c("#FF4500", "#32CD32"))+theme(legend.position="none")
data2<-data_desc[data_desc$male==1,]
abc5 <- ggplot(data2, aes(x=paid)) +labs(title = "Male Users")+ geom_bar(aes(fill=paid) )+theme(axis.title.x=element_blank())+ scale_fill_manual(values=c("#FF4500", "#32CD32"))+theme(legend.position="none") #geom_bar(stat="identity") #geom_histogram()
abc6 <- ggplot(data_desc, aes(paid))
abc6 <- abc6 + geom_bar(aes(fill = male))+ scale_fill_manual(values=c("#808080", "#FFD700"))+theme(axis.title.x=element_blank())
data1<-data_desc[data_desc$good_country==0,]
abc7 <- ggplot(data1, aes(x=paid)) +labs(title = "Bad Country Users")+ geom_bar(aes(fill=paid) )+theme(axis.title.x=element_blank())+ scale_fill_manual(values=c("#FF4500", "#32CD32"))+theme(legend.position="none")
data2<-data_desc[data_desc$good_country==1,]
abc8 <- ggplot(data2, aes(x=paid)) +labs(title = "Good Country Users")+ geom_bar(aes(fill=paid) ) +theme(axis.title.x=element_blank())+ scale_fill_manual(values=c("#FF4500", "#32CD32"))+theme(legend.position="none")#geom_bar(stat="identity") #geom_histogram()
data_descx=mutate(data_desc,country=good_country)
abc9 <- ggplot(data_descx, aes(paid))
abc9 <- abc9 + geom_bar(aes(fill = country))+ scale_fill_manual(values=c("#808080", "#FFD700"))+theme(axis.title.x=element_blank())
grid.arrange(abc1,abc2,abc3,abc4,abc5,abc6,abc7,abc8,abc9, ncol=3)
Plot between friend count and subscriber count to understand if having more friends leads to more subscriber friends based on gender and adopter.
labs <- paste("Gender:", c("Male", "Female"))
labs2 <- paste(c("Paid User", "Free User"))
data %>% mutate(male = ifelse(male == 1, labs[1], labs[2]))%>%
mutate(adopter = ifelse(adopter == 1, labs2[1], labs2[2]))%>%
ggplot(aes(x=friend_cnt, y=subscriber_friend_cnt,shape=male, color=male)) + geom_point(size=2)+
facet_wrap(~adopter)+ylim(0,100)
We observe that the Paid users have lesser friends, compared to free users and gender doesn't have much difference in it.
We plot the Box plots to understand the spread of the variables across adopters and non-adopters
print("Box Plot for Peer Influence understanding to help visualize how adopters and non-adopters perform")
abc2 <- ggplot(data_desc, aes(x=paid, y=friend_cnt)) + geom_boxplot(aes(fill=paid),outlier.colour="red", outlier.shape=8,outlier.size=1)
abc2 <- abc2 + scale_fill_manual(values=c("#FF4500", "#32CD32"))+theme(legend.position="none")+scale_y_continuous(limits = quantile(data_desc$friend_cnt, c(0.1, 0.9)))+theme(axis.title.x=element_blank())
abc3 <- ggplot(data_desc, aes(x=paid, y=avg_friend_age)) + geom_boxplot(aes(fill=paid),outlier.colour="red", outlier.shape=8,outlier.size=1)
abc3 <- abc3 + scale_fill_manual(values=c("#FF4500", "#32CD32"))+theme(legend.position="none")+scale_y_continuous(limits = quantile(data_desc$avg_friend_age, c(0.1, 0.9)))+theme(axis.title.x=element_blank())
abc4 <- ggplot(data_desc, aes(x=paid, y=avg_friend_male)) + geom_boxplot(aes(fill=paid),outlier.colour="red", outlier.shape=8,outlier.size=1)
abc4 <- abc4 + scale_fill_manual(values=c("#FF4500", "#32CD32"))+theme(legend.position="none")+scale_y_continuous(limits = quantile(data_desc$avg_friend_male, c(0.1, 0.9)))+theme(axis.title.x=element_blank())
abc5 <- ggplot(data_desc, aes(x=paid, y=friend_country_cnt)) + geom_boxplot(aes(fill=paid),outlier.colour="red", outlier.shape=8,outlier.size=1)
abc5 <- abc5 + scale_fill_manual(values=c("#FF4500", "#32CD32"))+theme(legend.position="none")+scale_y_continuous(limits = quantile(data_desc$friend_country_cnt, c(0.1, 0.9)))+theme(axis.title.x=element_blank())
abc6 <- ggplot(data_desc, aes(x=paid, y=subscriber_friend_cnt)) + geom_boxplot(aes(fill=paid),outlier.colour="red", outlier.shape=8,outlier.size=1)
abc6 <- abc6 + scale_fill_manual(values=c("#FF4500", "#32CD32"))+theme(legend.position="none")+scale_y_continuous(limits = quantile(data_desc$subscriber_friend_cnt, c(0.1, 0.9)))+theme(axis.title.x=element_blank())
grid.arrange(abc2,abc3,abc4,abc5,abc6, ncol=3)
Plot between songs listened and loved to understand based on gender and adopter to check if user engagement changes based on the said covariants
labs <- paste("Gender:", c("Male", "Female"))
labs2 <- paste(c("Paid User", "Free User"))
data %>% mutate(male = ifelse(male == 1, labs[1], labs[2]))%>%
mutate(adopter = ifelse(adopter == 1, labs2[1], labs2[2]))%>%
ggplot(aes(x=songsListened, y=lovedTracks,shape=adopter, color=adopter)) + geom_point(size=2)+
facet_wrap(~male)#+ylim(0,100)
We can observe that in the given sample, male users listen to more number of songs compared to female users and the users who have listened to the maximum number of songs or at the top 95th percentile aren't essentially paid customers.
Plot between songs listened and playlists to understand based on gender and adopter to check if user engagement changes based on the said covariants
labs <- paste("Gender:", c("Male", "Female"))
labs2 <- paste(c("Paid User", "Free User"))
data %>% mutate(male = ifelse(male == 1, labs[1], labs[2]))%>%
mutate(adopter = ifelse(adopter == 1, labs2[1], labs2[2]))%>%
ggplot(aes(x=songsListened, y=playlists,shape=adopter, color=adopter)) + geom_point(size=2)+
facet_wrap(~male) #+ylim(0,150)
The number of playlists doesn't essentially depend on the the count of songs listened to but in general males have higher counts of playlists in comparision to female users. Similar to the previous graph, users who have the highest engagement with maximum number of playlists or being in the top 95th percentile doesn't essentially guarantee they are paid customers.
We plot the Box plots to understand the spread of the variables across adopters and non-adopters
print("Box Plot for User Engagement Influence understanding to help visualize how adopters and non-adopters perform")
abc7 <- ggplot(data_desc, aes(x=paid, y=songsListened)) + geom_boxplot(aes(fill=paid),outlier.colour="red", outlier.shape=8,outlier.size=1)
abc7 <- abc7 + scale_fill_manual(values=c("#FF4500", "#32CD32"))+theme(legend.position="none")+scale_y_continuous(limits = quantile(data_desc$songsListened, c(0.1, 0.9)))+theme(axis.title.x=element_blank())
abc8 <- ggplot(data_desc, aes(x=paid, y=lovedTracks)) + geom_boxplot(aes(fill=paid),outlier.colour="red", outlier.shape=8,outlier.size=1)
abc8 <- abc8 + scale_fill_manual(values=c("#FF4500", "#32CD32"))+theme(legend.position="none")+scale_y_continuous(limits = quantile(data_desc$lovedTracks, c(0.1, 0.9)))+theme(axis.title.x=element_blank())
abc9 <- ggplot(data_desc, aes(x=paid, y=posts)) + geom_boxplot(aes(fill=paid),outlier.colour="red", outlier.shape=8,outlier.size=1)
abc9 <- abc9 + scale_fill_manual(values=c("#FF4500", "#32CD32"))+theme(legend.position="none")+scale_y_continuous(limits = quantile(data_desc$posts, c(0.1, 0.9)))+theme(axis.title.x=element_blank())
abc10 <- ggplot(data_desc, aes(x=paid, y=playlists)) + geom_boxplot(aes(fill=paid),outlier.colour="red", outlier.shape=8,outlier.size=1)
abc10 <- abc10 + scale_fill_manual(values=c("#FF4500", "#32CD32"))+theme(legend.position="none")+scale_y_continuous(limits = quantile(data_desc$playlists, c(0, 0.95)))
abc11 <- ggplot(data_desc, aes(x=paid, y=shouts)) + geom_boxplot(aes(fill=paid),outlier.colour="red", outlier.shape=8,outlier.size=1)
abc11 <- abc11 + scale_fill_manual(values=c("#FF4500", "#32CD32"))+theme(legend.position="none")+scale_y_continuous(limits = quantile(data_desc$shouts, c(0.1, 0.9)))
abc12 <- ggplot(data_desc, aes(x=paid, y=tenure)) + geom_boxplot(aes(fill=paid),outlier.colour="red", outlier.shape=8,outlier.size=1)
abc12 <- abc12 + scale_fill_manual(values=c("#FF4500", "#32CD32"))+theme(legend.position="none")+scale_y_continuous(limits = quantile(data_desc$tenure, c(0.1, 0.9)))
grid.arrange(abc7,abc8,abc9,abc10,abc11,abc12, ncol=3)
We plot the variation of parameters based on the user engagement.
pairs(data[,c(8:12,14)])
Correlation plot of all predictors
M=cor(data,method = c("pearson", "kendall", "spearman"))
corrplot(M,method="number",addrect=2)
We define treatment as users who have one or more than 1 subscriber friends and study the variation of tratment across users from the two adopter classes
data <- mutate(data,treatment=ifelse(data$subscriber_friend_cnt >=1,1,0))
data %>%
group_by(adopter) %>% summarise(mean_treatment = mean(treatment),users=n())
There seems to be a difference in the treatment for the two class of adopters since the means are not similar so we run a t-test to find if there is any significance.
with(data, t.test(treatment ~ adopter))
So we can say that the treatment is a significant predictor and try understanding the significance across covariants
data %>%
group_by(treatment) %>%
summarise_all(funs(mean(., na.rm = T)))
We can see there is a significant difference in mean of the etimators for the treatment so we run a t-test on the means.
They is significance between them since the means are not the same so we calculate the t-test for each covariate by treatment status
labels <- c('age','male','friend_cnt','avg_friend_age','avg_friend_male','friend_country_cnt','songsListened','lovedTracks','posts','playlists','shouts','tenure','good_country')
lapply(labels, function(v) {
t.test(data[, v] ~ data$treatment)
})
The alternative hypothesis: true difference in means is not equal to 0 is the outcome for all the variables in the data in comparision with the adopter variable.
The difference-in-means is statistically significant at conventional levels of confidence as the p value is less than 0.05
We estimate the propensity score by running a logit model (probit also works), where the outcome variable is a binary variable indicating treatment status.
It helps understand covariates(Independent Variables) we should include for the matching to give a causal estimate.In the end, you need to include any covariate that is related to both the treatment assignment and potential outcomes
model_pse <- glm(treatment ~ age+male+friend_cnt+avg_friend_age+avg_friend_male+friend_country_cnt+songsListened+lovedTracks+posts+playlists+shouts+tenure+good_country,
family = binomial(), data = data)
summary(model_pse)
vif(model_pse)
Using this model, we can now calculate the propensity score for each user. It is simply the user’s predicted probability of being Treated, given the estimates from the logit model. Below, I calculate this propensity score using predict() and create a dataframe that has the propensity score as well as the users actual treatment status.
propensity_score <- data.frame(pr_score = predict(model_pse, type = "response"),
treatment=model_pse$model$treatment)
We now examine the region of common support, so post estimating the propensity score, we plot histograms of the estimated propensity scores based on treatment status to study the distribution
labs <- paste("Treatment Type:", c("0 Subscriber friends", "1 or more Subscriber friends"))
propensity_score %>%
mutate(treatment = ifelse(treatment == 0, labs[1], labs[2])) %>%
ggplot(aes(x = pr_score)) +
geom_histogram(color = "white",bins=55) +
facet_wrap(~treatment) +
xlab("Probability of having subscriber friends") +
theme_bw()+xlim(0,1.01)
We check if there are NA Values before the matching since the MatchIt doesn't allow NA Values.
sum(is.na(data))
Since the count of NA values is zero we proceed with the MatchIt function
mod_match <- matchit(treatment ~ age+male+friend_cnt+avg_friend_age+avg_friend_male+friend_country_cnt+songsListened+lovedTracks+posts+playlists+shouts+tenure+good_country,
method = "nearest", data = data)
summary(mod_match)
Since there are 9823 records that have been treated[Subscriber Friend count is >= 1], the Matchit matches it with 9823 records from the Control that have the nearest Propensity score estimate.
So we now have 9823 treated and 9823 control records that have been matched based on Propensity score.
plot(mod_match)
We can see how post matching the QQ plots are normalised for the covariants where All on the left side denoted before Matching and Matched refers to the QQ Plot of the covariant after Propensity score matching.
We now create a dataframe with only the 9823 treatment records and 9823 matched Control records using the match.data function.
dta_m <- match.data(mod_match)
print("Dimensions of the matched data frame")
dim(dta_m)
print("Table of Treatment distriution in the matched data frame")
table(dta_m$treatment)
Out of curiousity we want to study the distribution of Propensity score estimated distance across Control and Treatment. So we plot a boxplot for it.
dta_m$treat <- as.factor(dta_m$treatment)
abc2 <- ggplot(dta_m, aes(y=distance,x=treat)) + geom_boxplot(outlier.colour="red", outlier.shape=8,outlier.size=1)
abc2 <- abc2 +theme(legend.position="none")
abc2
dta_m$treat <- NULL
We can observe that there are covariates whose values are matched between treatment and control, so we do a visual inspection of it.
fn_bal <- function(dta, variable) {
dta$variable <- dta[, variable]
dta$treatment <- as.factor(dta$treatment)
support <- c(min(dta$variable), max(dta$variable))
ggplot(dta, aes(x = distance, y = variable, color = treatment)) +
geom_point(alpha = 0.2, size = 1.3) +
geom_smooth(method = "loess", se = F) +
xlab("Propensity score") +
ylab(variable) +
theme_bw() +
ylim(support)
}
grid.arrange(
fn_bal(dta_m, "age"),
fn_bal(dta_m, "male") + theme(legend.position = "none"),
fn_bal(dta_m, "friend_cnt"),
fn_bal(dta_m, "avg_friend_age") + theme(legend.position = "none"),
fn_bal(dta_m, "avg_friend_male"),
fn_bal(dta_m, "songsListened") + theme(legend.position = "none"),
fn_bal(dta_m, "lovedTracks"),
fn_bal(dta_m, "posts") + theme(legend.position = "none"),
fn_bal(dta_m, "playlists"),
fn_bal(dta_m, "shouts") + theme(legend.position = "none"),
fn_bal(dta_m, "tenure"),
fn_bal(dta_m, "good_country") + theme(legend.position = "none"),
nrow = 7)
We compare the means across the treatments post propensity score matching
group_by(dta_m,treatment) %>%
summarise_all(funs(mean(., na.rm = T)))
The means are in range for selected predictors and we now try finding if the difference of means is significant through a t-test for the covariants
labels <- c('age','male','friend_cnt','avg_friend_age','avg_friend_male','friend_country_cnt','songsListened','lovedTracks','posts','playlists','shouts','tenure','good_country')
lapply(labels, function(v) {
t.test(dta_m[, v] ~ dta_m$treatment)
})
We now Estimate the treatment effects since we have the matched sample that we are happy with. We can use a t-test:
with(dta_m, t.test(adopter~ treatment))
Now, we will use a logistic regression approach to test which variables (including subscriber friends) are significant for explaining the likelihood of becoming an adopter. We use our judgment and visualization results to decide which variables to include in the regression. We also estimate the odds ratios for the key variables.
We build a glm model with only the treatment variable to check if it is a significant predictor of the adopter variable.
glm_treat1 <- glm(adopter ~ treatment, data = dta_m,family='binomial')
summary(glm_treat1)
exp(coef(glm_treat1))
The treatment [Having subscriber friends] is a significant predictor in judging if the user will be a free or a fee user. The odds ratio of the treatment is 2.27 which is greater than 1 and hence supports the case, also the probability is almost zero which makes it a good predictor
We now build a glm model with the treatment variable and other covariates to predict the adopter status.
glm_treat2 <- glm(adopter ~ treatment+age+male+friend_cnt+avg_friend_age+avg_friend_male+friend_country_cnt+songsListened+lovedTracks+posts+playlists+shouts+tenure+good_country+treatment, data = dta_m,family='binomial')
summary(glm_treat2)
Predictor variables such as friend count, average friend being a male, posts and shouts don't have much significance in deciding if the user would be a free or a fee customer. Variables such as age, gender, having subscriber friends, songs listened, loved tracks, country from and playlists have significant impact on the adopter variable.
We build the third glm model with only the significant predictors from the glm_model_2
glm_treat3 <- glm(adopter ~ treatment+age+male+avg_friend_age+friend_country_cnt+songsListened+lovedTracks+playlists+tenure+good_country, data = dta_m)
summary(glm_treat3)
We find all the predictors are significant in the model and hence calculate the odds ratio of the estimators below.
exp(coef(glm_treat3))
We build a glm model with the same significant predictors from glm_model_3 but with the original dataset which has not been matched by propensity score, to try finding if the odds ratio are similar.
glm_treat4 <- glm(adopter ~ treatment+age+male+avg_friend_age+friend_country_cnt+songsListened+lovedTracks+playlists+tenure+good_country, data = data)
summary(glm_treat4)
All the predictors are significant and we hence find the odds ratio.
exp(coef(glm_treat4))
We can find that the odds ratio is similar for the glm model built with the propensity score matched data and the given dataset.
Our conclusion from the result would be that, the odds ratio of the user converting to a paid subscriber of the music app increases if the user has friends who are subscribers as well. Also it was found that people who are older tend to have sunscriptions because of the affordability factor and males have a higher probability of converting to paid users.
User engagement helps increase the chances of the customer paying for the services as the engagement variables songs listened, loved tracks, playlists have positive odds ratio coeffecients. Tenure however has an odds ratio less than 1 which we discuss in the implications part of Q5. Average friend age also has a postive odds ratio similar to the age predictor.
Key take aways from our analysis and the suggestions we would give High Note for a “free-to-fee” strategy would be the following. We find that demographics, peer influence, user engagement all have good effect on the adopter variable and hence formulate the below findings and suggestions to the company.
Having subscriber friends has a positive effect in increasing the odds of becoming a fee customer. So we want to offer group packages and referral schemes to get more users excited about the paid subscription.
As the tenure increases the likelihood of converting to a paid customer reduces, so we want to target the user with ads more during the initial phase of usage.
Songs listened, Loved tracks and playlists all have positive odds ratio and hence to capitalise on it we want to increase the number of songs offered, maybe expand with regional songs to improve user engagement.
If the user is from a good country the odds ratio of him converting is lesser because there are more competitive players in the market and we want to personalise the app more in such countries.
Friend Country count has a positive effect on the odds ratio so we can suggest people from other countries as friends in the recommendation to enhace chances.