For many full-time music artists, getting high chart positions is their meal ticket; they need to have a prominent presence in the industry in order to make money and chart positions are a clear way of showing just how prominent they are. At the end of each year, Spotify compiles a playlist of the songs streamed most often over the course of that year. This year’s playlist included 100 songs. The question is: What do these top songs have in common? Why do people like them?
Music, as one can claim, is a art-form composed and performed for many purposes ranging from aesthetic pleasure, ceremonial purposes, or to an entertainment product used for mass comsumption. Prior to the invention of sound recordings, people would actually buy sheet music and play at home on a piano for enjoyment. Now in modern times, music can be easily enjoyed through online streaming services such as Spotify, Apple Music, YouTube Music, and Pandora etc.
With that being said, a song's popularity can be measured by its chart position. For many musicians, getting their songs to chart high is of paramount importance in order garner a prominent presense in the music industry, and above all, to make money.
So now we ask: What do all these hit songs have in common? Is there a certain pattern or characteristics to determine a hit song? What are the significant and influential predictors?
We acquired data using Spotify's API for songs released in 2017. With Spotify's acquisition of Echo Nest, a music intelligence and data platform, their API allows users to access a song's audio features. These features includes attributes such as a song's level of acousticess, loudness, tempo and energy etc.
We pulled a total number of 7429 records, each categorized by 10 features.
Below is a detailed description of each of the features we used:
Features | Description |
---|---|
acousticness | A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic. |
danceability | Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable. |
duration_ms | The duration of the track in milliseconds. |
energy | Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy. |
instrumentalness | Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0. |
liveness | Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live. |
loudness | The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db. |
speechiness | Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. |
tempo | The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration. |
valence | A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry). |
artist_popularity | The popularity of the artist. The value will be between 0 and 100, with 100 being the most popular. The artist’s popularity is calculated from the popularity of all the artist’s tracks. |
Using this data, we mapped each song to whether or not it was a hit if it satisfies either of these rules:
Python was used to pull data from Spotify's API, primarily utilizing a package called Spotipy (https://github.com/plamere/spotipy). Spotipy is a thin client library that handles the OAuth authorization flow (using a client id and client secret one needs to provide after signing up for a web developer account at Spotify) and access to all of the music data provided in JSON format.
Here's a code snippet for getting all the songs released in 2017 in the US:
def get_songs_by_year_released(client: spotipy.Spotify, year: int) -> pd.DataFrame:
''' returns a pandas dataframe of songs of given year
Parameters:
- client: spotipy wrapper client
- year: the year in which songs were released
'''
result = []
id_set = set()
for i in range(200):
search_results = client.search(q="year:{}".format(year), type="track", market="US", limit=50, offset=50*i)
for item in search_results.get("tracks").get("items"):
artist = item.get("artists")[0]
artist_name = artist.get("name")
artist_id = artist.get("id")
artist_popularity = client.artist(artist_id).get("popularity")
track = item.get("name")
track_id = item.get("id")
if track_id not in id_set:
t = {"aid": artist_id, "artist": artist_name, "artist_popularity": artist_popularity, "track": track, "id": track_id}
result.append(t)
id_set.add(track_id)
return pd.DataFrame(result)
Spotify's search endpoint that allows users to get Spotify Catalog information about artists, albums, tracks or playlists which match a keyword string only allows 50 results per query. Hence, I had to provide an offset value for every query, making a total number of 200 requests. This was primarily intended to get 10000 records, however there were duplicate songs provided in each result set, decreasing the number down to 7429.
Then, using the songs we pulled, we marked each of them with their corresponding song attributes:
def get_songs_features(client: spotipy.Spotify, df: pd.DataFrame) -> pd.DataFrame:
''' returns a pandas dataframe of song features of given songs
Parameters:
- client: spotipy wrapper client
- df: pandas dataframe of songs from get_songs_by_year_released
'''
result = []
songs_id = df["id"]
for i in range(200):
features = client.audio_features(songs_id[50*i:50*(i+1)])
for s in features:
if s is not None:
result.append(s)
features = pd.DataFrame(result)
return pd.merge(features, df, on='id')
Next, we marked each record indicating whether or not they were a hit. Here, we also used another python package called billboard.py (https://github.com/guoguo12/billboard-charts) which allows accessing music charts from Billboard.com.
def get_top_tracks_of_2017(client: spotipy.Spotify, df: pd.DataFrame, playlist_ids: str) -> pd.DataFrame:
''' returns an updated pandas dataframe of song features with a new hit value:
0 - not in the given top playlist
1 - in the given top playlist
Parameters:
- client: spotipy wrapper client
- df: pandas dataframe of song features from get_songs_features
- playlist_ids: list of public playlist ids of the top songs
'''
df["hit"] = 0
## First query the Spotify public playlists
result = set()
for pid in playlist_ids:
uri = "spotify:user:spotifycharts:playlist:{}".format(pid)
username = uri.split(':')[2]
playlist_id = uri.split(':')[4]
playlist_results = client.user_playlist(username, playlist_id)
hit_songs = set([item.get("track").get("id") for item in playlist_results.get("tracks").get("items")])
result = result.union(hit_songs)
df.loc[df["id"].isin(hit_songs), "hit"] = 1
## Second, query the Billboard data
date_template = "2017-{:02d}-01"
songs = set()
for i in range(1, 13):
print(date_template.format(i))
chart = billboard.ChartData('hot-100', date = date_template.format(i))
to_add = set([c.title for c in chart])
songs = songs.union(to_add)
df.loc[df["track"].isin(songs), "hit"] = 1
return df
Here is the resulting data set after we've selected the columns we needed, and filtering out records that weren't songs (i.e. records on Spotify that are sleep sounds, white noise, instrumental music with no vocals).
data = pd.read_csv("song_features_2017.csv")
data.drop(data.loc[data["tempo"] == 0, "track"].index, inplace = True)
data.drop(data.loc[data["speechiness"] >= 0.66, "track"].index, inplace = True)
data.drop(data.loc[data["liveness"] >= 0.8, "track"].index, inplace = True)
data.drop(data.loc[data["instrumentalness"] >= 0.8, "track"].index, inplace = True)
df = data[["acousticness", "danceability", "duration_ms", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity", "hit"]]
df.head(10)
Since this is a binary classification problem, we will use the following the methods:
And for evaluating the model, we will need to take consideration that our model is has class imbalance: 287 hits and 6358 no-hits.
Hence, we cannot use Accurary as an evaluation measure since our model can simply predict all as no-hits have acheive a 95% accuracy rate.
Thus, we will use the AUC as it is a function of sensitivity and specificity, the curve is insensitive to disparities in the class proportions. Along with other metrics such as F1-score, precision and recall.
First, we separate the data indicating X as the independent variables, and y as the dependent variable.
predictors = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")
# Split training data and validating data (70-30 split)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
logisticRegr = LogisticRegression(solver='liblinear', random_state=10)
logisticRegr.fit(x_train, y_train)
y_test_pred = logisticRegr.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred))
From the results, we can already see that this model is not a good model for prediction as it did not predict any songs to be hits, and pretty much predicted all songs to be non-hits.
If we look at the significance of our model's predictors, we get the following:
logit = sm.Logit(y, X)
logit.fit().summary()
We see that valence and danceability are statistically not significant at alpha level 0.05. If we re-run the model without those two predictors, we get the following:
predictors = ["acousticness", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")
# Split training data and validating data (70-30 split)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
logisticRegr = LogisticRegression(solver='liblinear', random_state=10)
logisticRegr.fit(x_train, y_train)
y_test_pred = logisticRegr.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred))
y_pred_proba = logisticRegr.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
We still get a bad result (as our precision, recall, f1-score is at 0), but as we noted prior, this is due to an inbalanced class sample, so our model needs pre-processing via means of:
## Method 1: Putting weights using skleaern's class_weight parameter
logisticRegr = LogisticRegression(solver='liblinear', random_state=10, class_weight='balanced')
logisticRegr.fit(x_train, y_train)
y_test_pred = logisticRegr.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred))
Now, we can see some progress as there were 63 correctly predicted hit songs, but our precision and f1-score is relatively low at 0.10 and 0.17 respectively.
We can see the model's AUC like so:
y_pred_proba = logisticRegr.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
The AUC is reported to be at 72, so it's somewhat of an improved model to our previous un-preprocessed model.
Another method of providing weights can be achieved doing a manual grid search with varying weights to see which weights correspond to the highest f1-score.
weights = np.linspace(0.05, 0.95, 20)
gsc = GridSearchCV(
estimator=LogisticRegression(solver='liblinear', random_state=10),
param_grid={
'class_weight': [{0: x, 1: 1.0-x} for x in weights]
},
scoring='f1',
cv=3
)
grid_result = gsc.fit(x_train, y_train)
print("Best parameters : %s" % grid_result.best_params_)
# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
'weight': weights })
dataz.plot(x='weight')
We get the best class weight parameters to be: 0.098 for class 0 and 0.903 for class 1
## Method 2: Putting calculated weights using grid search
logisticRegr = LogisticRegression(solver='liblinear', random_state=10, **grid_result.best_params_)
logisticRegr.fit(x_train, y_train)
y_test_pred = logisticRegr.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred))
y_pred_proba = logisticRegr.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
We do get a better model compared to sklearn's balanced class weight parameter as our precision score has now improved to 0.17, f1-score to 0.21, and AUC to 0.73.
Next, we can try a re-sampling technique known as SMOTE: Synthetic Minority Over-sampling Technique. SMOTE is a very common sampling technique that synthesises new minority instances between existing (real) minority instances.
## Method 3: SMOTE
pipe = make_pipeline(
SMOTE(),
LogisticRegression(solver='liblinear', random_state=10)
)
# Fit..
pipe.fit(x_train, y_train)
# Predict..
y_pred = pipe.predict(x_test)
# Evaluate the model
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
y_pred_proba = logisticRegr.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
We get very similar results to the manual weighting method except now precision has dropped back to .1 and f1-score to 0.17 (recall has improved to 66). So now.. what if we combine the two methods in order to finding the optimal ratio for resampling the data?
pipe = make_pipeline(
SMOTE(),
LogisticRegression(solver='liblinear', random_state=10)
)
weights = np.linspace(0.05, 0.95, 20)
gsc = GridSearchCV(
estimator=pipe,
param_grid={
"smote__sampling_strategy": [x for x in weights]
},
scoring='f1',
cv=3
)
grid_result = gsc.fit(X, y)
print("Best parameters : %s" % grid_result.best_params_)
# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
'weight': weights })
dataz.plot(x='weight')
## Method 4: SMOTE + optimal ratio
pipe = make_pipeline(
SMOTE(sampling_strategy=0.381578947368421),
LogisticRegression(solver='liblinear', random_state=10)
)
# Fit..
pipe.fit(x_train, y_train)
# Predict..
y_pred = pipe.predict(x_test)
# Evaluate the model
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
y_pred_proba = pipe.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
At this point, all we can say is this is the extent our logistic model can improve in terms of performance. The precision is at 0.17 (highest so far) and f1-score at 0.22 (highest so far).
First, we separate the data indicating X as the independent variables, and y as the dependent variable.
predictors = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")
# Split training data and validating data (70-30 split)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
gnb = GaussianNB()
gnb.fit(x_train, y_train)
y_test_pred = gnb.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred))
y_pred_proba = gnb.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Already, using Gaussian Naive Bayes, we get a comparable result to the pre-processed Logistic Model (albeit a lower recall score). The AUC is at 0.76 as well, the highest score we've achieved so far.
Next, we can implement the same pre-processing techniques we used before.
# Method 1: Using optimal priors (Prior probabilities of the classes) with grid search
weights = np.linspace(0.05, 0.95, 20)
gsc = GridSearchCV(
estimator=GaussianNB(),
param_grid={
'priors': [[x, 1.0-x] for x in weights]
},
scoring='f1',
cv=3
)
grid_result = gsc.fit(x_train, y_train)
print("Best parameters : %s" % grid_result.best_params_)
# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
'weight': weights })
dataz.plot(x='weight')
gnb = GaussianNB(**grid_result.best_params_)
gnb.fit(x_train, y_train)
y_test_pred = gnb.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred))
y_pred_proba = gnb.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Precision scored dropped to 0.14, albeit increasing our recall score. F1 remains the same, hovering around .2. The AUC remains at 0.76 as well.
Now, if we do the SMOTE method with optimal priors:
## Method 2: SMOTE + optimal ratio
pipe = make_pipeline(
SMOTE(),
GaussianNB()
)
weights = np.linspace(0.05, 0.95, 20)
gsc = GridSearchCV(
estimator=pipe,
param_grid={
"smote__sampling_strategy": [x for x in weights]
},
scoring='f1',
cv=3
)
grid_result = gsc.fit(X, y)
print("Best parameters : %s" % grid_result.best_params_)
# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
'weight': weights })
dataz.plot(x='weight')
pipe = make_pipeline(
SMOTE(sampling_strategy=0.05),
GaussianNB()
)
# Fit..
pipe.fit(x_train, y_train)
# Predict..
y_pred = pipe.predict(x_test)
# Evaluate the model
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
y_pred_proba = pipe.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
The SMOTE method provided minimal inprovement in precision, but lower recall and f1-scores.
First, we separate the data indicating X as the independent variables, and y as the dependent variable.
predictors = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
decTree = DecisionTreeClassifier(random_state = 10)
decTree.fit(x_train, y_train)
y_test_pred = decTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred))
y_pred_proba = decTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Using Decision Trees, we get a very similar results to the two other models we used before (Naive Gaussian Bayes and Logistic Regression). However, notice that our AUC score is at its lowest (0.579)
Next, we can implement the same pre-processing techniques we used before.
Using decision tress, we can attempt to further improve our model by actually selecting features that provides good impurity reduction (The measure based on which the (locally) optimal condition is chosen is called impurity).
Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.
For sklearn's decision trees, our impurity measure is defaulted to Gini impurity.
feature_imp = pd.Series(decTree.feature_importances_,index=predictors).sort_values(ascending=False)
feature_imp
Here, we see that instrumentalneess has the highest impurity score, thus we can attempt to remove it and train our model again.
predictors = ["acousticness", "danceability", "energy", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
decTree = DecisionTreeClassifier(random_state = 10)
decTree.fit(x_train, y_train)
y_test_pred = decTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred))
y_pred_proba = decTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
We get a small but slight improvement from the feature reduction.
## Method 1: Putting weights using skleaern's class_weight parameter
decTree = DecisionTreeClassifier(random_state=10, class_weight='balanced')
decTree.fit(x_train, y_train)
y_test_pred = decTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred))
y_pred_proba = decTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Using sklearn's class weight parameter, we get an even higher f1-score at 0.25. Our AUC score has also slightly increased to (0.615)
weights = np.linspace(0.05, 0.95, 20)
gsc = GridSearchCV(
estimator=DecisionTreeClassifier(random_state=10),
param_grid={
'class_weight': [{0: x, 1: 1.0-x} for x in weights]
},
scoring='f1',
cv=3
)
grid_result = gsc.fit(x_train, y_train)
print("Best parameters : %s" % grid_result.best_params_)
# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
'weight': weights })
dataz.plot(x='weight')
## Method 2: Putting calculated weights using grid search
decTree = DecisionTreeClassifier(random_state=10, **grid_result.best_params_)
decTree.fit(x_train, y_train)
y_test_pred = decTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred))
y_pred_proba = decTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
We pretty much get a similar result as Method 1 (might even be worse).
## Method 3: SMOTE
pipe = make_pipeline(
SMOTE(),
DecisionTreeClassifier(random_state=10)
)
# Fit..
pipe.fit(x_train, y_train)
# Predict..
y_pred = pipe.predict(x_test)
# Evaluate the model
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
y_pred_proba = pipe.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Using the SMOTE method, we get a higher AUC score (0.62) and seems to have a higher recall score. But our precision rate has taken a hit (0.12).
## Method 4: SMOTE + optimal ratio
pipe = make_pipeline(
SMOTE(),
DecisionTreeClassifier(random_state=10)
)
weights = np.linspace(0.05, 0.95, 20)
gsc = GridSearchCV(
estimator=pipe,
param_grid={
"smote__sampling_strategy": [x for x in weights]
},
scoring='f1',
cv=3
)
grid_result = gsc.fit(X, y)
print("Best parameters : %s" % grid_result.best_params_)
# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
'weight': weights })
dataz.plot(x='weight')
pipe = make_pipeline(
SMOTE(sampling_strategy=0.19210526315789472),
DecisionTreeClassifier(random_state=10)
)
# Fit..
pipe.fit(x_train, y_train)
# Predict..
y_pred = pipe.predict(x_test)
# Evaluate the model
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
y_pred_proba = pipe.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Method 4 provdies us a model that is very similar to all the pre-processed Decision Tree models.
First, we separate the data indicating X as the independent variables, and y as the dependent variable.
predictors = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
rfTree=RandomForestClassifier(n_estimators=100, random_state=20)
rfTree.fit(x_train, y_train)
y_pred = rfTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
y_pred_proba = rfTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Now we're getting somewhere! We see that our precision sore has increased vastly to 0.83, albeit our low recall score and f1-score.
The AUC is at 0.764 which signals a good indicator that our model performs well.
Very similar to decision trees, we can conduct feature reduction using the same method based on impurity scores.
In fact Random forest is simply of a number of decision trees (in our case 100 trees). Every node in the decision trees is a conditional on a single feature, designed to split the dataset into two so that similar response values end up in the same set.
For sklearn's random forests, our impurity measure is defaulted to Gini impurity.
feature_imp = pd.Series(rfTree.feature_importances_,index=predictors).sort_values(ascending=False)
feature_imp
Here, we see that instrumentalness again has the lowest score, thus we can remove it and train our model again.
predictors = ["acousticness", "danceability", "energy", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
rfTree=RandomForestClassifier(n_estimators=100, random_state=20)
rfTree.fit(x_train, y_train)
y_pred = rfTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
y_pred_proba = rfTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Removing instrumentalness as a predictor, gave us a precision score of 1.00! The AUC score remains the same at 0.759.
Next, we can attempt to provide weights and re-sample using SMOTE.
## Method 1: Putting weights using skleaern's class_weight parameter
rfTree=RandomForestClassifier(n_estimators=100, random_state=20, class_weight="balanced")
rfTree.fit(x_train, y_train)
y_pred = rfTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
y_pred_proba = rfTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Very similar results, with precision score at 1.0 again. The AUC has marginally improved to 0.76.
weights = np.linspace(0.05, 0.95, 20)
gsc = GridSearchCV(
estimator=RandomForestClassifier(n_estimators=100, random_state=20),
param_grid={
'class_weight': [{0: x, 1: 1.0-x} for x in weights]
},
scoring='f1',
cv=3
)
grid_result = gsc.fit(x_train, y_train)
print("Best parameters : %s" % grid_result.best_params_)
# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
'weight': weights })
dataz.plot(x='weight')
## Method 2: Putting calculated weights using grid search
rfTree=RandomForestClassifier(n_estimators=100, random_state=20, **grid_result.best_params_)
rfTree.fit(x_train, y_train)
y_pred = rfTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
y_pred_proba = rfTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Method 2 did not provide an improvement and performed worse than Method 1.
## Method 3: SMOTE
pipe = make_pipeline(
SMOTE(),
RandomForestClassifier(n_estimators=100, random_state=20)
)
# Fit..
pipe.fit(x_train, y_train)
# Predict..
y_pred = pipe.predict(x_test)
# Evaluate the model
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
y_pred_proba = pipe.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Using the SMOTE method, we seem to get results similar to our previous ML techniques.
## Method 4: SMOTE + optimal ratio
pipe = make_pipeline(
SMOTE(),
RandomForestClassifier(n_estimators=100, random_state=20)
)
weights = np.linspace(0.05, 0.95, 20)
gsc = GridSearchCV(
estimator=pipe,
param_grid={
"smote__sampling_strategy": [x for x in weights]
},
scoring='f1',
cv=3
)
grid_result = gsc.fit(X, y)
print("Best parameters : %s" % grid_result.best_params_)
# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
'weight': weights })
dataz.plot(x='weight')
pipe = make_pipeline(
SMOTE(sampling_strategy=0.14473684210526316),
RandomForestClassifier(n_estimators=100, random_state=20)
)
# Fit..
pipe.fit(x_train, y_train)
# Predict..
y_pred = pipe.predict(x_test)
# Evaluate the model
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
y_pred_proba = pipe.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
Method 4 did provide better results than Method 3.
Method | Precision | Recall | F1-Score | AUC |
---|---|---|---|---|
Logistic Regression (NO pre-processing) | 0 | 0 | 0 | 0.732 |
Logistic Regression (Balanced Class Weights) | 0.1 | 0.65 | 0.17 | 0.727 |
Logistic Regression (Calculated Weights) | 0.17 | 0.27 | 0.21 | 0.734 |
Logistic Regression (SMOTE) | 0.1 | 0.67 | 0.18 | 0.734 |
Logistic Regression (SMOTE + optimal ratio) | 0.17 | 0.31 | 0.22 | 0.729 |
Gaussian Naive Bayes (NO pre-processing) | 0.27 | 0.17 | 0.21 | 0.765 |
Gaussian Naive Bayes (Calculated Priors) | 0.14 | 0.47 | 0.21 | 0.765 |
Gaussian Naive Bayes (SMOTE + optimal ratio) | 0.26 | 0.11 | 0.16 | 0.769 |
Decision Trees (No pre-processing) | 0.16 | 0.21 | 0.18 | 0.579 |
Decision Trees (After feature-reduction) | 0.19 | 0.25 | 0.22 | 0.600 |
Decision Trees (Balanced Class Weights) | 0.23 | 0.27 | 0.25 | 0.615 |
Decision Trees (Calculated Weights) | 0.17 | 0.2 | 0.18 | 0.578 |
Decision Trees (SMOTE) | 0.12 | 0.36 | 0.19 | 0.625 |
Decision Trees (SMOTE + optimal ratio) | 0.15 | 0.29 | 0.20 | 0.609 |
Random Forests (No pre-processing) | 0.83 | 0.13 | 0.22 | 0.764 |
Random Forests (After feature-reduction) | 1.00 | 0.12 | 0.21 | 0.759 |
Random Forests (Balanced Class Weights) | 1.00 | 0.10 | 0.19 | 0.761 |
Random Forests (Calculated Weights) | 0.80 | 0.10 | 0.18 | 0.716 |
Random Forests (SMOTE) | 0.18 | 0.28 | 0.22 | 0.736 |
Random Forests (SMOTE + optimal ratio) | 0.56 | 0.18 | 0.27 | 0.754 |
Above, we've compiled all the models' performance results in predicing hit songs.
For logistic regression, it seems like the pre-processed methods yielded better performance (albeit all similar) compared to the un-preprocessed method.
For Gaussian Naive Bayes, it seems like we weren't able to improve the benchmark set by the un-preprocessed method. Although, the calculated priors did yield a better recall score, it performed worsed in terms in precision. The SMOTE + optimal ration method did objectively worse amongst its Gaussian Naive Bayes counterparts.
For Decision Trees, it seems like this technique was the weakest amongst all the techiques we tried. However, we were able to beat the benchmark performance set by the un-preprocssed Decision Tree method after balancing the class weights.
For Random Forests, this technique overall performed the best amongst all the techinques we tried. We were able to marginally match the benchmark set by the un-preprocessed method with both feature-reduction and balanced class weights.
If we run the un-preprocessed Random Forest model, the feature-reduceed Random Forest model, and the balanced class weights Random Forest model on the entire data set we get the following:
# Un-preprocessed
predictors = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
rfTree=RandomForestClassifier(n_estimators=100, random_state=20)
rfTree.fit(x_train, y_train)
y_pred = rfTree.predict(X)
cnf_matrix = metrics.confusion_matrix(y, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y, y_pred))
y_pred_proba = rfTree.predict_proba(X)[::,1]
fpr, tpr, _ = metrics.roc_curve(y, y_pred_proba)
auc = metrics.roc_auc_score(y, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
We get 196 songs that we predicted correctly, with precision score of 0.99, recall of 0.68, f1-score of 0.81, and AUC of 0.931.
# Feature-reduced
predictors = ["acousticness", "danceability", "energy", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
rfTree=RandomForestClassifier(n_estimators=100, random_state=20)
rfTree.fit(x_train, y_train)
y_pred = rfTree.predict(X)
cnf_matrix = metrics.confusion_matrix(y, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y, y_pred))
y_pred_proba = rfTree.predict_proba(X)[::,1]
fpr, tpr, _ = metrics.roc_curve(y, y_pred_proba)
auc = metrics.roc_auc_score(y, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
We get 210 songs that we predicted correctly, with precision score of 0.99, recall of 0.73, f1-score of 0.84, and AUC of 0.953.
rfTree=RandomForestClassifier(n_estimators=100, random_state=20, class_weight="balanced")
rfTree.fit(x_train, y_train)
y_pred = rfTree.predict(X)
cnf_matrix = metrics.confusion_matrix(y, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y, y_pred))
y_pred_proba = rfTree.predict_proba(X)[::,1]
fpr, tpr, _ = metrics.roc_curve(y, y_pred_proba)
auc = metrics.roc_auc_score(y, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()
We get 214 songs that we predicted correctly, with precision score of 0.97, recall of 0.75, f1-score of 0.84, and AUC of 0.951.
Here are the compiled results:
Method | Precision | Recall | F1-Score | AUC |
---|---|---|---|---|
Random Forests (Un-preprocessed) | 0.99 | 0.68 | 0.81 | 0.931 |
Random Forests (After feature reduction) | 0.99 | 0.73 | 0.84 | 0.953 |
Random Forests (Balanced Class Weights) | 0.97 | 0.75 | 0.84 | 0.951 |
Overall, the two pre-processed methods did beat the beat the benchmark set by the un-preprocessed method with higher recall scores!
For curiosity, here are all the songs that the two pre-processed method predicted to be hits:
print(data.loc[(data.RF1_pred) & (data.RF2_pred), ["track", "artist"]].to_string())