Predicting Chartbusters using ML Techniques (Title WIP)

For many full-time music artists, getting high chart positions is their meal ticket; they need to have a prominent presence in the industry in order to make money and chart positions are a clear way of showing just how prominent they are. At the end of each year, Spotify compiles a playlist of the songs streamed most often over the course of that year. This year’s playlist included 100 songs. The question is: What do these top songs have in common? Why do people like them?

Introduction

Music, as one can claim, is a art-form composed and performed for many purposes ranging from aesthetic pleasure, ceremonial purposes, or to an entertainment product used for mass comsumption. Prior to the invention of sound recordings, people would actually buy sheet music and play at home on a piano for enjoyment. Now in modern times, music can be easily enjoyed through online streaming services such as Spotify, Apple Music, YouTube Music, and Pandora etc.

With that being said, a song's popularity can be measured by its chart position. For many musicians, getting their songs to chart high is of paramount importance in order garner a prominent presense in the music industry, and above all, to make money.

So now we ask: What do all these hit songs have in common? Is there a certain pattern or characteristics to determine a hit song? What are the significant and influential predictors?

Executive Summary

Business Implications

Data Summary & Description

We acquired data using Spotify's API for songs released in 2017. With Spotify's acquisition of Echo Nest, a music intelligence and data platform, their API allows users to access a song's audio features. These features includes attributes such as a song's level of acousticess, loudness, tempo and energy etc.

We pulled a total number of 7429 records, each categorized by 10 features.

Below is a detailed description of each of the features we used:

Features Description
acousticness A confidence measure from 0.0 to 1.0 of whether the track is acoustic. 1.0 represents high confidence the track is acoustic.
danceability Danceability describes how suitable a track is for dancing based on a combination of musical elements including tempo, rhythm stability, beat strength, and overall regularity. A value of 0.0 is least danceable and 1.0 is most danceable.
duration_ms The duration of the track in milliseconds.
energy Energy is a measure from 0.0 to 1.0 and represents a perceptual measure of intensity and activity. Typically, energetic tracks feel fast, loud, and noisy. For example, death metal has high energy, while a Bach prelude scores low on the scale. Perceptual features contributing to this attribute include dynamic range, perceived loudness, timbre, onset rate, and general entropy.
instrumentalness Predicts whether a track contains no vocals. “Ooh” and “aah” sounds are treated as instrumental in this context. Rap or spoken word tracks are clearly “vocal”. The closer the instrumentalness value is to 1.0, the greater likelihood the track contains no vocal content. Values above 0.5 are intended to represent instrumental tracks, but confidence is higher as the value approaches 1.0.
liveness Detects the presence of an audience in the recording. Higher liveness values represent an increased probability that the track was performed live. A value above 0.8 provides strong likelihood that the track is live.
loudness The overall loudness of a track in decibels (dB). Loudness values are averaged across the entire track and are useful for comparing relative loudness of tracks. Loudness is the quality of a sound that is the primary psychological correlate of physical strength (amplitude). Values typical range between -60 and 0 db.
speechiness Speechiness detects the presence of spoken words in a track. The more exclusively speech-like the recording (e.g. talk show, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks.
tempo The overall estimated tempo of a track in beats per minute (BPM). In musical terminology, tempo is the speed or pace of a given piece and derives directly from the average beat duration.
valence A measure from 0.0 to 1.0 describing the musical positiveness conveyed by a track. Tracks with high valence sound more positive (e.g. happy, cheerful, euphoric), while tracks with low valence sound more negative (e.g. sad, depressed, angry).
artist_popularity The popularity of the artist. The value will be between 0 and 100, with 100 being the most popular. The artist’s popularity is calculated from the popularity of all the artist’s tracks.

Using this data, we mapped each song to whether or not it was a hit if it satisfies either of these rules:

  • The song was in the Spotify's Top Tracks of 2017 playlist.
  • The song was in the Spotify's Top Artists of 2017 playlist.
  • The song was in the Top tracks of 2017: USA playlist.
  • The song charted at least once in the Billboard's Hot-100 playlist throughout 2017.

Data Acquisition

Python was used to pull data from Spotify's API, primarily utilizing a package called Spotipy (https://github.com/plamere/spotipy). Spotipy is a thin client library that handles the OAuth authorization flow (using a client id and client secret one needs to provide after signing up for a web developer account at Spotify) and access to all of the music data provided in JSON format.

Here's a code snippet for getting all the songs released in 2017 in the US:

In [ ]:
def get_songs_by_year_released(client: spotipy.Spotify, year: int) -> pd.DataFrame:
    ''' returns a pandas dataframe of songs of given year

        Parameters:
            - client: spotipy wrapper client
            - year: the year in which songs were released
    '''
    result = []
    id_set = set()
    for i in range(200):
        search_results = client.search(q="year:{}".format(year), type="track", market="US", limit=50, offset=50*i)
        for item in search_results.get("tracks").get("items"):
            artist = item.get("artists")[0]
            artist_name = artist.get("name")
            artist_id = artist.get("id")
            artist_popularity = client.artist(artist_id).get("popularity")
            track = item.get("name")
            track_id = item.get("id")
            if track_id not in id_set:
                t = {"aid": artist_id, "artist": artist_name, "artist_popularity": artist_popularity, "track": track, "id": track_id}
                result.append(t)
                id_set.add(track_id)
    return pd.DataFrame(result)

Spotify's search endpoint that allows users to get Spotify Catalog information about artists, albums, tracks or playlists which match a keyword string only allows 50 results per query. Hence, I had to provide an offset value for every query, making a total number of 200 requests. This was primarily intended to get 10000 records, however there were duplicate songs provided in each result set, decreasing the number down to 7429.

Then, using the songs we pulled, we marked each of them with their corresponding song attributes:

In [ ]:
def get_songs_features(client: spotipy.Spotify, df: pd.DataFrame) -> pd.DataFrame:
    ''' returns a pandas dataframe of song features of given songs

        Parameters:
            - client: spotipy wrapper client
            - df: pandas dataframe of songs from get_songs_by_year_released
    '''
    result = []
    songs_id = df["id"]
    for i in range(200):
        features = client.audio_features(songs_id[50*i:50*(i+1)])
        for s in features:
            if s is not None:
                result.append(s)
    features = pd.DataFrame(result)
    return pd.merge(features, df, on='id')

Next, we marked each record indicating whether or not they were a hit. Here, we also used another python package called billboard.py (https://github.com/guoguo12/billboard-charts) which allows accessing music charts from Billboard.com.

In [ ]:
def get_top_tracks_of_2017(client: spotipy.Spotify, df: pd.DataFrame, playlist_ids: str) -> pd.DataFrame:
    ''' returns an updated pandas dataframe of song features with a new hit value:
        0 - not in the given top playlist
        1 - in the given top playlist

        Parameters:
            - client: spotipy wrapper client
            - df: pandas dataframe of song features from get_songs_features
            - playlist_ids: list of public playlist ids of the top songs
    '''
    df["hit"] = 0
    
    ## First query the Spotify public playlists
    result = set()
    for pid in playlist_ids:
        uri = "spotify:user:spotifycharts:playlist:{}".format(pid)
        username = uri.split(':')[2]
        playlist_id = uri.split(':')[4]
        playlist_results = client.user_playlist(username, playlist_id)
        hit_songs = set([item.get("track").get("id") for item in playlist_results.get("tracks").get("items")])
        result = result.union(hit_songs)
    df.loc[df["id"].isin(hit_songs), "hit"] = 1
    
    ## Second, query the Billboard data
    date_template = "2017-{:02d}-01"
    songs = set()
    for i in range(1, 13):
        print(date_template.format(i))
        chart = billboard.ChartData('hot-100', date = date_template.format(i))
        to_add = set([c.title for c in chart])
        songs = songs.union(to_add)
    df.loc[df["track"].isin(songs), "hit"] = 1
    return df

Here is the resulting data set after we've selected the columns we needed, and filtering out records that weren't songs (i.e. records on Spotify that are sleep sounds, white noise, instrumental music with no vocals).

In [3]:
data = pd.read_csv("song_features_2017.csv")
data.drop(data.loc[data["tempo"] == 0, "track"].index, inplace = True)
data.drop(data.loc[data["speechiness"] >= 0.66, "track"].index, inplace = True)
data.drop(data.loc[data["liveness"] >= 0.8, "track"].index, inplace = True)
data.drop(data.loc[data["instrumentalness"] >= 0.8, "track"].index, inplace = True)
df = data[["acousticness", "danceability", "duration_ms", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity", "hit"]]
df.head(10)
Out[3]:
acousticness danceability duration_ms energy instrumentalness liveness loudness speechiness tempo valence artist_popularity hit
0 0.469000 0.872 119133 0.391 0.000004 0.2970 -9.144 0.2420 134.021 0.437 98 1
1 0.017200 0.797 146520 0.533 0.000152 0.1030 -9.740 0.0412 131.036 0.329 98 0
2 0.847000 0.734 95467 0.570 0.000021 0.1120 -7.066 0.1330 129.953 0.689 98 1
3 0.737000 0.483 203569 0.412 0.000000 0.1160 -8.461 0.0402 170.163 0.247 93 0
4 0.149000 0.880 172800 0.428 0.000051 0.1140 -8.280 0.2060 100.007 0.333 91 0
5 0.002640 0.732 182707 0.750 0.000000 0.1090 -6.366 0.2310 155.096 0.401 89 1
6 0.138000 0.718 156000 0.767 0.000000 0.1140 -5.641 0.1660 160.084 0.519 85 0
7 0.493000 0.763 190472 0.572 0.001180 0.1060 -7.312 0.0595 151.930 0.250 73 0
8 0.199000 0.798 202547 0.539 0.000017 0.1650 -6.351 0.0421 136.949 0.394 96 1
9 0.000282 0.908 177000 0.621 0.000054 0.0958 -6.638 0.1020 150.011 0.421 91 1

Analysis

Since this is a binary classification problem, we will use the following the methods:

  • Logistic Regression
  • Gaussian Naive Bayes
  • Decision Tree
  • Random Forests

And for evaluating the model, we will need to take consideration that our model is has class imbalance: 287 hits and 6358 no-hits.

Hence, we cannot use Accurary as an evaluation measure since our model can simply predict all as no-hits have acheive a 95% accuracy rate.

Thus, we will use the AUC as it is a function of sensitivity and specificity, the curve is insensitive to disparities in the class proportions. Along with other metrics such as F1-score, precision and recall.

Technique 1: Logistic Regression

First, we separate the data indicating X as the independent variables, and y as the dependent variable.

In [9]:
predictors = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")

Without pre-processing data:

In [10]:
# Split training data and validating data (70-30 split)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
logisticRegr = LogisticRegression(solver='liblinear', random_state=10)
logisticRegr.fit(x_train, y_train)
y_test_pred = logisticRegr.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred)) 
[[1903    0]
 [  91    0]]
              precision    recall  f1-score   support

       False       0.95      1.00      0.98      1903
        True       0.00      0.00      0.00        91

   micro avg       0.95      0.95      0.95      1994
   macro avg       0.48      0.50      0.49      1994
weighted avg       0.91      0.95      0.93      1994

From the results, we can already see that this model is not a good model for prediction as it did not predict any songs to be hits, and pretty much predicted all songs to be non-hits.

If we look at the significance of our model's predictors, we get the following:

In [15]:
logit = sm.Logit(y, X)
logit.fit().summary()
Optimization terminated successfully.
         Current function value: 0.163836
         Iterations 10
Out[15]:
Logit Regression Results
Dep. Variable: hit No. Observations: 6645
Model: Logit Df Residuals: 6635
Method: MLE Df Model: 9
Date: Sat, 01 Dec 2018 Pseudo R-squ.: 0.07933
Time: 21:18:27 Log-Likelihood: -1088.7
converged: True LL-Null: -1182.5
LLR p-value: 1.291e-35
coef std err z P>|z| [0.025 0.975]
acousticness -1.7869 0.311 -5.738 0.000 -2.397 -1.177
danceability -0.2411 0.434 -0.555 0.579 -1.092 0.610
energy -4.5888 0.419 -10.964 0.000 -5.409 -3.768
instrumentalness -2.8824 1.521 -1.895 0.058 -5.864 0.099
liveness -1.1263 0.564 -1.998 0.046 -2.231 -0.021
loudness 0.2883 0.031 9.351 0.000 0.228 0.349
speechiness 1.1299 0.542 2.086 0.037 0.068 2.192
tempo -0.0079 0.002 -3.781 0.000 -0.012 -0.004
valence 0.1865 0.314 0.594 0.552 -0.429 0.802
artist_popularity 0.0427 0.005 9.384 0.000 0.034 0.052

We see that valence and danceability are statistically not significant at alpha level 0.05. If we re-run the model without those two predictors, we get the following:

In [ ]:
predictors = ["acousticness", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")

# Split training data and validating data (70-30 split)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
In [24]:
logisticRegr = LogisticRegression(solver='liblinear', random_state=10)
logisticRegr.fit(x_train, y_train)
y_test_pred = logisticRegr.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred))
[[1897    0]
 [  97    0]]
              precision    recall  f1-score   support

       False       0.95      1.00      0.98      1897
        True       0.00      0.00      0.00        97

   micro avg       0.95      0.95      0.95      1994
   macro avg       0.48      0.50      0.49      1994
weighted avg       0.91      0.95      0.93      1994

In [27]:
y_pred_proba = logisticRegr.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

We still get a bad result (as our precision, recall, f1-score is at 0), but as we noted prior, this is due to an inbalanced class sample, so our model needs pre-processing via means of:

  • Putting weights on errors proportional to the class imbalance.
  • Re-sample the data to balance the positive/negatives classes.

With pre-processing data:

In [28]:
## Method 1: Putting weights using skleaern's class_weight parameter

logisticRegr = LogisticRegression(solver='liblinear', random_state=10, class_weight='balanced')
logisticRegr.fit(x_train, y_train)
y_test_pred = logisticRegr.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred))
[[1298  599]
 [  34   63]]
              precision    recall  f1-score   support

       False       0.97      0.68      0.80      1897
        True       0.10      0.65      0.17        97

   micro avg       0.68      0.68      0.68      1994
   macro avg       0.53      0.67      0.48      1994
weighted avg       0.93      0.68      0.77      1994

Now, we can see some progress as there were 63 correctly predicted hit songs, but our precision and f1-score is relatively low at 0.10 and 0.17 respectively.

We can see the model's AUC like so:

In [29]:
y_pred_proba = logisticRegr.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

The AUC is reported to be at 72, so it's somewhat of an improved model to our previous un-preprocessed model.

Another method of providing weights can be achieved doing a manual grid search with varying weights to see which weights correspond to the highest f1-score.

In [31]:
weights = np.linspace(0.05, 0.95, 20)

gsc = GridSearchCV(
    estimator=LogisticRegression(solver='liblinear', random_state=10),
    param_grid={
        'class_weight': [{0: x, 1: 1.0-x} for x in weights]
    },
    scoring='f1',
    cv=3
)
grid_result = gsc.fit(x_train, y_train)

print("Best parameters : %s" % grid_result.best_params_)

# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
                       'weight': weights })
dataz.plot(x='weight')
Best parameters : {'class_weight': {0: 0.09736842105263158, 1: 0.9026315789473685}}
Out[31]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd611de6240>

We get the best class weight parameters to be: 0.098 for class 0 and 0.903 for class 1

In [32]:
## Method 2: Putting calculated weights using grid search

logisticRegr = LogisticRegression(solver='liblinear', random_state=10, **grid_result.best_params_)
logisticRegr.fit(x_train, y_train)
y_test_pred = logisticRegr.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred))
[[1772  125]
 [  71   26]]
              precision    recall  f1-score   support

       False       0.96      0.93      0.95      1897
        True       0.17      0.27      0.21        97

   micro avg       0.90      0.90      0.90      1994
   macro avg       0.57      0.60      0.58      1994
weighted avg       0.92      0.90      0.91      1994

In [33]:
y_pred_proba = logisticRegr.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

We do get a better model compared to sklearn's balanced class weight parameter as our precision score has now improved to 0.17, f1-score to 0.21, and AUC to 0.73.

Next, we can try a re-sampling technique known as SMOTE: Synthetic Minority Over-sampling Technique. SMOTE is a very common sampling technique that synthesises new minority instances between existing (real) minority instances.

In [40]:
## Method 3: SMOTE

pipe = make_pipeline(
    SMOTE(),
    LogisticRegression(solver='liblinear', random_state=10)
)

# Fit..
pipe.fit(x_train, y_train)

# Predict..
y_pred = pipe.predict(x_test)

# Evaluate the model
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
[[1323  574]
 [  32   65]]
              precision    recall  f1-score   support

       False       0.98      0.70      0.81      1897
        True       0.10      0.67      0.18        97

   micro avg       0.70      0.70      0.70      1994
   macro avg       0.54      0.68      0.50      1994
weighted avg       0.93      0.70      0.78      1994

In [36]:
y_pred_proba = logisticRegr.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

We get very similar results to the manual weighting method except now precision has dropped back to .1 and f1-score to 0.17 (recall has improved to 66). So now.. what if we combine the two methods in order to finding the optimal ratio for resampling the data?

In [37]:
pipe = make_pipeline(
    SMOTE(),
    LogisticRegression(solver='liblinear', random_state=10)
)

weights = np.linspace(0.05, 0.95, 20)

gsc = GridSearchCV(
    estimator=pipe,
    param_grid={
        "smote__sampling_strategy": [x for x in weights]
    },
    scoring='f1',
    cv=3
)
grid_result = gsc.fit(X, y)

print("Best parameters : %s" % grid_result.best_params_)

# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
                       'weight': weights })
dataz.plot(x='weight')
Best parameters : {'smote__sampling_strategy': 0.381578947368421}
Out[37]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd611c5f8d0>
In [38]:
## Method 4: SMOTE + optimal ratio
pipe = make_pipeline(
    SMOTE(sampling_strategy=0.381578947368421),
    LogisticRegression(solver='liblinear', random_state=10)
)

# Fit..
pipe.fit(x_train, y_train)

# Predict..
y_pred = pipe.predict(x_test)

# Evaluate the model
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
[[1748  149]
 [  67   30]]
              precision    recall  f1-score   support

       False       0.96      0.92      0.94      1897
        True       0.17      0.31      0.22        97

   micro avg       0.89      0.89      0.89      1994
   macro avg       0.57      0.62      0.58      1994
weighted avg       0.92      0.89      0.91      1994

In [39]:
y_pred_proba = pipe.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

At this point, all we can say is this is the extent our logistic model can improve in terms of performance. The precision is at 0.17 (highest so far) and f1-score at 0.22 (highest so far).

Technique 2: Gaussian Naive Bayes

First, we separate the data indicating X as the independent variables, and y as the dependent variable.

In [42]:
predictors = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")

Without pre-processing data:

In [46]:
# Split training data and validating data (70-30 split)
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
gnb = GaussianNB()
gnb.fit(x_train, y_train)
y_test_pred = gnb.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred)) 
[[1866   40]
 [  73   15]]
              precision    recall  f1-score   support

       False       0.96      0.98      0.97      1906
        True       0.27      0.17      0.21        88

   micro avg       0.94      0.94      0.94      1994
   macro avg       0.62      0.57      0.59      1994
weighted avg       0.93      0.94      0.94      1994

In [47]:
y_pred_proba = gnb.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Already, using Gaussian Naive Bayes, we get a comparable result to the pre-processed Logistic Model (albeit a lower recall score). The AUC is at 0.76 as well, the highest score we've achieved so far.

Next, we can implement the same pre-processing techniques we used before.

With pre-processing data:

In [51]:
# Method 1: Using optimal priors (Prior probabilities of the classes) with grid search

weights = np.linspace(0.05, 0.95, 20)

gsc = GridSearchCV(
    estimator=GaussianNB(),
    param_grid={
        'priors': [[x, 1.0-x] for x in weights]
    },
    scoring='f1',
    cv=3
)
grid_result = gsc.fit(x_train, y_train)

print("Best parameters : %s" % grid_result.best_params_)

# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
                       'weight': weights })
dataz.plot(x='weight')
Best parameters : {'priors': [0.9026315789473683, 0.09736842105263166]}
Out[51]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd612171f98>
In [52]:
gnb = GaussianNB(**grid_result.best_params_)
gnb.fit(x_train, y_train)
y_test_pred = gnb.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred)) 
[[1649  257]
 [  47   41]]
              precision    recall  f1-score   support

       False       0.97      0.87      0.92      1906
        True       0.14      0.47      0.21        88

   micro avg       0.85      0.85      0.85      1994
   macro avg       0.55      0.67      0.56      1994
weighted avg       0.94      0.85      0.88      1994

In [53]:
y_pred_proba = gnb.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Precision scored dropped to 0.14, albeit increasing our recall score. F1 remains the same, hovering around .2. The AUC remains at 0.76 as well.

Now, if we do the SMOTE method with optimal priors:

In [54]:
## Method 2: SMOTE + optimal ratio
pipe = make_pipeline(
    SMOTE(),
    GaussianNB()
)

weights = np.linspace(0.05, 0.95, 20)

gsc = GridSearchCV(
    estimator=pipe,
    param_grid={
        "smote__sampling_strategy": [x for x in weights]
    },
    scoring='f1',
    cv=3
)
grid_result = gsc.fit(X, y)

print("Best parameters : %s" % grid_result.best_params_)

# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
                       'weight': weights })
dataz.plot(x='weight')
Best parameters : {'smote__sampling_strategy': 0.05}
Out[54]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd612115320>
In [55]:
pipe = make_pipeline(
    SMOTE(sampling_strategy=0.05),
    GaussianNB()
)

# Fit..
pipe.fit(x_train, y_train)

# Predict..
y_pred = pipe.predict(x_test)

# Evaluate the model
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
[[1877   29]
 [  78   10]]
              precision    recall  f1-score   support

       False       0.96      0.98      0.97      1906
        True       0.26      0.11      0.16        88

   micro avg       0.95      0.95      0.95      1994
   macro avg       0.61      0.55      0.56      1994
weighted avg       0.93      0.95      0.94      1994

In [56]:
y_pred_proba = pipe.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

The SMOTE method provided minimal inprovement in precision, but lower recall and f1-scores.

Technique 3: Decision Trees

First, we separate the data indicating X as the independent variables, and y as the dependent variable.

In [86]:
predictors = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")

Without pre-processing data:

In [87]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
decTree = DecisionTreeClassifier(random_state = 10)
decTree.fit(x_train, y_train)
y_test_pred = decTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred)) 
[[1806   98]
 [  71   19]]
              precision    recall  f1-score   support

       False       0.96      0.95      0.96      1904
        True       0.16      0.21      0.18        90

   micro avg       0.92      0.92      0.92      1994
   macro avg       0.56      0.58      0.57      1994
weighted avg       0.93      0.92      0.92      1994

In [88]:
y_pred_proba = decTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Using Decision Trees, we get a very similar results to the two other models we used before (Naive Gaussian Bayes and Logistic Regression). However, notice that our AUC score is at its lowest (0.579)

Next, we can implement the same pre-processing techniques we used before.

With pre-processing data:

Using decision tress, we can attempt to further improve our model by actually selecting features that provides good impurity reduction (The measure based on which the (locally) optimal condition is chosen is called impurity).

Thus when training a tree, it can be computed how much each feature decreases the weighted impurity in a tree. For a forest, the impurity decrease from each feature can be averaged and the features are ranked according to this measure.

For sklearn's decision trees, our impurity measure is defaulted to Gini impurity.

In [89]:
feature_imp = pd.Series(decTree.feature_importances_,index=predictors).sort_values(ascending=False)
feature_imp
Out[89]:
loudness             0.143632
acousticness         0.137652
tempo                0.129540
danceability         0.119468
liveness             0.117228
speechiness          0.082234
energy               0.077908
artist_popularity    0.077029
valence              0.072830
instrumentalness     0.042480
dtype: float64

Here, we see that instrumentalneess has the highest impurity score, thus we can attempt to remove it and train our model again.

In [90]:
predictors = ["acousticness", "danceability", "energy", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")
In [96]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)
decTree = DecisionTreeClassifier(random_state = 10)
decTree.fit(x_train, y_train)
y_test_pred = decTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred)) 
[[1820   89]
 [  64   21]]
              precision    recall  f1-score   support

       False       0.97      0.95      0.96      1909
        True       0.19      0.25      0.22        85

   micro avg       0.92      0.92      0.92      1994
   macro avg       0.58      0.60      0.59      1994
weighted avg       0.93      0.92      0.93      1994

In [97]:
y_pred_proba = decTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

We get a small but slight improvement from the feature reduction.

In [100]:
## Method 1: Putting weights using skleaern's class_weight parameter

decTree = DecisionTreeClassifier(random_state=10, class_weight='balanced')
decTree.fit(x_train, y_train)
y_test_pred = decTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred))
[[1833   76]
 [  62   23]]
              precision    recall  f1-score   support

       False       0.97      0.96      0.96      1909
        True       0.23      0.27      0.25        85

   micro avg       0.93      0.93      0.93      1994
   macro avg       0.60      0.62      0.61      1994
weighted avg       0.94      0.93      0.93      1994

In [101]:
y_pred_proba = decTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Using sklearn's class weight parameter, we get an even higher f1-score at 0.25. Our AUC score has also slightly increased to (0.615)

In [102]:
weights = np.linspace(0.05, 0.95, 20)

gsc = GridSearchCV(
    estimator=DecisionTreeClassifier(random_state=10),
    param_grid={
        'class_weight': [{0: x, 1: 1.0-x} for x in weights]
    },
    scoring='f1',
    cv=3
)
grid_result = gsc.fit(x_train, y_train)

print("Best parameters : %s" % grid_result.best_params_)

# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
                       'weight': weights })
dataz.plot(x='weight')
Best parameters : {'class_weight': {0: 0.23947368421052628, 1: 0.7605263157894737}}
Out[102]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd611a03c88>
In [104]:
## Method 2: Putting calculated weights using grid search

decTree = DecisionTreeClassifier(random_state=10, **grid_result.best_params_)
decTree.fit(x_train, y_train)
y_test_pred = decTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_test_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_test_pred))
[[1826   83]
 [  68   17]]
              precision    recall  f1-score   support

       False       0.96      0.96      0.96      1909
        True       0.17      0.20      0.18        85

   micro avg       0.92      0.92      0.92      1994
   macro avg       0.57      0.58      0.57      1994
weighted avg       0.93      0.92      0.93      1994

In [105]:
y_pred_proba = decTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

We pretty much get a similar result as Method 1 (might even be worse).

In [106]:
## Method 3: SMOTE

pipe = make_pipeline(
    SMOTE(),
    DecisionTreeClassifier(random_state=10)
)

# Fit..
pipe.fit(x_train, y_train)

# Predict..
y_pred = pipe.predict(x_test)

# Evaluate the model
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
[[1690  219]
 [  54   31]]
              precision    recall  f1-score   support

       False       0.97      0.89      0.93      1909
        True       0.12      0.36      0.19        85

   micro avg       0.86      0.86      0.86      1994
   macro avg       0.55      0.62      0.56      1994
weighted avg       0.93      0.86      0.89      1994

In [107]:
y_pred_proba = pipe.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Using the SMOTE method, we get a higher AUC score (0.62) and seems to have a higher recall score. But our precision rate has taken a hit (0.12).

In [108]:
## Method 4: SMOTE + optimal ratio

pipe = make_pipeline(
    SMOTE(),
    DecisionTreeClassifier(random_state=10)
)

weights = np.linspace(0.05, 0.95, 20)

gsc = GridSearchCV(
    estimator=pipe,
    param_grid={
        "smote__sampling_strategy": [x for x in weights]
    },
    scoring='f1',
    cv=3
)
grid_result = gsc.fit(X, y)

print("Best parameters : %s" % grid_result.best_params_)

# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
                       'weight': weights })
dataz.plot(x='weight')
Best parameters : {'smote__sampling_strategy': 0.19210526315789472}
Out[108]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd6118a6ef0>
In [109]:
pipe = make_pipeline(
    SMOTE(sampling_strategy=0.19210526315789472),
    DecisionTreeClassifier(random_state=10)
)

# Fit..
pipe.fit(x_train, y_train)

# Predict..
y_pred = pipe.predict(x_test)

# Evaluate the model
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
[[1766  143]
 [  60   25]]
              precision    recall  f1-score   support

       False       0.97      0.93      0.95      1909
        True       0.15      0.29      0.20        85

   micro avg       0.90      0.90      0.90      1994
   macro avg       0.56      0.61      0.57      1994
weighted avg       0.93      0.90      0.91      1994

In [110]:
y_pred_proba = pipe.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Method 4 provdies us a model that is very similar to all the pre-processed Decision Tree models.

Technique 4: Random Forests

First, we separate the data indicating X as the independent variables, and y as the dependent variable.

In [163]:
predictors = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")

Without pre-processing:

In [119]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

rfTree=RandomForestClassifier(n_estimators=100, random_state=20)
rfTree.fit(x_train, y_train)
y_pred = rfTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
[[1913    2]
 [  69   10]]
              precision    recall  f1-score   support

       False       0.97      1.00      0.98      1915
        True       0.83      0.13      0.22        79

   micro avg       0.96      0.96      0.96      1994
   macro avg       0.90      0.56      0.60      1994
weighted avg       0.96      0.96      0.95      1994

In [120]:
y_pred_proba = rfTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Now we're getting somewhere! We see that our precision sore has increased vastly to 0.83, albeit our low recall score and f1-score.

The AUC is at 0.764 which signals a good indicator that our model performs well.

With pre-processing:

Very similar to decision trees, we can conduct feature reduction using the same method based on impurity scores.

In fact Random forest is simply of a number of decision trees (in our case 100 trees). Every node in the decision trees is a conditional on a single feature, designed to split the dataset into two so that similar response values end up in the same set.

For sklearn's random forests, our impurity measure is defaulted to Gini impurity.

In [121]:
feature_imp = pd.Series(rfTree.feature_importances_,index=predictors).sort_values(ascending=False)
feature_imp
Out[121]:
tempo                0.118183
danceability         0.109975
speechiness          0.109001
artist_popularity    0.107442
loudness             0.105766
liveness             0.101695
acousticness         0.100775
energy               0.097329
valence              0.093922
instrumentalness     0.055913
dtype: float64

Here, we see that instrumentalness again has the lowest score, thus we can remove it and train our model again.

In [167]:
predictors = ["acousticness", "danceability", "energy", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")
In [142]:
x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

rfTree=RandomForestClassifier(n_estimators=100, random_state=20)
rfTree.fit(x_train, y_train)
y_pred = rfTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
[[1916    0]
 [  69    9]]
              precision    recall  f1-score   support

       False       0.97      1.00      0.98      1916
        True       1.00      0.12      0.21        78

   micro avg       0.97      0.97      0.97      1994
   macro avg       0.98      0.56      0.59      1994
weighted avg       0.97      0.97      0.95      1994

In [143]:
y_pred_proba = rfTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Removing instrumentalness as a predictor, gave us a precision score of 1.00! The AUC score remains the same at 0.759.

Next, we can attempt to provide weights and re-sample using SMOTE.

In [144]:
## Method 1: Putting weights using skleaern's class_weight parameter

rfTree=RandomForestClassifier(n_estimators=100, random_state=20, class_weight="balanced")
rfTree.fit(x_train, y_train)
y_pred = rfTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
[[1916    0]
 [  70    8]]
              precision    recall  f1-score   support

       False       0.96      1.00      0.98      1916
        True       1.00      0.10      0.19        78

   micro avg       0.96      0.96      0.96      1994
   macro avg       0.98      0.55      0.58      1994
weighted avg       0.97      0.96      0.95      1994

In [145]:
y_pred_proba = rfTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Very similar results, with precision score at 1.0 again. The AUC has marginally improved to 0.76.

In [149]:
weights = np.linspace(0.05, 0.95, 20)

gsc = GridSearchCV(
    estimator=RandomForestClassifier(n_estimators=100, random_state=20),
    param_grid={
        'class_weight': [{0: x, 1: 1.0-x} for x in weights]
    },
    scoring='f1',
    cv=3
)
grid_result = gsc.fit(x_train, y_train)

print("Best parameters : %s" % grid_result.best_params_)

# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
                       'weight': weights })
dataz.plot(x='weight')
Best parameters : {'class_weight': {0: 0.7131578947368421, 1: 0.2868421052631579}}
Out[149]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd611b03470>
In [150]:
## Method 2: Putting calculated weights using grid search

rfTree=RandomForestClassifier(n_estimators=100, random_state=20, **grid_result.best_params_)
rfTree.fit(x_train, y_train)
y_pred = rfTree.predict(x_test)
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
[[1914    2]
 [  70    8]]
              precision    recall  f1-score   support

       False       0.96      1.00      0.98      1916
        True       0.80      0.10      0.18        78

   micro avg       0.96      0.96      0.96      1994
   macro avg       0.88      0.55      0.58      1994
weighted avg       0.96      0.96      0.95      1994

In [152]:
y_pred_proba = rfTree.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Method 2 did not provide an improvement and performed worse than Method 1.

In [153]:
## Method 3: SMOTE

pipe = make_pipeline(
    SMOTE(),
    RandomForestClassifier(n_estimators=100, random_state=20)
)

# Fit..
pipe.fit(x_train, y_train)

# Predict..
y_pred = pipe.predict(x_test)

# Evaluate the model
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
[[1813  103]
 [  56   22]]
              precision    recall  f1-score   support

       False       0.97      0.95      0.96      1916
        True       0.18      0.28      0.22        78

   micro avg       0.92      0.92      0.92      1994
   macro avg       0.57      0.61      0.59      1994
weighted avg       0.94      0.92      0.93      1994

In [154]:
y_pred_proba = pipe.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Using the SMOTE method, we seem to get results similar to our previous ML techniques.

In [155]:
## Method 4: SMOTE + optimal ratio

pipe = make_pipeline(
    SMOTE(),
    RandomForestClassifier(n_estimators=100, random_state=20)
)

weights = np.linspace(0.05, 0.95, 20)

gsc = GridSearchCV(
    estimator=pipe,
    param_grid={
        "smote__sampling_strategy": [x for x in weights]
    },
    scoring='f1',
    cv=3
)
grid_result = gsc.fit(X, y)

print("Best parameters : %s" % grid_result.best_params_)

# Plot the weights vs f1 score
dataz = pd.DataFrame({ 'score': grid_result.cv_results_['mean_test_score'],
                       'weight': weights })
dataz.plot(x='weight')
Best parameters : {'smote__sampling_strategy': 0.14473684210526316}
Out[155]:
<matplotlib.axes._subplots.AxesSubplot at 0x7fd611de6668>
In [161]:
pipe = make_pipeline(
    SMOTE(sampling_strategy=0.14473684210526316),
    RandomForestClassifier(n_estimators=100, random_state=20)
)

# Fit..
pipe.fit(x_train, y_train)

# Predict..
y_pred = pipe.predict(x_test)

# Evaluate the model
cnf_matrix = metrics.confusion_matrix(y_test, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y_test, y_pred))
[[1905   11]
 [  64   14]]
              precision    recall  f1-score   support

       False       0.97      0.99      0.98      1916
        True       0.56      0.18      0.27        78

   micro avg       0.96      0.96      0.96      1994
   macro avg       0.76      0.59      0.63      1994
weighted avg       0.95      0.96      0.95      1994

In [162]:
y_pred_proba = pipe.predict_proba(x_test)[::,1]
fpr, tpr, _ = metrics.roc_curve(y_test, y_pred_proba)
auc = metrics.roc_auc_score(y_test, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

Method 4 did provide better results than Method 3.

Model Summary (Evaluations)

Method Precision Recall F1-Score AUC
Logistic Regression (NO pre-processing) 0 0 0 0.732
Logistic Regression (Balanced Class Weights) 0.1 0.65 0.17 0.727
Logistic Regression (Calculated Weights) 0.17 0.27 0.21 0.734
Logistic Regression (SMOTE) 0.1 0.67 0.18 0.734
Logistic Regression (SMOTE + optimal ratio) 0.17 0.31 0.22 0.729
Gaussian Naive Bayes (NO pre-processing) 0.27 0.17 0.21 0.765
Gaussian Naive Bayes (Calculated Priors) 0.14 0.47 0.21 0.765
Gaussian Naive Bayes (SMOTE + optimal ratio) 0.26 0.11 0.16 0.769
Decision Trees (No pre-processing) 0.16 0.21 0.18 0.579
Decision Trees (After feature-reduction) 0.19 0.25 0.22 0.600
Decision Trees (Balanced Class Weights) 0.23 0.27 0.25 0.615
Decision Trees (Calculated Weights) 0.17 0.2 0.18 0.578
Decision Trees (SMOTE) 0.12 0.36 0.19 0.625
Decision Trees (SMOTE + optimal ratio) 0.15 0.29 0.20 0.609
Random Forests (No pre-processing) 0.83 0.13 0.22 0.764
Random Forests (After feature-reduction) 1.00 0.12 0.21 0.759
Random Forests (Balanced Class Weights) 1.00 0.10 0.19 0.761
Random Forests (Calculated Weights) 0.80 0.10 0.18 0.716
Random Forests (SMOTE) 0.18 0.28 0.22 0.736
Random Forests (SMOTE + optimal ratio) 0.56 0.18 0.27 0.754

Above, we've compiled all the models' performance results in predicing hit songs.

For logistic regression, it seems like the pre-processed methods yielded better performance (albeit all similar) compared to the un-preprocessed method.

For Gaussian Naive Bayes, it seems like we weren't able to improve the benchmark set by the un-preprocessed method. Although, the calculated priors did yield a better recall score, it performed worsed in terms in precision. The SMOTE + optimal ration method did objectively worse amongst its Gaussian Naive Bayes counterparts.

For Decision Trees, it seems like this technique was the weakest amongst all the techiques we tried. However, we were able to beat the benchmark performance set by the un-preprocssed Decision Tree method after balancing the class weights.

For Random Forests, this technique overall performed the best amongst all the techinques we tried. We were able to marginally match the benchmark set by the un-preprocessed method with both feature-reduction and balanced class weights.

If we run the un-preprocessed Random Forest model, the feature-reduceed Random Forest model, and the balanced class weights Random Forest model on the entire data set we get the following:

In [169]:
# Un-preprocessed
predictors = ["acousticness", "danceability", "energy", "instrumentalness", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

rfTree=RandomForestClassifier(n_estimators=100, random_state=20)
rfTree.fit(x_train, y_train)
y_pred = rfTree.predict(X)
cnf_matrix = metrics.confusion_matrix(y, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y, y_pred))
[[6356    2]
 [  91  196]]
              precision    recall  f1-score   support

       False       0.99      1.00      0.99      6358
        True       0.99      0.68      0.81       287

   micro avg       0.99      0.99      0.99      6645
   macro avg       0.99      0.84      0.90      6645
weighted avg       0.99      0.99      0.98      6645

In [170]:
y_pred_proba = rfTree.predict_proba(X)[::,1]
fpr, tpr, _ = metrics.roc_curve(y, y_pred_proba)
auc = metrics.roc_auc_score(y, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

We get 196 songs that we predicted correctly, with precision score of 0.99, recall of 0.68, f1-score of 0.81, and AUC of 0.931.

In [172]:
# Feature-reduced

predictors = ["acousticness", "danceability", "energy", "liveness", "loudness", "speechiness", "tempo", "valence", "artist_popularity"]
X = df[predictors]
y = df["hit"].astype("bool")

x_train, x_test, y_train, y_test = train_test_split(X, y, test_size=0.3)

rfTree=RandomForestClassifier(n_estimators=100, random_state=20)
rfTree.fit(x_train, y_train)
y_pred = rfTree.predict(X)
cnf_matrix = metrics.confusion_matrix(y, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y, y_pred))
[[6355    3]
 [  77  210]]
              precision    recall  f1-score   support

       False       0.99      1.00      0.99      6358
        True       0.99      0.73      0.84       287

   micro avg       0.99      0.99      0.99      6645
   macro avg       0.99      0.87      0.92      6645
weighted avg       0.99      0.99      0.99      6645

In [173]:
y_pred_proba = rfTree.predict_proba(X)[::,1]
fpr, tpr, _ = metrics.roc_curve(y, y_pred_proba)
auc = metrics.roc_auc_score(y, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

We get 210 songs that we predicted correctly, with precision score of 0.99, recall of 0.73, f1-score of 0.84, and AUC of 0.953.

In [174]:
rfTree=RandomForestClassifier(n_estimators=100, random_state=20, class_weight="balanced")
rfTree.fit(x_train, y_train)
y_pred = rfTree.predict(X)
cnf_matrix = metrics.confusion_matrix(y, y_pred)
print(cnf_matrix)
print(metrics.classification_report(y, y_pred))
[[6352    6]
 [  73  214]]
              precision    recall  f1-score   support

       False       0.99      1.00      0.99      6358
        True       0.97      0.75      0.84       287

   micro avg       0.99      0.99      0.99      6645
   macro avg       0.98      0.87      0.92      6645
weighted avg       0.99      0.99      0.99      6645

In [175]:
y_pred_proba = rfTree.predict_proba(X)[::,1]
fpr, tpr, _ = metrics.roc_curve(y, y_pred_proba)
auc = metrics.roc_auc_score(y, y_pred_proba)
plt.plot(fpr,tpr,label="data 1, auc="+str(auc))
plt.legend(loc=4)
plt.show()

We get 214 songs that we predicted correctly, with precision score of 0.97, recall of 0.75, f1-score of 0.84, and AUC of 0.951.

Here are the compiled results:

Method Precision Recall F1-Score AUC
Random Forests (Un-preprocessed) 0.99 0.68 0.81 0.931
Random Forests (After feature reduction) 0.99 0.73 0.84 0.953
Random Forests (Balanced Class Weights) 0.97 0.75 0.84 0.951

Overall, the two pre-processed methods did beat the beat the benchmark set by the un-preprocessed method with higher recall scores!

For curiosity, here are all the songs that the two pre-processed method predicted to be hits:

In [183]:
print(data.loc[(data.RF1_pred) & (data.RF2_pred), ["track", "artist"]].to_string())
                                           track                      artist
2             Everybody Dies In Their Nightmares                XXXTENTACION
8                             Young Dumb & Broke                      Khalid
9                                        HUMBLE.              Kendrick Lamar
17                                  Let You Down                          NF
18                                      Location                      Khalid
22                                       Silence                  Marshmello
27                                       Perfect                  Ed Sheeran
31                                   Look At Me!                XXXTENTACION
34                  Drowning (feat. Kodak Black)      A Boogie Wit da Hoodie
38                                          DNA.              Kendrick Lamar
41                                    Plain Jane                   A$AP Ferg
43                          Too Good At Goodbyes                   Sam Smith
47              Love Galore (feat. Travis Scott)                         SZA
48                                 Losin Control                        Russ
51                                   Bad At Love                      Halsey
55                             Whatever It Takes             Imagine Dragons
58                                     Hurricane                  Luke Combs
60                                      do re mi                   Blackbear
63                               Sorry Not Sorry                 Demi Lovato
69                   Slippery (feat. Gucci Mane)                       Migos
71                                        Wolves                Selena Gomez
74                                     New Rules                    Dua Lipa
86                                        Issues              Julia Michaels
89                                    Hallelujah                  Pentatonix
91                               Strip That Down                  Liam Payne
99                                 First Day Out                Tee Grizzley
100                                Tunnel Vision                 Kodak Black
103                                     Betrayed                     Lil Xan
105                                     The Race                       Tay-K
106                                        Mercy                 Brett Young
113                                     No Limit                      G-Eazy
117                               Small Town Boy                Dustin Lynch
123                                  Look At Me!                XXXTENTACION
127                                  The Weekend                         SZA
132              There's Nothing Holdin' Me Back                Shawn Mendes
133                                  Sauce It Up                Lil Uzi Vert
150                                   Rake It Up                    Yo Gotti
151                                        Mercy                Shawn Mendes
153                                Unforgettable                Thomas Rhett
157                     Look What You Made Me Do                Taylor Swift
161                                     Portland                       Drake
163                                    Fake Love                       Drake
167                     Something Just Like This            The Chainsmokers
177                                        Yours           Russell Dickerson
180                                      T-Shirt                       Migos
184                                  Galway Girl                  Ed Sheeran
192                                       Chanel                 Frank Ocean
202                                 Broken Halos             Chris Stapleton
209                                         Dive                  Ed Sheeran
216                                         4 AM                    2 Chainz
222   Swalla (feat. Nicki Minaj & Ty Dolla $ign)                Jason Derulo
228                                      Praying                       Kesha
238                                        Paris            The Chainsmokers
240                                   Patty Cake                 Kodak Black
244                                What About Us                        P!nk
245                                     Mi Gente                    J Balvin
249                                    Despacito                  Luis Fonsi
250                                        Slide               Calvin Harris
253                                      Mayores                     Becky G
254                                 Losing Sleep                 Chris Young
260                                        Rolex                   Ayo & Teo
265                                     Gorgeous                Taylor Swift
270                          Scared to Be Lonely               Martin Garrix
280                                     No Limit                      G-Eazy
292                                        Human              Rag'n'Bone Man
293                                     Bad Liar                Selena Gomez
294             Bad Things (with Camila Cabello)           Machine Gun Kelly
296                                  Gyalchester                       Drake
298                                  Craving You                Thomas Rhett
303                                        Moves                    Big Sean
312                                      My Girl                 Dylan Scott
313                                        Signs                       Drake
336                     Something Just Like This            The Chainsmokers
340                        Call It What You Want                Taylor Swift
353                          Nobody Else But You                  Trey Songz
364                                     HandClap       Fitz and The Tantrums
374                                          DNA                         BTS
376                                      Praying                       Kesha
391                            Sleep Without You                 Brett Young
392                             Drinkin' Problem                     Midland
418                                        Feels               Calvin Harris
428                                    This Town                 Niall Horan
435                                         Weak                         AJR
470                                If I Told You               Darius Rucker
477                                  Untouchable  YoungBoy Never Broke Again
482                                    New Rules                    Dua Lipa
490                                        Lemon                     N.E.R.D
555                                     Location                      Khalid
569                                    Too Hotty             Quality Control
581                               Play That Song                       Train
582                                    Liability                       Lorde
594                              Black SpiderMan                       Logic
627                                      Selfish                      Future
640                                        Alone                      Halsey
652                          More Girls Like You                   Kip Moore
656                                    Questions                 Chris Brown
673                                        Paris            The Chainsmokers
676                                 Love So Soft              Kelly Clarkson
677                                       Malibu                 Miley Cyrus
701                               No Limit REMIX                      G-Eazy
717                                  Patek Water                      Future
753                                  Untouchable  YoungBoy Never Broke Again
784                        Chained To The Rhythm                  Katy Perry
799                                    No Frauds                 Nicki Minaj
830                                            X                Lil Uzi Vert
837                                     Drowning                       KREAM
855                                        Woman                       Kesha
857                                         Down               Fifth Harmony
868                                      Privacy                 Chris Brown
874                         Yours If You Want It               Rascal Flatts
877                                  Swish Swish                  Katy Perry
904                           Every Little Thing                Carly Pearce
914                                    No Favors                    Big Sean
926                                        Water                    Ugly God
959                                        Slide               Calvin Harris
963                                   Fresh Eyes                Andy Grammer
1023                                80s Mercedes                Maren Morris
1097                                 Candy Paint                 Post Malone
1180                                 Guys My Age                  Hey Violet
1182                                    Location               Playboi Carti
1223                               Walk On Water                    A$AP Mob
1235                             Too Much To Ask                 Niall Horan
1360                                     Thunder                   Roy Blair
1362                                    The Plan                      G-Eazy
1405                                       Today                Trippie Redd
1539                                Let You Down                          NF
1625                          Not Afraid Anymore                      Halsey
1635                                  Peek A Boo                  Lil Yachty
1647            Bad Things (with Camila Cabello)           Machine Gun Kelly
1659                                 Sauce It Up                Lil Uzi Vert
1727                               First Day Out                 Kodak Black
1813                                       Feels               Calvin Harris
1842                                 Craving You                Thomas Rhett
1844                                   Liability                       Lorde
1883                                       Alone                    Yo Trane
1902                                 No Promises                Shawn Mendes
1922                                     Privacy                 Chris Brown
1923                            Drinkin' Problem                     Midland
1932             All I Want For Christmas Is You           Laurence Nerbonne
1960                               Tunnel Vision                 Kodak Black
1963                                       Mercy                      Lookas
2008                               XO TOUR Llif3                Lil Uzi Vert
2150                               XO TOUR Llif3                Fame on Fire
2173                                      Change                          RM
2179                           Homemade Dynamite                       Lorde
2194                                  Better Man  YoungBoy Never Broke Again
2278                                   Too Hotty             Quality Control
2416                                           X                Lil Uzi Vert
2469                                     Perfect                   Dave East
2505                               Unforgettable                Thomas Rhett
2670                                      Change                Lana Del Rey
2748                               Cheap Thrills                  The Bellas
2749                                   Questions                    PnB Rock
2776                                     Friends                    PnB Rock
2929                                 Look At Me!                XXXTENTACION
2939                                 It Ain't Me                        Kygo
2951                             It's Goin' Down                Dove Cameron
3162                                 Untouchable                      Eminem
3238                                  Peek A Boo                  Lil Yachty
3292                                   I Got You                Josh Mirenda
3299                                       Heavy              Our Last Night
3394                                       Water                Caleb Belkin
3434             There's Nothing Holdin' Me Back               Kidz Bop Kids
3468                                       Woman                       Kesha
3653             All I Want For Christmas Is You              Clementine Duo
3676                                      Closer                      POWERS
3700                           Don't Let Me Down                Joy Williams
3878                                    High End                 Chris Brown
4287                             Congratulations                  Shy Glizzy
4294                                    Everyday                     Guvna B
4365             There's Nothing Holdin' Me Back                Shawn Mendes
4380                                    Caroline                Animal Years
4418                                        Dive                Coast Modern
4444                                 Bounce Back                    Big Sean
4549                                        DNA.              Kendrick Lamar
4590                                       Slide                       Aminé
4705                                 Light It Up                      Neffex
4739                            Jingle Bell Rock                 Dylan Scott
4743                                      Closer                Boyce Avenue
4785                                      Honest                    Jay Wile
4922                                    The Plan                      G-Eazy
5074                                Passionfruit                       Drake
5118                               Unforgettable              French Montana
5151                    Look What You Made Me Do                 The Mayries
5161                                       Heavy               Jeremy Zucker
5162                               Tunnel Vision                 Kodak Black
5182                                      Wolves                Rise Against
5208                                The Greatest               Kidz Bop Kids
5278                                  Plain Jane                   A$AP Ferg
5431                                       Human                   Sevdaliza
5596                              Jocelyn Flores                       Q.Z.B
5708                                       Curve                        SoMo
5779                                     Friends                      Dbangz
5785                              1-800-273-8255                       Logic
5862                                    Portland                       Drake
5881                                   Despacito               Conor Maynard
6041                                    Caroline                       Aminé
6087                                       Heavy                   Oh Wonder
6140                                 Sauce It Up                Lil Uzi Vert
6171                                     Privacy                 Chris Brown
6196                                      Change                     Sir Sly
6199                                       Mercy                Shawn Mendes
6327                                     T-Shirt                Foo Fighters
6476                                    No Limit                      G-Eazy
6693                           Whatever It Takes                 Anita Baker
6815                                  Sky Walker                      Miguel
7025                         Mary, Did You Know?                Zara Larsson
7155                                Transportin'                 Kodak Black
7304                                      Honest            Promoting Sounds
7313                                  Plain Jane                   A$AP Ferg
7327                              1-800-273-8255                       Logic

Takeaways