Music Streaming Model Success

Back to Research & Publication

Would a model that gives artists on music streaming platforms individualized insights on the potential success or popularity of their music be effective?



Introduction

While the present day music industry has a substantial impact on contemporary trends and pop culture, its main purpose is still profit, as can be seen by the stringent enforcement of laws such as the Digital Millennium Copyright Act (DMCA) on online video streaming services. As a result, record labels and music distributors might express interest in an algorithmic approach to evaluating the potential popularity of a new song based on features available from music streaming platforms to aid decisions on which artists to sign. Thus, the motivation behind this project was whether a model that gives artists on music streaming platforms individualized insights on the potential success or popularity of their music would be effective. To answer this question, I tested 3 models based on artist metadata and multiple musical properties of their songs in order to identify factors that impact the potential popularity of their music. The Random Forest model, Logistic Regression, and Neural Network had maximum accuracies of 80.11%, 78.00%, and 41.23% in their respective schemes.

Data Collection & Exploratory Data Analysis

For this project, I used the Million Song Dataset created by Columbia University’s LabROSA and The Echo Nest, which contains information on 1,000,000 songs sampled from multiple music genres and time periods. This dataset was chosen primarily because of its ease of access, volume of data, and the fact that many modern music streaming platforms, such as Spotify, provide similar or identical musical analysis features via their API. Due to the size of the full dataset (approximately 250 GB), I opted to use a smaller 10,000 song subset. Each entry in the dataset has 53 features, which include musical attributes such as tempo, time signature, timbre (texture features of the music), song and artist metadata, and a popularity score ranging from 0 to 1, with 1 being the most popular.

One might anticipate that metadata such as artist familiarity and the geographical location of the artist are discerning factors in whether a song becomes popular. For this reason, the geographic location of the artist was plotted as below:



Notably, the latitude and longitudes of the songs tend to be concentrated in existing music industry centers such as Los Angeles, London, and parts of Latin America, which suggests that location may not be a significant factor for popularity in the context of this model. Moreover, the distribution of popularity scores was found to be right-skewed. The smaller quantity of data available among the higher score range raised concerns regarding how effective a model would be at predicting the percentile in which a sample song might fall.



Data Cleaning & Feature Engineering

The Million Song Dataset is originally stored in HDF5 file format, which is a type of file specialized for storing large volumes of data. The HDF5 file containing the data consisted of 3 groups: analysis (musical features), metadata (song and artist metadata), and musicbrainz (song metadata). I used the h5py library in order to convert the original HDF5 file to multiple pandas Dataframes, which were then inner joined based on the audio_md5 feature that uniquely identifies each song. After each group was successfully joined, data points containing NaN or whose features were all 0 were removed along with songs that were missing a value for year or popularity. Moreover, categorical features such as time signature were implemented using a one-hot encoding scheme. Throughout this process, I identified features such as idx_bars_confidence whose value was 0 for all data points. These features were also removed in the fully cleaned version of the dataset, resulting in 17 features remaining.

The original dataset provides a real-numbered popularity score in the range [0, 1]. In an effort to enhance the relevance of the popularity metric, I created multiple versions of the dataset using several schemes for determining the song popularity label.

  1. 2 classes (Scheme 1): Songs were given the label 1 if their song_hotttnesss value was greater than or equal to 0.5, 0 otherwise.
  2. 2 classes (Scheme 2): Songs were given the label 1 if their song_hotttnesss value was greater than or equal to the median song_hotttness in the dataset, 0 otherwise.
  3. 10 classes (Scheme 3): Songs were given a label in the range [0, 9] that corresponds to the percentile (multiple of 10) in which they are contained. For example, if a song was in the 7th percentile, it would be assigned a label value of 0 to correspond to the 0th percentile; if a song was in the 15th percentile, it would be assigned a label value of 1 since it corresponds to the 10th percentile, and so on.



Modelling

The models used were a Random Forest model, Logistic Regression, and a 3-layer neural network. The reason for choosing a Random Forest model was the fact that decision trees can give a clear metric of feature importance and relations can be observed easily, which was one of the initial goals of the project. Logistic regression was used as a general benchmark for comparison purposes with the other models. Lastly, a neural network was implemented with the goal of capturing any potential nonlinear relationships between features.







I used sklearn to implement the Random Forest and Logistic Regression models and PyTorch for the neural network. The majority of model parameters for the former 2 models are the defaults provided by the sklearn library, though the modified parameters of each model are summarized in the table below:



Results



The Random Forest Model showed an accuracy of 80.11% in Scheme 1 and 69.80% in Scheme 2, which is a marginal improvement over random binary classification predictions. The Logistic Regression Model showed a 78% in Scheme 1 and 71% accuracy in Scheme 2. Finally, the Neural Network achieved an accuracy of 41.23% in Scheme 3, which was significantly better than the expected average percentage of correct predictions (10%). Based on the Gini importance of each feature as computed by sklearn, metadata such as artist_hotttnesss, artist_familiarity, proved to be less influential on positive labels than initially predicted. However, the ranking of features such as artist_latitude and artist_longitude as the least important are consistent with initial predictions since most songs in the dataset originated from regions where popular music production is most concentrated.