Introduction

For this research project, we wanted to explore trends in music in the recent decade. We utilized a dataset of the top Spotify songs from 2010 to 2019. There are 603 rows (songs in the dataset) and 15 columns in this dataset. The 15 columns/variables are:

  • id: an ID number
  • title: The song’s title
  • artist: The song’s artist, includes featuring artists
  • top genre: The song’s main genre
  • year: The song’s year on Billboard
  • bpm: beats per minute, indicates the song’s tempo
  • nrgy: a measurement of the song’s energy, higher values means more energetic
  • dnce: danceability, a measurement of how easy it is to dance to the song
  • dB: loudness; the higher the value, the louder the song
  • live: liveness; the higher the value, the more likely the song is a live recording
  • val: valence; the higher the value, the more positive mood for the song
  • dur: length; the duration of the song
  • acous: acousticness; the higher the value, the more acoustic the song is
  • spch: speechiness; the higher the value, the more spoken word the song contains
  • pop: popularity; the higher the value, the more popular the song is

The categorical variables are id, title, artist, top genre and year (which can also be understood as a quantitative variable). Every other variable is a continuous quantitative variable.

We wanted to address the following research questions:

  • Which values of different song components, such as genre, valence, or danceability, are correlated with the most popular songs of the decade 2010 - 2019 on Spotify?
  • Are songs that are more positive (higher valence) more popular? Which song components are more correlated with valence? Does median values of valence change throughout the decade?
  • What lyrics are common within different genres of music? Do these themes change over time?



Data Manipulation

Before we began our analysis of this dataset, we creates a new variable genre, which aggregated the 50 original genres into 10 levels: “alternative pop,” “boy band,” “dance pop,” “edm,” “electro,” “hip hop,” “international pop,” “pop,” “r&b,” and “other.” This was to make facetting by genre as well as correlations and statistical analysis more insightful and clear.



Exploratory Data Analysis (EDA)

The EDA produced here shows that there’s actually very little correlation between bpm and pop, even when accounting for genre. There’s some variation in the scatterplots produced by the facet. Different genres appear to have slightly different spreads (there’s a larger range in both pop and bpm for “dance pop”), but at least here, most genres of music evident in the dataset (of top Spotify songs from 2010 to 2019) are more similar to each other than dissimilar.

We also looked at whether more genres tended to be more popular. Utilizing side-by-side boxplots, it looks like “dance pop” has the most extreme outliers. “r&b” has the highest median popularity, followed by “electro” and “pop”, whereas “alternative pop” has the lowest median popularity. Overall, it looks like “pop” has the smallest difference in popularity between the 25th and 75th percentiles, whereas “hip hop” and “other” has the largest difference in popularity between the 25th and 75th percentiles.



Multi-Dimensional Scaling (MDS)

To better understand potential underlying patterns in the data, we performed Multi Dimensional Scaling (MDS) on the quantitative variables of this data set, of which there are many (bpm,nrgy,dnce,dB,live,val, dur, acous, `spch“,”pop").

We found that there was only one cluster, with no discernible quantitative variables associated with the single cluster. Because of the single large cluster, we wanted to see if there was any large correlation between the variables themselves, which can be more clearly discerned through Principal Component Analysis (PCA).



Principal Component Analysis (PCA)

We performed PCA analysis on all of the quantitative variables in the Spotify dataset with the new 10 categories the genre variable is coded as. As there is no clear clustering between the 10 genre groups, no concrete relationships can be found between a variable’s growth and a song’s genre identity. This all being said, we can still look at the angle between each of the variable arrows to discern the correlations between each of the variables. We can see that the variables that have the largest negative correlation are live, bpm, and spch. The most discernible correlations are positive correlations between dB, nrgy,dnce, and val, a strong negative correlation between acous and nrgy, and perhaps some correlation between live, bpm, and spch.

After computing the correlations between the bariables with the strongest relationships, we can see that there is a strong positive correlation (correlation value greater than 0.5) of 0.6636233 between dB and nrgy, and a correlation of 0.494928 between dnce and val. There is also a strong negative correlation (correlation value less than -0.5) of -0.5765065 between acous and nrgy.



Exploring Valence

We can see that val has a strong correlation with dnce, thus supporting a dive into the val variable further. We looked at trends of valence over time and across genres. Are certain genres of songs “happier” in general? Are songs that were released in a certain year “happier”?

Based on the boxplots, it seems like the range/spread of valence level is smallest for EDM, and is the biggest for boy band songs. On average, boy band songs seem to have the highest levels of valence, as seen by the fact that the boxplot for that genre is the highest (overall) in the plot, meaning that boy band songs put us in the most positive mood. There is one electro song in the dataset that has a surprisingly high level of valence, shown by the outlier in the boxplot. The first set of boxplots are colored by the year the song was released to see if there was any pattern between year and these attributes (i.e. Were less energetic songs released in 2016 after the presidential election?), but there seems to be no distinct difference in the range of valence levels between/among different years.

Then to tie it all back to the popularity of songs, we performed a lot of data analysis with regard to Valence and how it’s correlated with the popular songs and other variables (supported Danceability’s correlation with Valence).

There doesn’t seem to be a correlation between the popularity of the song and the positive mood levels of the song. This is surprising, because we would expect that people like songs that put them in a better mood.



Exploring Lyrics

We wanted to bring in another factor to provide more information about the data - lyrics. First, we made an overall wordcloud to see the common words in all of these popular songs.

A lot of words in this plot are related to love and relationships, like “love”, “like”, and “want”. This suggests that popular songs, or songs in general, are about romance. Next, we wanted to see if these trends shown in the overall word cloud would also be true within each genre of data.

Our initial idea was to create word clouds for each genre’s song lyrics. However, this resulted in a lot of very similar plots, which implied that all genres have many words in common, so we decided to change tactics.Our next step was to then identify the differences between each genre’s lyrics and the lyrics of all of the songs in our corpus. We did this by creating a comparison word cloud of each genre’s lyrics with the lyrics of all of the songs in our dataset.

Since there are 10 genres, we decided to choose the 2 most interesting comparison clouds to display, which were hip hop and international pop. We decided that these two were the most interesting because they had the largest difference with the overall corpus.

We found that there are many less words on the “all lyrics” side of the cloud (in all of the word clouds, not just these two) even after increasing the max.words argument in the word cloud function. This led us to believe that many of the most common words in the individual genres are also common words for songs in general. One aspect of the international pop genre’s comparison cloud is that there is a mix of foreign and English words on the international pop side, which suggests that the songs in this genre are generally a mix of Spanish and English lyrics.

For the next part of the lyric analysis, we also wanted to see how lyrics changed over time within each genre. We split our dataset in 2015, since this was about an even split for the number of songs before and after, and we used the same two genres as before to maintain consistency.

A couple of the words in the after 2015 halves are sort of nonsense words, like “doo”, “hmm”, or “boom”, which could indicate that these genres use a lot of filler words in between verses. We decided not to filter these words out of our corpus, since they do indicate some meaning (or lack of meaning) within songs.



Conclusion

In conclusion, we have gathered some key takeaways from each of our research questions. From the EDA examining the pop variable, we found that despite the different genres, many popular songs from 2010 - 2019 are very similar in terms of their ranges for quantitative variables such as bpm.

From our MDS and PCA, we found that there are different correlations between different pairs of variables and that genre has no discernible clustering or patterns among songs. This provides support that there are no distinct patterns between song genres that correlate with the level of a song’s popularity. However, a song’s decibel level provides a lot of information about its energy level, while a song’s danceability level provides a lot of information about how “happy” the song is. In a similar vain, how acoustic a song is has an inverse relationship with the song’s energy level.

When looking at the val (valence) across different genres and time (years), it seems like boy band songs have the highest levels of valence, meaning that boy band songs put us in the most positive mood.There seems to be no distinct difference in the range of valence levels between different years, despite the fact that there may be certain events that take place in different years. Tying this all back to the popularity of the song, there doesn’t seem to be a correlation between the popularity of the song and the positive mood levels of the song for different genres.

Moving onto lyrics, we found that some common words in all lyrics of the songs related to romance, which suggests that relationships and love are common topics within songs. Using similar analysis across different genres, we saw that all genres had lyrics that were very similar to the overall lyrics of the corpus, but they had many words in the genre-specific sections. This implies that within different genres the lyrics and themes are fairly similar to the ones from the entire song corpus. When we split the songs from pre and post 2015, after 2015, we saw an increase in meaningless words, like “boom”, “yuh”. For hip hop specifically, the words became more explicit after 2015.