Scenario
About
Spotify
1. Ask Phase
1.1. Business
Task
2. Prepare Phase
2.1. Data Source
2.2.
Information About The Dataset and Data Organization
3. Process Phase
3.1. Loading Packages
3.2. Importing Dataset
3.3.
Previewing the Data Frame
3.4. Cleaning And Formatting
3.4.1. Data Cleaning
3.4.2. Data Formatting
3.4.3. Merging and
Standardizing the Date Column
3.4.4 Cleaning and
Converting Character Columns to Numeric or Integer Values
4. Analyze and Share Phase
4.1. What are the
Reflections of the Popular Songs on Other Platforms?
4.2. Is there a
correlation between BPM and Danceability Values?
4.3. Is There a Common
Pattern in BPM and Danceability?
4.4. Do Popular Songs
Have High Energy and High Danceability Values?
4.5. Is There a Popular
“Word” for a Song Name?
5. Act Phase
This analysis explores key audio features—danceability, energy, loudness, and valence—to identify trends in Spotify’s most streamed songs. The dataset is preprocessed by handling missing values and categorizing songs by energy and danceability levels for structured comparisons.
Visualizations, including bar charts and word clouds, highlight patterns in popular music. The findings help identify which characteristics contribute to a song’s success, providing insights for artists, producers, and music analysts.
Spotify is a popular music streaming platform launched in 2008 that allows users to listen to millions of songs, podcasts, and audio books. With a free, ad-supported version and premium subscription options, Spotify provides access to curated playlists, personalized recommendations, and offline listening. Its algorithm uses user preferences and listening history to suggest content, making discovery seamless. Spotify supports cross-device listening and is available on mobile, desktop, smart speakers, and other devices. It also empowers artists to share their music globally, offering detailed analytics through Spotify for Artists. Known for its intuitive interface, Spotify is a leader in the audio streaming industry.
This dataset is analyzed to uncover trends in song releases, danceability, acoustics, and popularity metrics. Optimal characteristics and timing for future songs are identified to maximize streaming performance and enhance audience engagement.
Analyze the Most Streamed Spotify Songs 2023 dataset to determine the release months when danceable songs are most common, identify key trends in the top 1,000 songs, and evaluate the time it takes for songs to reach their peak popularity after release. Use these insights to develop actionable strategies for optimizing future song releases and maximizing audience engagement.
-During which months of the year are danceable songs mostly released?
-What are the trends in the top 1000 songs of 2023?
-How many days after the release date does it take for songs to reach their highest level?
-How can these trends be applied for future songs?
In this analysis, the dataset titled Most Streamed Spotify Songs 2023 was utilized, obtained from Kaggle.
” This dataset contains a comprehensive list of the most famous songs of 2023 as listed on Spotify. The dataset offers a wealth of features beyond what is typically available in similar datasets. It provides insights into each song’s attributes, popularity, and presence on various music platforms. The dataset includes information such as track name, artist(s) name, release date, Spotify playlists and charts, streaming statistics, Apple Music presence, Deezer presence, Shazam charts, and various audio features. ”
The dataset comprises the 953 most-streamed songs on Spotify in 2023 and is provided as a single .csv file for analysis.
The analysis will be conducted in R due to its accessibility, ability to handle large datasets, and robust tools for creating data visualizations to effectively share results with stakeholders.
The following packages will be used for our analysis:
install.packages("tidyverse")
install.packages("dplyr")
install.packages("janitor")
install.packages("lubridate")
install.packages("ggpubr")
install.packages("here")
install.packages("skimr")
install.packages("ggrepel")
install.packages("ggplot2")
install.packages("fmsb")
install.packages("tm")
install.packages("wordcloud")
install.packages("tidytext")
install.packages("gridExtra")
library(tidyverse)
library(dplyr)
library(janitor)
library(lubridate)
library(ggpubr)
library(here)
library(skimr)
library(ggrepel)
library(ggplot2)
library(fmsb)
library(gridExtra)
library(tm)
library(wordcloud)
library(tidytext)
First, the dataset is imported for use in our analysis, and the necessary evaluations are made.
# Load data from a .csv file using read.csv()
spotify_data <-
read.csv("~/Data Analysis/Projects/spotifyAnalysis/spotify2023.csv")
# Check the imported data frame
View(spotify_data)
The data frames previously created will be previewed, and a search will be conducted to identify key details and common points across each data frame.
# View first few rows and structure.
head(spotify_data)
str(spotify_data)
Before proceeding, it is essential to ensure that the data frames are thoroughly cleaned and properly formatted. This step is vital for ensuring accurate analysis and reliable results.
Data cleaning is conducted to handle any duplicate entries or NA values present in the dataset.
# Remove duplicate rows and rows with NA values from "spotify_data" data frame
spotify_data_cleaned <- spotify_data %>%
distinct() %>%
drop_na()
A thorough check is performed to ensure that there are no remaining duplicates or missing values in the dataset.
# Checking for duplicates in each dataset.
sum(duplicated(spotify_data_cleaned))
Column names are being standardized by converting them to lowercase to ensure consistency across datasets before merging. The clean_names() function from the janitor package is used for this purpose, as it automatically transforms column names into a consistent and clean format. This includes converting them to lowercase, removing special characters, and replacing spaces with underscores, improving readability and compatibility during analysis.
# Clean and check the column names
spotify_data_cleaned <- clean_names(spotify_data_cleaned)
colnames(spotify_data_cleaned)
Upon examining the dataset, it was observed that the date information is stored across three separate columns: released_year, released_month, and released_day. To streamline the analysis, these columns need to be merged into a single date column. This transformation simplifies the analysis process, enhances compatibility with data analysis libraries, and facilitates easier and more effective data visualization.
# Create a single date column, then remove year, month, day columns
spoty <- spotify_data_cleaned %>%
mutate(date = as.Date(with(., sprintf("%04d-%02d-%02d",
released_year,
released_month,
released_day)))) %>%
select(-released_year,
-released_month,
-released_day)
The released_year, released_month, and released_day columns were combined into a single string using the sprintf() function. This string was then converted into a proper Date object using the as.Date() function to standardize the date format for further analysis.
In order for the analysis to function properly, it is necessary for some character columns to be converted to numeric value. This conversion must be completed to ensure that the data is processed correctly and without errors, allowing the analysis to proceed as intended. Since a different problem is associated with each column, each one is discussed separately.
# Clean and convert 'streams' column to numeric values.
spoty<- spoty[grepl("^[0-9]+$", spoty$streams), ]
spoty$streams <- gsub("[^0-9]", "", spoty$streams)
spoty$streams[spoty$streams == "" | is.na(spoty$streams)] <- 0
# Converted to numeric due to 32-bit integer limitations.
spoty$streams <- as.numeric(spoty$streams)
# Clean and convert 'in_shazam_charts' column to integers.
spoty$in_shazam_charts <- gsub("[^0-9]", "", spoty$in_shazam_charts)
spoty$in_shazam_charts[spoty$in_shazam_charts == "" |
is.na(spoty$in_shazam_charts)] <- 0
spoty$in_shazam_charts <- as.integer(spoty$in_shazam_charts)
# Clean and convert 'in_deezer_playlists' column to integers.
spoty$in_deezer_playlists <- gsub("[^0-9]", "", spoty$in_deezer_playlists)
spoty$in_deezer_playlists[spoty$in_deezer_playlists == "" |
is.na(spoty$in_deezer_playlists)] <- 0
spoty$in_deezer_playlists <- as.integer(spoty$in_deezer_playlists)
# Check the final result
str(spoty)
As a result of the data examination and analysis, the following recommendations have been developed:
When the 2023 data was examined, it was found that the 10 most streamed songs on Spotify are similar to those on Deezer. However, a slight similarity was observed with the Apple Music application. On the other hand, a close resemblance can be established between Spotify and Shazam, with 6 songs appearing in both platforms’ charts, while the remaining 4 songs were not present in Shazam’s chart. Based on these findings, it can be inferred that Spotify and Deezer users may exhibit similar behaviors due to their preference for similar songs. By closely examining Deezer’s operations, various benchmarks can be created. Additionally, exploring the countries where both applications are most widely used and analyzing the gender distribution of their users may reveal further similarities. This understanding can be used to better comprehend users’ preferences and expectations.
When the songs with high danceability were examined, it was observed that these songs are not necessarily in high tempos. In other words, no clear and strong relationship was found between BPM and danceability. This suggests that users evaluate elements other than tempo when assessing a song’s danceability. Interestingly, no songs with a danceability score of 85 or above were found to have a BPM above 150. However, songs with a danceability score between 80-100 were observed to have a BPM in the range of 100 to 150. This indicates that users still expect songs to have a certain tempo to be considered danceable.
In Section 4.3, where the danceability and BPM values were analyzed in greater detail, two important findings were revealed. The top and narrowest ring of the graph can be regarded as the area where the optimal ratio is achieved. This region also showed a high density of songs. Secondly, attention should be paid to the second ring surrounding this area, where songs clustered around 100 and 120 BPM, with danceability values ranging between 70 and 80. By focusing on these two regions, the foundation of a highly danceable song can be identified.
When the songs were classified into levels according to their energy and danceability scores, it was found that their distribution is surprisingly similar. This suggests that users evaluate songs in terms of both energy and danceability together.
The word cloud created in Section 4.5 revealed unexpected insights. Several words that were not anticipated appeared in the cloud, which was generated by analyzing the track_name column containing only song names. The results can be divided into two groups: “Christmas,” “Spiderman,” and “Spiderverse,” as well as “BTS,” “Metro Boomin,” and “BZRP.” In addition to confirming the continued popularity of Christmas-themed songs, it was clearly seen that collaborations with successful movies have a significant positive impact on songs’ streamings. Even though common words such as “version,” “feat,” and “vol” were excluded, the collaborators involved in these partnerships remain present in the track_name column. This further highlights the importance of popular names like BTS and Metro Boomin, which were prominently featured in the word cloud. This is crucial in understanding their popularity.