Can We Predict a Song’s Genre from its Lyrics? - Part 1
Implementing a Simple Naive Bayes in Python
Have you ever wondered if different music genres use certain words more often than others? Do country songs talk about dirt roads and beer more often than rock songs? Do rap songs swear more often than jazz? If you’re like Alex and I, these questions keep you up at night. We built a corpus and did the analysis so you don’t have to. Along the way, we learned more about ourselves than we originally intentioned (and certainly less about lyrics, the project wasn’t that successful).
Our approach was to build a corpus of songs with genres and lyrics, then implement naive bayes and k-nearest neighbors for prediction. This blog post details our naive bayes implementation and the accompanying code. At this point, we had already built a corpus and saved it in a database, so the next step was to train the model and predict new songs.
A quick note: the full code can be viewed here.
Constructor class
Nothing too exciting here with the constructor class. In this case, X_train_data
is a list of songs lyrics with each item being a song. Y_train_data
is an accompanying list of genres that correspond to each song. The data has already been divided into train
and test
buckets in this case. You can see that our 4 categories are Easy Listening, Rap, Country, and Post-1980s rock. The word_counts
dict has a dict for each genre, wherein the inner dict keep tracks of each word used within that genre. The priors
dict simply measures what portion of our training data is in each category.
def __init__(self, X_train_data, Y_train_data, X_test_data, Y_test_data):
self.X_train_data = X_train_data
self.Y_train_data = Y_train_data
self.X_test_data = X_test_data
self.Y_test_data = Y_test_data
self.vocab = {}
self.word_counts = {
"Easy": {},
"Rap": {},
"Country": {},
"Post-1980s Rock": {}
}
self.priors = {
"Easy": 0.,
"Rap": 0.,
"Country": 0.,
"Post-1980s Rock": 0.
}
Auxiliary functions - tokenize, stem and count
To start off, we needed to define a few auxiliary functions to help us out with tokenizing, stemming and counting. To help speed up this process, we imported the nltk
library which helps us accomplish most of these basic tasks. nltk
is a very powerful library and something I discussed before, so check it out. nltk
has a built-in library for NaiveBayes, but there’s nothing like building it yourself.
The first function is tokenize
, which tokenizes our sentences by stripping down a sentence into an array of individual words. stem_tokens
stems a list of words, which is the process of taking the stems of words, like turning swimming into swim in order to combine like words and keep our word list tight. Finally, count_words
makes a dict and records the count of each word provided.
Normally, we would also remove stopwords, but this has already been done for us.
def tokenize(self, text):
tokens = nltk.word_tokenize(text)
stems = self.stem_tokens(tokens, self.stemmer)
return stems
def stem_tokens(self, tokens, stemmer):
stemmed = []
for item in tokens:
stemmed.append(stemmer.stem(item.decode("utf-8", "replace")))
return stemmed
def count_words(self, words):
wc = {}
for word in words:
wc[word] = wc.get(word, 0.0) + 1.0
return wc
Training the corpus
The next step is to “train” our model. We’re really looking to accomplish two things here:
- Take a tally of all the songs in our database, looking at the distribution of genres across the songs. Is country 20% of the songs, or 40%? This is also known as looking at the prior.
- Take a tally of all the words used, and how often per genre. We basically want to know how often the word “truck” was used in general, and then how often it was used across Rap, Easy Listening, Country and Rock.
def train_corpus(self):
for i, x in enumerate(self.X_train_data):
category = self.Y_train_data[i]
self.priors[category] += 1
text = self.X_train_data[i]
words = self.tokenize(text)
counts = self.count_words(words)
for word, count in counts.items():
# if we haven't seen a word yet, let's add it to our dictionaries with a count of 0
if word not in self.vocab:
self.vocab[word] = 0.0
if word not in self.word_counts[category]:
self.word_counts[category][word] = 0.0
self.vocab[word] += count
self.word_counts[category][word] += count
Predict the song
The beast of the code is in the prediction phase, but let’s take it step by step.
First, we tokenize and count the words from our new song. Then, we calculate the priors for each genre, which again is just the percent breakdown of the genres. We divide the amount of songs in each category by the total in order to get the prior.
def predict_song(self, song):
words = self.tokenize(song)
counts = self.count_words(words)
prior_rock = (self.priors["Post-1980s Rock"] / sum(self.priors.values()))
prior_rap = (self.priors["Rap"] / sum(self.priors.values()))
prior_country = (self.priors["Country"] / sum(self.priors.values()))
prior_easy = (self.priors["Easy"] / sum(self.priors.values()))
Second, we calculate the log-probability for our new song by looking up each of the words from our new song in our model. Why log probability? We’re dealing with a very large corpus with pretty wide-reaching word use (despite the fact that most Top 40 songs may sound the same). Our numbers will get very small very fast, so better to log everything and work with negatives.
This log-probability is made up of three elements:
- Evidence - What is the probability that this word appears in our corpus?
- Likelihood - What is the probability that this word appears in each genre?
- Count - How many times does this word appear in this song?
We then multiply how often the word appears by the probability of the word appearing in each genre, and then divide by the overall probability of the genre. If the word appears a lot in a rap song but very little overall and the probability of that word appearing in a rap song is high, the log-probability will increase a lot. That’s the basic premise of naive bayes.
log_prob_rock = 0.0
log_prob_country = 0.0
log_prob_rap = 0.0
log_prob_easy = 0.0
for w, cnt in counts.items():
# skip words that we haven't seen before, or words less than 3 letters long
if not w in self.vocab or len(w) <= 3:
continue
# calculate the probability that the word occurs at all
p_word = self.vocab[w] / sum(self.vocab.values())
# for all categories, calculate P(word|category), or the probability a
# word will appear, given that we know that the document is <category>
p_w_given_rock = self.word_counts["Post-1980s Rock"].get(w, 0.0) / sum(self.word_counts["Post-1980s Rock"].values())
p_w_given_rap = self.word_counts["Rap"].get(w, 0.0) / sum(self.word_counts["Rap"].values())
p_w_given_easy = self.word_counts["Easy"].get(w, 0.0) / sum(self.word_counts["Easy"].values())
p_w_given_country = self.word_counts["Country"].get(w, 0.0) / sum(self.word_counts["Country"].values())
# add new probability to our running total: log_prob_<category>. if the probability
# is 0 (i.e. the word never appears for the category), then skip it
if p_w_given_rock > 0:
log_prob_rock += math.log(cnt * p_w_given_rock / p_word)
if p_w_given_rap > 0:
log_prob_rap += math.log(cnt * p_w_given_rap / p_word)
if p_w_given_easy > 0:
log_prob_easy += math.log(cnt * p_w_given_easy / p_word)
if p_w_given_country > 0:
log_prob_country += math.log(cnt * p_w_given_country / p_word)
Once we finishing looping through the words of our new song, we need to add in the prior of that genre occurring (addition here because we’re on a log scale). The winner of our naive bayes battle is whoever has the highest score!
rock_score = (log_prob_rock + math.log(prior_rock))
country_score = (log_prob_country + math.log(prior_country))
rap_score = (log_prob_rap + math.log(prior_rap))
easy_score = (log_prob_easy + math.log(prior_easy))
rock_score = (log_prob_rock + math.log(prior_rock))
country_score = (log_prob_country + math.log(prior_country))
rap_score = (log_prob_rap + math.log(prior_rap))
easy_score = (log_prob_easy + math.log(prior_easy))
winner = max(rock_score, country_score, rap_score, easy_score)
I hope this code helps you! Look forward to our k-nearest neighbors implementation soon!
Notes:
- Some code inspiration from here - Courtesy of Yhat