Can We Predict a Song’s Genre from its Lyrics? - Part 2

Implementing kNN in Python

5 minute read

Rivalries are as old as time. Leah vs Rachel, Monica vs Chandler, and now, Naive Bayes vs k nearest neighbors. A real statistician would go through the pros and cons of each, but alas, I am not a real statistician. So instead, I write a witty introduction and move on!

A quick note: the full code can be viewed here. Previous blog post is here

Overview

k Nearest Neighbors is a deceivingly simple algorithm. The idea is to plot each song in your library on a map. For simplicity, let’s assume it’s a two dimensional map with the x-axis being number of words, and the y-axis being average length of words. In this case, we would simply compute an (x,y) coordinate for each song, then plot all the songs on a map. The second step is to take a new song, not in our corpus, and plot it on our map. Finally, we find the k-closest neighbors and find out the mode (most frequently occurring) song type. If $k=1$, then we take the single closest neighbor and assume that song’s category to be the same as its closest neighbor.

The Complication

Well that’s all well and fun in two dimensions, but that’s pretty basic. We want to leverage all the words at once, not just two features of the song. A song with 100 different words would yield 100 different dimensions! And it would just be a bunch of 0s and 1s. I can barely think in 3 dimensions, let alone 100!

The Solution

Cosine mother-f-ing similarity (if you thought this blog was SFW, you forgot how desperate I am for attention). Cosine similarity measures the similarity between two non-zero vectors by taking the dot product over the magnitude of those two vectors:

$$ cos(\theta) = \frac{\bf{A} \cdot \bf{B}}{||\bf{A}|| ||\bf{B}||} $$

The results range from -1, meaning exact opposite, to 1, meaning exactly the same. Let’s show a brief example. Consider the following sentences:

A. “This sentence rules yeah” </br> B. “This sentence rocks” </br> C. “This phrase bad”

Step 1 is to convert these into feature vectors:

sent_a = {"this": 1, "sentence": 1, "rules": 1, "yeah": 1}
sent_b = {"this": 1, "sentence": 1, "rocks": 1}
sent_c = {"this": 1, "phrase": 1, "bad": 1}

Step 2 is to perform the analysis! Starting with sentence A and B:

\[cos(\theta) = \frac{\bf{A} \cdot \bf{B}}{||\bf{A}|| ||\bf{B}||}\] \[cos(\theta) = \frac{1 * 1 + 1 * 1}{\sqrt{(1^2+1^2+1^2+1^2)}\sqrt{(1^2+1^2+1^2)}}\] \[cos(\theta) = \frac{2}{\sqrt{(4)}\sqrt{(3)}}\] \[cos(\theta) = \frac{2}{2*\sqrt{(3)}}\] \[cos(\theta) = 0.577\]

Not bad! Let’s try to compare sentence A and C.

\[cos(\theta) = \frac{\bf{A} \cdot \bf{B}}{||\bf{A}|| ||\bf{B}||}\] \[cos(\theta) = \frac{1*1}{\sqrt{(1^2+1^2+1^2+1^2)}\sqrt{(1^2+1^2+1^2)}}\] \[cos(\theta) = \frac{1}{\sqrt{(3)}\sqrt{(4)}}\] \[cos(\theta) = \frac{1}{2*\sqrt{(3)}}\] \[cos(\theta) = 0.289\]

Less similar, but that’s to be expected. Basically, the more overlap in words, the bigger the numerator gets which increase the similarity. Now that we have a similarity measure, the rest is easy! All we do is take our new song, compute the cosine similarity between this new song and our entire corpus, sort in descending order, then grab the top $k$ and take the mode of those.

Training the Corpus

The constructor class and auxiliary functions (stemming, tokenizing) are the same as before, so let’s skip through those and discuss in fine detail the training and predicting classes. Starting with the training function:

def train_corpus(self):
    vectorizer = TfidfVectorizer(tokenizer=self.tokenize, stop_words='english')
    self.X_train = vectorizer.fit_transform(self.X_train_data)

I know all of you fans at home are screaming at your computer screens right now. You’re probably saying things like: “I stayed up all night waiting for this new blog post, only for you to cop out and use a pre-programmed vectorizer?!” Loyal fans, it’s true. That first line, the TfidfVectorizer, it does take all of our sentences and turn them into the tf-idf feature vectors for us. tf-idf is a very important concept in text mining that essentially weighs words by how they important they are in a document. I don’t have time to go through the whole concept, so I copped out with a sci-kit function. The next line goes through our training data, and transforms it into a beautiful corpus. I’m moving on loyal readers, I hope you can too.

Testing the Corpus

def predict_song(self, song):
    cosine_similarities = linear_kernel(self.X_train, song).flatten()
    related_docs_indices = cosine_similarities.argsort()[:-self.k:-1]
    words_to_count = self.Y_train_data[related_docs_indices]
    c = Counter(words_to_count)
    return c.most_common(1)[0][0]

The predict function takes in a song (that has already been converted into a feature vector), and outputs a category. The first line of this function takes the cosine similarity between the new song and our training corpus. We then sort the list and take the top $k$ results. Now, the tricky part here is that the cosine similarities are all numbers and our categories are stored in the accompanying Y_train_data, so we just look at the indices of the X_train_data, and pull out the corresponding categories from Y_train_data. We then use a counter to find the mode, and return the first result!

Conclusion + Next Steps

I hope this code helps you design your own kNN. The next steps here would be to calibrate appropriate $k$s and perhaps exploring other similarity measures. There’s also the fun of feature selection and messing around with your data a little more.

Notes:

Title Inspiration - link

Share on

Twitter Facebook Google+ LinkedIn

Ross Hochwert

Can We Predict a Song’s Genre from its Lyrics? - Part 2

Implementing kNN in Python

Overview

The Complication

The Solution

Training the Corpus

Testing the Corpus

Conclusion + Next Steps

Share on

You May Also Enjoy

A Statistical Approach to Customer Lifetime Value - Part 4: The Beta Geometric

A Statistical Approach to Customer Lifetime Value - Part 3: Duration Dependence

Can We Predict a Song’s Genre from its Lyrics? - Part 1

A Statistical Approach to Customer Lifetime Value - Part 2: Exponential Churn Rates