Naïve Bayes Classifier 101: Your Guide to Understanding a Simple ML Classifier
Who you calling naive…
In my last post, I used a Naïve Bayes Classifier from NLTK to classify tweets as positive or negative, but I didn’t really talk about where the classifier comes from. I wasn’t terribly satisfied with the “what is a naïve bayes classifier” Google results, so I’ll pitch in my two cents here.
Bayesian Probability
The Bayes classifier is a direct byproduct of Bayesian probability. Bayesian probability stems directly from Bayes’ theorem. Bayes’’’ theorem is that thing you learned in college but probably forgot an hour later. At a high level, a Bayes classifier utilizes the independent probability of each “feature” to classify our data. For our twitter example, our classifier looks at each individual word from our training set and determines the independent probability of that word occurring in a negative or positive tweet.
Rather than write 1,000 words on Bayesian vs. Frequentist approach or fill the page with examples of Bayes’ theorem, I’d like to keep it high level. We basically need three elements for Bayesian inference: a prior, a likelihood, and evidence. Here’s our formula:
The Prior
The prior is our probability of X before we know anything. Quite literally, “prior” to meeting this person or addressing this problem, what’s the probability of X. For our tweets, the prior is the general probability a tweet about my movie will be positive. In order to estimate the prior, I look to Rotten Tomatoes and find that my movie scored a 74% positive score (not bad), so I assume a 74% prior: 74% of my reviews should be positive without knowing anything else.
The Likelihood
This percentage describes how often the word appears in positive tweets. For example, given that a tweet is positive, what percentage of the time will that tweet contain the word “great?”” In our “bag of words” model, we’ll compute this for each word, multiply by our prior (the probability a tweet is positive) and add them up together. As an example, let’s assume that “great” appears in 45% of positive tweets.
The Evidence
This is the denominator of our equation and is the probability the word “great” shows up in any tweet, positive or negative. When dealing with multiple words, again, we’ll have to add all these independent probabilities together. As an example, let’s say the word “great” shows up in 60% of total tweets.
After performing the calculations, we are left with a posterior, or a probability, that a tweet is positive, given the words we provided. To keep things simple, we assume that this tweet is only one word, “great.”
At 55%, our classifier says this tweet makes the threshold and can be classified as positive.* In a real situation, we’d add many more words.
So What?
Why should you care? Well, the Naïve Bayes classifier is super popular because it’s grounded in intuition and simplicity: we multiply a bunch of percentages together and pick the highest number. If you can break a problem down into words and associate each of those words with a class (positive/negative tweet, spam/not spam message, angry/happy customer), you can quickly work up a model to automate this classification process.
I hope by the end of this post, the path to utilizing NLP and ML in your work is becoming clearer. By manually classifying tweets as positive or negative, we create a corpus for our Naïve Bayes classifier. Our classifier calculates how often all words appear in positive tweets and negative tweets and stores that away when we do our “training”. When test tweets, all the classifier needs to do is find the probabilities for each word, add them up and spit out an answer.
What’s Next?
“Bayesian inference, classification, machine learning, it all seems so easy!” you say, “why spend more than 10 minutes on this?” Well, you and I both know this is a simplification. But even wihin my simplification, there’s one critical ML aspect I glossed over (that I shouldn’t have): feature selection. So far, we have fed our classifier everything, as the “bag of words” model dictates. But is that always effective? Of course not. In computer science, there’s a saying that goes “garbage in, garbage out.” If you put garbage in your model, your model will spit out garbage. Instead of feeding out classifier every word, what if we only fed it words longer than 3 characters. Or only the top 1000 words from our training set. Or only 500 pre-selected words that seem informative. What about collocations?
Personally, I don’t think the role of the data scientist is to just understand how the underlying classifier works, but also to understand how to maximize results with this classifier and what to do with those results. That includes how to gather, clean and select the best data for your model. It’s a combination of statistics (to understand how Bayesian inference and classifiers work), computer science (to code in the NLP and ML objects) and business (choosing the correct features/data and analyzing business impact.)
*In true Bayesian inference, our prior and posterior would be distribution functions.
Links:
I may have been unsatisfied with the Google results, but here are some good resources: