Sentiment Analysis on Tweets Using NLTK: An Introduction

A little bit of NLP by my side, a little bit of ML all night long…

9 minute read

Editor’s Note: This post is written for someone with basic knowledge of NLTK/Python and is a complement to existing resources. If you’re not in the NLTK-person demographic, I highly recommend you look at many wonderful other sources to get started first. Here’s a goodie

Sentiment analysis is a fancy word for a simple problem: can a machine evaluate if something is happy or sad? That simple problem is called classification, and can be applied to many different fields like finance, law, health care and so on. Today, we’ll use it for the entertainment industry and specifically for tweets about my fake movie, “Ross’s Excellent Adventure”.

When it comes to basic sentiment analysis, we need two things:

  1. A Corpus
  2. A Feature Detector

Corpus

A corpus is a collection of already annotated texts. Our corpus is going to be a few tweets that we’ve marked as either “positive” or “negative.” Twitter strictly prohibits distributing their content without adhering to their “visual” rules, so it’s tough (and would be illegal) to find an already established corpus ready for download. Here’s some resources to get you started though. NLTK comes built in with a bunch of already annotated corpus’s, including one for movies, but I will ignore those so we can learn the entire process together. Here’s my corpus:

pos_tweets = [("I love this movie! You are #genius ross.", "positive"),
              ("How innovative #rossexad is. I was lol'ing the entire time. More more more!", "positive"),
              ("I feel great this morning after having watched this movie. ", "positive"),
              ("I feel alive. Love this. Really fun stuff. Hilarious. Love his humor", "positive"),
              ("I am so excited about the movie", "positive"),
              ("Love love this movie. I wish I could watch it 100 times. I really enjoyed myself", "positive"),
              ("He is my best friend", "positive")]

neg_tweets = [("Ross's Excellent adventure sucksssss. Sucky suck suck suck. Please give me my $10 back.", "negative"),
              ("FU, fu very much ross for a terrible movie. Garbage. Racist and sexist you pig.", "negative"),
              ("I wonder if I lost IQ points from watching #rossexad. Stupid hashtag too.", "negative"),
              ("If I had a gun with one bullet and had to choose between watching #rossexad and killing myself, I'd watch #rossexad. I'm not an idiot, but not a good movie. ", "negative"),
              ("Ross, not a good guy. #rossexad", "negative")]

Note how each corpus is really just a list of tuples. Is this a big corpus? No, of course not silly reader. In reality, this is very small, very insignificant, and somewhat useless. In my humble opinion, I would try to have at least 1,000 tweets in a corpus before doing analysis. You should also look to design corpus’s for your specific domain; “crying my eyes out” may be a good phrase for movies (crying from laughter, sadness), but not so much for health care (no example needed).

Feature Detector

Right now, we have some tweets that we manually went through and marked as positive or negative. We know we’re going to feed this into our classifier, but the next step is to tell the classifier what to pay attention to. After all, the classifier is a machine, how should it know what to look for? For this, we break out our feature detector.

From a high level, a feature detector breaks down sentences and analyzes them for the important parts. For example, words that are less than three characters (is, a, I, to, me, as) probably don’t help us much in discerning the sentiment of the tweet, so let’s remove them. We also don’t need the hashtag “#rossexad” either, so let’s remove that too.

def extract_word_feats(tweet):
    '''
    Extract features from each tweet. Criteria: more than 3 letters and not the hashtag.
    Returns the filtered words as a dict [(word, true),(word,true)]
    '''
    words_filtered = dict([(e.lower(), True) for e in nltk.word_tokenize(tweet) if len(e) >= 3 and e != "rossexad"])
    return words_filtered

This function has a lot going on. First, we tokenize each word with NLTK, transforming our sentence into a list of words. Then we iterate through each word and throw it out if it’s less than 3 characters or it’s the hashtag. We then make each word lowercase and put it into a tuple (word, True), which then gets added to the rest of the dictionary. With me so far? Good.

Why do we make it a tuple? Well, we need to supply our classifier with a “feature vector”, something that tells the classifier what each tweet has. In this example, we kept it easy and just fed it all the words that it has in a sentence (referred to as the bag of words model), hence the (word,True) notation. Everything else is assumed to be false. As we make our feature detector more advanced, we may want to feed it words from a specific set of “significant words” or only search for 10 specific words to begin with. I’ve included links at the bottom of this post that detail more advanced models.

To review, we’ve established a corpus which is a collection of tweets that have already been marked as positive or negative. We also created a feature detector which returns all words from a tweet longer than 2 characters and not equal to “rossexad.” Let’s throw it all together:

training_set = nltk.classify.apply_features(extract_word_feats, pos_tweets+neg_tweets)
classifier = nltk.NaiveBayesClassifier.train(training_set)

The first line creates an official training set for our classifier in the exact format it needs (I’m still working out what data structure nltk needs, so I use the apply_features to do it for me). The second line “trains” the classifier (in this case a NaiveBayesClassifier). We now can play with our classifier and test it out.

    test_tweet_pos = "I loved this movie. What an awesome wat to spend an afternoon. More of this please! Thanks!"
    test_tweet_neg = "I hated this movie. Fu you ross and you're terrible sense of humor. I hate you."
    print 'The negative tweet is classified as:'
    print classifier.classify(extract_word_feats(test_tweet_neg))

    print 'The positive tweet is classified as:'
    print classifier.classify(extract_word_feats(test_tweet_pos))

In this case, the classifier works well and classifies them appropriately. The third part to classification is testing, something I won’t detail now (because we don’t have a large enough training or testing set), but testing will either validate or destroy our feature extractor. In the future, I hope to expand this example to include more sophisticated corpuses (corpi?), feature extractors and testing data.

Here’s my full code:

import nltk.classify.util
from nltk.classify import NaiveBayesClassifier
 
pos_tweets = [("I love this movie! You are #genius ross.", "positive"),
              ("How innovative #rossexad is. I was lol'ing the entire time. More more more!", "positive"),
              ("I feel great this morning after having watched this movie. ", "positive"),
              ("I feel alive. Love this. Really fun stuff. Hilarious. Love his humor", "positive"),
              ("I am so excited about the movie", "positive"),
              ("Love love this movie. I wish I could watch it 100 times. I really enjoyed myself", "positive"),
              ("He is my best friend", "positive")]
 
neg_tweets = [("Ross's Excellent adventure sucksssss. Sucky suck suck suck. Please give me my $10 back.", "negative"),
              ("FU, fu very much ross for a terrible movie. Garbage. Racist and sexist you pig.", "negative"),
              ("I wonder if I lost IQ points from watching #rossexad. Stupid hashtag too.", "negative"),
              ("If I had a gun with one bullet and had to choose between watching #rossexad and killing myself, I'd watch #rossexad. I'm not an idiot, but not a good movie. ", "negative"),
              ("Ross, not a good guy. #rossexad", "negative")]
 
 
def extract_word_feats(tweet):
    '''
    Extract features from each tweet. Criteria: more than 3 letters and not the hashtag.
    Returns the filtered words as a dict [(word, true),(word,true)]
    '''
    words_filtered = dict([(e.lower(), True) for e in nltk.word_tokenize(tweet) if len(e) >= 3 and e != "rossexad"])
    return words_filtered
 
 
def main():
    training_set = nltk.classify.apply_features(extract_word_feats, pos_tweets+neg_tweets)
    classifier = nltk.NaiveBayesClassifier.train(training_set)
    classifier.show_most_informative_features()
 	
    test_tweet_pos = "I loved this movie. What an awesome wat to spend an afternoon. More of this please! Thanks!"
    test_tweet_neg = "I hated this movie. Fu you ross and you're terrible sense of humor. I hate you."
    print 'The negative tweet is classified as:'
    print classifier.classify(extract_word_feats(test_tweet_neg))
 
    print 'The positive tweet is classified as:'
    print classifier.classify(extract_word_feats(test_tweet_pos))
 
 
if __name__ == "__main__":
    main()

Special thanks to the following blogs/people for help on this post. They have amazing information and much better examples, so consult them first for when you’re confused (in fact just don’t even read my post, just go to them).

  1. Stream Hacker: Text Classification for Sentiment Analysis - Eliminate Low Information Features. A great read on decreasing dimensionality to improve results.
  2. Stream Hacker: Text Classification for Sentiment Analysis - Naive Bayes Classifier. A basic introduction to classification with movie reviews. 10x better than what I’ve written here.
  3. Twitter Sentiment Analysis Using Python and NLTK. Our posts share the same title, but hers is much more thorough. She’s like the book, I’m the cliffs notes.
  4. Twitter Sentiment Analysis Gist A link to my code.
  5. Title Inspiration

Updated: