Big Data: Sentiment Analysis in the Real World

Part 2 in a series on Big Data

3 minute read

The movie industry is fascinating, so I’d be doing it a disservice if I didn’t start with that statement off the bat. It’s complex, layered and multivariate, so I’m going to slice off a small chunk of the industry and apply a simple example of movie reviews to demonstrate how unstructured data can be used to help the movie industry. The critical time for a movie is within the first four weeks of it coming out (when it makes the most money.) For you marketing nerds out there, you may recognize movies as the only exception to the Bass Model, wherein adoption decreases over time. A good marketing manager wants to maximize ticket sales during this period because of this fact and needs the best data possible to make that happen. How does she do it?

Before a movie comes out, the studio screens it for movie reviewers to give them time to write a review. When the review comes out, the reviewer assigns it a number of stars and justifies that rating with 500 words or so. Readers read, get influenced, and make a decision to either go or not go. But how often do people agree with the critics? Roger Ebert trashed Zoolander and it’s one of my favorite movies. As great as it to gather a consensus from professionals, ultimately movies are attended by layman like myself. Where’s the best place to find layman these days? Twitter.

Twitter’s API offers mass consumption of a diverse public sentiment on a diverse set of issues (read: everyone talks about everything.) So upon the release of my movie Ross’s Excellent Adventure, I want to learn how the public perceives my movie. I login to my Twitter API account, download all tweets mentioning my movie (or using the hashtag #rossexad), and stare at my screen. Now what? This is the unstructured data I mentioned before, which while useless in its current state, can provide us some insights with a little elbow grease.

I’m going to apply sentiment analysis to learn how people feel about my movie. I start up my algorithm, parse the tweets and rank each on a sentiment scale from 1-10 (higher being better). Someone tweeting “that movie was awesome! I loved it #rossexad” gets a 10; I won’t insult your intelligence by demoing a 1. I find that 30% of people fall in the 1-5 category, 50% fall in the 6-7 category, and the remaining 20% of people fall in the 8-10 category. Great, now what?

Well, I’ll don my marketing manager pants and say “hey, who are the people that fall in the 30% category?” I make a customer profile of the lacking 30% and find, on average, they’re X years old, live in Y community and have Z interests and store that in the XYZ profile. I make up a different ABC profile for the top 20% and transfer my XYZ marketing dollars to bump up my ABC dollars. More ABC people will watch my movie now and tell their friends, helping my bottom line and saving me wasted dollars on segments I shouldn’t be marketing to.

The next post will be a brief stray from the big data series. Instead, I will demo some python and NLTK code that shows how we might approach sentiment analysis with tweets.

Updated: