3rd bite

For a while now I been playing with the idea of sentiment analysis for restaurants, I love using apps like yelp, but sometimes the variety and amount of reviews are a bit overwhelming, specially dew to the fact that I am yet to eat (hunting for food), I came up with this simple app that scout the web in a search for restaurants reviews and rank them by sentiment analysis, so I won’t have to read them on empty stomach. here is what the first screens looks like and the logic behind how to train data  – the way the sentiment analysis works.

3rd bite

 now let’s say you are around prospect heights in Brooklyn and you typed ‘Ramen’ you’ll see something like the following screen

3rd Bite

 This is pretty strait forward, I’m sure you have seen many apps that find things for you to eat around you, the difference between these apps and 3rd bite is that, 3rd bite will do much of the work analyzing the results as well, giving you a prices choice to make based on a mathematical idea – the percentage ratio of positive and negative review it found.

A few sentiments on sentiment analysis

Sentiment analysis has became very popular tool mainly for brands to get a better understanding what people are sharing about them, I find that it also makes a great use to bring much of this info to the public, so everyone can now what everyone is saying, before I dive into the process of analyzing sentiments here is the complete code – https://github.com/screename/Sentiment-Analysis.git of you want to take a look.

The idea behind a sentiment analysis is very simple, first we take data that we know what’s the sentiment behind it and train our algorithm with it, this will create a probability estimate that if a certain word is included in a sentence, most likely the sentence is positive or negative. 

Naive Bayes classifier

The core of the training and classifying is done with the Naive Bayes classifier theorem, in python it is really easy since it comes built in with the nltk library, The Naive Bayes classifier predict the probability that a given feature belong to a particular label using the following formula

Error: http://lab.onetwoclick.com/wp-content/plugins/wpmathpub/phpmathpublisher/img/ must have write access Read the official wpmathpub plugin FAQ for more details

P(label) is the probability that a given label will occur based on the data we used to train our classifier with, so if our training database has 40% positive to 60% negative ratio the probability that sentiment is positive is %40.

P(features|label) is the probability that a given set of features will classify as a specific category – label, meaning the word ‘fun’ will come out as a positive word.

P(features) is the probability that a given feature set will occur, in other words it’s the likelihood that a word will show up based on how many times this word is repeated in our training database.

P(label|features) – the result of the Naive Bayes theorem is the probability that a given feature will have a given label, obvilusly the higher the result the more likely we can now that this is close to be correct.

While it is great to know more about the way the naive bayes classifier works, it is certainly not needed in order to use it with nltk, here is an example of how it works using the build in movie reviews corups

 

This will result with something like the following:

now all you need to do is to pass a sentence to your classifier and see it it return a positive or a negative sentiment

STEP I – Build your own sentiment analysis

Using the naive bayes classifier and the movie reviews corups is a great place to start, but you can do much more with it, as you’ll see some sentiments you want to classify based on data you have collected that feeds from the same source, a great example for that is twitter or blog comments where people even though they use English, have very specific way of saying things, which may be interpreted differently when you cross it with data from other sources. 

The first step in building the sentiment analysis is to build a custom corpus which will include the relevant sentiments for our training, again you can see the full code for this here –  https://github.com/screename/Sentiment-Analysis.git which will follow the following:

1. Define a connection to download the right data, in our case – twitter api, since we are using python we’ll use the Twython library to handle the connection and get the tweets for us.

2. Run a loop (with a pause – to make sure we do not violate the twitter api limit of 350 calls per hour) and get all the data needed.

3. Filter the results, remove stop words, urls, hashtags @mentions and such and save it to a text file which will be our new tweet corpus.

 This is what our code will look like, beginning with the import statements

We initiate the class with two objects, one is a list which will contain all the custom stop words which we won’t like to save to our corpus, note that I’m using a source which contain a lot of tweets about apple – google and microsoft so many of the stop word relate to them, there is no point in classifying a word like ‘iphones’ as a positive or negative, even though it may be included in an original sentiment that is such

The second object is a dictionary which will hold the date based on each category while we download it from twitter, we’ll use the keys in this dictionary to save the corresponding files once our loop is done

You can download tweets without initilizing an api key, but you’ll be limited to a much lower rate (I think its around 100 calls per hour), using Twython this process is very simple, all you need is to setup a developer account, create an app and get the api key and secret, Twython will handle all the rest for you

We’ll use the following methods to clean and tokenize the tweets, pretty straitforward nltk and regex

These methods will load and save our data at the beginning and end of the process

and finally we are ready to run the loop and get the tweets we need

note that it will take around 18 hours to get the tweets dew to twitter api limits, I just let it run overnight and when I came back from work the day after it was all completed and ready for me to start training my classifier.

Step II – Train your classifier

This step is somewhat identical to the one we did with the movie reviews corpus, only that we’ll use our freshly downloaded twitter sentiments corpus and will set it up in an object oriented manner with a few additions to get direct content from specific web pages which we want our app to scan.

We’ll use the following import statements on our class, note that we’ll be using a similar text summarize class to the one I wrote about here – http://lab.onetwoclick.com/gps-trivia/ as well as the python Goose library to get content from web pages

We’ll initiate the class with a few lists to hold the data we want to add to the classifier and a function to load a corpus file and return its content as a list

For the training data we can use one of the following options, the nltk movie reivews corpus, our twitter corups sentiments or a custom data that we’ll pass directly to the classifier, lets begin with the movie reviews corpus, which has an identical structure to the one we saw above

To load our twitter sentiments corpus we’ll use the following method, note that our database is much smaller than the movie reviews corpus which means that it will be faster to load and work with, but also may be less accurate for cetain tasks, you may want to play around with te split value when we set this for training and see what returns the best results

When we add custom sentiments to our data we’ll need to make sure they are included in the training set we’ll pass to the classifier

Here we’ll train the classifier with the date we have collected

a few utility methods to add custom sentiments to our list, one thing which I’ll like to add to these is the ability to save them to a custom corpus file for a later use

and a few methods to clean and tokenize new data

Step III – Classify

Finally we get to test our classifier and see what type of sentiments it returns

let’s try it

Even though the accuracy level is lower then the one we got using the movie reviews the data test show more reasnable results since it is train with data that comes from the same source

We can also test it on a paragraph using the following logic to return the accumulative sentiment value based on the sentiment of each of the participating sentences, for example a paragraph can be combine from 2 negative sentences and 3 positive sentences which will make it a positive paragraph

let’s test it with a real user comment from this post about some new pizza place around union square in NY, obviously you can see from the text that the sentiment is very negative

Which will result with 7-3 negative to positive ratio

another intersting use of the sentiment analysis is with the combination of grabing contnet from urls, since this is what we intent to do with the 3rd bite app, I was contemplating between two options, one to break each page into paragraphs and return the accumulative value of each or to summarize the text on the page first and run it through the classifier

Like with the following review

which return these values

as you can see the results are not so far away from each other, though I’m not sure if the tweet sentiments are the right filter to look at them, perhaps this is the time to build a user comments corpus :)

I hope this gives some background of how sentiments analysis works, post a comment or send me an email if you have any questions or want to see how it gets integrated into the 3rd bite app.

Step IV – The analysis ( Zipf ’s Law)

Some interesting learning from this experiment, apparently the data shown in regard to users reviews also applies to  Zipf ’s Law, which I came across in this NY Times article, the law was established around the idea that the most repeated word in a text will appear twice as much as the 2nd most frequent word, the same law has been known to apply to the evolution of cities where the largest city will be twice as large from the 2nd largest city and 3 times as large from the 3rd and so on. interesting that the law has been proofed to be true in many situation following this formula:

Error: http://lab.onetwoclick.com/wp-content/plugins/wpmathpub/phpmathpublisher/img/ must have write access Read the official wpmathpub plugin FAQ for more details

 where N be the number of elements; k be their rank and s be the value of the exponent characterizing the distribution. What is really interesting in regard to the 3rd bite data is that it also follow the Zipf’s law pattern where the most positively reviewed restaurant in a given area is twice as positive from the 2nd positive restaurant and 3 times as positive from the 3rd, but that is probably a topic for a whole new post. 

 

Here are a few of the resources I encountered while working on the sentiment analysis:

Source code: https://github.com/screename/Sentiment-Analysis

http://en.wikipedia.org/wiki/Naive_Bayes_classifier

http://nltk.org

http://andybromberg.com/sentiment-analysis-python/

http://sentiment.christopherpotts.net/

http://sananalytics.com/lab/twitter-sentiment/

http://www.laurentluce.com/posts/twitter-sentiment-analysis-using-python-and-nltk/

Twython: https://github.com/ryanmcgrath/twython

Goose: https://github.com/grangier/python-goose

Zipf ’s Law = http://en.wikipedia.org/wiki/Zipf’s_law

http://economix.blogs.nytimes.com/2010/04/20/a-tale-of-many-cities/