GPS Trivia

I used to love playing Trivia as a kid, but then after a while you simply memorized all the answers, so the game become more of a memory game rather then earning new knowledge, I wondered what it would be like to have a Trivia game that is surrounded around my life and is constantly updated, one that can teach me more about the places I go, their history and culture, so I came up with the following ‘Local Trivia’ app.

The local Trivia app

The local Trivia app

Local Trivia Question

Local Trivia Question

 

so lets say that I’m visiting in NY and taking a walk down on Broome st, the above question may appear, teaching me more about the history of Broome st.The interesting challenge here is to have enough questions to keep the game interesting and relevant, obviously the more people who contribute the more fun the game will become, but this is not something that the app can relay on, there for I choose to develop an algorithm that can ask questions (and answer them correctly) based on a geographic location coordinates. The rules of GPS trivia are similar to the normal trivia game you know, the only addition is that I decided to include 2 potential types of trivia questions.

1. Text question – similar to regular trivia question.
2. Image question – a question that will feature a few images as optional answers, taking advantage of the mass geo-tagging.

for now lets focus on the first type of trivia question.

How a computer can ask a meaningful question about a text?

To generate automated text (traditional trivia) questions I decided to use the following logic

The idea is simple, first we run a search to find relevant web pages on this topic, then we extract the topic from each of them, we test if the topic can be composed into a question by running it through a search and seeing what type of results we find, if we find a meaningful answer we compose the wrong answer, if not we’ll repeat the process again.

Text summarizer 

I came across this text summarizer code posted by the tokenizer and took the steps to integrate some nltk functions into it to it.  This will outline the first two steps in the above process, I choose to use python goose as it already remove all the html tags from a page and return the exact content we need. 

The idea behind the summarizer class (adopted from here) is as following

1. split the text into sentences 

2. Create a clean version of the text – remove stop words, and stem the words so when we compare the words we compare between similar words even if originally they were written in different forms

3. Create sentence dictionary where each word in each sentence get a rank based on the number of times it appear in the whole text, the idea is if a word appear many times in the text it must be an important word (thats why we  remove the stop words at step 2 so words like ‘The’, ‘a’, ‘and’, ‘or’ would not affect the result

4. Run a loop on each paragraph and see if it has sentences with high rank,  if so – consider them apart of the summarized text

NLTK is a library which provide functionality for natural language processing in python, it includes many functions to tokenize, stem text and remove ‘stop words’, which makes it ideal for any text processing.

 here is what my modifications of the summarizer code looks like, you can find my complete code here: https://github.com/screename/web-page-summarizer (note: though we are developing for a mobile app – this script will run on a server side, probably on an amazon server with clustering if a relevant question was not found on the database, therefor it is written in python)

The import statements

The rankSentences method which classify each sentence based on the rank value of the words it include and the number of times they appear in the whole text, the assumption is that if a word appear many times in the text it must be an important word, a sentence with many impotent words imply that it is an important sentence

Modified methods – use nltk to:

1. remove stop words form each sentence, since we are looking to rank sentences based on repeated words, if we wont filter words like ‘and’, ‘the’, ‘or’, ‘a’ we will effect the value of the sentence, there for we use nltk stopwords class, if we find a word that is included in the stop words we won’t include it in the value of the sentence.

2. stem the sentence, writing is often inconsistent, many times we write the same word in different forms, while the word actually has the same value, for example ‘read’ and ‘reading’ or ‘car’ and ‘cars’, when we stem a word we find the root of the word and will use it to create the rank – intersection with other words of the same stem.

Extracting the content of a url is easy when you are using Goose

this will get us a summarized version of the page that we can try to develop questions from, more about how to transform the summarized version into a meaningful question will come soon, assuming that it works we’ll end up with something like this.

Correct answer

Correct answer

Wrong answer

Wrong answer

Obviously I can answer the question right or wrong, once I did I can also add or view comments on the question or share it with my friends. alternatively if I did not know the answer and didn’t want to take a risk, I can consult with my friends on the answer once my ‘time is up’.

Ask a friend - Time out

Ask a friend – Time out

Image based question

Image based question

Questions can also be image based, so for example if I am standing in the line at shake shack at madison square park in NY, I may get a challenge to identify the flat iron building (across the street).

Open source questions

Open source questions 

 What if we can’t relay on our questions populator algorithm 

Well… Ideally the questions will be open source and submitted by the public, so when someone has a question, they can simply go ahead and submit it based on their current location (or a destination).

Related reading 

While researching to build the local trivia post I came across the following resources

Python goose – article extractor: https://github.com/grangier/python-goose

NLTK – Python natural language toolkit: http://nltk.org/

TextRank: http://en.wikipedia.org/wiki/Automatic_summarization#Unsupervised_keyphrase_extraction:_TextRank

Build your own summery tool: http://thetokenizer.com/2013/04/28/build-your-own-summary-tool/

The tokenizer: http://thetokenizer.com/

ConceptNet: http://conceptnet5.media.mit.edu/

Stanford Named Entity Recognizer (NER): http://www-nlp.stanford.edu/software/CRF-NER.shtml

Natural Language Processing APIs and Python NLTK Demos: http://text-processing.com/

Text Classification for sentiment analysis – Naive Bayes Classifier: http://streamhacker.com/2010/05/10/text-classification-sentiment-analysis-naive-bayes-classifier/

NLTK, Text processing: http://streamhacker.com/