NLP meets NLP, neuro linguistic programming meets natural language processing, a mobile app that track facial, voice and body patterns and match them with neuro linguistic practice. I’m really excited to start working on this app as it involve with 3 passions I have: Computer vision, natural language processing and learning about people.

The NLP practice is centered around listening and learning from people, the way they move and talk, this involve in recognizing and reading their facial gestures, analyzing their eye movements and body gestures as well as identifying the language they use to imply on the things that are important to them and the way they view the world.

You can get a general idea form the first design sketches of the app how its somewhat going to work.  

Home   Listener   NLPMobile-Keywords

 I’ll like to focus in this post about one of the aspects of the app that I find really cool – the keywords density listener. one second before I go ahead and write my own voice recognition library, I decided to give the native voice recognition functionality that comes with android a try, I already assume a few challenges but wanted to see if it can give minimal results.

The way it works is very simple actually, using the improving native speech functionality of mobile phones we get the text in a string format and analyze the keyword density, with just one caveat – no nltk this time, as most mobile phone find it a bit challenging to run python from with in the app. I guess this is an exciting opportunity to play with processing language in a more basic form.

I decided to decided the task into two phases: recording and analyzing, which will also make it easier to improve the voice recognition part at later point.

Step I – Voice recognition

This appear to be very simple actually, first we add a validation if the VoiceRecognizer is available on the device we are testing it with. note – currently it is not available on the android emulator so you’ll have to use a real device.

 next we’ll associate the “record” button the record function which will initiate the speech intent.

 when a speech command will be passed to the app we’ll receive it with the onActivityResult function. at the begining of the the method we’ll run a check to see if the VOICE_RECOGNITION_REQUEST_CODE  is similar to the one we passed at the end of the record method. I find it best to define it as a static var at the top of the class

 that’s it, we are pretty much done with step I, from this point all we need to do is analyze the textInput.get(0) which will be a string with the new recorded text.

Step II – Keyword density analysis

In this step, first we’ll setup a list with the stop words we want to filter, something like the following:

 so before we go ahead and tokenize the recording, we’ll initiate the stop words list and create variation without punctuation as well.

 The tokenizing method will also clean the list from any stop words, I can already see how this can benefit from some more regex, but for this test this should be good enough.

 The keyword density analysis takes two steps.

1. sort all the words in our keywords list or up the count value of existing words.

2. calculate the density of each word in regard the the total length of words we already recorded.

 to sort the words we run a double loop one through all the words and add them as single words to the keywords list and one at the size of the current index plus the static var KEYWORDS_TAIL_LENGTH – I like to set it up to 2 so we get up to 3 words combination analysis.

the addKeyword method check if we have an existing value in the keywords map, if so we simply up its value, if not we add a new entry for the new word with the value of 1.

 the density analysis is based on the following formulas:

for single words we divide the count of the word with the number to all the words we recorded so far and multiply it with 100.

Error: must have write access Read the official wpmathpub plugin FAQ for more details

 while for a phrase we’ll multiply the occurrence value of the phrase with the number of words in the phrase, divided by the length of the total words recorded multiply by 100 

Error: must have write access Read the official wpmathpub plugin FAQ for more details 

This seems to be working pretty well for a single person speech recognition, the challenge which I expected is that in order to make a better analysis we need more information about the voice recorded, first to learn how to identify between two voices  (or more) so the analysis is not for the conversation but for a specific participant in it. secondly there is a lot of information encumpused within the intensity of a specific recording, this is crucial for understanding the way one person emphasize something with their way of speaking. it seems that google (and apple) are yet to give you more access to the real data in the recording so I’ll have to make an attempt to build my own voice recognition (perhaps using sphinx).

Another interesting approach with I’ll like to test instead of building a speech recognition library form zero, is to do an integration between the built in VoiceRecognizer and the ability to record – listen, so basically instead of having the record button activating the record function, it will activate a listening function which will first record the voices while activating the speech intent, we’ll run analysis of voices and will pass a new VOICE_RECOGNITION_REQUEST_CODE for each voice, so we know when the voice is being captured that it was activated by a certain voice.

Speaker Recognition

As you may expect the speaker recognition process is divided into two steps: training and testing. in the case of our app, first we’ll train the app to recognize the main speaker (prob’ the owner of the phone), when the app will run, if we do not recognize a speaker as the main speaker we’ll train a new speaker with a unique code, so the next time our app listen to this speaker he would already be in the database.

To compute the analysis of voices I decided to use the Mel-frequency cepstral coefficient – MFCC algorithm, it seems like the task is getting somewhat more complicated, but I’m up for the challenge. this is a general diagram of how I see the app operating.


 While Step 1, 3 ,and 4 are covered above step 2 has been the most surprising for me in terms of learning and a great introduction to the way computers hear sound. still not sure if I got it right, but it seems to work. so here is a reflection of my understanding using MFCC and DTW (dynamic time warping) algorithms to identify voice .


The idea behind MFCC is that it enable a computer to analyze sound in a similar way to how human ear process sound, mimicking the idea of how sound is being shaped is the task of MFCC, which take place in the following steps.

Emphasizing – the first step is not mandatory to my understanding, the idea here is to emphasize the higher frequencies, so the sequence we deal with is actually more extreme – it will be easier to identify differences between voices.

Framing – Since voice input and sound file are rather long and complex, framing is the process of separating them into small frames of 20-40ms.

Windowing – I guess this is somewhat similar to smoothing when we deal with images, the result of windowing is to reduce the noise in the sound, looking at the changes in the sequence and analyzing the difference between point i and i + 1.

FFT – Fast Fourier transform is the core of MFCC, the job of the FFT is to convert time (or space when dealing with images) into frequencies. There are many FFT algorithms for java that I looked into, somewhat I wanted to keep it as simple and look into coding my own, many of the example I encounter are listed bellow in the resources. the computation of the FFT takes two arrays one of the real data and one of the imaginary data – usually an equal length array with zero values. the two most interesting approaches I found was either to use a Matrix object to do the calculations or to define a custom object which will include the basic mathematical functions needed to be applied, something like this:

 I guess this is somewhat similar to using a matrix. I’m not going to go into the process of the FFT as it is outside the scope of this post (and I doubt that I’ll do a great job explaining it) however, you can check out many of the examples listed bellow if you want to write your own, or use an existing library which include a FFT algorithm.

Mel Filter Processing – this is a set of 20 – 40 (usually 26) triangle filters we apply to the periodogram we got from the FFT, the result of this calculation will give us the indication of how much energy was in each filterbank. the conversion between frequencies and mel is done with the following formula

Error: must have write access Read the official wpmathpub plugin FAQ for more details

while the conversion from mel to frequencies is:

Error: must have write access Read the official wpmathpub plugin FAQ for more details

both which we compute with the following functions

Discrete Cosine Transform – this is a conversion of the Mel spectrum into a domain – array which represent the each of the  acoustic vectors it is also called the Mel Frequency Cepstrum Coefficient.  This function will return an array with the acoustic vectors values which we will pass to the DTW.

DTW – Dynamic time warping. the DTW will give us the similarity in pattern between voices, the way DTW is looking for pattern which could vary over time, so even if a speech patter is slower or faster it may still be a same pattern. Similar to FFT there are many type of implementations of DTW, to my learning it is also very useful in the area of movement recognition as it can track a move pattern regardless to the speed of the move over time. obviously that this will be needed to be implemented in other areas of the app as well – I guess just like FFT the DTW deserve it’s own post and full exploration.

working on the 3rd bite app I mentioned the idea of Zipfs law in the analysis of written language, would be interesting to look at spoken language and see if it still apply.