The Ear

The-Ear-Home  

The ear is an ASL live translator app that recognize both ASL gestures and spoken language to enable dual translation between the two. I’m really excited to be working on this app as it touches on so many great technologies, from computer vision and nlp to analysis of sound and voices, in other words, quite a lot, but as usual its written one line of code at a time, and actually it seems that most of it has been written for me already :)

The idea behind The Ear app is to enable live translation between ASL (American Sign Language) and common language on one hand, while also enabling voice recognition translation, so the translation works in a dual way, the person who speaks the sign language can read the text of the words they can not hear, while the person who do not understand sign language can see live translation of the conversation.

as you can see the app interface is very simple, there are two modes for it, listen and translate.

The Ear - Listen

The Ear – Listen

 
The Ear Translate

The Ear Translate

The process of handling each mode is somewhat of an app of its own, one centered around voice recognition and nlp while the other is based on computer vision, gesture recognition and image training.

Phase I – Listen

This phase would have been quite complicated if I hadn’t had the chance to work on the nlp² app which basically does exactly that, though even more simplified in this case. this is the flow I used in the nlp² app

Speaker-Recognition

in this case we can skip the speaker recognition phase, since it doesn’t matter who talks, we want to translate it all, I also realized that it will be challenging to relay on the native voice recognition pack which comes with the phone, so I decided to use sphinix or actually CMU Sphinx and the framework OpenEars.

open ears

when I said earlier that most of the code was written for me I wasn’t kidding, as Open Ears really does most of the work. one of the advantages of using a framework like Open Ears is that it runs locally on the device which means that the translation could be done instantly without long calls to a server to do the translation, on the other hand the data set we use is much smaller from the one you can host on a server, therefor we add an additional dictionary, as with Open Ears we get to define the vocabulary of the app, which could be small or large (I guess we’ll keep it somewhere in the middle)

This is the code (taken from the Open Ears documentation), seems pretty strait forward to customize and build the dictionary. Open Ears is ideal for a dictionary of up to 300 words, which doesn’t seems ideal for our app, but I wanted to give it a try and if it works well, develop a similar framework that will support more elements.

The-Ear-Bar-Listen

We also project the audio sound wave so the user can scroll back and repeat the translation.

Phase I – Look

To identify a specific gesture we follow an identical path to the one I used in the pets recognition project, only that now we’ll try to find ASL signs in each picture rather than a cat face. The concept is very simple, fist we train a model with a set of images reflecting each sign and classify that these set of images represent an x sign, this is probably where the main challenge of the app is, I am using the same concept I posted about here for the training, as this is manual training (no easy way out this time) on a set of videos where people use sign language, using a video is ideal for this since you get a variety of positions and lights to train on. the challenge is that there are around 11,000 terms in ASL, assuming that we’ll need at least 100 images for each gesture to have a decent recognition, this means many hours of training, but it has to be done, another way to increase the productivity of the training process is use real images, or alternatively to set an online training environment where people can train ASL terms on their free time, this will be saved to a database which will be used for future training in the app, playing around with this for a bit, I see the tool having the following states

Guesture-Training

Capture – a countdown to take a picture of the desire gesture.

Crop – Crop the sign / gesture from the larger image.

Tag – Translate the image into spoken language and tag it.

Save – Save the image to the database and move to the next gesture.

Once we collected all the data we define the model and train it. first load the data and compose to vectors one with the objects we’ll like to train and one with corresponding labels 

 The function getIndexFromFileName will give us the index value of the labels we”ll assign for the selected gesture. then we pass both vectors to the model .

 Once the model is trained with the data collected we can start recognizing gestures

 The get similarity function will compare the two matrix by evaluating the square root of sum of squared errors.

 if we found a sign, we look for the index value in our labels vector and we can pass it to the front end of the app.

 The-Ear-Bar-Translate

same as with the audio wave, we project the signs found so the user can start learning them intuitively as well as scroll back and see what was said earlier.

Update: testing one of the beta version of the app, one of the testers suggested integrating the same online training tool to be apart of the app, that way users can contribute to the growth of the ASL signs database. I really like that idea and hope to have it apart of the official launch.