Facebook pumps up character recognition to mine memes
Facebook described how they applied convolutional neural networks, already widely used for image recognition, to detect whether there's text in an image and then transcribe that text. It could be useful for helping visually impaired users, though the immediate appeal of the system would seem to be to mine the vast quantities of internet memes being uploaded by the millions.
Researchers at Facebook offered up a summary of a system they call "Rosetta," a machine learning approach that boosts traditional optical character recognition, or "OCR," to mine the hundreds of millions of photos uploaded to Facebook daily.
Say you want to search for memes in images on Facebook: The site's challenge is to detect whether there are letters printed within an image, and then parse those letters to know what a phrase says.
This technology has, of course, been in use for document processing for ages, but the challenge at Facebook was both to recognize text in any number of complex images, including text laid over the image, as in an internet meme or text such as a sign that was part of the original image, and then to make it work at the scale of the site's constant stream of images.
Facebook split up the task of "extracting" text from an image into two separate matters, that of first detecting whether there is text at all in an image, and then to parsing what that word of phrase might be.
For the first task, detection, the authors used a convolutional neural network (CNN) called "Faster R-CNN," which itself derived from work done originally by Facebook's Ross Girshick when he was at Microsoft. While CNNs have been used quite a bit in the last decade for image recognition tasks, such as ImageNet, the R-CNN adds the notion of "regions" as a way to speedily pick out objects in an image and say where precisely in the image the object is located.
Facebook has already widely deployed an object-recognition system throughout its infrastructure called "Detectron," and having that in place clearly helped in this case.
Once text is located in an image, the coordinates of that image are passed to another CNN to discern the word or phrase, character by character. The product of that second step are sequences of characters making up words and phrases.
Because recognizing long words or long phrases can be especially challenging, the authors describe using what's called a "curriculum" approach to train the character recognition system. They started out by training the system on small words of five characters or less, and progressively increased the length of words with subsequent iterations of the training.
All the training work for both the detection part and the recognition part were performed using the "Caffe2" framework.
The authors spend a substantial amount of time in the original paper describing how they tuned the system for optimal speed for "inference," when a new photo is looked at and has to be quickly searched for text and transcribed. "Given our scale and throughput requirements, we spent [a] significant amount of time improving the execution speed of text detection model while keeping the detection accuracy high," they write.