Google slips out open source OCR engine

Google slips out open source OCR engine

Summary: Google has helped out with the dusting-off and release of an old HP Labs optical character recognition engine

SHARE:

Google has announced that it "quietly released" a veteran optical character recognition (OCR) engine as open source a few months ago.

The engine, Tesseract, was developed between 1985 and 1995 by HP Labs to some acclaim, but was filed away when the company pulled out of the OCR business.

According to a recent Google Code Blog post by "Uber Tech Lead" Luc Vincent, a couple of HP employees decided to dust it off as open source software with the help of the Information Science Research Institute at UNLV, who in turn called on Google to help with debugging.

Tesseract is mostly covered by the Apache open source licence, although part is covered by a second licence that may put some restrictions on commerical use.

Although Vincent admitted that Tesseract was not currently a strong competitor to commercial OCR engines due to various issues — it only supports English, performs poorly with multi-column material and balks at greyscale or colour documents — he insisted it was "far more accurate than any other open source OCR package out there".

Interestingly, the post also mentioned that Google was looking to hire "top-notch OCR engineers".

Topics: Apps, Software Development

David Meyer

About David Meyer

David Meyer is a freelance technology journalist. He fell into journalism when he realised his musical career wouldn't pay the bills. David's main focus is on communications, as well as internet technologies, regulation and mobile devices.

Kick off your day with ZDNet's daily email newsletter. It's the freshest tech news and opinion, served hot. Get it.

Talkback

1 comment
Log in or register to join the discussion
  • Shame it doesn't compile under Linux.

    Well, it fails on my Ubuntu machine, and when I looked on Sourceforge there were other cases noted on Debian and Fedora. Doesn't seem to be much (read: no) activity on investigating the problems.
    anonymous