Amazon Web Services on Wednesday announced the general availability of Textract, a fully managed service that uses machine learning to automatically extract text and data, including from tables and forms. Textract was one of multiple AI-powered tools and services unveiled at last year's AWS re:Invent conference that requires no machine learning expertise to use.
Typically, companies use optical character recognition (OCR) software to extract text and data from files like contracts, tax documents, expense reports or patient forms. However, traditional OCR technologies can't recognize common layouts like forms and tables. They consequently generate a lengthy and often inaccurate text dump.
By comparison, AWS has called Textract an OCR ++ service. It can, for instance, see a document with a table and recognize that the data belongs in rows and columns. "It's able to identify there's a table and able to lay out for you what that table should look like so you can use and read that data," AWS CEO Andy Jassy said at re:Invent.
Textract's API supports multiple image formats including scans, PDFs and photos, and customers can use it with database and analytics services like Amazon Elasticsearch Service, Amazon DynamoDB and Amazon Athena. They can also use it with other machine learning services like Amazon Comprehend, Comprehend Medical, Amazon Translate or Amazon SageMaker.
Customers using the service already include The Globe and Mail, PwC, Healthfirst, UiPath, Teradact, Ripcord, BluePrism and Alfresco.
Textract is now available in the US East (Ohio) region, US East (N. Virginia), US West (Oregon) and EU (Ireland). AWS will bring it to additional regions in the coming year.