Language agnostic document processing: Finding relations using statistics, machine learning, and graphs

Would you like to be able to find related work regardless of domain or language, more efficiently than you ever thought possible? Omnity is out to help achieve this, using a mix of techniques.
Written by George Anadiotis, Contributor
Image: Andrew Ostrovsky

Relations are hard. Any relationship counselor can tell you that, as well as anyone who has ever done research on any topic. Finding what other people have done in your domain is hard work, but it is necessary to properly position and relate your work, discover competitors or collaborators, and improve experimental design.

That has always been the case, but as the pace of innovation and research accelerate, keeping up is getting harder and harder. As someone who has some 120 patents to his name, Brian Sager has been there and done that, and decided he's had enough.

As it often happens in research, deciding to tackle one issue may have far-reaching implications. Sager, a seasoned R&D professional and serial entrepreneur, saw the potential and decided to found Omnity to commercialize his solution to that problem.

"Documents are much less connected than they could be. Only a small fraction, typically in the area of about 1 percent of all possible references actually exists in the related work section. Why? Because no author knows everything.

And it's only going to get worse -- our brains stay the same, while the information we have to deal with is exploding. So you can expect that 1 percent to drop. To solve that problem, what we did is we developed the next generation of search."

The way Omnity leverages a mix of data-centric techniques to do this makes its approach not only interesting in and by itself, but also applicable to a number of other problems.

First you get the power law, then you get the language


An example power-law graph. To the right is the long tail, and to the left are the few that dominate (also known as the 80-20 rule). This also applies to languages. Image: Wikipedia.

The way it works is by leveraging the power law -- the statistical distribution of word frequencies in language. In every language, there are a small number of words that everyone uses, and a big number of words that nobody uses.

In English, where there are roughly 700K words, the Top 10 words are used in 25 percent of documents, the Top 100 in 50 percent, and the Top 7K in 90 percent.

According to Sager:

"The ones using those words in the long tail mean exactly what they mean and nothing else.

So if two documents have the same pattern governing the distribution of rare words in them, it means they almost certainly are talking about the same thing. We call this pattern the document's semantic signature.

And we can apply this not just in English, but in other languages too, as they all follow the same power law. For example, if I want to file for a patent, I need to make sure it's not a duplicate.

So I need to go and check every patent registry in the world. Perhaps there is someone in Japan who did something similar, but I don't speak Japanese so there is no way I could possibly know.

With Omnity, this is not a problem: we'll analyze your document's semantic signature and find potential matches in any language. Our processing is based on math, not language. We break document content into tokens and work with tokens."

Mix + match = language-agnostic document processing


Graph processing is central to Omnity, to the point where it has developed its own solution for it. Image: Omnity

But wait. How do you know if a word is frequently used or not? Don't you need something like a lexicon for this? How much language-agnostic can language processing really be?

"One of the problems with machine translation is you need context and a number of other things, but we don't need that. We simply map rare words, we don't translate," says Sager.

Document similarity is calculated by adding up scores of primary, secondary, and tertiary matches: words that are present in both documents, words whose synonyms or synsets are present in both documents, and words whose synsets are related and present in the documents.

But is that all there is to it -- statistics? No, not really. Omnity also uses a combination of machine learning (ML) and graph processing.

Omnity has its own internal database of 15 TB worth of documents, and when users submit documents to be processed it searches against it. It also organizes documents in corpora such as medicine, law, etc., and uses ML to classify newly submitted documents.

ML is also used to enhance Omnity's algorithms by measuring how well they perform. Users are presented with a list of matches for their documents, so events and metrics like CTR and mouse over are recorded and used to evaluate and evolve the algorithms.

Omnity evaluates user intent and uses it to boost or downplay document ranking, combining it with its graph structure: results are nodes, relations between them are edges. Some results will be more connected (cited) than others, which signifies higher importance.

Graph is central to Omnity. "We looked at solutions like Neo4j, but we run into problems with scale," says Sager. "When we checked, Neo4j had a 30 Billion node limit, which was not good enough for us. We have quadrillions of nodes, so we just had to develop our own solution. We may even license it independently at some point."

Relations -- what are they good for?


Omnity lets users find relations in documents regardless of domain or language. Image: Omnity

That's all fine and well, but why should you care if you're not a researcher? Sager says their focus is on knowledge workers -- people who think for a living. But you don't have to be authoring research papers or submitting patents to benefit from this.

Omnity sees R&D and law as its prime application domains, but also has "opportunistic" involvement in domains like finance or content management. Use cases in domains with a big number of documents with technical terms and high degree of connectedness, a need for clarity and a sense of urgency are good candidates for Omnity according to Sager.

In law, by using semantic pattern detection people can link their thesis to other documents to see how well it holds up, says Sager. For example, they could establish precedent or discover relevant legislation or even evidence.

In finance, Omnity has worked in merger & acquisition deals as well due diligence. In both cases, there is a huge number of documents that need to be processed as fast as possible, and Omnity says they can retrieve in a second a body of documents that would take an analyst days or weeks to assemble.

Obviously, there are also limits to what Omnity can do. It can be very efficient in getting related documents, but that's where its mission ends. So, what happens with that Japanese patent?

By bringing it into the system and linking it to its English equivalent, Omnity helps decide what to do with it -- assign more resources to it or maybe translate it. Omnity also supports the notion of workspaces for users.

Documents are ingested in workspaces, where they are processed and relations are built leveraging user world views via ML. "We use ML to categorize user document using their own labels," says Sager.

"If we have an idea of their workspace, we tailor our processing accordingly. Everyone has their own view of semantics, and we can accommodate all of them. For example, if a Fortune 100 is looking for product A, they don't care about where a document was found, they care about where it fits semantically.

We have a unique way of processing documents, and our goal is to be the best in our market."

How big data will change your life in 2017:

Editorial standards