Using machine learning to solve your dark data nightmare

Documents are everywhere, but they're also static, dark data. Can a startup founded by AI and document experts change that?

Selling children's data: The latest dark-web trend TechRepublic's Karen Roby sits down wit ZDNet's Danny Palmer to learn more about how cyber criminals are stealing children's data and what parents can be doing to prevent their child from becoming a victim. Read more: https://zd.net/2DiR6rT

Big data: An overview

Big data: An overview

Data is being generated about the activities of people and inanimate objects on a massive and increasing scale. We examine how much data is involved, how much might be useful, what tools and techniques are available to analyse it, and whether businesses are actually getting to grips with big data.

Read More

We live in a world full of documents. This is one.

We create a lot of documents. This is one I made.

It's one of many; a hard drive full of writing from the 1990s to today.

But if you were to ask me about how I construct these, and the invoices I send my clients, I'd have to run a search to find what I'm looking for. I certainly couldn't pull a list of all the topics I've covered, the applications and hardware I've reviewed, the reports I've written, the contracts I've signed. They're all what we think of as "dark data", unstructured content that's just there, static data filling up flash memory here on my PC and up there in a cloud or two.

SEE: 60 ways to get the most value from your big data initiatives (free PDF)

Jean Paoli, one of the creators of XML, is thinking a lot about that dark data these days, in fact since he left Microsoft two years ago. The results of that thinking, and of his co-founders at Docugami, is starting to come out, as the stealth startup slowly unveils what it's doing with a team that mixes document experts with machine learning.

He's calling the problem "document dysfunction", the morass of files and words that businesses create and use. It's a problem that affects the quality of our documents, along with their consistency, and it's one that puts us at risk of failing to meet regulatory compliance. It's not deliberate, it's just that there's so much unstructured data in our businesses and on our PCs.

Part of that problem is one of scale, with Paoli pointing out that the vast majority of businesses around the world are small and medium-sized organizations that don't have the resources or the tools to build the mammoth enterprise content management tools used by larger companies, and certainly don't have the time to build templates and form tools to automate the construction of commonly used documents.

Paoli's assessment of the document dysfunction problem is a depressing one, with his estimate of 85% of enterprise data buried in email, in tools like Slack and Teams, and in billions of ad hoc documents. It's a problem that's only going to get worse, despite the compute we can throw at it in cloud-hosted data lakes. We've already seen how bad it can get, in the document catastrophes of the 2008 financial collapse that left banks not knowing who owned mortgages and how contracts were structured. It's also visible in the complex discharge processes after hospital stays, where medication and prescriptions are easily lost.

As Paoli points out, while documents are written for humans, they need to be understood by computers. We've tried to build systems that let humans build computer-readable documents, applying descriptive markup, but they've been relatively inflexible, handling only a limited set of use cases, or else they've been complex, requiring manual tagging of existing content. What's needed is a new approach to the problem, one that uses computers as an assistive technology, helping us write common documents.

The company name gives some clues to how Paoli's team is planning on solving the problem; a portmanteau of "document" and the Japanese arts of paper cutting, kirigami, and paper folding, origami. A very early version of the Docugami tooling is in use at a small number of customers, with a public beta still six to nine months away.

Instead of finding a better way of indexing and storing those unstructured documents, Paoli is working on new ways of creating common documents, using AI techniques to construct reusable documents. As he says, "We take the repetition out of things, you can use the five minutes to add creativity."

One important point he makes is that this is a world of what he calls "small data". Big data is of the order of terabytes of information, not 50 or so contracts or NDAs. Small teams, so Paoli thinks, need small algorithms, their own machine-learning models. It's actually essential for them, as not only are lowest common denominator approaches unreliable, they're a possible vector for information leaks. If a model is yours and yours alone, it can be secured and can't be used by an attacker to infer your document structures.

If something like this is to be successful it also needs to operate inside several key constraints: it can't need expensive consultants to get working, and it can't be expensive to run. Paoli characterizes his possible audience as individuals and small teams, like the public defender who has too many documents and too many forms to fill in to manage their case load effectively, as well as larger enterprises.

So why now, when we've tried to deliver some of this idea so many times over the last few decades? Paoli sees it as the point where acceptance of the cloud means it's easy for businesses to pick up a new tool that takes advantage of cloud compute to deliver results faster and more accurately than on-premises software and hardware.

SEE: Sensor'd enterprise: IoT, ML, and big data (ZDNet special report) | Download the report as a PDF (TechRepublic)

The Docugami team is certainly well suited to the task at hand, with an application development team that comes from Office and Windows (including many of the original creators of Microsoft's form management tool InfoPath), and a pure science team that mixes XML and machine-learning skills, as well as human/machine-learning interfaces. It's an interesting approach to working with documents, mixing natural language processing and evolutionary machine-learning skills with a deep enterprise history.

With a public beta still some time away, and much of the technical detail still being kept secret, it's going to be interesting to watch what Paoli and his team come up with.

We live in a world of documents.

Soon this may be one a machine helped me make.