Using machine learning to solve your dark data nightmare

Data is being generated about the activities of people and inanimate objects on a massive and increasing scale

Data is being generated about the activities of people and inanimate objects on a massive and increasing scale. We examine how much data is involved, how much might be useful, what tools and techniques are available to analyse it, and whether businesses are actually getting to grips with big data.

We live in a world full of documents. This is one.

We create a lot of documents. This is one I made.

It’s one of many; a hard drive full of writing from the 1990s to today.

But if you were to ask me about how I construct these, and the invoices I send my clients, I’d have to run a search to find what I’m looking for. I certainly couldn’t pull a list of all the topics I’ve covered, the applications and hardware I’ve reviewed, the reports I’ve written, the contracts I’ve signed. They’re all what we think of as “dark data”, unstructured content that’s just there, static data filling up flash memory here on my PC and up there in a cloud or two.

Jean Paoli, one of the creators of XML, is thinking a lot about that dark data these days, in fact since he left Microsoft two years ago. The results of that thinking, and of his co-founders at Docugami, is starting to come out, as the stealth startup slowly unveils what it’s doing with a team that mixes document experts with .

He’s calling the problem “document dysfunction”, the morass of files and words that businesses create and use. It’s a problem that affects the quality of our documents, along with their consistency, and it’s one that puts us at risk of failing to meet regulatory compliance. It’s not deliberate, it’s just that there’s so much unstructured data in our businesses and on our PCs.

Part of that problem is one of scale, with Paoli pointing out that the vast majority of businesses around the world are small and medium-sized organizations that don’t have the resources or the tools to build the mammoth enterprise content management tools used by larger companies, and certainly don’t have the time to build templates and form tools to automate the construction of commonly used documents.

Paoli’s assessment of the document dysfunction problem is a depressing one, with his estimate of 85% of enterprise data buried in email, in tools like Slack and Teams, and in billions of ad hoc documents. It’s a problem that’s only going to get worse, despite the compute we can throw at it in cloud-hosted data lakes. We’ve already seen how bad it can get, in the document catastrophes of the 2008 financial collapse that left banks not knowing who owned mortgages and how contracts were structured. It’s also visible in the complex discharge processes after hospital stays, where medication and prescriptions are easily lost.[…]

read more –


