So you’re collecting lots of data with the intention to automate decision-making through the strategic use of machine learning.
So you’re collecting lots of data with the intention to automate decision-making through the strategic use of machine learning. That’s great! But as your data scientists and data engineers quickly realize, building a production AI system is a lot easier said than done, and there are many steps to master before you get that ML magic.
At a high level, the anatomy of AI is fairly simple. You start with some data, train a machine learning model upon it, and then position the model to infer on real-world data. Sounds easy, right? Unfortunately, as the old saying goes, the devil is in the details. And in the case of AI, there are a lot of small details you have to get right before you can claim victory.
One person who’s familiar with the complexities of data workflows for AI and ML projects is Avinash Shahdadpuri. As head of data and infrastructure at Nexla, Shahdadpuri has helped build AI pipelines for some sizable firms, including a Big 4 asset management firm in New York and a major logistics firm that tracks packages for retailers.
According to Nexla, which has just published a white paper titled “Managing Data for AI: From Development to Production,” upwards of 70% of the time and energy spent in AI projects is consumed by preparing the data to be consumed by the ML algorithms. This data-management work can be broken down into a handful of phases, including:
1. Data discovery: What data do you need, and how do you find it?
2. Data analysis: Is the data useful for ML? How do you validate it?
3. Data preparation: What format is the data in? How do you transform it into the correct format?
4. Data modeling: What algorithm can you use to model the date
5. Time spent in different tasks in AI project (Image courtesy of Nexla)
Some of these tasks might be longer or shorter depending upon the type of data you’re working with and the type of problem you’re solving, Shahdadpuri says. “If your data is very segmented, you might have a bigger discovery phase,” he says. “If you’re very intimate with the data, you might have shorter discovery phase, but you’ll still be doing these things on and off.”
Data catalog and discovery tools can be very useful for finding relevant data sets, Shahdadpuri says. But many enterprises have millions of different data sets, many of which have not been cataloged via metadata analysis. That could lessen the usefulness of cataloging tools when workign with dark data sets. Now you’re back to fishing in the data lake with a fishing pole and hoping to get lucky. […]