Most companies remain in the research and development phase of AI implementation, and one reason why few have actual AI deployments is that data science teams are facing data shortages. 

Copyright by


SwissCognitiveAnalysts agree that the more data you have, the better trained your models will be. So how does a data shortage factor in when determining how to create a data set for machine learning? The solution may be to look for data in unique places and pull from research and prior collection.

At the recent AI World Conference & Expo, data scientist Madhu Bhattacharyya, managing director of enterprise data and analytics at global consultancy firm Protiviti talked internal data shortages, mediating bias and the importance of external data collection.

What are some tips for how to create a data set for machine learning if you have limited internal data?

Madhu Bhattacharyya: In reality, the more data you have, the better the model is because you can check for seasonality, you can check for factors that become inherent to the model when you’re building it. From a prediction perspective, accuracy also increases with more data.

So if the data you have is very lean, or you’re a company that doesn’t have enough data, but wants to come up with insights, you need to figure out a way — through analytics, analysis, data multiplication or data mining exercises.

Say you’re a startup, or you’re just developing a new product. There will be some data which will be available right away, because before you start up with something, you do a lot of research. Nothing starts off out of the blue. Before releasing any product or service, think of what you do that collects data. You check for viability, you check for market penetration, you check for potential ROI.

If you’re selling a product, a platform as a service or a service, even before you generate your own data, you will have the initial market data that you researched. How did you identify your potential customer? How do you identify that you need to have the launch in Boston versus in Dallas, for example? All of that information that helped you strategize multiple angles before the launch of the product is useful for building models and creating a data pipeline.

Don’t restrict yourself only to internal data. Try and bring in relevant external data. Ideally, you want a huge amount of data to fall back on from an amalgamation of both internal and external data that actually makes models and AI training much more robust from a decision-making perspective. […]


Read more –