Data is precious – so it’s been asserted; it has become the world’s most valuable commodity. And when it comes to training artificial intelligence (AI) and machine learning (ML) models, it’s absolutely essential.


Copyright: – “Why synthetic data makes real AI better”


Still, due to various factors, high-quality, real-world data can be hard – sometimes even impossible – to come by.

This is where synthetic data becomes so valuable.

Synthetic data reflects real-world data, both mathematically and statistically, but it’s generated in the digital world by computer simulations, algorithms, statistical modeling, simple rules and other techniques. This is opposed to data that’s collected, compiled, annotated and labeled based on real-world sources, scenarios and experimentation.

The concept of synthetic data has been around since the early 1990s, when Harvard statistics professor Donald Rubin generated a set of anonymized U.S. Census responses that mirrored that of the original dataset (but without identifying respondents by home address, phone number or Social Security number).

Synthetic data came to be more widely used in the 2000s, particularly in the development of autonomous vehicles. Now, synthetic data is increasingly being applied to numerous AI and ML use cases.

Synthetic data vs. real data

Real-world data is almost always the best source of insights for AI and ML models (because, well, it’s real). That said, it can often simply be unavailable, unusable due to privacy regulations and constraints, imbalanced or expensive. Errors can also be introduced through bias.

To this point, Gartner estimates that through 2022, 85% of AI projects will deliver erroneous outcomes.

“Real-world data is happenstance and does not contain all permutations of conditions or events possible in the real world,” Alexander Linden, VP analyst at Gartner, said in a firm-conducted Q&A.

Synthetic data may counter many of these challenges. According to experts and practitioners, it’s often quicker, easier and less expensive to produce and doesn’t need to be cleaned and maintained. It removes or reduces constraints in using sensitive and regulated data, can account for edge cases, can be tailored to certain conditions that might otherwise be unobtainable or have not yet occurred, and can allow for quicker insights. Also, training is less cumbersome and much more effective, particularly when real data can’t be used, shared or moved.[…]

Read more: