Why synthetic data may be better than the real thing

We are excited to bring back Transform 2022 in person on July 19 and virtually from July 20 to August 3. Join AI and data leaders for insightful talks and exciting networking opportunities. Learn more about Transform 2022


To implement successful AI, organizations need data to train models.

That said, high-quality data isn’t always easy to access, creating a significant hurdle for organizations when launching AI initiatives.

This is where synthetic data can be so useful.

Unlike data that is collected and measured in the real world, synthetic data is generated in the digital world using computer simulations, algorithms, simple rules, statistical models, simulation, and other techniques. It is an alternative to real world data, but it reflects real world data, mathematically and statistically.

Some experts even claim that synthetic data is better than real-world people, places, and things when it comes to training AI models. Restrictions on the use of sensitive and regulated data are removed or reduced; data sets can be tailored to certain conditions that would otherwise be impossible to obtain; insights can be obtained much more quickly; and training is less cumbersome and much more efficient.

To that extent, Gartner projects synthetic data to completely dwarf actual data in AL models by 2030.

“The fact is, you won’t be able to build high-quality, high-value AI models without synthetic data,” according to the Gartner report.

Leaders in synthetic data

To support accelerating demand, a growing number of companies are offering synthetic models: Major startups in the space include Mostly AI, AI.Reverie, Sky Engine, and Datagen. Leading data engineering company Innodata has also entered the market, today launching an e-commerce portal where customers can purchase synthetic data sets on demand and train models immediately.

“The kind of data sets we’re looking at reflect real-world problems that CIOs and customers have come back to us with,” said CPO Rahul Singhal. “We started looking at: How do we create the large amounts of training data that machines need?”

Innodata AI Data Marketplace has been developed by in-house experts specifically for building and training AI/ML models. The data packages are ready to use, easily previewable, unbiased, diverse, complete and secure, according to Singhal. Innodata is initially launching 17 data packs in four languages ​​that focus on financial services. These packages are verbatim, meaning they include invoices, purchase orders, and bank and credit card statements.

“One of the great needs of AI is data diversity,” Singhal said. “We need many different ways to create invoices, we need visibility. It seems very easy, but in reality it is very complicated.

The marketplace complements Innodata’s open source repository of over 4,000 data sets. These help in prototyping supervised and unsupervised ML projects.

New synthetic data sets take that to the next level based on real-world information. “Machines learn by looking at real-world examples,” Singhal said.

For example, he pointed out the many ways a credit card statement could be structured: one could have names listed on the right hand side; another to the left; one could use a table format; another a column format. To be exact, the machines must account for these variations, both in quality and quantity. Innodata models come with hundreds of templates to allow for such variations and replicate real world scenarios.

“Machine learning (ML) relies on a diversity of data sets,” Singhal said. “We create real-world data sets as much as possible and replicate what real-world document types will look like.”

Why synthetic data?

Among their many advantages, synthetic data sets are free of personal data and therefore not subject to compliance restrictions or other privacy protection laws, Singhal noted. This also protects against security breaches. Biases are removed to help automate workflows and enable predictive modeling. Singhal noted that “things in the real world aren’t flawless” and that people can smear bank statements or obfuscate things accidentally or deliberately.

Ultimately, synthetic data will be an important tool in driving AI adoption, Singhal said.

The ultimate intent with the Innodata market is to expand into third-party AI training datasets, as well as beyond documents to images, video, audio, and voice (the latter in response to the growth of conversational AI). These data sets will also span industries (telecommunications and utilities, transportation and logistics, energy services, pharmaceuticals, hospitality, insurance, retail, health care) and will be provided in an increasing number of languages ​​so that scientists at data can build from a global perspective.

“Our goal is to create a vibrant marketplace where companies can contribute datasets and monetize datasets,” Singhal said. “This has the potential to democratize data for AI.”

The VentureBeat Mission is to be a digital public square for technical decision makers to learn about transformative business technology and transact. Learn more about membership.

Leave a Comment