It was only a matter of time until the rise of artificial intelligence brought about the creation of artificial data to feed the AI models. While the idea of synthetic data isn’t new, its use and potential have grown rapidly over the past year. A number of tech startups and universities now offer synthetic data services for a variety of uses, including insurance and finance.
While synthetic data can be collected via sensors and through images, videos or audio, just as real data is collected, Ali Jahanian, a former research scientist at the MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL), says that “the idea is to have an algorithm — like a simulator or a generative model — to generate such modalities of data with the goal of having this synthetic data as realistic as the real one.”
Synthetic Data Offers Cost Savings and Eliminates Privacy Concerns
The advantages of synthetic data over real-world data signal huge potential. At a basic level, synthetic data is simply less expensive than real data to collect and maintain; real-world data sets can cost millions of dollars.
Another consideration is that training AI models on real data has resulted in challenges with data privacy, biases and fairness — issues that are essentially eliminated with the use of synthetic data. And often, the type of real-world data required for a project is simply unavailable or of low quality.
Click the banner below for exclusive content about emerging technologies in higher ed.
Jahanian says to think of “a generative model that generates synthetic data as an interface to your real data. That means you can get your real data but transform it in a way you couldn’t do with your real data.” At CSAIL, Jahanian and his team were able to turn daylight scenes into nighttime scenes and turn a dormant volcano into an active one.
“These are examples of transformations that you can get for free from a generative model and that are not available in the real data that you collected,” he says.
Already, research conducted by Jahanian and his colleagues has shown that some results with synthetic data are comparable to those using real-world data, but other results are even better with synthetic data. Synthetic data also allows AI to train itself, something that Jahanian says can be “both cool and scary.”
LEARN MORE: Check out some of the emerging AI technologies in higher education.
Use of Synthetic Data Will Continue to Grow
When it comes to the higher education space, one application of synthetic data might be to “provide different narratives about a concept by being able to generate rich content. Imagine if everyone could generate the content they need by personalizing it. This can help each individual learn in their own style of learning. Maybe one person needs more background for understanding a concept,” says Jahanian.
The percentage of data used for the development of artificial intelligence and analytics projects that will be synthetically generated by 2024
Source: blogs.gartner.com, “By 2024, 60% of the data used for the development of AI and analytics projects will be synthetically generated,” July 24, 2021
It’s possible that synthetic data might entirely remove the need for the use of real-world data in the near future. Research firm Gartner projects that synthetic data will completely overshadow real data in AI models by 2030. Jahanian says he agrees.
“I believe it will create parallel worlds, and it will have its utilities,” he says. “Depending on how rich we want this world to be, it could take a few years. However, currently we see examples of synthetic language or image generations — like OpenAI GPT-3 and DALL-E — that are very close to human capabilities or even beyond, in some specific cases.”
Wei-Chiu Ma and Jose-Luis Olivares/MIT