Synthetic data could be better than real data – Nature.com

Credit: Janelle Barone

When more than 155,000 students from all over the world signed up to take free online classes in electronics in 2012, offered through the fledgling US provider edX, they set in motion an explosion in the popularity of online courses.

The edX platform, created by the Massachusetts Institute of Technology (MIT) and Harvard University, both in Cambridge, Massachusetts, was not the first attempt at teaching classes online but the number of participants it attracted was unusual. The activity created a massive amount of information on how people interact with online education, and presented researchers with an opportunity to garner answers to questions such as What might encourage people to complete courses?, and What might give them a reason to drop out?.

We had a tonne of data, says Kalyan Veeramachaneni, a data scientist at MITs Laboratory for Information and Decision Systems. Although the university had long dealt with large data sets generated by others, that was the first time that MIT had big data in its own backyard, says Veeramachaneni.

Hoping to take advantage, Veeramachaneni assigned 20 MIT students to run analyses of the information. But he soon ran into a roadblock: legally, the data had to be private. This wealth of information was held on a single computer in his laboratory, with no connection to the Internet to prevent hacking. The researchers had to schedule a time to use it. It was a nightmare, Veeramachaneni says. I just couldnt get the work done because the barrier to the data was very high.

Why artificial intelligence needs to understand consequences

His solution, eventually, was to create synthetic students computer-generated versions of edX participants that shared characteristics with real students using the platform, but that did not give away private details. The team then applied machine-learning algorithms to the synthetic students activity, and in doing so discovered several factors associated with a person failing to complete a course1. For instance, students who tended to submit assignments right on a deadline were more likely to drop out. Other groups took the findings of this analysis and used them to help create interventions to help real people complete future courses2.

This experience of building and using a synthetic data set led Veeramachaneni and his colleagues to create the Synthetic Data Vault, a set of open-source software that allows users to model their own data and then use those models to generate alternative versions of the data3. In 2020, he co-founded a company called DataCebo, based in Boston, Massachusetts, which helps other companies to do this.

The desire to preserve privacy is one of the driving forces behind synthetic-data research. Because artificial intelligence (AI) and machine learning have expanded rapidly, finding their way into areas as diverse as health care, art and financial analysis, concerns about the data used to train the systems is also growing. To learn, these algorithms must consume vast amounts of information much of which relates to individuals. The system could reveal private details, or be used to discriminate against people when making decisions on hiring, lending or housing, for example. The data fed to these machines might also be owned by an individual or company that does not want the information to be used to create a tool that might then compete with them or at least, might not want to give the data away for free.

Some researchers think that the answer to these concerns could lie in synthetic data. Getting computers to manufacture data that is close enough to the real thing without recycling real information could help to address privacy problems. But it could also do much more. I want to move away from just privacy, says Mihaela van der Schaar, a machine-learning researcher and director of the UK Cambridge Centre for AI in Medicine. I hope that synthetic data could help us create better data.

All data sets come with issues that go beyond privacy considerations. They can be expensive to produce and maintain. In some cases for example, trying to diagnose a rare medical condition using imaging there simply might not be enough real-world data available to train a system to do the task reliably. Bias is also a problem both social biases, which might cause systems to favour one group of people over another, and subtler issues such as a training set of photos that includes only a handful taken at night. Synthetic data, its proponents say, can get around these problems by adding absent information to data sets faster and more cheaply than gathering it from the real world, assuming it were possible to obtain the real thing at all.

To me, its about making data this living, controllable object that you can change towards your application and your goals, says Phillip Isola, a computer scientist at MIT who specializes in machine vision. Its a fundamental new way of working with data.

There are several ways to synthesize data, but they all invoke the same concept. A computer, using a machine-learning algorithm or a neural network, analyses a real data set and learns about the statistical relationships within it. It then creates a new data set containing different data points than the original, but retaining the same relationships. A familiar example is ChatGPT, the text generation engine. ChatGPT is based on a large language model, Generative Pre-trained Transformer, which pored over billions of examples of text written by humans, analysed the relationships between the words and built a model of how they fit together. When given a prompt Write me an ode to ducks ChatGPT takes what it has learnt about odes and ducks and produces a string of words, with each word choice informed by the statistical probability of it following the previous one:

Oh ducks, feathered and free,

Paddling in ponds with such glee,

Your quacks and waddles are a delight,

A joy to behold, day or night.

With the right training, machines can produce not only text but also images, audio or the rows and columns of tabular data. The question is, how accurate is the output? Thats one of the challenges in synthetic data, says Thomas Strohmer, a mathematician who directs the Center for Data Science and Artificial Intelligence Research at the University of California, Davis (UC Davis).

Jason Adams, Thomas Strohmer and Rachael Callcut (left to right) are part of the synthetic data research team at UC Davis Health.

You first have to figure out what you mean by accuracy, he says. To be useful, a synthetic data set must retain the aspects of the original that are relevant to the outcome the all-important statistical relationships. But AI has accomplished many of its impressive feats by identifying patterns in data that are too subtle for humans to notice. If we could understand the data well enough to easily identify the relationships in medical data that suggest someone is at risk of a disease, we would have no need for a machine to find those relationships in the first place, Strohmer says.

This catch-22 means that the clearest way to know whether a synthetic data set has captured the important nuances of the original is to see if an AI system trained on the synthetic data makes similarly accurate predictions to a system trained on the original. The more capable the machine, the harder it is for humans to distinguish the real from the fake. AI-generated images and text are already at the point where they seem realistic to most people, and the technology is advancing rapidly. Were getting close to the level where, even to the expert, the imagery looks correct, but it still might not be correct, Isola says. It is therefore important that users treat synthetic data with some caution, and dont lose sight of the fact that it isnt real data, he says. It still might be misleading.

Last April, Strohmer and two of his colleagues at UC Davis Health in Sacramento, California, won a four-year, US$1.2-million grant from the US National Institutes of Health to work out ways to generate high-quality synthetic data that could help physicians to predict, diagnose and treat diseases. As part of the project, Strohmer is developing mathematical methods of proving just how accurate synthetic data sets are.

He also wants to include a mathematical guarantee of privacy, especially given the stringent laws around medical privacy around the world, such as the Health Insurance Portability and Accountability Act in the United States and the European Unions General Data Protection Regulation. The difficulty is that the utility and privacy of data are in tension; increasing one means decreasing the other.

To increase privacy in data, scientists add statistical noise to a data set. If, for instance, one of the data points collected is a persons age, they throw in some random ages to make individuals less identifiable. Its easier to pinpoint a 45-year-old man with diabetes than a person with diabetes who might be 38, or 51, or 62. But, if the age of diabetes onset is one of the factors being studied, this privacy-protecting measure will lead to less accurate results.

Abandoned: the human cost of neurotechnology failure

Part of the difficulty of guaranteeing privacy is that scientists are not completely sure how synthetic data reveals private information or how to measure how much it reveals, says Florimond Houssiau, a computer scientist at the Alan Turing Institute in London. One way in which secrets could be spilled is if the synthetic data are too similar to the original data. In a data set that contains many pieces of information associated with an individual, it can be hard to grasp the statistical relationships. In this case, the system generating the synthetic version is more likely to replicate what it sees rather than make up something entirely new. Privacy is not actually that well understood, Houssiau says. Scientists can assign a numerical value to the privacy level of a data set, but we dont exactly know which values should be considered safe or not. And so its difficult to do that in a way that everyone would agree on.

The varied nature of medical data sets also makes generating synthetic versions of them challenging. They might include notes written by physicians, X-rays, temperature measurements, blood-test results and more. A medical professional with years of training and experience might be able to put those factors together and come up with a diagnosis. Machines, so far, cannot. We just dont know enough, in terms of machine learning, to extract information from different modalities, Strohmer says. Thats a problem for analysis tools, but its also a problem for machines tasked with creating synthetic data sets that retain the all-important relationships. We dont understand yet how to automatically detect these relationships, he says.

There are also fundamental theoretical limits to how much improvement data can undergo, says Isola. Information theory contains a principle called the data-processing inequality, which states that processing data can only reduce the amount of information available, not add to it4. And all synthetic data must have real data at its root, so all the problems with real data privacy, bias, expense and more still exist at the start of the pipeline. Youre not getting something for free youre still ultimately learning from the world, from data. Youre just reformatting that into an easier-to-work-with format that you can control better, Isola says. With synthetic data, data comes in and a better version of the data comes out.

Although synthetic data in medicine havent yet made their way into clinical use, there are some areas where such data sets have taken off. They are being widely used in finance, Strohmer says, with many companies springing up to help financial institutions create new data that protect privacy. Part of the reason for this difference might be that the stakes are lower in finance than in medicine. If in finance you get it wrong, it still hurts, but it doesnt lead to death, so they can push things a little bit faster than in the medical field, Strohmer says.

In 2021, the US Census Bureau announced that it was looking at creating synthetic data to enhance the privacy of people who respond to its annual American Community Survey, which provides detailed information about households in subsections of the country. Some researchers have objected, however, on the grounds that the move could undermine the datas usefulness. In February, Administrative Data Research UK, a partnership that enables the sharing of public-sector data, announced a grant to study the value of synthetic versions of data sets that have been created by the Office of National Statistics and the UK Data Service.

Bioinspired robots walk, swim, slither and fly

Some people are also using synthetic data to test software that they hope to eventually use on real data that they do not yet have access to, says Andrew Elliott, a statistician at the University of Glasgow, UK. These fake data have to look something like the real data, but they can be meaningless, because they only exists for testing the code. A scientist who wants to analyse a sensitive data set that they are granted only limited access to can perfect the code first with synthetic data, and not have to waste time when they get hold of the real data.

For now, synthetic data are a relatively niche pursuit. van der Schaar thinks that more people should be talking about synthetic data and their potential impact and not just scientists. Its important that not only computer scientists understand, but also the general public, she says. People need to wrap their heads around this technology because it could affect everyone.

The issues around synthetic data not only raise interesting research questions for scientists but also important issues for society at large, Strohmer says. Data privacy is so important in the age of surveillance capitalism, he says. Creating good synthetic data that both preserve privacy and reflect diversity, and that are made widely available, has the potential not just to improve the performance of AI and expand its uses, but also to help democratize AI research. A lot of data is owned by a few big companies, and that creates an imbalance. Synthetic data could help to re-establish this balance a little bit, Strohmer says. I think thats an important, bigger goal behind synthetic data.

See the original post:
Synthetic data could be better than real data - Nature.com

Are We Overly Infatuated With Deep Learning? - Forbes [Last Updated On: August 18th, 2024] [Originally Added On: December 28th, 2019]
CMSWire's Top 10 AI and Machine Learning Articles of 2019 - CMSWire [Last Updated On: August 18th, 2024] [Originally Added On: December 28th, 2019]
Can machine learning take over the role of investors? - TechHQ [Last Updated On: August 18th, 2024] [Originally Added On: December 28th, 2019]
Pear Therapeutics Expands Pipeline with Machine Learning, Digital Therapeutic and Digital Biomarker Technologies - Business Wire [Last Updated On: August 18th, 2024] [Originally Added On: January 11th, 2020]
Dell's Latitude 9510 shakes up corporate laptops with 5G, machine learning, and thin bezels - PCWorld [Last Updated On: August 18th, 2024] [Originally Added On: January 11th, 2020]
Limits of machine learning - Deccan Herald [Last Updated On: August 18th, 2024] [Originally Added On: January 11th, 2020]
Forget Machine Learning, Constraint Solvers are What the Enterprise Needs - - RTInsights [Last Updated On: August 18th, 2024] [Originally Added On: January 11th, 2020]
Tiny Machine Learning On The Attiny85 - Hackaday [Last Updated On: August 18th, 2024] [Originally Added On: January 11th, 2020]
Finally, a good use for AI: Machine-learning tool guesstimates how well your code will run on a CPU core - The Register [Last Updated On: August 18th, 2024] [Originally Added On: January 11th, 2020]
How Will Your Hotel Property Use Machine Learning in 2020 and Beyond? | - Hotel Technology News [Last Updated On: August 18th, 2024] [Originally Added On: January 11th, 2020]
Technology Trends to Keep an Eye on in 2020 - Built In Chicago [Last Updated On: August 18th, 2024] [Originally Added On: January 11th, 2020]
AI and machine learning trends to look toward in 2020 - Healthcare IT News [Last Updated On: August 18th, 2024] [Originally Added On: January 11th, 2020]
The 4 Hottest Trends in Data Science for 2020 - Machine Learning Times - machine learning & data science news - The Predictive Analytics Times [Last Updated On: August 18th, 2024] [Originally Added On: January 11th, 2020]
The Problem with Hiring Algorithms - Machine Learning Times - machine learning & data science news - The Predictive Analytics Times [Last Updated On: August 18th, 2024] [Originally Added On: January 11th, 2020]
Going Beyond Machine Learning To Machine Reasoning - Forbes [Last Updated On: August 18th, 2024] [Originally Added On: January 11th, 2020]
Doctor's Hospital focused on incorporation of AI and machine learning - EyeWitness News [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Being human in the age of Artificial Intelligence - Deccan Herald [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Raleys Drive To Be Different Gets an Assist From Machine Learning - Winsight Grocery Business [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Break into the field of AI and Machine Learning with the help of this training - Boing Boing [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
BlackBerry combines AI and machine learning to create connected fleet security solution - Fleet Owner [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
What is the role of machine learning in industry? - Engineer Live [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Seton Hall Announces New Courses in Text Mining and Machine Learning - Seton Hall University News & Events [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Christiana Care offers tips to 'personalize the black box' of machine learning - Healthcare IT News [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Leveraging AI and Machine Learning to Advance Interoperability in Healthcare - - HIT Consultant [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Essential AI & Machine Learning Certification Training Bundle Is Available For A Limited Time 93% Discount Offer Avail Now - Wccftech [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Educate Yourself on Machine Learning at this Las Vegas Event - Small Business Trends [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
2020: The year of seeing clearly on AI and machine learning - ZDNet [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
How machine learning and automation can modernize the network edge - SiliconANGLE [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Five Reasons to Go to Machine Learning Week 2020 - Machine Learning Times - machine learning & data science news - The Predictive Analytics Times [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Don't want a robot stealing your job? Take a course on AI and machine learning. - Mashable [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Adventures With Artificial Intelligence and Machine Learning - Toolbox [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Optimising Utilisation Forecasting with AI and Machine Learning - Gigabit Magazine - Technology News, Magazine and Website [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Machine Learning: Higher Performance Analytics for Lower ... [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Machine Learning Definition [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Machine Learning Market Size Worth $96.7 Billion by 2025 ... [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Difference between AI, Machine Learning and Deep Learning [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Machine Learning in Human Resources Applications and ... [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Pricing - Machine Learning | Microsoft Azure [Last Updated On: August 18th, 2024] [Originally Added On: January 19th, 2020]
Looking at the most significant benefits of machine learning for software testing - The Burn-In [Last Updated On: August 18th, 2024] [Originally Added On: January 22nd, 2020]
New York Institute of Finance and Google Cloud Launch A Machine Learning for Trading Specialization on Coursera - PR Web [Last Updated On: August 18th, 2024] [Originally Added On: January 22nd, 2020]
Uncover the Possibilities of AI and Machine Learning With This Bundle - Interesting Engineering [Last Updated On: August 18th, 2024] [Originally Added On: January 22nd, 2020]
Red Hat Survey Shows Hybrid Cloud, AI and Machine Learning are the Focus of Enterprises - Computer Business Review [Last Updated On: August 18th, 2024] [Originally Added On: January 22nd, 2020]
Machine learning - Wikipedia [Last Updated On: August 18th, 2024] [Originally Added On: January 22nd, 2020]
Vectorspace AI Datasets are Now Available to Power Machine Learning (ML) and Artificial Intelligence (AI) Systems in Collaboration with Elastic -... [Last Updated On: August 18th, 2024] [Originally Added On: January 22nd, 2020]
Learning that Targets Millennial and Generation Z - HR Exchange Network [Last Updated On: August 18th, 2024] [Originally Added On: January 23rd, 2020]
Machine learning and eco-consciousness key business trends in 2020 - Finfeed [Last Updated On: August 18th, 2024] [Originally Added On: January 24th, 2020]
Jenkins Creator Launches Startup To Speed Software Testing with Machine Learning -- ADTmag - ADT Magazine [Last Updated On: August 18th, 2024] [Originally Added On: January 24th, 2020]
Research report investigates the Global Machine Learning In Finance Market 2019-2025 - WhaTech Technology and Markets News [Last Updated On: August 18th, 2024] [Originally Added On: January 25th, 2020]
Expert: Don't overlook security in rush to adopt AI - The Winchester Star [Last Updated On: August 18th, 2024] [Originally Added On: January 25th, 2020]
Federated machine learning is coming - here's the questions we should be asking - Diginomica [Last Updated On: August 18th, 2024] [Originally Added On: January 25th, 2020]
I Know Some Algorithms Are Biased--because I Created One - Scientific American [Last Updated On: August 18th, 2024] [Originally Added On: February 1st, 2020]
Iguazio Deployed by Payoneer to Prevent Fraud with Real-time Machine Learning - Business Wire [Last Updated On: August 18th, 2024] [Originally Added On: February 1st, 2020]
Want To Be AI-First? You Need To Be Data-First. - Forbes [Last Updated On: August 18th, 2024] [Originally Added On: February 1st, 2020]
How Machine Learning Will Lead to Better Maps - Popular Mechanics [Last Updated On: August 18th, 2024] [Originally Added On: February 1st, 2020]
Technologies of the future, but where are AI and ML headed to? - YourStory [Last Updated On: August 18th, 2024] [Originally Added On: February 1st, 2020]
In Coronavirus Response, AI is Becoming a Useful Tool in a Global Outbreak - Machine Learning Times - machine learning & data science news - The... [Last Updated On: August 18th, 2024] [Originally Added On: February 1st, 2020]
This tech firm used AI & machine learning to predict Coronavirus outbreak; warned people about danger zones - Economic Times [Last Updated On: August 18th, 2024] [Originally Added On: February 1st, 2020]
3 books to get started on data science and machine learning - TechTalks [Last Updated On: August 18th, 2024] [Originally Added On: February 1st, 2020]
JP Morgan expands dive into machine learning with new London research centre - The TRADE News [Last Updated On: August 18th, 2024] [Originally Added On: February 1st, 2020]
Euro machine learning startup plans NYC rental platform, the punch list goes digital & other proptech news - The Real Deal [Last Updated On: August 18th, 2024] [Originally Added On: February 1st, 2020]
The ML Times Is Growing A Letter from the New Editor in Chief - Machine Learning Times - machine learning & data science news - The Predictive... [Last Updated On: August 18th, 2024] [Originally Added On: February 1st, 2020]
Top Machine Learning Services in the Cloud - Datamation [Last Updated On: August 18th, 2024] [Originally Added On: February 1st, 2020]
Combating the coronavirus with Twitter, data mining, and machine learning - TechRepublic [Last Updated On: August 18th, 2024] [Originally Added On: February 1st, 2020]
Itiviti Partners With AI Innovator Imandra to Integrate Machine Learning Into Client Onboarding and Testing Tools - PRNewswire [Last Updated On: August 18th, 2024] [Originally Added On: February 2nd, 2020]
Iguazio Deployed by Payoneer to Prevent Fraud with Real-time Machine Learning - Yahoo Finance [Last Updated On: August 18th, 2024] [Originally Added On: February 2nd, 2020]
ScoreSense Leverages Machine Learning to Take Its Customer Experience to the Next Level - Yahoo Finance [Last Updated On: August 18th, 2024] [Originally Added On: February 2nd, 2020]
How Machine Learning Is Changing The Future Of Fiber Optics - DesignNews [Last Updated On: August 18th, 2024] [Originally Added On: February 2nd, 2020]
How to handle the unexpected in conversational AI - ITProPortal [Last Updated On: August 18th, 2024] [Originally Added On: February 5th, 2020]
SwRI, SMU fund SPARKS program to explore collaborative research and apply machine learning to industry problems - TechStartups.com [Last Updated On: August 18th, 2024] [Originally Added On: February 5th, 2020]
Reinforcement Learning (RL) Market Report & Framework, 2020: An Introduction to the Technology - Yahoo Finance [Last Updated On: August 18th, 2024] [Originally Added On: February 5th, 2020]
ValleyML Is Launching a Series of 3 Unique AI Expo Events Focused on Hardware, Enterprise and Robotics in Silicon Valley - AiThority [Last Updated On: August 18th, 2024] [Originally Added On: February 5th, 2020]
REPLY: European Central Bank Explores the Possibilities of Machine Learning With a Coding Marathon Organised by Reply - Business Wire [Last Updated On: August 18th, 2024] [Originally Added On: February 5th, 2020]
VUniverse Named One of Five Finalists for SXSW Innovation Awards: AI & Machine Learning Category - PRNewswire [Last Updated On: August 18th, 2024] [Originally Added On: February 5th, 2020]
AI, machine learning, robots, and marketing tech coming to a store near you - TechRepublic [Last Updated On: August 18th, 2024] [Originally Added On: February 5th, 2020]
Putting the Humanity Back Into Technology: 10 Skills to Future Proof Your Career - HR Technologist [Last Updated On: August 18th, 2024] [Originally Added On: February 6th, 2020]
Twitter says AI tweet recommendations helped it add millions of users - The Verge [Last Updated On: August 18th, 2024] [Originally Added On: February 6th, 2020]
Artnome Wants to Predict the Price of a Masterpiece. The Problem? There's Only One. - Built In [Last Updated On: August 18th, 2024] [Originally Added On: February 6th, 2020]
Machine Learning Patentability in 2019: 5 Cases Analyzed and Lessons Learned Part 1 - Lexology [Last Updated On: August 18th, 2024] [Originally Added On: February 6th, 2020]
The 17 Best AI and Machine Learning TED Talks for Practitioners - Solutions Review [Last Updated On: August 18th, 2024] [Originally Added On: February 6th, 2020]
Overview of causal inference in machine learning - Ericsson [Last Updated On: August 18th, 2024] [Originally Added On: February 6th, 2020]