What is data poisoning and how do we stop it? – TechRadar

The latest trend in businesses is the adoption of machine learning models to bolster AI systems. However, as this process gets more and more automated, this naturally puts them at greater risk of new emerging threats to the function and integrity of AI, including data poisoning.

About the author

Spiros Potamitis is Senior Data Scientist at Global Technology Practice at SAS.

Below, discover what data poisoning is, how it threatens business systems, and finally how to defeat it and win the fight against those who wish to manipulate data for their own gain

Before we discuss data poisoning, its worth revisiting how machine learning models work. We train these models to make predictions by feeding them with historical data. From these data, we already know the outcome that we would like to predict in the future and the characteristics that drive this outcome. These data teach the model to learn from the past. The model can then use what it has learned to predict the future. As a rule of thumb, when more data are available to train the model, its predictions will be more accurate and stable.

AI systems that include machine learning models are normally developed by experienced data scientists. They thoroughly examine and explore the data, remove outliers and run several sanity and validation checks before, during and after the model development process. This means that, as far as possible, the data used for training genuinely reflect the outcomes that the developers want to achieve.

However, what happens when this training process is automated? This does not very often occur during development, but there are many occasions when we want models to continuously learn from new operational data: on the job learning. At that stage, it would not be difficult for someone to develop misleading data that would directly feed into AI systems to make them produce faulty predictions.

Consider, for example, Amazon or Netflixs recommendation engines. Think how easy it is to change the recommendations you receive by buying something for someone else. Now consider that it is possible to set up bot-based accounts to rate programs or products millions of times. This will clearly change ratings and recommendations, and poison the recommendation engine.

This is known as data poisoning. It is particularly easy if those involved suspect that they are dealing with a self-learning system, like a recommendation engine. All they need to do is make their attack clever enough to pass the automated data checkswhich is not usually very hard.

The other issue with data poisoning is that it could be a long, slow process. Hackers can afford to take their time to change the data by feeding in a few results at a time. Indeed, this is often more effective, because it is harder to detect than a massive influx of data at a single point in timeand significantly harder to undo.

Fortunately, there are steps that organizations can take to prevent data poisoning. These include

1. Establish an end-to-end ModelOps process to monitor all aspects of model performance and data drifts, to closely inspect system function.

2. For automatic re-training of models, establish a business flow. This means that your model will have to go through a series of checks and validations by different people in the business before the updated version goes live.

3. Hire experienced data scientists and analysts. There is a growing tendency to assume that everything technical can be handled by software engineers, especially with the shortage of qualified and experienced data scientists. However, this is not the case. We need experts who really understand AI systems and machine learning algorithms, and who know what to look for when we are dealing with threats like data poisoning.

4. Use open with caution. Opensource data are very appealing because they provide access to more data to enrich existing sources. In principle, this should make it easier to develop more accurate models. However, these data are just that: open. This makes them an easy target for fraudsters and hackers. The recent attack on PyPI, which flooded it with spam packages, shows just how simple this can be.

It is vital that businesses follow the recommendations above so as to defend against the threat of data poisoning. However, there remains a crucial means of protection that often gets overlooked: human intervention. While businesses can automate their systems as much as they would like, it is paramount that they rely on the trained human eye to ensure effective oversight of the entire process. This prevents data poisoning from the offset, allowing organizations to innovate through insights, with their AI assistants beside them.

Continue reading here:
What is data poisoning and how do we stop it? - TechRadar

Related Posts
This entry was posted in $1$s. Bookmark the permalink.