Join our MCubed web lecture this week to find out how to get machine learning into production – The Register

Posted on October 12, 2021 by admin

Special series If youve ever worked with an application that uses some form of machine learning, youll know that some component or other is always evolving. If it isnt the training data thats changing, youll surely come across a model that needs updating, and if all is well in those areas, theres a good chance a feature request is waiting for implementation so code modifications are due.

In regular software projects, we already know how to automatically take care of changes and make sure that we have a way of keeping our systems up to date without (too many) manual steps. The number of variables at play in ML however make it really tricky to come up with similar processes in that discipline, which is why it is often cited as one of the major roadblocks in getting machine-learning-based applications into production.

For the second episode of our MCubed webcast, on October 7, we therefore decided to sit down with you and have an in-depth look at how to tackle the operational side of ML. Joining in will be DevOps and data expert Danilo Sato, who helped quite a few organisations set up a comprehensible continuous delivery (CD) workflow for their machine-learning projects.

You might know Danilo from a popular article series on CD4ML, however his work reaches far beyond that. In his 2014 book DevOps in Practice: Reliable and Automated Software Delivery, he shared insights from working on all sorts of platform modernisation and data engineering projects.

On the webcast, Danilo will discuss how the principles of Continuous Delivery apply to machine-learning applications, and walk you through the technical components necessary to implement a system that takes care of CD for your ML project. Hell walk you through the differences between MLOps and CD4ML, take a closer look at the peculiarities of version control and artifact repositories in ML projects, give you some tips on what to observe, and introduce you to the many different ways a model can be deployed.

And in case you have all of this figured out already, Danilo will provide a look into the future of machine-learning infrastructure as well as give you some food for thought on open challenges such as explainability and auditability.

The MCubed webcast on October 7 will start 11am BST (noon CEST) with a roundup of the latest in machine-learning-related software development news, and then its straight on to the talk.

Dont forget to let us know if you have any topics youd like to learn more about, or if you are interested in practical experience reports from specific industries we really want to make these webcasts worth your time, so every hint helps. Also, reach out if you want to share some tricks yourself, we always love to hear from you!

See the rest here:
Join our MCubed web lecture this week to find out how to get machine learning into production - The Register

The Evolution Of Data Science And AI At The New York Times – Forbes

Posted on October 12, 2021 by admin

Data science and machine learning are evolving in just about every single industry. The adoption of AI at companies continues to grow and evolve and AI developers are trying to prove that there is value that can be added to different parts of the company through machine learning. Not surprisingly, journalism, an industry whose primary focus is the communication of ideas in both text and visual format, has come to adopt the tools and techniques of data science to put power behind analysis and visualization of data.

The New York Times (NYT) has had a data science group since 2012, but only recently has this group moved out of the experimental phase and taken a major role in the company, adding value through machine learning. The Director of Data Science at the New York Times, Colin Russel, will be sharing some of the insights learned from the NYT data science team at an upcoming Data for AI event on November 4, 2021. Colin uses his background in predictive modeling and designing and applying machine learning algorithms to implement the Times vast quantities of data into models and visualizations that can help different segments of the company. In this article, we share some of his insights into where data science is heading at the NYT and beyond as well as insights previously shared by the NYT at the Data for AI conference in 2020.

Applications of AI

Colin Russel, New York Times

The NYT has invested in building out different machine learning teams that combine aspects of data science, data analytics, and engineering. These teams are centralized with different data science groups working with the newsroom, others with marketing, and others working with different business operations. Although each of these teams are focused on different aspects of the companys overall mission, they are all looking to build a machine learning platform that can take all of the overlapping deployment and infrastructure development and centralize it for overall use.

Traditionally, the newsroom and editorial operations are separate from the business side of the company for obvious reasons of conflict of interest and maintaining a separation between revenue-generating and news-generating activities. Because of the separation of the data journalism side and the data science side of the organization, there is a separation of culture. Due to this separation, it is often challenging working in AI for a large company and it is crucial to have a lot of clear and constant communication around the process and goals of AI implementation.

The use of data to drive decision-making and insights is spread across the entirety of the organization, however, with data analysis being used to power both business decisions as well as journalistic and editorial insights. The newsroom is very interested in data and understanding the audience of the NYT in a world where many people are getting their news from social media. Likewise, operations is interested in data-driven insights to improve advertising performance, deliver optimized content to readers, and generate more visibility of various operations and offerings.

Technology for AI

While many companies outsource their AI tools, the NYT is focused on building, not buying. Implementing AI technology is often not the hardest part a project, but rather engineering, organizing, and manipulating the data to where it can be efficiently modeled is often the challenge. Years ago, data was all over the place and as a data scientist trying to use data from different sectors of the company, you needed to get credentials for every different part. Add the difficulty obtaining data to the difficulty deciding what parts of the data are appropriate to be used for the model and this makes the actual technology for AI a smaller issue.

Due to the different areas of focus and priorities for different parts of the company, AI developers must figure out how to balance the competing concerns. The NYT recently went through an overhaul where they wanted to consolidate data on the cloud. This gave them the opportunity to start fresh and make it easy to upload data from different parts of the company.

Dealing with Variability

Data science and machine learning models are verified and evaluated to measure baseline performance as well as testing model improvements that are being developed. One of the main difficulties in taking advantage of AI is the difficulty in quantifying the goal and choosing the metric that you want to optimize. In the news and journalism industry, there is a lot of variability based on news cycles. For example, the Covid-19 pandemic has changed the company a lot as it is now giving free access to Covid-19 related news. The subscription business that wants as many subscribers as possible now has a public service component and believes having free access to information at a certain level is very important.

Certain types of recommendation algorithms respond better in different types of news cycles. Models are retrained as of protocol and the performance of a model must be taken into context with the news cycle. To evaluate the quality of a model, it must be taken over a longer period due to news cycles and environmental effects. Figuring out which models to use in each news cycle is a challenge that Colin and his team are looking to solve.

Implementing AI and ML algorithms can be a challenge in any company and determining the technology, metrics, and data to be used is very difficult. The New York Times handles these issues daily, with greater details and insights to be shared at the upcoming Data for AI event.

Here is the original post:
The Evolution Of Data Science And AI At The New York Times - Forbes

Why predictive modelling and machine learning can revolutionise the biopharma industry – PharmaTimes

Posted on October 12, 2021 by admin

Implementing AI in biopharma and healthcare

But the first crucial step in unlocking the power of predictive modelling and machine learning is the generation of a robust and well-maintained data backbone. No matter how each company achieves this, they must each build in a high level of interoperability, not only to facilitate the transition from siloed to streamlined data management within various company functions but more broadly to facilitate the common usage and interrogation of data within the industry. Such data federation requires the aggregation of data from a myriad of sources and the integration of this data into a common model.

Achieving this integration at an industry-wide level is challenging. The complex landscape of equipment and instruments leads to inaccessible data silos with important context and process knowledge locked away in paper records and spreadsheets. To overcome this, leading organisations are forging relationships with science and technology companies and embracing new innovations to help accelerate their digital transformation and realise the potential of AI and Biopharma 4.0.

IDBS, one of the Danaher operating companies, has recently launched the worlds first biopharma life cycle management (BPLM) platform, a new cloud-based data management platform that enables biopharma companies to more efficiently design, scale and run their processes. An embedded integration layer simplifies the curation of a process data backbone that powers data analytics and machine learning. By eliminating repetitive manual tasks, users are able to focus on developing more robust and scalable processes, supported by easily accessible data and actionable insights that accelerate process optimisation, technology transfer and regulatory filings.

Dedicated solutions are also being developed to help overcome specific challenges in the biopharma life cycle. For example, predictive modelling can be leveraged to scale up bioproduction processes, and perform fast, efficient and affordable experiments in silico. Another Danaher operating company, Cytiva, has created a digital bioreactor scaling tool, which allows users to scale bioreactor processes from bench to the Extended Detection and Response (XDR) platform and vice versa with a high degree of accuracy. The Cytiva Digital Bioreactor Scaler takes into account process parameters, along with bioreactor and cell line characteristics, and will propose options for scaling from process development to manufacturing scale.

We believe that in silico simulations will play a huge role in the future of bioprocessing. Earlier this year Cytiva acquired German scientific software manufacturer GoSilico, whose technologies create predictions to assist in the development of downstream chromatography purification processes. Process development is an intense and time-consuming part of making any therapy. By using the Cytiva Digital Bioreactor Scaler and mechanistic models to test different options for upstream and downstream processes, drug developers can expect to predictably manage resources, time and risk, and help in process understanding across the organisation. This sort of digital innovation will provide transformational capabilities for our industry, at a time when speed to market is more important than ever.

Embracing disruption

Broader applications of these groundbreaking technologies are going to radically disrupt the industry, all the way from the laboratory bench to point of care. The amount and complexity of health data at point of care is growing at an exponential rate, with increasing adoption of electronic health records (EHRs), wearable health apps and trackers that collect long-term personalised data, and services such as direct-to-consumer genetic testing that can be integrated into social networks. Meanwhile, machine learning is already being applied to quickly and accurately analyse hundreds of radiological images, enabling precise decisions as reliable as those made by trained and experienced physicians.

As digitalisation becomes increasingly fundamental to healthcare, it is imperative that the biopharma industry also does everything it can to embrace these disruptive technologies so we can all move towards a more personalised and precise approach to medicine. R&D experiments performed in silico are providing new insights into health and disease biology, revealing novel treatment targets, and providing new ways to precisely hit those targets. These digital experiments can also analyse huge data sets to identify novel biomarkers for delivering personalised therapy: ensuring that the right patient is treated with the right drug at the right time.

Building on federated data in robust data backbones, predictive modelling with machine learning is what will move our industry away from the inefficient empirical approach currently used in drug discovery and process development. Instead of carrying out hundreds of time-consuming, costly and inefficient wet bench experiments to identify what succeeds, we will first experiment in silico by using highly trained machine learning-based models to quickly and accurately predict the outcomes.

Positive outcomes will then be tested in a few select experiments, which will also test the validity and utility of the digital models. Validated models and experimental data will then inform decision-making.

Through predictive modelling and beyond, using machine learning and other forms of AI and data science, we will be able to push the boundaries of what can be achieved by the biopharma industry. It will require investment, innovation and a mindset shift, but will enable us to deliver better medicines, faster, more accessibly, and with greater personalisation towards patients.

Kevin Chance is vice president of Danaher Corporation

Here is the original post:
Why predictive modelling and machine learning can revolutionise the biopharma industry - PharmaTimes

Kennesaw State professor awarded NSF grant to teach students detection of cybersecurity attacks – Kennesaw State University

Posted on October 12, 2021 by admin

Hossain Shahriar

KENNESAW, Ga. (Oct 11, 2021) Kennesaw State University information technology professor Hossain Shahriar, along with colleagues Dan Lo and Michael Whitman, has been awarded a National Science Foundation (NSF) grant to develop hands-on, interactive materials for students to recognize cybersecurity threats.

Shahriar and his team will use the nearly $280,000 grant from the NSF Secure and Trustworthy Cyberspace Program to create 10 modules to solve common cybersecurity problems using machine learning algorithms. Whitman is executive director of the Institute of Cybersecurity Workforce Development and a professor of information security and assurance and Lo is a professor of computer science.

Machine learning technologies are being used by large companies like Apple and Google to protect our personal data, detect and filter spam emails, identify phishing websites and detect malware, said Shahriar, aprofessor in theCollege of Computing and Software Engineering.

Shahriar wants students to learn about this technology early in their degree program so they develop marketable skillsets for future employers.

We want to ensure students have more confidence tackling these cyber challenges,saidShahriar.We start with the basics. For example, students should be comfortable detecting whether an email theygetis spam or a normal email by using machine learning techniques.

Shahriar, who teachesseveral courses at KSU includingoneon ethical hacking and network security,said cybersecurity threats are constantly evolvingbecausehackers also use machine learning techniques to generate new malware.In addition to research done at KSU, Shahriar will partner with faculty at Tuskegee University in Alabama to introduce materials into their curriculums.

"Thats very significant for this project because our goals include increasing the cybersecurity knowledge base and workforce around the country and increasing participation from underrepresented, minority students, he said. Tuskegee University is a prominent historically Black university, and we are excited about this collaboration.

The research team, which includes undergraduate and graduate students at KSU,plansto spendayear building online modules andthenpiloting them in the classroom. Once that is complete,Shahriarhopes to disseminate those modules to professors around the Southeast to use in theirundergraduate and graduate curriculums.KSU'sBurruss Institute of Public Service and Researchand Hillary Steiner, Associate Director for the Scholarship of Teaching and Learning at the Center for Excellence in Teaching and Learning, will also have a role in evaluating the project outcomes.

Ascybersecuritythreats and hackers evolve,Shahriarsaid its more important than ever for students to have the skills to detect them.

Abbey OBrien Barrows

A leader in innovative teaching and learning, Kennesaw State University offers undergraduate, graduate and doctoral degrees to its nearly 43,000 students. With 11 colleges on two metro Atlanta campuses, Kennesaw State is a member of the University System of Georgia. The universitys vibrant campus culture, diverse population, strong global ties and entrepreneurial spirit draw students from throughout the country and the world. Kennesaw State is a Carnegie-designated doctoral research institution (R2), placing it among an elite group of only 6 percent of U.S. colleges and universities with an R1 or R2 status. For more information, visit kennesaw.edu.

More:
Kennesaw State professor awarded NSF grant to teach students detection of cybersecurity attacks - Kennesaw State University

What Is Machine Learning, and How Does It Work? Here’s a Short Video Primer – Scientific American

Posted on October 5, 2021 by admin

Machine learning is the process by which computer programs grow from experience.

This isntscience fiction, where robots advance until they take over the world.

When we talk about machine learning, were mostly referring to extremely clever algorithms.

In 1950mathematician Alan Turing argued that its a waste of time to ask whether machines can think. Instead, he proposed a game: a player has two written conversations, one with another human and one with a machine. Based on the exchanges, the human has to decide which is which.

This imitation game would serve as a test for artificial intelligence. But how would we program machines to play it?

Turing suggested that we teach them, just like children. We could instruct them to follow a series of rules, while enabling them to make minor tweaks based on experience.

For computers, the learning process just looks a little different.

First, we need to feed them lots of data: anything from pictures of everyday objects to details of banking transactions.

Then we have to tell the computers what to do with all that information.

Programmers do this by writing lists of step-by-step instructions, or algorithms. Those algorithms help computers identify patterns in vast troves of data.

Based on the patterns they find, computers develop a kind of model of how that system works.

For instance, some programmers are using machine learning to develop medical software. First, they might feed a program hundreds of MRI scans that have already been categorized. Then, theyll have the computer build a model to categorize MRIs it hasnt seen before. In that way, that medical software could spot problems in patient scans or flag certain records for review.

Complex models like this often require many hidden computational steps. For structure, programmers organize all the processing decisions into layers. Thats where deep learning comes from.

These layers mimic the structure of the human brain, where neurons fire signals to other neurons. Thats why we also call them neural networks.

Neural networks are the foundation for services we use every day, like digital voice assistants and online translation tools. Over time, neural networks improve in their ability to listen and respond to the information we give them, which makes those services more and more accurate.

Machine learning isnt just something locked up in an academic lab though. Lots of machine learning algorithms are open-source and widely available. And theyre already being used for many things that influence our lives, in large and small ways.

People have used these open-source tools to do everything from train their pets to create experimental art to monitor wildfires.

Theyve also done some morally questionable things, like create deep fakesvideos manipulated with deep learning. And because the data algorithms that machines use are written by fallible human beings, they can contain biases.Algorithms can carry the biases of their makers into their models, exacerbating problems like racism and sexism.

But there is no stopping this technology. And people are finding more and more complicated applications for itsome of which will automate things we are accustomed to doing for ourselves--like using neural networks to help run power driverless cars. Some of these applications will require sophisticated algorithmic tools, given the complexity of the task.

And while that may be down the road, the systems still have a lot of learning to do.

View post:
What Is Machine Learning, and How Does It Work? Here's a Short Video Primer - Scientific American

Getting machine learning into production is hard the MCubed webcast is here for support DEVCLASS – DevClass

Posted on October 5, 2021 by admin

The MCubed webcast returns this week to tackle a whole other beast: Continuous Delivery for Machine Learning. Join us on October 7th at 11am BST (thats 12 oclock for you CEST peeps) to get into the nitty gritty of the operational side of ML.

If youve ever worked with an application that uses some form of machine learning, youll know that some component or other is always evolving: If it isnt the training data thats changing, youll surely come across a model that needs updating, and if all is well in those areas, theres a good chance a feature request is waiting for implementation so code modifications are due.

For the second episode of our free MCubed webcast on October 7th, we therefore decided to sit down with you and have an in-depth look at how to tackle the operational side of ML. Joining in will be DevOps and data expert Danilo Sato, who helped quite a few organisations to set up a comprehensible continuous delivery (CD) workflow for their machine learning projects.

You might know Mr Sato from a popular article series on CD4ML, however his work reaches far beyond that. In his 2014 book DevOps in Practice: Reliable and Automated Software Delivery he already shared insights from working on all sorts of platform modernisation and data engineering projects, that also informed some of the good practices he recently investigated.

On the webcast, Sato will discuss how the principles of Continuous Delivery apply to machine learning applications, and walk you through the technical components necessary to implement a system that takes care of CD for your ML project. Hell walk you through the differences between MLOps and CD4ML, take a closer look at the peculiarities of version control and artifact repositories in ML projects, give you some tips on what to observe, and introduce you to the many different ways a model can be deployed.

And in case you have all of this figured out already, Danilo Sato will provide a look into the future of machine learning infrastructure as well as give you some food for thought on open challenges such as explainability and auditability.

The MCubed webcast on October 7th will start 11am BST (12pm CEST) with a roundup of the latest in machine learning-related software development news, but then its straight on to the talk.

Go here to read the rest:
Getting machine learning into production is hard the MCubed webcast is here for support DEVCLASS - DevClass

Immunis.AI Chosen by Amazon Web Services to Showcase its Cloud-Based Genomic Pipeline for Machine Learning – Business Wire

Posted on October 5, 2021 by admin

ROYAL OAK, Mich.--(BUSINESS WIRE)--Immunis.AI, Inc., an immunogenomics platform company developing noninvasive blood-based tests to optimize patient care, today announced that Amazon Web Services (AWS) will showcase the companys cloud-based genomic pipeline for machine learning. In collaboration with Mission Cloud Services, the platform will be highlighted in a virtual event, Behind the Innovation, hosted by AWS, today.

Immunis.AI engaged Mission, a partner with deep life science expertise, to design an AWS architecture to leverage Amazon S3 alongside a backend data pipeline using Amazon EC2 and Amazon EBS infrastructure. The challenge Immunis.AI faced was data ingestion and real-time analytics of its large immunotranscriptomic data sets and parallel processing of thousands of samples through its machine learning pipelines. Through the collaboration with Mission and AWS, the ingestion of data by Immunis.AI, which took two weeks to finish manually, can be completed within hours.

The virtual event, which will be held Tuesday, October 5th, from 9 to 10:30 a.m. Pacific Time (12 to 1:30 p.m. Eastern Time). For more information, or to register for the virtual event, click here.

Machine learning presents its own set of unique challenges and managing the large data sets is a major problem in the field of genomics. Working with Mission and AWS to design an architecture to streamline our data ingestion and analytics, has enabled us to drastically accelerate development of our immunogenomic tests to improve diagnosis and treatment of cancer patients, said Geoffrey Erickson, a founder and Senior VP of Corporate Development at Immunis.AI, who will be presenting at the event. We are pleased to have been chosen by AWS to highlight our architecture and our powerful partnership and the life changing outcomes it enables.

While it is still evolving, Mission provided Immunis.AI with a tested and proven blueprint for a viable research oriented genomic platform, all backed by AWS and ready to scale quickly and economically, said Jonathan LaCour, Chief Technology Officer & Sr. Vice President, Service Delivery. We are pleased to support Immunis.AIs important mission to develop tests that can improve the lives of cancer patients and proud of its Mission-built AWS infrastructure that is helping them.

Immunis.AI continues to leverage its successful blueprint across several clinical studies, as it develops and plans to commercialize its products. Mission will also continue to help Immunis.AI with data modernization, including data lake and analytics initiatives on AWS.

About Immunis.AI, Inc.

IMMUNIS.AI is a privately held immunogenomics company with a patented liquid biopsy platform that offers unique insights into disease biology and individualized assessment. The Intelligentia platform combines the power of the immune system, RNAseq technology and Machine Learning (ML) for the development of disease-specific signatures. This proprietary method leverages the immune systems surveillance apparatus to overcome the limitations of circulating tumor cells (CTCs) and cell free DNA (cfDNA). The platform improves detection of early-stage disease, at the point of immune-escape, when there is the greatest opportunity for cure. For more information, please visit our website: https://immunis.ai/

Why automation, artificial intelligence and machine learning are becoming increasingly critical for SOC operations – Security Magazine

Posted on October 5, 2021 by admin

Why automation, artificial intelligence and machine learning are becoming increasingly critical for SOC operations This website requires certain cookies to work and uses other cookies to help you have the best experience. By visiting this website, certain cookies have already been set, which you may delete and block. By closing this message or continuing to use our site, you agree to the use of cookies. Visit our updated privacy and cookie policy to learn more. This Website Uses CookiesBy closing this message or continuing to use our site, you agree to our cookie policy. Learn MoreThis website requires certain cookies to work and uses other cookies to help you have the best experience. By visiting this website, certain cookies have already been set, which you may delete and block. By closing this message or continuing to use our site, you agree to the use of cookies. Visit our updated privacy and cookie policy to learn more.

Read the rest here:
Why automation, artificial intelligence and machine learning are becoming increasingly critical for SOC operations - Security Magazine

Come And Do Research In Particle Physics With Machine Learning In Padova! – Science 2.0

Posted on October 5, 2021 by admin

I used some spare research funds to open a six-months internship to help my research group in Padova, and the call is open for applications at this site (the second in the list right now, the number is #23584). So here I wish to answer a few questions from potential applicants, namely:1) Can I apply?2) When is the call deadline?3) What is the salary?4) What is the purpose of the position? What can I expect to gain from it?

5) What will I be doing if I get selected?

Answers:1 - You can apply if you have completed a masters degree in a scientific discipline (physics, astronomy, mathematics, statistics, computer science) not earlier than one year ago. You are supposed to possess some programming skills, although your wish to learn is more important than your knowledge base.

2 - The deadline is October 16. The application process is simple, but you want to look into the electronic procedure early on to verify that you have the required documents.

3 - The salary is in line with the wage of Ph.D. students enrolled in the course in Padova. I do not know the net after taxation,but it is of the order of 1100 euros per month. This is not a lot of money, but it is enough to live by in Padova for a student. You won't get rich, but your focus should be to gain experience and titles for your future career!

4 - The purpose of the internship is to endow the recipient with skills in machine learning applied to fundamental physics research. Ideally, the recipient would be interested to apply for a Ph.D. at the University of Padova after finishing the internship, and the research work would be a very useful asset for his or her CV, along with the probable authorship of a publication in machine learning applications to particle physics; but the six months of work may be a good training also for graduate students who wish to move out of academia, to pursue a career in industry. The point is that what we will be working on together is a topic at the real bleeding edge of innovative applications of deep learning - something which will be invaluable in the future both in research and in industry. I will explain more what this is about below.

5 - If you get selected, you will join my research team, which is embedded in the MODE collaboration of which I am the leader. We want to use differentiable programming techniques (available through python libraries offered by packages such as Pytorch or TensorFlow) to create software that studies the end-to-end optimization of complex instruments used for particle physics research or for industrial applications such as muon tomography or proton therapy.

More in detail, we are currently tackling an "easy" application of deep-learning-powered end-to-end optimization which consists in finding the most advantageous layout of detection elements in a muography apparatus. Muon tomography consists in detecting the flow of cosmic-ray muons in and out of a unknown volume, of which we wish to determine the inner material distribution. This has applications to volcanology (where is the magma?), archaeology (study of hidden chambers in ancient buildings or pyramids), foundries (where is the melted material?), nuclear waste disposal (is there uranium in this box of scrap metal?), or detecting defects in pipelines or other industrial equipment.

To find the optimal layout we consider geometry, technology and cost of the detector as parameters, and we find the optimal solution by maximizing a utility function connected to how well the imaging is performed in a given time, and the cost of the apparatus and other constraints. So this is a relatively simple application of differentiable programming - you can pull it off if you model with continuous functions the various elements of the problem. If we manage to create a good software product we will share it freely, and then move on to some harder detector optimization problem (there is a long list of candidates). Actually, we are starting in parallel to study the optimization of calorimeters for particle detectors, which is a much, much more ambitious project which a fresh new Ph.D. student working with me, Federico Nardi, will investigate, again within the MODE collaboration.

So, if you are a bright master graduate and you want to deepen your skills in machine learning, please consider applying! If we select you, we will have loads of fun attacking these hard problems together!

If you need more information, please feel free to email me at dorigo (at) pd (dot) infn (dot) it. Thank you for your interest, and share this information with other potential applicants!

What is data poisoning and how do we stop it? – TechRadar

Posted on October 5, 2021 by admin

The latest trend in businesses is the adoption of machine learning models to bolster AI systems. However, as this process gets more and more automated, this naturally puts them at greater risk of new emerging threats to the function and integrity of AI, including data poisoning.

About the author

Spiros Potamitis is Senior Data Scientist at Global Technology Practice at SAS.

Below, discover what data poisoning is, how it threatens business systems, and finally how to defeat it and win the fight against those who wish to manipulate data for their own gain

Before we discuss data poisoning, its worth revisiting how machine learning models work. We train these models to make predictions by feeding them with historical data. From these data, we already know the outcome that we would like to predict in the future and the characteristics that drive this outcome. These data teach the model to learn from the past. The model can then use what it has learned to predict the future. As a rule of thumb, when more data are available to train the model, its predictions will be more accurate and stable.

AI systems that include machine learning models are normally developed by experienced data scientists. They thoroughly examine and explore the data, remove outliers and run several sanity and validation checks before, during and after the model development process. This means that, as far as possible, the data used for training genuinely reflect the outcomes that the developers want to achieve.

However, what happens when this training process is automated? This does not very often occur during development, but there are many occasions when we want models to continuously learn from new operational data: on the job learning. At that stage, it would not be difficult for someone to develop misleading data that would directly feed into AI systems to make them produce faulty predictions.

Consider, for example, Amazon or Netflixs recommendation engines. Think how easy it is to change the recommendations you receive by buying something for someone else. Now consider that it is possible to set up bot-based accounts to rate programs or products millions of times. This will clearly change ratings and recommendations, and poison the recommendation engine.

This is known as data poisoning. It is particularly easy if those involved suspect that they are dealing with a self-learning system, like a recommendation engine. All they need to do is make their attack clever enough to pass the automated data checkswhich is not usually very hard.

The other issue with data poisoning is that it could be a long, slow process. Hackers can afford to take their time to change the data by feeding in a few results at a time. Indeed, this is often more effective, because it is harder to detect than a massive influx of data at a single point in timeand significantly harder to undo.

Fortunately, there are steps that organizations can take to prevent data poisoning. These include

1. Establish an end-to-end ModelOps process to monitor all aspects of model performance and data drifts, to closely inspect system function.

2. For automatic re-training of models, establish a business flow. This means that your model will have to go through a series of checks and validations by different people in the business before the updated version goes live.

3. Hire experienced data scientists and analysts. There is a growing tendency to assume that everything technical can be handled by software engineers, especially with the shortage of qualified and experienced data scientists. However, this is not the case. We need experts who really understand AI systems and machine learning algorithms, and who know what to look for when we are dealing with threats like data poisoning.

4. Use open with caution. Opensource data are very appealing because they provide access to more data to enrich existing sources. In principle, this should make it easier to develop more accurate models. However, these data are just that: open. This makes them an easy target for fraudsters and hackers. The recent attack on PyPI, which flooded it with spam packages, shows just how simple this can be.

It is vital that businesses follow the recommendations above so as to defend against the threat of data poisoning. However, there remains a crucial means of protection that often gets overlooked: human intervention. While businesses can automate their systems as much as they would like, it is paramount that they rely on the trained human eye to ensure effective oversight of the entire process. This prevents data poisoning from the offset, allowing organizations to innovate through insights, with their AI assistants beside them.

Continue reading here:
What is data poisoning and how do we stop it? - TechRadar