Key Takeaways
In recent years, machine learning has been very successful in solving a wide range of problems.
In particular, neural networks have reached human, and sometimes super-human, levels of ability in tasks such as language translation, object recognition, game playing, and even driving cars.
With this growth in capability has come a growth in complexity. Data scientists and machine learning engineers must perform feature engineering, design model architectures, and optimize hyperparameters.
Since the purpose of the machine learning is to automate a task normally done by humans, naturally the next step is to automate the tasks of data scientists and engineers.
This area of research is called automated machine learning, or AutoML.
There have been many exciting developments in AutoML recently, and it's important to take a look at the current state of the art and learn about what's happening now and what's coming up in the future.
InfoQ reached out to the following subject matter experts in the industry to discuss the current state and future trends in AutoML space.
InfoQ:What is AutoML and why is it important?
Francesca Lazzeri:AutoML is the process of automating the time consuming, iterative tasks of machine learning model development, including model selection and hyperparameter tuning. When automated systems are used, the high costs of running a single experiment (e.g. training a deep neural network) and the high sample complexity (i.e. large number of experiments required) can be decreased. Auto ML is important because data scientists, analysts, and developers across industries can leverage it to:
Matthew Tovbin:Similarly to how we use software to automate repetitive or complex processes, automated machine learning is a set of techniques we apply to efficiently build predictive models without manual effort. Such techniques include methods for data processing, feature engineering, model evaluation, and model serving. With AutoML, we can focus on higher-level objectives such as answering questions and delivering business values faster while avoiding mundane tasks, e.g., data wrangling, by standardizing the methods we apply.
Adrian de Wynter:AutoML is the idea that the machine learning process, from data selection to modeling, can be automated by a series of algorithms and heuristics. In its most extreme version, AutoML is a fully automated system: you give it data, and it returns a model (or models) that generalizes to unseen data. The common hurdles that modelers face, such as tuning hyperparameters, feature selection--even architecture selection--are handled by a series of algorithms and heuristics.
I think its importance stems from the fact that a computer does precisely what you want it to do, and it is fantastic at repetition. The large majority of the hurdles I mentioned above are precisely that: repetition. Finding a hyperparameter set that works for a problem is arduous. Finding a hyperparameter set and an architecture that works for a problem is even harder. Add to the mix data preprocessing, the time spent on debugging code, and trying to get the right environment to work, and you start wondering whether computers are actually helping you solve said problem, or just getting in the way. Then, you have a new problem, and you have to start all over again.
The key insight of AutoML is that you might be able to get away by using some things you tried out before (i.e., your prior knowledge) to speed up your modeling process. It turns out that said process is effectively an algorithm, and thus it can be written into a computer program for automation.
Leah McGuire:AutoML is machine learning experts automating themselves. Creating quality models is a complex, time-consuming process. It requires understanding the dataset and question to be answered. This understanding is then used to collect and join the needed data, select features to use, clean the data and features, transform the features into values that can be used by a model, select an appropriate model type for the question, and tune feature-engineering and model parameters. AutoML uses algorithms based on machine learning best practices to build high-quality models without time-intensive work from an expert.
AutoML is important because it makes it possible to create high quality models with less time and expertise. Companies, non-profits, and government agencies all collect vast amounts of data; in order for this data to be utilized, it needs to be synthesized to answer pertinent questions. Machine learning is an effective way of synthesizing data to answer relevant questions, particularly if you do not have the resources to employ analysts to spend huge amounts of time looking at the data. However, machine learning requires both expertise and time to implement. AutoML seeks to decrease these barriers. This means that more data can be analyzed and used to make decisions.
Marios Michailidis:Broadly speaking, I would call it the process of automatically deriving or extracting useful information from data via harnessing the power of machines. Digital data is being produced at an incredible pace. Now that companies have found ways to harness it to extract value, it has become imperative to invest in data science and machine learning. However, the supply of data science (in human resource) is not enough to meet the current needs, hence making existing data scientists more productive is of the essence. This is where the notion of automated machine learning can provide the most value, via equipping the existing data scientists with tools and processes that can make their work easier, quicker, and generally more efficient.
InfoQ:What parts of the ML process can be automated and what are some parts unlikely to be automated?
Lazzeri:With Automated ML, the following tasks can be automated:
However, there are a few important tasks that cannot be automated during the model development cycle, such us developing industry-specific knowledge and data acumen, which are hard to automate and it is impossible to not keep humans in the loop. Another important aspect to consider is about operationalizing machine learning models: AutoML is very useful for the machine learning model development cycle; however, for the automation of the deployment step, there are other tools that need to be used, such as MLOps, which enables data science and IT teams to collaborate and increase the pace of model development and deployment via monitoring, validation, and governance of machine learning models.
Tovbin:Through the years of development of the machine learning domain, we have seen that a large number of tasks around data manipulation, feature engineering, feature selection, model evaluation, hyperparameter tuning can be defined as an optimization problem and, with enough computing power, efficiently automated. We can see numerous proofs for that not only in research but also in the software industry as platform offerings or open-source libraries. All these tools use predefined methods for data processing, model training, and evaluation.
The creative approach to framing problems and applying new techniques to existing problems is the one that is not likely to be replicated by machine automation, due to a large number of possible permutations, complex context, and expertise the machine lacks. As an example, look at the design of neural net architectures and their applications, a problem where the search space is so ample, where the progress is still mostly human-driven.
de Wynter:In theory, the entire ML process is computationally hard. From fitting data to, say, a neural network, to hyperparameter selection, to neural architecture search (NAS), these are all hard problems in the general case. However, all of these components have been automated with varying degrees of success for specific problems thanks to a combination of algorithmic advances, computational power, and patience.
I would like to think that the data preprocessing step and feature selection processes are the hardest to automate, given that a machine learning model will only learn what it has seen, and its performance (and hence the solution provided by the system) is dependent on its input. That said, there is a growing body of research on that aspect, too, and I hope that it will not remain hard for many natural problems.
McGuire:I would break the process of creating a machine learning model into four main components: data ETL and cleaning, feature engineering, model selection and tuning, and model explanation and evaluation.
Data cleaning can be relatively straight forward or incredibly challenging, depending on your data set. One of the most important factors is history; if you have information about your data at every point in time, data cleaning can be automated quite well. If you have only a static representation of current state, cleaning becomes much more challenging. Older data systems designed before relatively cheap storage tend to keep only the current state of information. This means that many important datasets do not have a history of actions taken on the data. Cleaning this type of history-less data has been a challenge for AutoML to provide good quality models for our customers.
Feature engineering is - again - a combination of easy and extremely difficult to automate steps. Some types of feature engineering are easy to automate given sufficient metadata about particular features. For example, parsing a phone number to validate and extract the location from the area code is straightforward as long as you know that a particular string is a phone number. However, feature engineering that requires intimate, domain-specific knowledge of how a business works are unlikely to be automated. For example, if profits from a sale need to account for local taxes before being analyzed for cost-to-serve, some human input is likely required to establish this relationship (unless you have a massive amount of data to learn from). One reason deep learning has overtaken feature engineering in fields like vision and speech is the massive amounts of high quality training data. Tabular data is often quite source specific making it difficult to generalize and feature engineering remains a challenge. In addition, defining the correct way to combine sources of data is often incredibly complex and labor intensive. Once you have the relationship defined, the combination can be automated, but establishing this relationship takes a fair amount of manual work and is unlikely to be automated any time soon.
Model selection and tuning is the easiest component to automate and many libraries already do this; there are even AutoML algorithms to find entirely new deep learning architectures. However, model selection and tuning libraries assume that the data you are using for modeling is clean and that you have a good way of evaluating the efficacy of your model. Massive data sets also help. Establishing clean datasets and evaluation frameworks still remain the biggest challenges.
Model explanations have been an important area of research for machine learning in general. While it is not strictly speaking part of AutoML, the growth of AutoML makes it even more important. It is also the case that the way in which you implement automation has implications for explainability. Specifically tracking metadata about what was tried and selected determines how deep explanations can go. Building explanations into AutoML requires a conscious effort and is very important. At some point the automation has to stop and someone will look at and use the result. The more information the model provides about how it works the more useful it is to the end consumer.
Michailidis:I would divide the areas where automation can be applied to the following main areas:
Regarding problems which are hard to automate, the first thing that pops into my mind is anything related to translating the business problem into a machine learning problem. For AutoML to succeed, it would require mapping the business problem into a type of solvable machine learning problem. It will also need to be supported by the right data quality/relevancy. The testing of the model and the success criteria need to be defined carefully by the data scientist.
Another area that will be hard for AutoML to succeed is whenethical dilemmasmay arise from the use of machine learning. For example, if there is an accident involved due to an algorithmic error, who will be responsible? I feel this kind of situation can be a challenge for AutoML.
InfoQ: What type of problems or use cases are better candidates to use AutoML?
Lazzeri:Classification, regression, and time series forecasting are the best candidates for AutoML. Azure Machine Learning offers featurizations specifically for these tasks, such as deep neural network text featurizers for classification.
Common classification examples include fraud detection, handwriting recognition, and object detection. Different from classification where predicted output values are categorical, regression models predict numerical output values based on independent predictors. For example automobile price based on features like, gas mileage, safety rating, etc.
Finally, building forecasts is an integral part of any business, whether its revenue, inventory, sales, or customer demand. Data Scientists can use automated ML to combine techniques and approaches and get a recommended, high-quality time series forecast.
Tovbin:Classification or regression problems relying on structured or semi-structured data, where one can define an evaluation metric, can usually be automated. For example, predicting user churn, real estate price prediction, autocomplete.
de Wynter:It depends. Let us assume that you want the standard goal of machine learning: you need to learn an unseen probability distribution from samples. You also know that there is some AutoML system that does an excellent job for various, somewhat related tasks. Theres absolutely no reason why you shouldnt automate it, especially if you dont have the time to be trying out possible solutions by yourself.
I do need to point out, however, that in theory a model that performs well for a specific problem does not have any guarantees around other problemsin fact, it is well-known that there exists at least one task where it will fail. Still, this statement is quite general and can be worked around in practice.
On the other hand, from an efficiency point of view, a problem that has been studied for years by many researchers might not be a great candidate, unless you are particularly interested in marginal improvements. This follows immediately from the fact that most AutoML results, and more concretely, NAS results, for well-known problems usually are equivalent within a small delta to the human-designed solutions. However, making the problem "interesting" (e.g., by including newer constraints such as parameter size) makes it effectively a new problem, and again perfect for AutoML.
McGuire:If you have a clean dataset that has a very well defined evaluation method it is a good candidate for AutoML. Early advances in AutoML have focused on areas such as hyper parameter tuning. This is a well defined but time consuming problem. These AutoML solutions are essentially taking advantage of increases in computational power combined with models of the problem space to arrive at solutions that are often better than an expert could achieve with less human time input. The key here is the clean dataset with a direct and easily measurable effect on the well defined evaluation set. AutoML will maximize your evaluation criteria very well. However, if there is any mismatch between that criteria and what you are trying to do or any confounding factors in the data AutoML will not see that in the way a human expert (hopefully) would.
Michailidis:Well-defined problemsare good use cases for AutoML. In these problems, the preparatory work has already been done. There are clear inputs and outputs and well-defined success criteria. Under these constraints, AutoML can produce the best results.
InfoQ: What are some important research problems in AutoML?
Lazzeri:An interesting research open question in AutoML is the problem of feature selection in supervised learning tasks. This is also called the differentiable feature selection problem, a gradient-based search algorithm for feature selection. Feature selection remains a crucial step in machine learning pipelines and continues to see active research: a few researchers from Microsoft Research are developing a feature selection method that is statistically efficient and computationally efficient.
Tovbin:The two significant ones that come to my mind are the transparency and bias of trained models.
Both experts and users often disagree or do not understand why ML systems, especially automated ones, make specific predictions. It is crucial to provide deeper insights into model predictions to allow users to gain confidence in such predictive systems. For example, when providing recommendations of products to consumers, a system can additionally highlight the contributing factors that influenced particular recommendations. In order to provide such functionality, in addition to the trained model, one would need to maintain additional metadata and expose it together with provided recommendations, which often cannot be easily achieved due to the size of the data or privacy concerns.
The same concerns apply to model bias, but the problem has different roots, e.g., incorrect data collection resulting in skewed datasets. This problem is more challenging to address because we often need to modify business processes and costly software. With applied automation, one can detect invalid datasets and sometimes even data collection practices early and allow removing bias from model predictions.
de Wynter:I think first and foremost, provably efficient and correct algorithms for hyperparameter optimization (HPO) and NAS. The issue with AutoML is that you are solving the problem of, well, problem solving (or rather, approximation), which is notoriously hard in the computational sense. We as researchers often focus on testing a few open benchmarks and call it a day, but, more often than not, such algorithms fail to generalize, and, as it was pointed out last year, they tend to not outperform a simple random search.
There is also the issue that from a computational point of view, a fully automated AutoML system will face problems that are not necessarily similar to the ones that it has seen before; or worse, they might have a similar input but completely different solutions. Normally, this is related to the field of "learning to learn", which often involves some type of reinforcement learning (or neural network) to learn how previous ML systems solved a problem, and approximately solve a new one.
McGuire:I think there is a lot of interesting work to do on automating feature engineering and data cleaning. This is where most of the time is spent in machine learning and domain expertise can be hugely important. Add to that the fact that most real world data is extremely messy and complex and you see that the biggest gains from automation are from automating as much data processing and transformation as possible.
Automating the data preparation work that currently takes a huge amount of human expertise and time is not a simple task. Techniques that have removed the need for custom feature engineering in fields like vision and language do not currently generalize to small messy datasets. You can use deep learning to identify pictures of cats because a cat is a cat and all you need to do is get enough labeled data to let a complex model fill in the features for you. A table tracking customer information for a bank is very different from a table tracking customer information for a clothing store. Using these datasets to build models for your business is a small data problem. Such problems cannot be solved simply by throwing enough data at a model that can capture the complexities on its own. Hand cleaning and feature engineering can use many different approaches and determining the best is currently something of an art form. Turning these steps into algorithms that can be applied across a wide range of data is a challenging but important area of research.
Being able to automatically create and more importantly explain models of such real world data is invaluable. Storage is cheap but experts are not. There is a huge amount of data being collected in the world today. Automating the cleaning and featurization of such data provides the opportunity to use it to answer important real world questions.
Michailidis:I personally find the area of (automation-aided)explainable AIand machine learning interpretability very interesting and very important for bridging the gap between Blackbox modelling and a model that stakeholders can comfortably trust.
Another area I am interested in is "model compression". I think it can be a huge game changer if we can automatically go from a powerful, complicated solution down to a much simpler one that canbasically produce the same/similar performance, but much faster, utilizing less resources.
InfoQ What are some AutoML techniques and open-source tool practitioners can use now?
Lazzeri:AutoML democratizes the machine learning model development process, and empowers its users, no matter their data science expertise, to identify an end-to-end machine learning pipeline for any problem. There are several AutoML techniques that practitioners can use now, my favorite ones are:
Tovbin:In recent years we have seen an explosion of tooling for machine learning practitioners starting from cloud platforms (Google Cloud AutoML, Salesforce Einstein, AWS SageMaker Autopilot, H2O AutoML) to open-source software (TPOT, AutoSklearn, TransmogrifAI). Here one can find more information on these and other solutions:
de Wynter:Disclaimer: I work for Amazon. This is an active area of research, and theres quite a few well-known algorithms (with more appearing every day) focusing on different parts of the pipeline, and with well-known successes on various problems. Its hard to name them all, but some of the best-known examples are grid search, Bayesian, and gradient-based methods for HPO; and search strategies (e.g., hill climbing), population/RL-based methods (e.g., ENAS, DARTS for one-shot NAS, and the algorithm used for AmoebaNet) for NAS. On the other hand, full end-to-end systems have achieved good results for a variety of problems.
McGuire:Well of course I need to mention our own open source AutoML library TransmogrifAI. We focus mainly on automating data cleaning and feature engineering with some model selection and are built on top of Spark.
There are also a large number of interesting AutoML libraries coming out in python including Hyperopt, scikit-optimize, and TPOT.
Michailidis:In the open source space, H2O.ai for has a tool called AutoML, that incorporates many of the elements I mentioned in the previous questions. It is also very scalable and can be used in any OS.Other tools are the autosklearnor autoweka.
InfoQ: What are the limitations of AutoML?
Lazzeri:Auto ML is raising a few challenges such as model parallelization, result collection, resource optimization, and iteration. Searching for the best model and hyperparameters is an iterative process constrained by many limitations, such as compute, money and time. Machine learning pipelines provide a solution to answer those AutoML challenges with a clear definition of the process and automation features. Azure Machine Learning pipeline is an independently executable workflow of a complete machine learning task. Pipelines should focus on machine learning tasks such as:
Tovbin:One problem that AutoML does not handle well is complex data types. The majority of automated methods expect certain data types, e.g., numerical, categorical, text, geo coordinates, and, therefore, specific distributions. Such methods are a poor fit to handle more complicated scenarios, such as behavioral data, e.g., online store visit sessions.
Another problem is feature engineering that needs to consider domain-specific properties of the data. For example, if we would like to build a system to automate email classification for an insurance sales team. The input from the sales team members that define which parts of the email are and are not necessary would usually be more valuable than a metric. When building such systems, it is essential to reinforce the system with domain expert feedback to achieve more reliable results.
de Wynter:There is the practical limitation of the sheer amount of computational resources you have to throw at a problem to get it solved. It is not a true obstacle insofar as you can always use more machines, but--environmentally speakingthere are consequences associated with such a brute-force approach. Now, not all of AutoML is brute-force (as I mentioned earlier, this is a computationally hard problem, so brute-forcing a problem will only get you so far), and relies heavily on heuristics, but you still need sizable compute to solve a given AutoML problem, since you have to try out multiple solutions end-to-end. Theres a push in the science community to obtain better, "greener" algorithms, and I think its fantastic and the way to go.
From a theoretical point of view, the hardness of AutoML is quite interestingultimately, it is a statement on how intrinsically difficult the problem is, regardless of what type or number of computers you use. Add to that what I mentioned earlier that there is no such thing as "one model to rule them all," (theoretically) and AutoML becomes a very complex computational problem.
Lastly, current AutoML systems have a well-defined model search space (e.g., neural network layers, or a mix of classifiers), which is expected to work for every input problem. This is not the case. However, the search spaces that provably generalize well for all possible problems are somewhat hard to implement in practice, so there is still an open question on how to bridge such a gap.
McGuire:I dont think AutoML is ready to replace having a human in the loop. AutoML can build a model, but as we automate more and more of modeling, developing tools to provide transparency into what the model is doing becomes more and more important. Models are only as good as the data used to build them. As we move away from having a human spending time to clean and deeply understand relationships in the data we need to provide new tools to allow users of the model to understand what the models are doing. You need a human to take a critical look at the models and the elements of the data they use and ask: is this the right thing to predict, and is this data OK to use? Without tools to answer these questions for AutoML models we run the risk unintentionally shooting ourselves in the foot. We need the ability to ensure we are not using inappropriate models or perpetuating and reinforcing issues and biases in society without realizing it.
Michailidis:This was covered mostly in previous sections. Another thing I would like to mention is that performance is greatly affected by theresources allocated. More powerful machines will be to cover a search space of potential algorithms, features and techniques much faster.
These tools (unless they are built to support very specific applications)do not have domain knowledgebut are made to solve generic problems. For example, they would not know out of the box that if a field in the data is called "distance travelled" and another one is called "duration in time" , they can be used to compute "speed" which may be an important feature for a given task. They may have a chance to generate that feature via stochastically trying different transformations in the data but a domain expert would figure this out much quicker, hence these tools will produce better results under the hands of an experienced data practitioner. Hence, these tools will be more successful if they have the option to incorporate domain knowledge coming from the expert.
The panelists agreed that AutoML is important because it saves time and resources, removing much of the manual work and allowing data scientist to deliver business value faster and more efficiently. The panelists predict, however, that AutoML will not likely remove the need for a "human in the loop," particularly for industry-specific knowledge and the ability to translate business problems into machine-learning problems. Important research areas in AutoML include feature engineering and model explanation.
The panelists highlighted several existing commercial and open-source AutoML tools and described the different parts of the machine-learning process that can be automated. Several panelists noted that one limitation of AutoML is the amount of computational resources required, while others pointed out the need for domain knowledge and model transparency.
Francesca Lazzeri, PhD is an experienced scientist and machine learning practitioner with over 12 years of both academic and industry experience. She is the author of a number of publications, including technology journals, conferences, and books. She currently leads an international team of cloud advocates and AI developers at Microsoft. Before joining Microsoft, she was a research fellow at Harvard University in the Technology and Operations Management Unit. Find her on Twitter:@frlazzeriand Medium:@francescalazzeri
Matthew Tovbinis a Co-Founder of Faros AI, a software automation platform for DevOps. Before founding Faros AI, he acted as Software Engineering Architect at Salesforce, developing the Salesforce Einstein AI platform, which powers the worlds smartest CRM. In addition, Matthew is a creator of TransmogrifAI, co-organizer of Scala Bay meetup, presenter and an active member in numerous functional programming groups. Matthew lives in the San Francisco Bay area with his wife and kid, enjoys photography, hiking, good whisky and computer gaming.
Adrian de Wynteris an Applied Scientist in Alexa AIs Secure AI Foundations organization. His work can be categorized in three broad, sometimes overlapping, areas: language modeling, neural architecture search, and privacy-preserving machine learning. His research interests involve meta-learning and natural language understanding, with a special emphasis on the computational foundations of these topics.
Leah McGuireis a Machine Learning Architect at Salesforce, working on automating as many of the steps involved in machine learning as possible. This automation has been instrumental in developing and shipping a number of customer facing machine learning offerings at Salesforce. Our goal is to bring intelligence to each customers unique data and business goals. Before focusing on developing machine learning products, she completed a PhD and a Postdoctoral Fellowship in Computational Neuroscience at the University of California, San Francisco, and at University of California, Berkeley, where she studied the neural encoding and integration of sensory signals.
MariosMichailidisis a Competitive data scientist at H2O.ai, developing the next generation of machine learning products in the AutoML space. He holds a Bsc in accounting Finance from the University of Macedonia in Greece, an Msc in Risk Management from the University of Southampton and a PhD in machine learning from the University College London (UCL) with focus on ensemble modelling. He is the creator ofKazAnova, a freeware GUI for credit scoring and data mining 100% made in Java as well as is the creator ofStackNet Meta-Modelling Framework. In his spare time he loves competing on data science challenges where he was ranked1st out of 500,000 members in the popular Kaggle.comdata science platform.
Originally posted here:
State of the Art in Automated Machine Learning - InfoQ.com