Home

Data

Automated Machine Learning

By Adnan Masood

Book

eBook $35.99 $24.99

Print $48.99

Subscription $15.99 $10 p/m for three months

BUY NOW

$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!

eBook $35.99 $24.99

Print $48.99

Subscription $15.99 $10 p/m for three months

What do you get with a Packt Subscription?

This book & 7000+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook + Subscription?

Download this book in EPUB and PDF formats, plus a monthly download credit

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with a Packt Subscription?

This book & 6500+ ebooks & video courses on 1000+ technologies

60+ curated reading lists for various learning paths

50+ new titles added every month on new and emerging tech

Early Access to eBooks as they are being written

Personalised content suggestions

Customised display settings for better reading experience

50+ new titles added every month on new and emerging tech

Playlists, Notes and Bookmarks to easily manage your learning

Mobile App with offline access

What do you get with eBook?

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Download this book in EPUB and PDF formats

Access this title in our online reader

DRM FREE - Read whenever, wherever and however you want

Online reader with customised display settings for better reading experience

What do I get with Print?

Get a paperback copy of the book delivered to your specified Address*

Access this title in our online reader

Online reader with customised display settings for better reading experience

What do you get with video?

Download this video in MP4 format

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with video?

Stream this video

Access this title in our online reader

DRM FREE - Watch whenever, wherever and however you want

Online reader with customised display settings for better learning experience

What do you get with Audiobook?

Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF

What do you get with Exam Trainer?

Flashcards, Mock exams, Exam Tips, Practice Questions

Access these resources with our interactive certification platform

Mobile compatible-Practice whenever, wherever, however you want

About this book

Every machine learning engineer deals with systems that have hyperparameters, and the most basic task in automated machine learning (AutoML) is to automatically set these hyperparameters to optimize performance. The latest deep neural networks have a wide range of hyperparameters for their architecture, regularization, and optimization, which can be customized effectively to save time and effort. This book reviews the underlying techniques of automated feature engineering, model and hyperparameter tuning, gradient-based approaches, and much more. You'll discover different ways of implementing these techniques in open source tools and then learn to use enterprise tools for implementing AutoML in three major cloud service providers: Microsoft Azure, Amazon Web Services (AWS), and Google Cloud Platform. As you progress, you’ll explore the features of cloud AutoML platforms by building machine learning models using AutoML. The book will also show you how to develop accurate models by automating time-consuming and repetitive tasks in the machine learning development lifecycle. By the end of this machine learning book, you’ll be able to build and deploy AutoML models that are not only accurate, but also increase productivity, allow interoperability, and minimize feature engineering tasks.

Publication date:: February 2021
Publisher: Packt
Pages: 312
ISBN: 9781800567689

Chapter 1: A Lap around Automated Machine Learning

"All models are wrong, but some are useful."

– George Edward Pelham Box FRS

"One of the holy grails of machine learning is to automate more and more of the feature engineering process."

– Pedro Domingos, A Few Useful Things to Know about Machine Learning

This chapter will provide an overview of the concepts, tools, and technologies surrounding automated Machine Learning (ML). This introduction hopes to provide both a solid overview for novices and serve as a reference for experienced ML practitioners. We will start by introducing the ML development life cycle while navigating through the product ecosystem and the data science problems it addresses, before looking at feature selection, neural architecture search, and hyperparameter optimization.

It's very plausible that you are reading this book on an e-reader that's connected to a website that recommended this manuscript based on your reading interests. We live in a world today where your digital breadcrumbs give telltale signs of not only your reading interests, but where you like to eat, which friend you like most, where you will shop next, whether you will show up to your next appointment, and who you would vote for. In this age of big data, this raw data becomes information that, in turn, helps build knowledge and insights into so-called wisdom.

Artificial Intelligence (AI) and its underlying implementations of ML and deep learning help us not only find the metaphorical needle in the haystack, but also to see the underlying trends, seasonality, and patterns in these large data streams to make better predictions. In this book, we will cover one of the key emerging technologies in AI and ML; that is, automated ML, or AutoML for short.

In this chapter, we will cover the following topics:

The ML development life cycle
Automated ML
How automated ML works
Democratization of data science
Debunking automated ML myths
Automated ML ecosystem (open source and commercial)
Automated ML challenges and limitations

Let's get started!

The ML development life cycle

Before introducing you to automated ML, we should first define how we operationalize and scale ML experiments into production. To go beyond Hello-World apps and works-on-my-machine-in-my-Jupyter-notebook kinds of projects, enterprises need to adapt a robust, reliable, and repeatable model development and deployment process. Just as in a software development life cycle (SDLC), the ML or data science life cycle is also a multi-stage, iterative process.

The life cycle includes several steps – the process of problem definition and analysis, building the hypothesis (unless you are doing exploratory data analysis), selecting business outcome metrices, exploring and preparing data, building and creating ML models, training those ML models, evaluating and deploying them, and maintaining the feedback loop:

Figure 1.1 – Team data science process

A successful data science team has the discipline to prepare the problem statement and hypothesis, preprocess the data, select the appropriate features from the data based on the input of the Subject-Matter Expert (SME) and the right model family, optimize model hyperparameters, review outcomes and the resulting metrics, and finally fine-tune the models. If this sounds like a lot, remember that it is an iterative process where the data scientist also has to ensure that the data, model versioning, and drift are being addressed. They must also put guardrails in place to guarantee the model's performance is being monitored. Just to make this even more interesting, there are also frequent champion challenger and A/B experimentations happening in production – may the best model win.

In such an intricate and multifaceted environment, data scientists can use all the help they can get. Automated ML extends a helping hand with the promise to take care of the mundane, the repetitive, and the intellectually less efficient tasks so that the data scientists can focus on the important stuff.

Automated ML

"How many members of a certain demographic group does it take to perform a specified task?"

"A finite number: one to perform the task and the remainder to act in a manner stereotypical of the group in question." <insert your light bulb joke here>

This is meta humor – the finest type of humor for ensuing hilarity for those who are quantitatively inclined. Similarly, automated ML is a class of meta learning, also known as learning to learn – the idea that you can apply the automation principles to themselves to make the process of gaining insights even faster and more elegant.

Automated ML is the approach and underlying technology of applying certain automation techniques to accelerate the model's development life cycle. Automated ML enables citizen data scientists and domain experts to train ML models, and helps them build optimal solutions to ML problems. It provides a higher level of abstraction for finding out what the best model is, or an ensemble of models suitable for a specific problem. It assists data scientists by automating the mundane and repetitive tasks of feature engineering, including architecture search and hyperparameter optimization. The following diagram represents the ecosystem of automated ML:

Figure 1.2 – Automated ML ecosystem

These three key areas – feature engineering, architecture search, and hyperparameter optimization – hold the most promise for the democratization of AI and ML. Some automated feature engineering techniques that are finding domain-specific usable features in datasets include expand/reduce, hierarchically organizing transformations, meta learning, and reinforcement learning. For architectural search (also known as neural architecture search), evolutionary algorithms, local search, meta learning, reinforcement learning, transfer learning, network morphism, and continuous optimization are employed.

Last, but not least, we have hyperparameter optimization, which is the art and science of finding the right type of parameters outside the model. A variety of techniques are used here, including Bayesian optimization, evolutionary algorithms, Lipchitz functions, local search, meta learning, particle swarm optimization, random search, and transfer learning, to name a few.

In the next section, we will provide a detailed overview of these three key areas of automated ML. You will see some examples of them, alongside code, in the upcoming chapters. Now, let's discuss how automated ML really works in detail by covering feature engineering, architecture search, and hyperparameter optimization.

How automated ML works

ML techniques work great when it comes to finding patterns in large datasets. Today, we use these techniques for anomaly detection, customer segmentation, customer churn analysis, demand forecasting, predictive maintenance, and pricing optimization, among hundreds of other use cases.

A typical ML life cycle is comprised of data collection, data wrangling, pipeline management, model retraining, and model deployment, during which data wrangling is typically the most time-consuming task.

Extracting meaningful features out of data, and then using them to build a model while finding the right algorithm and tuning the parameters, is also a very time-consuming process. Can we automate this process using the very thing we are trying to build here (meta enough?); that is, should we automate ML? Well, that is how this all started – with someone attempting to print a 3D printer using a 3D printer.

A typical data science workflow starts with a business problem (hopefully!), and it is used to either prove a hypothesis or to discover new patterns in the existing data. It requires data; the need to clean and preprocess the data, which takes an awfully large amount of time – almost as much as 80% of your total time; and "data munging" or wrangling, which includes cleaning, de-duplication, outlier analysis and removal, transforming, mapping, structuring, and enriching. Essentially, we're taming this unwieldy vivacious raw real-world data and putting it in a tame desired format for analysis and modeling so that we can gain meaningful insights from it.

Next, we must select and engineer features, which means figuring out what features are useful, and then brainstorming and working with SMEs on the importance and validity of these features. Validating how these features would work with your model, the fitness from both a technical and business perspective, and improving these features as needed is also a critical part of the feature engineering process. The feedback loop to the SME is often very important, albeit being the least emphasized part of the feature engineering pipeline. The transparency of models stems from clear features – if features such as race or gender give higher accuracy regarding your loan repayment propensity model, this does not mean it's a good idea use them. In fact, an SME would tell you – if your conscious mind hasn't – that it's a terrible idea and that you should look for more meaningful and less sexist, racist, and xenophobic features. We will discuss this further in Chapter 10, AutoML in the Enterprise, when we discuss operationalization.

Even though the task of "selecting a model family" sounds like a reality show, that is what data scientists and ML engineers do as part of their day-to-day job. Model selection is the task of picking the right model that best describes the data at hand. This involves selecting a ML model from a set of candidate models. Automated ML can give you a helping hand with this.

Hyperparameters

You will hear about hyperparameters a lot, so let's make sure you understand what they are.

Each model has its own internal and external parameters. Internal parameters (also known as model parameters, or just parameters) are the ones intrinsic to the model, such as its weight and predictor matrix, while external parameters or hyperparameters are "outside" the model itself, such as the learning rate and its number of iterations. An intuitive example can be derived from k-means, a well-understood unsupervised clustering algorithm known for its simplicity.

The k in k-means stands for the number of clusters required, and epochs (pronounced epics, as in Doctor Who is an epic show!) are used to specify the number of passes that are done over the training data. Both of these are examples of hyperparameters – that is, the parameters that are not intrinsic to the model itself. Similarly, the learning rate for training a neural network, C and sigma for support vector machines, the k number of leaves or depth of a tree, the latent factors in a matrix factorization, and the number of hidden layers in a deep neural network are all examples of hyperparameters.

Selecting the right hyperparameters has been called tuning your instrument, which is where the magic happens. In ML tribal folklore, these elusive numbers have been brandished as "nuisance parameters", to the point where proverbial statements such as "tuning is more of an art than a science" and "tuning models is like black magic" tend to discourage newcomers in the industry. Automated ML is here to change this perception by helping you choose the right hyperparameters; more on this later. Automated ML enables citizen data scientists to build, train, and deploy ML models, thus possibly disrupting the status quo.

Important note

Some consider the term "citizen data scientists" as a euphuism for non-experts, but SME and people who are curious about analytics are some of the most important people – and don't let anyone tell you otherwise.

In conclusion, from building the correct ensembles of models to preprocessing the data, selecting the right features and model family, choosing and optimizing model hyperparameters, and evaluating the results, automated ML offers algorithmic solutions that can programmatically address these challenges.

The need for automated ML

At the time of writing, Open AI's GPT-3 model has recently been announced, and it has an incredible 175 billion parameters. Due to this ever-increasing model complexity, which includes big data and an exponentially increasing number of features, we now have a necessity to not only be able to tune these parameters, but also have sophisticated, repeatable procedures in place to tweak these proverbial knobs so that they can be adjusted. This complexity makes it less accessible for citizen data scientists, business subject matter experts, and domain experts – which might sound like job security, but it is not good for business, nor for the long-term success of the field.

Also, this isn't just about the hyperparameters, but the entire pipeline and the reproducibility of the results becoming harder as the model's complexity grows, which curtails AI democratization.

Democratization of data science

To nobody's surprise, data scientists are in high demand! As a LinkedIn Workforce Report found in August 2018, there were more than 151,000 data scientist jobs going unfilled across the US (https://economicgraph.linkedin.com/resources/linkedin-workforce-report-august-2018). Due to this disparity in supply and demand, the notion of democratization of AI, which is enabling people who are not formally trained in math, statistics, computer science, and related quantitative fields to design, develop, and use predictive models, has become quite popular. There are arguments on both sides regarding whether an SME, a domain SME, a business executive, or a program manager can effectively work as a citizen data scientist – which I consider to be a layer of abstraction argument. For businesses to gain meaningful actionable insights in a timely manner, there is no other way than to accelerate the process of raw data to insight, and insights to action. It is quite evident to anyone who has served in the analytics trenches. This means that no citizen data scientists are left behind.

As disclaimers and caveats go, like everything else, automatic ML is not the proverbial silver bullet. However, automated methods for model selection and hyperparameter optimization bear the promise of enabling non-experts and citizen data scientists to train, test, and deploy high quality ML models. The tooling around automated ML is shaping up and hopefully, this gap will be reduced, allowing for increased participation. Now, let's review some of the myths surrounding automated ML and debunk them, MythBusters style!

Debunking automated ML myths

Much like the moon landing, when it comes to automated ML, there are more than a few conspiracy theories and myths surrounding it. Let's take a look at a few that have been debunked.

Myth #1 – The end of data scientists

One of the most frequently asked questions around automated ML is, "Will automated ML be a job killer for data scientists?"

The short answer is, not anytime soon – and the long answer, as always, is more nuanced and boring.

The data science life cycle, as we discussed previously, has several moving parts where domain expertise and subject matter insights are critical. The data scientists collaborate with businesses to build a hypothesis, analyze the results, and decide on any actionable insights that may create business impact. The act of automating mundane and repeatable tasks in data science, does not take away from the cognitively challenging task of discovering insights. If anything, instead of spending hours sifting through data and cleaning up features, it frees up data scientists to learn more about the underlying business. A large variety of real-world data science applications need dedicated human supervision, as well as the steady gaze of domain experts to ensure the fine-grained actions that come out of these insights reflect the desired outcome.

One of the proposed approaches, A Human-in-the-Loop (HITL) Perspective on AutoML: Milestones and the Road Ahead by Doris Jung-Lin Lee et al., builds upon the notion of keeping humans in the loop. HITL suggests three different level of automation in data science workflows: user-driven, cruise control, and autopilot. As you progress through the maturity curve and the confidence of specific models increases, the user-driven flows move to cruise control and eventually to the autopilot stage. By leveraging different areas of expertise by building a talent pool, automated ML can help in multiple stages of the data science life cycle by engaging humans.

Myth #2 – Automated ML can only solve toy problems

This is a frequent argument from the skeptics of automated ML – that it can only be used to solve well-defined, controlled toy problems in data science and does not bode well for any real-world scenario.

The reality is quite the contrary – but I think the confusion arises from an incorrect assumption that we can just take a dataset, throw it to an automated ML model, and we will get meaningful insights. If we were to believe the hype around automated ML, then it should be able to look at messy data, perform a magical cleanup, figure out all the important features (including target variables), find the right model, tune its hyperparameters, and voila – it's built a magical pipeline!

Even though it does sound absurd when spoken out loud, this is exactly what you see in carefully crafted automated ML product demos. Then, there's the hype cycle, which has the opposite effect of diminishing the real value of automated ML offerings. The technical approaches powering automated ML are robust, and the academic rigor that's put into bringing these theories and techniques to life is like any other area of AI and ML.

In future chapters, we will look at several examples of hyperscalar platforms that benefit from automated ML, including – but not limited to – Google Cloud Platform, AWS, and Azure. These testimonials lead us to believe that real-world automated ML is not limited to eking out better accuracy in Kaggle championships, but rather poised to disrupt the industry in a big way.

Automated ML ecosystem

It almost feels redundant to point out that automated ML is a rapidly growing field; it's far from being commoditized – existing frameworks are constantly being evolved and new offerings and platforms are becoming mainstream. In the upcoming chapters, we will discuss some of these frameworks and libraries in detail. For now, we will provide you with a breadth-first introduction to get you acquainted with the automated ML ecosystem before we do a deep dive.

Open source platforms and tools

In this section, we will briefly review some of the open source automated ML platforms and tools that are available. We will deep dive into some of these platforms in Chapter 3, Automated Machine Learning with Open Source Tools and Libraries.

Microsoft NNI

Microsoft Neural Network Intelligence (NNI) is an open source platform that addresses the three key areas of any automated ML life cycle – automated feature engineering, architectural search (also referred to as neural architectural search or NAS), and hyperparameter tunning (HPI). The toolkit also offers model compression features and operationalization via KubeFlow, Azure ML, DL Workspace (DLTS), and Kubernetes over AWS.

The toolkit is available on GitHub to be downloaded: https://github.com/microsoft/nni.

auto-sklearn

Scikit-learn (also known as sklearn) is a popular ML library for Python development. As part of this ecosystem and based on Efficient and Robust Automated ML by Feurer et al., auto-sklearn is an automated ML toolkit that performs algorithm selection and hyperparameter tuning using Bayesian optimization, meta-learning, and ensemble construction.

The toolkit is available on GitHub to be downloaded: github.com/automl/auto-sklearn.

Auto-Weka

Weka, short for Waikato Environment for Knowledge Analysis, is an open source ML library that provides a collection of visualization tools and algorithms for data analysis and predictive modeling. Auto-Weka is similar to auto-sklearn but is built on top of Weka and implements the approaches described in the paper for model selection, hyperparameter optimization, and more.

The developers describe Auto-WEKA as going beyond selecting a learning algorithm and setting its hyperparameters in isolation. Instead, it implements a fully automated approach. The author's intent is for Auto-WEKA "to help non-expert users to more effectively identify ML algorithms" – that is, democratization for SMEs – via "hyperparameter settings appropriate to their applications".

The toolkit is available on GitHub to be downloaded: github.com/automl/autoweka.

Auto-Keras

Keras is one of the most widely used deep learning frameworks and is an integral part of the TensorFlow 2.0 ecosystem. Auto-Keras, based on the paper by Jin et al., proposes that it is "a novel method for efficient neural architecture search with network morphism, enabling Bayesian optimization". This helps the neural architectural search "by designing a neural network kernel and algorithm for optimizing acquisition functions in a tree-structured space". Auto-Keras is the implementation of this deep learning architecture search via Bayesian optimization.

The toolkit is available on GitHub to be downloaded: github.com/jhfjhfj1/autokeras.

TPOT

The Tree-based Pipeline Optimization Tool, or TPOT for short (nice acronym, eh!), is a product of University of Pennsylvania, Computational Genetics Lab. TPOT is an automated ML tool written in Python. It helps build and optimize ML pipelines with genetic programming. Built on top of scikit-learn, TPOT helps automate feature selection, preprocessing, construction, model selection, and parameter optimization by "exploring thousands of possible pipelines to find the best one". It is just one of the many toolkits with a small learning curve.

The toolkit is available on GitHub to be downloaded: github.com/EpistasisLab/tpot.

Ludwig – a code-free AutoML toolbox

Uber's automated ML tool, Ludwig, is an open source deep learning toolbox used for experimentation, testing, and training ML models. Built on top of TensorFlow, Ludwig enables users to create model baselines and perform automated ML-style experiments with different network architectures and models. In its latest release (at the time of writing), Ludwig now integrates with CometML and supports BERT text encoders.

The toolkit is available on GitHub to be downloaded: https://github.com/uber/ludwig.

AutoGluon – an AutoML toolkit for deep learning

From AWS Labs, with the goal of democratization of ML in mind, AutoGluon has been developed to enable "easy-to-use and easy-to-extend AutoML with a focus on deep learning and real-world applications spanning image, text, or tabular data". AutoGluon, an integral part of AWS's automated ML strategy, enables both junior and seasoned data scientists to build deep learning models and end-to-end solutions with ease. Like other automated ML toolkits, AutoGluon offers network architecture search, model selection, and custom model improvements.

The toolkit is available on GitHub to be downloaded: https://github.com/awslabs/autogluon.

Featuretools

Featuretools is an excellent Python framework that helps with automated feature engineering by using deep feature synthesis. Feature engineering is a tough problem due to its very nuanced nature. However, this open source toolkit, with its excellent timestamp handling and reusable feature primitives, provides an excellent framework you can use to build and extract a combination of features and look at what impact they have.

The toolkit is available on GitHub to be downloaded: https://github.com/FeatureLabs/featuretools/.

H2O AutoML

H2O's AutoML provides an open source version of H2O's commercial product, with APIs in R, Python, and Scala. This is an open source, distributed (multi-core and multi-node) implementation for automated ML algorithms and supports basic data preparation via a mix of grid and random search.

The toolkit is available on GitHub to be downloaded: github.com/h2oai/h2o-3.

Commercial tools and platforms

Now, let's go through some commercial tools and platforms that are used for automated ML.

DataRobot

DataRobot is a proprietary platform for automated ML. As one of the leaders in the automated ML space, Data Robot claims to "automate the end-to-end process for building, deploying, and maintaining AI at scale". Data Robot's model repository contains open source as well as proprietary algorithms and approaches for data scientists, with a focus on business outcomes. Data Robot's offerings are available for both the cloud and on-premises implementations.

The platform can be accessed here: https://www.datarobot.com/platform/.

Google Cloud AutoML

Integrated in the Google Cloud Compute platform, the Google Cloud AutoML offering aims to help train high-quality custom ML models with minimal effort and ML expertise. This offering provides AutoML Vision, AutoML Video Intelligence, AutoML Natural Language, AutoML Translation, and AutoML Tables for structured data analysis. We will discuss this Google offering in more detail in Chapter 8, Machine Learning with Google Cloud Platform, and Chapter 9, Automated Machine Learning with GCP Cloud AutoML of this book.

Google Cloud AutoML can be accessed at https://cloud.google.com/automl.

Amazon SageMaker Autopilot

AWS offers a wide variety of capabilities around AI and ML. SageMaker Autopilot is among one of these offerings and helps to "automatically build, train, and tune models" as part of the AWS ecosystem. SageMaker Autopilot provides an end-to-end automated ML life cycle that includes automatic feature engineering, model and algorithm selection, model tuning, deployment, and ranking based on performance. We will discuss AWS SageMaker Autopilot in Chapter 6, Machine Learning with Amazon Web Services, and Chapter 7, Doing Automated Machine Learning with Amazon SageMaker Autopilot.

Amazon SageMaker Autopilot can be accessed at https://aws.amazon.com/sagemaker/autopilot/.

Azure Automated ML

Microsoft Azure provides automated ML capabilities to help data scientists build ML models with speed and at scale. The platform offers automated feature engineering capabilities such as missing value imputation, transformations and encodings, drop ping high cardinality, and no variance features. Azure's automated ML also supports time series forecasting, algorithm selection, hyperparameter tunning, guardrails to keep model bias in check, and a model leaderboard for ranking and scoring. We will discuss the Azure ML and AutoML offerings in Chapter 4, Getting Started with Azure Machine Learning, and Chapter 5, Automated Machine Learning with Microsoft Azure.

Azure's automated ML offering can be accessed at https://azure.microsoft.com/en-us/services/machine-learning/automatedml/.

H2O Driverless AI

H2O's open source offerings were discussed earlier in the Open source platforms and books section. The commercial offering of H2O Driverless AI is an automated ML platform that addresses the needs of feature engineering, architecture search, and pipeline generation. The "bring your own recipe" feature is unique (even though it's now being adapted by other vendors) and is used to integrate custom algorithms. The commercial product has extensive capabilities and a feature-rich user interface for data scientists to get up to speed.

H2O Driverless AI can be accessed at https://www.h2o.ai/products/h2o-driverless-ai/.

Other notable frameworks and tools in this space include Autoxgboost, RapidMiner Auto Model, BigML, MLJar, MLBox, DATAIKU, and Salesforce Einstein (powered by Transmogrif AI). The links to their toolkits can be found in this book's Appendix. The following table is from Mark Lin's Awesome AutoML repository and outlines some of the most important automated machine learning toolkits, along with their corresponding links:

Figure 1.3 – Automated ML projects from Awesome-AutoML-Papers by Mark Lin

The classification type column specifies whether the library supports Network Architecture Search (NAS), Hyperparameter Optimization (HPO), and Automated Feature Engineering (AutoFE).

The future of automated ML

As the industry makes significant investments in the area surrounding automated ML, it is poised to become an important part of our enterprise data science workflows, if it isn't already. Serving as a valuable assistant, this apprentice will help data scientists and knowledge workers focus on the business problem and take care of any thing unwieldy and trivial. Even though the current focus is limited to automated feature engineering, architecture search, and hyperparameter optimization, we will also see that meta-learning techniques will be introduced in other areas to help automate this automation process.

Due to the increasing demand of democratization of AI and ML, we will see automated ML become mainstream in the industry – with all the major tools and hyperscaler platforms providing it as an inherent part of their ML offerings. This next generation of automated ML equipped tools will allow us to perform data preparation, domain customized feature engineering, model selection and counterfactual analysis, operationalization, explainability, monitoring, and create feedback loops. This will make it easier for us to focus on what's important in the business, including business insights and impact.

The automated ML challenges and limitations

As we mentioned earlier, data scientists aren't getting replaced, and automated ML is not a job killer – for now. The job of data scientists will evolve as the toolsets and their functions continue to change.

The reasons for this are twofold. Firstly, automated ML does not automate data science as a discipline. It is definitely a time saver for performing automated feature engineering, architecture search, hyperparameter optimization, or running multiple experiments in parallel. However, there are various other essential parts of the data science life cycle that cannot be easily automated, thus providing the current state of automated ML.

The second key reason is that being a data scientist is not a homogenous role – the competencies and responsibilities related to it vary across the industry and organizations. In lieu of democratizing data science with automated ML, the so-called junior data scientists will gain assistance from automated feature engineering capabilities, and this will speed up their data munging and wrangling practices. Meanwhile, senior engineers will have more time to focus on improving their business outcomes by designing better KPI metrices and enhancing the model's performance. As you can see, this will help all tiers of data science practitioners gain familiarity with the business domain and explore any cross-cutting concerns. Senior data scientists also have the responsibility of monitoring model and data quality and drift, as well as maintaining versioning, auditability, governance, lineage, and other MLOps (Machine Learning Operations) cross-cutting concerns.

Enabling the explainability and transparency of models to address any underlying bias is also a critical component for regulated industries across the world. Due to its highly subjective nature, there is limited functionality to address this automatically in the current toolsets; this is where a socially aware data scientist can provide a tremendous amount of value to stop the perpetuation of algorithmic bias.

A Getting Started guide for enterprises

Congratulations! You have almost made it to the end of the first chapter without dozing off – kudos! Now, you must be wondering: this automated ML thing sounds rad, but how do I go about using it in my company? Here are some pointers.

First, read the rest of this book to familiarize yourself with the concepts, technology, tools, and platforms. It is important to understand the landscape and understand that automated ML is a tool in your data science toolkit – it does not replace your data scientists.

Second, use automated ML as a democratization tool across the enterprise when you're dealing with analytics. Build a training plan for your team to become familiar with the tools, provide guidance, and chart a path to automation in data science workflows.

Lastly, due to the large churn in the feature sets, start with a smaller footprint, probably with an open source stack, before you commit to an enterprise framework. Scaling up this way will help you understand your own automation needs and give you time to do comparison shopping.

Summary

In this chapter, we covered the ML development life cycle and then defined automated ML and how it works. While building a case for the need for automated ML, we discussed the democratization of data science, debunked the myths surrounding automated ML, and provided a detailed walk-through of the automated ML ecosystem. Here, we reviewed the open source tools and then explored the commercial landscape. Finally, we discussed the future of automated ML, commented on the challenges and limitations of it, and finally provided some pointers on how to get started in an enterprise.

In the next chapter, we'll look under the hood of the technologies, techniques, and tools that are used to make automated ML possible. We hope that this chapter has introduced you to the automated ML fundamentals and that you are now ready to do a deeper dive into the topics that we discussed.

Automated Machine Learning

Chapter 1: A Lap around Automated Machine Learning

The ML development life cycle

Automated ML

How automated ML works

Hyperparameters

The need for automated ML

Democratization of data science

Debunking automated ML myths

Myth #1 – The end of data scientists

Myth #2 – Automated ML can only solve toy problems

Automated ML ecosystem

Open source platforms and tools

Microsoft NNI

auto-sklearn

Auto-Weka

Auto-Keras

TPOT

Ludwig – a code-free AutoML toolbox

AutoGluon – an AutoML toolkit for deep learning

Featuretools

H2O AutoML

Commercial tools and platforms

DataRobot

Google Cloud AutoML

Amazon SageMaker Autopilot

Azure Automated ML

H2O Driverless AI

The future of automated ML

The automated ML challenges and limitations

A Getting Started guide for enterprises

Summary

Further reading