Machine Learning Automation with TPOT

By Dario Radečić
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Free Chapter
    Chapter 1: Machine Learning and the Idea of Automation
About this book

The automation of machine learning tasks allows developers more time to focus on the usability and reactivity of the software powered by machine learning models. TPOT is a Python automated machine learning tool used for optimizing machine learning pipelines using genetic programming. Automating machine learning with TPOT enables individuals and companies to develop production-ready machine learning models cheaper and faster than with traditional methods.

With this practical guide to AutoML, developers working with Python on machine learning tasks will be able to put their knowledge to work and become productive quickly. You'll adopt a hands-on approach to learning the implementation of AutoML and associated methodologies. Complete with step-by-step explanations of essential concepts, practical examples, and self-assessment questions, this book will show you how to build automated classification and regression models and compare their performance to custom-built models. As you advance, you'll also develop state-of-the-art models using only a couple of lines of code and see how those models outperform all of your previous models on the same datasets.

By the end of this book, you'll have gained the confidence to implement AutoML techniques in your organization on a production level.

Publication date:
May 2021
Publisher
Packt
Pages
270
ISBN
9781800567887

 

Chapter 1: Machine Learning and the Idea of Automation

In this chapter, we'll make a quick revision of the essential machine learning topics. Topics such as supervised machine learning are covered, alongside the basic concepts of regression and classification.

We will understand why machine learning is essential for success in the 21st century from various perspectives: those of students, professionals, and business users, and we will discuss the different types of problems machine learning can solve.

Further, we will introduce the concept of automation and understand how it applies to machine learning tasks. We will go over automation options in the Python ecosystem and compare their pros and cons. We will briefly introduce the TPOT library, and discuss its role in the modern-day automation of machine learning.

This chapter will cover the following topics:

  • Reviewing the history of machine learning
  • Reviewing automation
  • Applying automation to machine learning
  • Automation options for Python
 

Technical requirements

To complete this chapter, you only need Python installed, alongside the basic data processing and machine learning libraries, such as numpy, pandas, matplotlib, and scikit-learn. You'll learn how to install and configure these in a virtual environment in Chapter 2, Deep Dive into TPOT, but let's keep this one easy. These libraries come preinstalled with any Anaconda distribution, so you shouldn't have to worry about it. If you are using raw Python instead of Anaconda, executing this line from the Terminal will install everything needed:

> pip install numpy pandas matplotlib scikit-learn

Keep in mind it's always a good practice to install libraries in a virtual environment, and you'll learn how to do that shortly.

The code for this chapter can be downloaded here:

https://github.com/PacktPublishing/Machine-Learning-Automation-with-TPOT/tree/main/Chapter01

 

Reviewing the history of machine learning

Just over 25 years ago (1994), a question was asked in an episode of The Today Show "What is the internet, anyway?" It's hard to imagine that a couple of decades ago, the general population had difficulty defining what the internet is and how it works. Little did they know that we would have intelligent systems managing themselves only a quarter of a century later, available to the masses.

The concept of machine learning was introduced much earlier in 1949 by Donald Hebb. He presented theories on neuron excitement and communication between neurons (A Brief History of Machine Learning – DATAVERSITY, Foote, K., March 26, 2019). He was the first to introduce the concept of artificial neurons, their activation, and their relationships through weights.

In the 1950s, Arthur Samuel developed a computer program for playing checkers. The memory was quite limited at that time, so he designed a scoring function that attempted to measure every player's probability of winning based on the positions on the board. The program chose its next move using a MinMax strategy, which eventually evolved into the MinMax algorithm (A Brief History of Machine Learning – DATAVERSITY, Foote, K., March 26; 2019). Samuel was also the first one to come up with the term machine learning.

Frank Rosenblatt decided to combine Hebb's artificial brain cell model with the work of Arthur Samuel to create a perceptron. In 1957, a perceptron was planned as a machine, which led to building a Mark 1 perceptron machine, designed for image classification.

The idea seemed promising, to say at least, but the machine couldn't recognize useful visual patterns, which caused a stall in further research – this period is known as the first AI winter. There wasn't much going on with the perceptron and neural network models until the 1990s.

The preceding couple of paragraphs tell us more than enough about the state of machine learning and deep learning at the end of the 20th century. Groups of individuals were making tremendous progress with neural networks, while the general population had difficulty understanding even what the internet is.

To make machine learning useful in the real world, scientists and researchers required two things:

  • Data
  • Computing power

The first was rapidly becoming more available due to the rise of the internet. The second was slowly moving into a phase of exponential growth – both in CPU performance and storage capacity.

Still, the state of machine learning in the late 1990s and early 2000s was nowhere near where it is today. Today's hardware has led to a significant increase in the use of machine-learning-powered systems in production applications. It is difficult to imagine a world where Netflix doesn't recommend movies, or Google doesn't automatically filter spam from regular email.

But, what is machine learning, anyway?

What is machine learning?

There are a lot of definitions of machine learning out there, some more and some less formal. Here are a couple worth mentioning:

  • Machine learning is an application of artificial intelligence (AI) that provides systems the ability to automatically learn and improve from experience without being explicitly programmed (What is Machine Learning? A Definition – Expert System, Expert System Team; May 6, 2020).
  • Machine learning is the concept that a computer program can learn and adapt to new data without human intervention (Machine Learning – Investopedia, Frankenfield, J.; August 31, 2020).
  • Machine learning is a field of computer science that aims to teach computers how to learn and act without being explicitly programmed (Machine Learning – DeepAI, Deep AI Team; May 17, 2020).

Even though these definitions are expressed differently, they convey the same information. Machine learning aims to develop a system or an algorithm capable of learning from data without human intervention.

The goal of a data scientist isn't to instruct the algorithm on how to learn, but rather to provide an adequately sized and prepared dataset to the algorithm and briefly specify the relationships between the dataset variables. For example, suppose the goal is to produce a model capable of predicting housing prices. In that case, the dataset should provide observations on a lot of historical prices, measured through variables such as location, size, number of rooms, age, whether it has a balcony or a garage, and so on.

It's up to the machine learning algorithm to decide which features are important and which aren't, ergo, which features have significant predictive power. The example in the previous paragraph explained the idea of a regression problem solved with supervised machine learning methods. We'll soon dive into both concepts, so don't worry if you don't quite understand it.

Further, we might want to build a model that can predict, with a decent amount of confidence, whether a customer is likely to churn (break the contract). Useful features would be the list of services the client is using, how long they have been using the service, whether the previous payments were made on time, and so on. This is another example of a supervised machine learning problem, but the target variable (churn) is categorical (yes or no) and not continuous, as was the case in the previous example. We call these types of problems classification machine learning problems.

Machine learning isn't limited to regression and classification. It is applied to many other areas, such as clustering and dimensionality reduction. These fall into the category of unsupervised machine learning techniques. These topics won't be discussed in this chapter.

But first, let's answer a question on the usability of machine learning models, and discuss who uses these models and in which circumstances.

In which sectors are the companies using machine learning?

In a single word – everywhere. But you'll have to continue reading to get a complete picture. Machine learning has been adopted in almost every industry in the last decade or two. The main reason is the advancements in hardware. Also, machine learning has become easier for the broader masses to use and understand.

It would be impossible to list every industry in which machine learning is used and to discuss further the specific problems it solves. The easier task would be to list the industries that can't benefit from machine learning, as there are far fewer of those.

We'll focus only on the better-known industries in this section.

Here's a list and explanation of the ten most common use cases of machine learning, both from the industry standpoint and as a general overview:

  • The finance industry: Machine learning is gaining more and more popularity in the financial sector. Banks and financial institutions can use it to make smarter decisions. With machine learning, banks can detect clients who most likely won't repay their loans. Further, banks can use machine learning methods to track and understand the spending patterns of their customers. This can lead to the creation of more personalized services to the satisfaction of both parties. Machine learning can also be used to detect anomalies and fraud through unexpected behaviors on some client accounts.
  • Medical industry: The recent advancements in medicine are at least partly due to advancements in machine learning. Various predictive methods can be used to detect diseases in the early stages, based on which medical experts can construct personalized therapy and recovery plans. Computer vision techniques such as image classification and object detection can be used, for example, to perform classification on lung images. These can also be used to detect the presence of a tumor based on a single image or a sequence of images.
  • Image recognition: This is probably the most widely used application of machine learning because it can be applied in any industry. You can go from a simple cat-versus-dog image classification to classifying the skin conditions of endangered animals in Africa. Image recognition can also be used to detect whether an object of interest is present in the image. For example, the automatic detection of Waldo in the Where's Waldo? game has roughly the same logic as an algorithm in autonomous vehicles that detects pedestrians.
  • Speech recognition: Yet another exciting and promising field. The general idea is that an algorithm can automatically recognize the spoken words in an audio clip and then convert them to a text file. Some of the better-known applications are appliance control (controlling the air conditioner with voice commands), voice dialing (automated recognition of a contact to call just from your voice), and internet search (browsing the web with your voice). These are only a couple of examples that immediately pop into mind. Automatic speech recognition software is challenging to develop. Not all languages are supported, and many non-native speakers have accents when speaking in a foreign language, which the ML algorithm may struggle to recognize.
  • Natural Language Processing (NLP): Companies in the private sector can benefit tremendously from NLP. For example, a company can use NLP to analyze the sentiments of online reviews left by their customers if there are too many to classify manually. Further, companies can create chatbots on web pages that immediately start conversations with users, which then leads to more potential sales. For a more advanced example, NLP can be used to write summaries of long documents and even segment and analyze protein sequences.
  • Recommender systems: As of late 2020, it's difficult to imagine a world where Google doesn't tailor the search results based on your past behaviors, Amazon doesn't automatically recommend similar products, Netflix doesn't recommend movies and TV shows based on the past watches, and Spotify doesn't recommend music that somehow flew under your radar. These are only a couple of examples, but it's not difficult to recognize the importance of recommender systems.
  • Spam detection: Just like it's hard to imagine a world where the search results aren't tailored to your liking, it's also hard to imagine an email service that doesn't automatically filter out messages about that now-or-never discount on a vacuum cleaner. We are bombarded with information every day, and automatic spam detection algorithms can help us focus on what's important.
  • Automated trading: Even the stock market is moving too fast to fully capture what's happening without automated means. Developing trading bots isn't easy, but machine learning can help you pick the best times to buy or sell, based on tons of historical data. If fully automated, you can watch how your money creates money while sipping margaritas on the beach. It might sound like a stretch to some of you, but with robust models and a ton of domain knowledge, I can't see why not.
  • Anomaly detection: Let's dial back to our banking industry example. Banks can use anomaly detection algorithms for various use cases, such as flagging suspicious transactions and activities. Lately, I've been using anomaly detection algorithms to detect suspicious behavior in network traffic with the goal of automatic detection of cyberattacks and malware. It is another technique applicable to any industry if the data is formatted in the right way.
  • Social networks: How many times has Facebook recommended you people you may know? Or YouTube recommended the video on the topic you were just thinking about? No, they are not reading your mind, but they are aware of your past behaviors and decisions and can predict your next move with a decent amount of confidence.

These are just a couple of examples of what machine learning can do – not an exhaustive list by any means. You are now familiar with a brief history of machine learning and know how machine learning can be applied to a wide array of tasks.

The next section will provide a brief refresher on supervised machine learning techniques, such as regression and classification.

Supervised learning

The majority of practical machine learning problems are solved through supervised learning algorithms. Supervised learning refers to a situation where you have an input variable (a predictor), typically denoted with X, and an output variable (what you are trying to predict), typically denoted with y.

There's a reason why features (X) are denoted with a capital letter and the target variable (y) isn't. In math terms, X denotes a matrix of features, and matrices are typically denoted with capital letters. On the other hand, y is a vector, and lowercase letters are typically used to denote vectors.

The goal of a supervised machine learning algorithm is to learn the function that can transform any input into the output. The most general math representation of a supervised learning algorithm is represented with the following formula:

Figure 1.1 – General supervised learning formula

Figure 1.1 – General supervised learning formula

We must apply one of two corrections to make this formula acceptable. The first one is to replace y with y-hat, as y generally denotes the true value, and y-hat denotes the prediction. The second correction we could make is to add the error term, as only then can we have the correct value of y on the other side. The error term represents the irreducible error – the type of error that can't be reduced by further training.

Here's how the first corrected formula looks:

Figure 1.2 – Corrected supervised learning formula (v1)

Figure 1.2 – Corrected supervised learning formula (v1)

And here's the second one:

Figure 1.3 – Corrected supervised learning formula (v2)

Figure 1.3 – Corrected supervised learning formula (v2)

It's more common to see the second one, but don't be confused by any of the formats – these formulas generally represent the same thing.

Supervised machine learning is called "supervised" because we have the labeled data at our disposal. You might have already picked this because of the feature and target discussion. This means that we have the correct answers already, ergo, we know which combinations of X yield the corresponding values of y.

The end goal is to make the best generalization from the data available. We want to produce the most unbiased model capable of generalizing on new, unseen data. The concepts of overfitting, underfitting, and the bias-variance trade-off are important to produce such a model, but they are not in the scope of this book.

As we've already mentioned, supervised learning problems are grouped into two main categories:

  • Regression: The target variable is continuous in nature, such as the price of a house in USD, the temperature in degrees Fahrenheit, weight in pounds, height in inches, and so on.
  • Classification: The target variable is a category – either binary (true/false, positive/negative, disease/no disease), or multi-class (no symptoms/mild symptoms/severe symptoms, school grades, and so on).

Both regression and classification are explored in the following sections.

Regression

As briefly discussed in the previous sections, regression refers to a phenomenon where the target variable is continuous. The target variable could represent a price, a weight, or a height, to name a few.

The most common type of regression is linear regression, a model where a linear relationship between variables is assumed. Linear regression further divides into a simple linear regression (only one feature), and multiple linear regression (multiple features).

Important note

Keep in mind that linear regression isn't the only type of regression. You can perform regression tasks with algorithms such as decision trees, random forests, support vector machines, gradient boosting, and artificial neural networks, but the same concepts still apply.

To make a quick recap of the regression concept, we'll declare a simple pandas.DataFrame object consisting of two columns – Living area and Price. The goal is to predict the price based only on the living space. We are using a simple linear regression model here just because it makes the data visualization process simpler, which, as the end result, makes the regression concept easy to understand:

  1. The following is the dataset – both columns contain arbitrary and made-up values:
    import pandas as pd 
    df = pd.DataFrame({
        'LivingArea': [300, 356, 501, 407, 950, 782, 
                       664, 456, 673, 821, 1024, 900, 
                       512, 551, 510, 625, 718, 850],
        'Price': [100, 120, 180, 152, 320, 260, 
                  210, 150, 245, 300, 390, 305, 
                  175, 185, 160, 224, 280, 299]
    })
  2. To visualize these data points, we will use the matplotlib library. By default, the library doesn't look very appealing, so a couple of tweaks are made with the matplotlib.rcParams package:
    import matplotlib.pyplot as plt 
    from matplotlib import rcParams
    rcParams['figure.figsize'] = 14, 8
    rcParams['axes.spines.top'] = False
    rcParams['axes.spines.right'] = False
  3. The following options make the charts larger by default, and remove the borders (spines) on the top and right. The following code snippet visualizes our dataset as a two-dimensional scatter plot:
    plt.scatter(df['LivingArea'], df['Price'], color='#7e7e7e', s=200)
    plt.title('Living area vs. Price (000 USD)', size=20)
    plt.xlabel('Living area', size=14)
    plt.ylabel('Price (000 USD)', size=14)
    plt.show()

    The preceding code produces the following graph:

    Figure 1.4 – Regression – Scatter plot of living area and price (000 USD)

    Figure 1.4 – Regression – Scatter plot of living area and price (000 USD)

  4. Training a linear regression model is most easily achieved with the scikit-learn library. The library contains tons of different algorithms and techniques we can apply on our data. The sklearn-learn.linear_model module contains the LinearRegression class. We'll use it to train the model on the entire dataset, and then to make predictions on the entire dataset. That's not something you would usually do in production environment, but is essential here to get a further understanding of how the model works:
    from sklearn.linear_model import LinearRegression
    model = LinearRegression()
    model.fit(df[['LivingArea']], df[['Price']])
    preds = model.predict(df[['LivingArea']])
    df['Predicted'] = preds
  5. We've assigned the prediction as yet another dataset column, just to make data visualization simpler. Once again, we can create a chart containing the entire dataset as a scatter plot. This time, we will add a line that represents the line of best fit – the line where the error is smallest:
    plt.scatter(df['LivingArea'], df['Price'], color='#7e7e7e', s=200, label='Data points')
    plt.plot(df['LivingArea'], df['Predicted'], color='#040404', label='Best fit line')
    plt.title('Living area vs. Price (000 USD)', size=20)
    plt.xlabel('Living area', size=14)
    plt.ylabel('Price (000 USD)', size=14)
    plt.legend()
    plt.show()

    The preceding code produces the following graph:

    Figure 1.5 – Regression – Scatter plot of living area and price (000 USD) with the line of best fit

    Figure 1.5 – Regression – Scatter plot of living area and price (000 USD) with the line of best fit

  6. As we can see, the simple linear regression model almost perfectly captures our dataset. This is not a surprise, as the dataset was created for this purpose. New predictions would be made along the line of best fit. For example, if we were interested in predicting the price of house that has a living space of 1,000 square meters, the model would make a prediction just a bit north of $350K. The implementation of this in the code is simple:
    model.predict([[1000]])
    >>> array([[356.18038708]])
  7. Further, if you were interested in evaluating this simple linear regression model, metrics like R2 and RMSE are a good choice. R2 measures the goodness of fit, ergo it tells us how much variance our model captures (ranging from 0 to 1). It is more formally referred to as the coefficient of determination. RMSE measures how wrong the model is on average, in the unit of interest. For example, an RMSE value of 10 would mean that on average our model is wrong by $10K, in either the positive or negative direction.

    Both the R2 score and the RMSE are calculated as follows:

    import numpy as np
    from sklearn.metrics import r2_score, mean_squared_error
    rmse = lambda y, ypred: np.sqrt(mean_squared_error(y, ypred))
    model_r2 = r2_score(df['Price'], df['Predicted'])
    model_rmse = rmse(df['Price'], df['Predicted'])
    print(f'R2 score: {model_r2:.2f}')
    print(f'RMSE: {model_rmse:.2f}')
    >>> R2 score: 0.97
    >>> RMSE: 13.88

To conclude, we've built a simple but accurate model. Don't expect data in the real world to behave this nicely, and also don't expect to build such accurate models most of the time. The process of model selection and tuning is tedious and prone to human error, and that's where automation libraries such as TPOT come into play.

We'll cover a classification refresher in the next section, again on the fairly simple example.

Classification

Classification in machine learning refers to a type of problem where the target variable is categorical. We could turn the example from the Regression section in the classification problem by converting the target variable into categories, such as Sold/Did not sell.

In a nutshell, classification algorithms help us in various scenarios, such as predicting customer attrition, whether a tumor is malignant or not, whether someone has a given disease or not, and so on. You get the point.

Classification tasks can be further divided into binary classification tasks and multi-class classification tasks. We'll explore binary classification tasks briefly in this section. The most basic classification algorithm is logistic regression, and we'll use it in this section to build a simple classifier.

Note

Keep in mind that you are not limited only to logistic regression for performing classification tasks. On the contrary – it's good practice to use a logistic regression model as a baseline, and to use more sophisticated algorithms in production. More sophisticated algorithms include decision trees, random forests, gradient boosting, and artificial neural networks.

The data is completely made up and arbitrary in this example:

  1. We have two columns – the first indicates a measurement of some sort (called Radius), and the second column denotes the classification (either 0 or 1). The dataset is constructed with the following Python code:
    import numpy as np
    import pandas as pd
    df = pd.DataFrame({
        'Radius': [0.3, 0.1, 1.7, 0.4, 1.9, 2.1, 0.25, 
                   0.4, 2.0, 1.5, 0.6, 0.5, 1.8, 0.25],
        'Class': [0, 0, 1, 0, 1, 1, 0, 
                  0, 1, 1, 0, 0, 1, 0]
    })
  2. We'll use the matplotlib library once again for visualization purposes. Here's how to import it and make it a bit more visually appealing:
    import matplotlib.pyplot as plt 
    from matplotlib import rcParams
    rcParams['figure.figsize'] = 14, 8
    rcParams['axes.spines.top'] = False
    rcParams['axes.spines.right'] = False
  3. We can reuse the same logic from the previous regression example to make a visualization. This time, however, we won't see data that closely resembles a line. Instead, we'll see data points separated into two groups. On the lower left are the data points where the Class attribute is 0, and on the right where it's 1:
    plt.scatter(df['Radius'], df['Class'], color='#7e7e7e', s=200)
    plt.title('Radius classification', size=20)
    plt.xlabel('Radius (cm)', size=14)
    plt.ylabel('Class', size=14)
    plt.show()

    The following graph is the output of the preceding code:

    Figure 1.6 – Classification – Scatter plot between measurements and classes

    Figure 1.6 – Classification – Scatter plot between measurements and classes

    The goal of a classification model isn't to produce a line of best fit, but instead to draw out the best possible separation between the classes.

  4. The logistic regression model is available in the sklearn.linear_model package. We'll use it to train the model on the entire dataset, and then to make predictions on the entire dataset. Again, that's not something we will keep doing later on in the book, but is essential to get insights into the inner workings of the model at this point:
    from sklearn.linear_model import LogisticRegression
    model = LogisticRegression()
    model.fit(df[['Radius']], df['Class'])
    preds = model.predict(df[['Radius']])
    df['Predicted'] = preds
  5. We can now use this model to make predictions on an arbitrary number of X values, ranging from the smallest to the largest in the entire dataset. The range of evenly spaced numbers is obtained through the np.linspace method. It takes three arguments – start, stop, and the number of elements. We'll set the number of elements to 1000.
  6. Then, we can make a line that indicates the probabilities for every value of X generated. By doing so, we can visualize the decision boundary of the model:
    xs = np.linspace(0, df['Radius'].max() + 0.1, 1000)
    ys = [model.predict([[x]]) for x in xs]
    plt.scatter(df['Radius'], df['Class'], color='#7e7e7e', s=200, label='Data points')
    plt.plot(xs, ys, color='#040404', label='Decision boundary')
    plt.title('Radius classification', size=20)
    plt.xlabel('Radius (cm)', size=14)
    plt.ylabel('Class', size=14)
    plt.legend()
    plt.show()

    The preceding code produces the following visualization:

    Figure 1.7 – Classification – Scatter plot between measurements and classes and the decision boundary

    Figure 1.7 – Classification – Scatter plot between measurements and classes and the decision boundary

    Our classification model is basically a step function, which is understandable for this simple problem. Nothing more complex is needed to correctly classify every instance in our dataset. This won't always be the case, but more on that later.

  7. A confusion matrix is one of the best methods for evaluating classification models. Our negative class is 0, and the positive class is 1. The confusion matrix is just a square matrix that shows the following:
    • True negatives: The upper left number. These are instances that had the class of 0 and were predicted as 0 by the model.
    • False negatives: The bottom left number. These are instances that had the class of 0, but were predicted as 1 by the model.
    • False positives: The top right number. These are instances that had the class of 1, but were predicted as 0 by the model.
    • True positives: The bottom right number. These are instances that had the class of 1 and were predicted as 1 by the model.

      Read the previous list as many times as necessary to completely understand the idea. The confusion matrix is an essential concept in classifier evaluation, and the later chapters in this book assume you know how to interpret it.

  8. The confusion matrix is available in the sklearn.metrics package. Here's how to import it and obtain the results:
    from sklearn.metrics import confusion_matrix
    confusion_matrix(df['Class'], df['Predicted'])

    Here are the results:

Figure 1.8 – Classification – Evaluation with a confusion matrix

Figure 1.8 – Classification – Evaluation with a confusion matrix

The previous figure shows that our model was able to classify every instance correctly. As a rule of thumb, if the diagonal elements stretching from the bottom left to the top right are zeros, it means the model is 100% accurate.

The confusion matrix interpretation concludes our brief refresher on supervised machine learning methods. Next, we will dive into the idea of automation, and discuss why we need it in machine learning.

 

Reviewing automation

This section briefly discusses the idea of automation, why we need it, and how it applies to machine learning. We will also answer the age-old question of machine learning replacing humans in their jobs, and the role of automation in that regard.

Automation plays a huge role in the modern world, and in the past centuries it has allowed us to completely remove the human factor from dangerous and repetitive jobs. This has opened a new array of possibilities on the job market, where jobs are generally based on something that cannot be automated, at least at this point in time.

But first, we have to understand what automation is.

What is automation?

There are many syntactically different definitions out there, but they all share the same basic idea. The following one presents the idea in the simplest terms:

Automation is a broad term that can cover many areas of technology where human input is minimized (What is Automation? – IBM, IBM Team; February 28, 2020).

The essential part of the definition is the minimization of the human input. An automated process is entirely or almost entirely managed by a machine. Up to a couple of years back, machines were a great way to automate boring, routine tasks, and leave creative things to people. As you might guess, machines are not that great with creative tasks. That is, they weren't until recently.

Machine learning provides us with a mechanism to not only automate calculations, spreadsheet management, and expenses tracking, but also more cognitive tasks, such as decision making. The field evolves by the day and it's hard to say when exactly we can expect machines to take over some more creative jobs.

The concept of automation in machine learning is discussed later, but it's important to remember that machine learning can take automation to a whole other level. Not every form of automation is equal, and the generally accepted division of automation is into four levels, based on complexity:

  • Basic automation: Automation of the simplest tasks. Robotic Process Automation (RPA) is the perfect example, as its goal is to use software bots to automate repetitive tasks. The end goal of this automation category is to completely remove the human factor from the equation, resulting in faster execution of repetitive tasks without error.
  • Process automation: This uses and applies basic automation techniques to an entire business process. The end goal is to completely automate a business activity and leave humans to only give the final approval.
  • Integration automation: This uses rules defined by humans to mimic human behavior in task completion. The end goal is to minimize human intervention in more complex business tasks.
  • AI automation: The most complex form of automation. The goal is to have a machine that can learn and make decisions based on previous situations and the decisions made in those situations.

You now know what automation is, and next, we'll discuss why it is a must in the 21st century.

Why is automation needed?

Both companies and customers can benefit from automation. Automation can improve resource allocation and management, and can make the business scaling process easier. Due to automation, companies can provide a more reliable and consistent service, which results in a more consistent user experience. As the end result, customers are more likely to buy and spend more than if the service quality was not consistent.

In the long run, automation simplifies and reduces human activities and reduces costs. Further, any automated process is likely to perform better than the same process performed by humans. Machines don't get tired, don't have a bad day, and don't require a salary.

The following list shows some of the most important reasons for automation:

  • Time saving: Automation simplifies daily routine tasks by making machines do them instead of humans. As the end result, humans can focus on more creative tasks right from the start.
  • Reduced cost: Automation should be thought of as a long-term investment. It comes with some start-up costs, sure, but those are covered quickly if automation is implemented correctly.
  • Accuracy and consistency: As mentioned before, humans are prone to errors, bad days, and inconsistencies. That's not the case with machines.
  • Workflow enhancements: Due to automation, more time can be spent on important tasks, such as providing individual assistance to customers. Employees tend to be happier and deliver better results if their shift isn't made up solely of repetitive and routine tasks.

The difficult question is not "do you automate?" but rather, "when do you automate?" There are a lot of different opinions on this topic and there isn't a single right or wrong answer. Deciding when to automate depends on the budget you have available and on the opportunity cost (the decisions/investments you would be able to make if time was not an issue).

Automating anything you are good at and focusing on the areas that require improvement is a general rule of thumb for most companies. Even as an individual, there is a high probability that you are doing something on a daily or weekly basis that can be described in plain language. And if something can be described step by step, it can be automated.

But how does the concept of automation apply to machine learning? Are machine learning and automation synonymous? That's what we will discuss next.

Are machine learning and automation the same thing?

Well, no. But machine learning can take automation to a whole different level. Let's refer back to the four levels of automation discussed a few of paragraphs ago. Only the last one uses machine learning, and it is the most advanced form of automation.

Let's consider a single activity in our day as a process. If you know exactly how the process will start and end, and everything that will happen in between and in which order, then this process can be automated without machine learning.

Here's an example. For the last couple of months, you've been monitoring real-estate prices in an area you want to move to. Every morning you make yourself a cup of coffee, sit in front of a laptop, and go to a real estate website. You filter the results to see only the ads that were placed in the last 24 hours, and then enter the data, such as the location, unit price, number of rooms, and so on, into a spreadsheet.

This process takes about an hour of your day, which results in 30 hours per month. That is a lot. In 30 hours, you can easily read a book or take an online course to further develop your skills in some other area. The process described in this paragraph can be automated easily, without the need for machine learning.

Let's take a look at another example. You are spending multiple hours per day on the stock market, deciding what to buy and what to sell. This process is different from the previous one, as it involves some sort of decision making. The thing is, with all of the datasets available online, a skilled individual can use machine learning methods to automate the buy/sell decision-making process.

This is the form of automation that includes machine learning, but no, machine learning and automation are not synonymous. Each can work without the other.

The following sections discuss in great detail the role of automation in machine learning (not vice versa), and answer what we are trying to automate and how it can be achieved in the modern day and age.

 

Applying automation to machine learning

We've covered the idea of automation and various types of automation thus far, but what's the connection between automation and machine learning? What exactly is it that we are trying to automate in machine learning?

That's what this section aims to demystify. By the end of this section, you will know the difference between the terms automation with machine learning and automating machine learning. These two might sound similar at first, but are very different in reality.

What are we trying to automate?

Let's get one thing straight – automation of machine learning processes has nothing to do with business process automation with machine learning. In the former, we're trying to automate the machine learning itself, ergo automating the process of selecting the best model and the best hyperparameters. The latter refers to automating a business process with the help of machine learning; for example, making a decision system that decides when to buy or sell a stock based on historical data.

It's crucial to remember this distinction. The primary focus of this book is to demonstrate how automation libraries can be used to automate the process of machine learning. By doing so, you will follow the exact same approach, regardless of the dataset, and always end up with the best possible model.

Choosing an appropriate machine learning algorithm isn't an easy task. Just take a look at the following diagram:

Figure 1.9 – Algorithms in scikit-learn (source: Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011)

Figure 1.9 – Algorithms in scikit-learn (source: Scikit-learn: Machine Learning in Python, Pedregosa et al., JMLR 12, pp. 2825-2830, 2011)

As you can see, multiple decisions are required to select an appropriate algorithm. In addition, every algorithm has its own set of hyperparameters (parameters specified by the engineer). To make things even worse, some of these hyperparameters are continuous in nature, so when you add it all up, there are hundreds of thousands or even millions of hyperparameter combinations that you as an engineer should test for.

Every hyperparameter combination requires training and evaluation of a completely new model. Concepts such as grid search can help you avoid writing tens of nested loops, but it is far from an optimal solution.

Modern machine learning engineers don't spend their time and energy on model training and optimization – but instead on raising the data quality and availability. Hyperparameter tweaking can squeeze that additional 2% increase in accuracy, but it is the data quality that can make or break your project.

We'll dive a bit deeper into hyperparameters next and demonstrate why searching for the optimal ones manually isn't that good an idea.

The problem of too many parameters

Let's take a look at some of hyperparameters available for one of the most popular machine learning algorithms – XGBoost. The following list shows the general ones:

  • booster
  • verbosity
  • validate_parameters
  • nthread
  • disable_default_eval_metric
  • num_pbuffer
  • num_feature

That's not much, and some of these hyperparameters are set automatically by the algorithm. The problem lies within the further selection. For example, if you choose gbtree as a value for the booster parameter, you can immediately tweak the values for the following:

  • eta
  • gamma
  • max_depth
  • min_child_weight
  • max_delta_step
  • subsample
  • sampling_method
  • colsample_bytree
  • colsample_bylevel
  • colsample_bynode
  • lambda
  • alpha
  • tree_method
  • sketch_eps
  • scale_pos_weight
  • updater
  • refresher_leaf
  • process_type
  • grow_policy
  • max_leaves
  • max_bin
  • predictor
  • num_parallel_tree
  • monotone_constraints
  • interaction_constraints

And that's a lot! As mentioned before, some hyperparameters take in continuous values, which tremendously increases the total number of combinations. Here's the final icing on the cake – these are only hyperparameters for a single model. Different models have different hyperparameters, which makes the tuning process that much more time consuming.

Put simply, model selection and hyperparameter tuning isn't something you should do manually. There are more important tasks to spend your energy on. Even if there's nothing else you have to do, I'd prefer going for lunch instead of manual tuning any day of the week.

AutoML enables us to do just that, so we'll explore it briefly in the next section.

What is AutoML?

AutoML stands for Automated Machine Learning, and its primary goal is to reduce or completely eliminate the role of data scientists in building machine learning models. Hearing that sentence might be harsh at first. I know what you are thinking. But no – AutoML can't replace data scientists and other data professionals.

In the best-case scenario, AutoML technologies enable other software engineers to utilize the power of machine learning in their application, without the need to have a solid background in ML. This best-case scenario is only possible if the data is adequately gathered and prepared – a task that's not the specialty of a backend developer.

To make things even harder for the non-data scientist, the machine learning process often requires extensive feature engineering. This step can be skipped, but more often than not, this will result in poor models.

In conclusion, AutoML won't replace data scientists, rather just the contrary – it's here to make the life of data scientists easier. AutoML only automates model selection and tuning to the full extent.

There are some AutoML services that advertise themselves as fully automating even the data preparation and feature engineering jobs, but that's just by combining various features together and making something that is not interpretable most of the time. A machine doesn't know the true relationships between variables. That's the job of a data scientist to discover.

 

Automation options

AutoML isn't that new a concept. The idea and some implementations have been around for years, and are receiving positive feedback overall. Still, some fail to implement and fully utilize AutoML solutions in their organization due to a lack of understanding.

AutoML can't do everything – someone still has to gather the data, store it, and prepare it. This isn't a small task, and more often than not requires a significant amount of domain knowledge. Then and only then can automated solutions be utilized to their full potential.

This section explores a couple of options for implementing AutoML solutions. We'll compare one code-based tool written in Python, and one that is delivered as a browser application, meaning that no coding is required. We'll start with the code-based one first.

PyCaret

PyCaret has been widely used to make production-ready machine learning models with as little code as possible. It is a completely free solution capable of training, visualizing, and interpreting machine learning models with ease.

It has built-in support for regression and classification models and shows in an interactive way which models were used for the task, and which generated the best result. It's up to the data scientist to decide which model will be used for the task. Both training and optimization are as simple as a function call.

The library also provides an option to explain machine learning models with game-theoretic algorithms such as SHAP (Shapely Additive Explanations), also with a single function call.

PyCaret still requires a bit of human interaction. Oftentimes, though, the initialization and training process of a model must be selected explicitly by the user, and that breaks the idea of a fully-automated solution.

Further, PyCaret can be slow to run and optimize for a larger dataset. Let's take a look at a code-free AutoML solution next.

ObviouslyAI

Not all of us know how to develop machine learning models, or even how to write code. That's where drag and drop solutions come into play. ObviouslyAI is certainly one of the best out there, and is also easy to use.

This service allows for in-browser model training and evaluation, and can even explain the reasoning behind decisions made by a model. It's a no-brainer for companies in which machine learning isn't the core business, as it's pretty easy to start with and doesn't cost nearly as much as an entire data science team.

A big gotcha with services like this one is the pricing. There's always a free plan included, but in this particular case it's limited to datasets with fewer than 50,000 rows. That's completely fine for occasional tests here and there, but is a deal-breaker for most production use cases.

The second deal-breaker is the actual automation. You can't easily automate mouse clicks and dataset loads. This service automates the machine learning process itself completely, but you still have to do some manual work.

TPOT

The acronym TPOT stands for Tree-based Pipeline Optimization Tool. It is a Python library designed to handle machine learning tasks in an automated fashion.

Here's a statement from the official documentation:

Consider TPOT your Data Science Assistant. TPOT is a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming (TPOT Documentation page, TPOT Team; November 5, 2019).

Genetic programming is a term that is further discussed in the later chapters. For now, just know that it is based on evolutionary algorithms – a special type of algorithm used to discover solutions to problems that humans don't know how to solve.

In a way, TPOT is your data science assistant. You can use it to automate everything boring in a data science project. The term "boring" is subjective, but throughout the book, we use it to refer to the tasks of manually selecting and tweaking machine learning models (read: spending days waiting for the model to tune).

TPOT can't automate the process of data gathering and cleaning, and the reason is obvious – a machine can't read your mind. It can, however, perform machine learning tasks on well prepared datasets better than most data scientists.

The following chapters discuss the library in great detail.

 

Summary

You've learned a lot in this section – or had a brief recap, at least. You are now fresh on the concepts of machine learning, regression, classification, and automation. All of these are crucial for the following, more demanding sections.

The chapters after the next one will dive deep into the code, so you will get a full grasp of the library. Everything from the most basic regression and classification automation, to parallel training, neural networks, and model deployment will be discussed.

In the next chapter, we'll dive deep into the TPOT library, its use cases, and its underlying architecture. We will discuss the core principle behind TPOT – genetic programming – and how is it used to solve regression and classification tasks. We will also fully configure the environment for the Windows, macOS, and Linux operating systems.

 

Q&A

  1. In your own words, define the term "machine learning."
  2. Explain supervised learning in a couple of sentences.
  3. What's the difference between regression and classification machine learning tasks?
  4. Name three areas where machine learning is used and provide concrete examples.
  5. How would you describe automation?
  6. Why do we need automation in this day and age?
  7. What's the difference between terms "automation with machine learning" and "machine learning automation"?
  8. Are the terms "machine learning" and "automation" synonyms? Explain your answer.
  9. Explain the problem of too many parameters in manual machine learning.
  10. Define and briefly explain AutoML.
 

Further reading

About the Author
  • Dario Radečić

    Dario Radečić is a full-time data scientist at Neos, in Croatia, a part-time data storyteller at Appsilon, in Poland, and a business owner. Dario has a master's degree in data science and years of experience in data science and machine learning, with an emphasis on automated machine learning. He is also a top writer in artificial intelligence on Medium and the owner of a data science blog called Better Data Science.

    Browse publications by this author
Machine Learning Automation with TPOT
Unlock this book and the full library FREE for 7 days
Start now