1. Building an end-to-end machine learning pipeline in Azure
This first chapter covers all the required components for running a custom end-to-end machine learning (ML) pipeline in Azure. Some sections might be a recap of your existing knowledge with useful practical tips, step-by-step guidelines, and pointers to using Azure services to perform ML at scale. You can see it as an overview of the book, where we will dive into each section in great detail with many practical examples and a lot of code during the remaining chapters of the book.
First, we will look at data experimentation techniques as a step-by-step process for analyzing common insights, such as missing values, data distribution, feature importance, and two-dimensional embedding techniques to estimate the expected model performance of a classification task. In the second section, we will use these insights about the data to perform data preprocessing and feature engineering, such as normalization, the encoding...
Performing descriptive data exploration
Descriptive data exploration is, without a doubt, one of the most important steps in an ML project. If you want to clean data and build derived features or select an ML algorithm to predict a target variable in your dataset, then you need to understand your data first. Your data will define many of the necessary cleaning and preprocessing steps; it will define which algorithms you can choose and it will ultimately define the performance of your predictive model.
Hence, data exploration should be considered an important analytical step to understanding whether your data is informative to build an ML model in the first place. By analytical step, we mean that the exploration should be done as a structured analytical process rather than a set of experimental tasks. Therefore, we will go through a checklist of data exploration tasks that you can perform as an initial step in every ML project—before starting any data cleaning, preprocessing...
Exploring common techniques for data preparation
After the data experimentation phase, you should have gathered enough knowledge to start preprocessing the data. This process is also often referred to as feature engineering. When coming from multiple sources, such as applications, databases, or warehouses, as well as external sources, your data cannot be analyzed or interpreted immediately.
It is, therefore, of imminent importance to preprocess data before you choose a model to interpret your problem. In addition to this, there are different steps involved in data preparation, which depend on the data that is available to you, such as the problem you want to solve, and with that, the ML algorithms that could be used for it.
You might ask yourself why data preparation is so important. The answer is that the preparation of your data might lead to improvements in model accuracy when done properly. This could be due to the relationships within your data that have been simplified...
Choosing the right ML model to train data
Similar to data experimentation and preprocessing, training ML model is an analytical, step-by-step process. Each step involves a thought process that evaluates the pros and cons of each algorithm according to the results of the experimentation phase. Like in every other scientific process, it is recommended that you come up with a hypothesis first and verify whether this hypothesis is true afterward.
Let's look at the steps that define the process of training an ML model:
- Define your ML task: First, we need to define the ML task we are facing, which most of the time is defined by the business decision behind your use case. Depending on the amount of labeled data, you can choose between non-supervised, semi-supervised, and supervised learning, as well as many other subcategories.
- Pick a suitable model to perform this task: Pick a suitable model for the chosen ML task. This includes logistic regression, a gradient-boosted...
Optimization techniques
If we have trained a simple ensemble model that performs reasonably better than the baseline model and achieves acceptable performance according to the expected performance estimated during data preparation, we can progress with optimization. This is a point we really want to emphasize. It's strongly discouraged to begin model optimization and stacking when a simple ensemble technique fails to deliver useful results. If this is the case, it would be much better to take a step back and dive deeper into data analysis and feature engineering.
Common ML optimization techniques, such as hyperparameter optimization, model stacking, and even automated machine learning, help you get the last 10% of performance boost out of your model while the remaining 90% is achieved by a single ensemble model. If you decide to use any of those optimization techniques, it is advised to perform them in parallel and fully automated on a distributed cluster.
After seeing too...
Deploying and operating models
Once you have trained and optimized an ML model, it is ready for deployment. Many data science teams, in practice, stop here and move the model to production as a Docker image, often embedded in a REST API using Flask or similar frameworks. However, as you can imagine, this is not always the best solution depending on your use case requirements. An ML or data engineer's responsibility doesn't stop here.
The deployment and operation of an ML pipeline can be best seen when testing the model on live data in production. A test is done to collect insights and data to continuously improve the model. Hence, collecting model performance over time is an essential step to guaranteeing and improving the performance of the model.
In general, we differentiate two architectures for ML-scoring pipelines, which we will briefly discuss in this section:
- Batch scoring using pipelines
- Real-time scoring using a container-based web service
Summary
In this chapter, we saw an overview of all the steps involved in making a custom ML pipeline. You might have seen familiar concepts for data preprocessing or analytics and learned an important lesson. Data experimentation is a step-by-step approach rather than an experimental process. Look for missing values, data distribution, and relationships between features and targets. This analysis will greatly help you to understand which preprocessing steps to perform and what model performance to expect.
You now know that data preprocessing, or feature engineering, is the most important part of the whole ML process. The more prior knowledge you have about the data, the better you can encode categorical and temporal variables or transform text to numerical space using NLP techniques. You learned that choosing the proper ML task, model, error metric, and train-test split is mostly defined by business decisions (for example, object detection against segmentation) or a performance...