Azure Data Scientist Associate Certification Guide

By Andreas Botsikas , Michael Hlobil

Early Access

This is an Early Access product. Early Access chapters haven’t received a final polish from our editors yet. Every effort has been made in the preparation of these chapters to ensure the accuracy of the information presented. However, the content in this book will evolve and be updated during the development process.

Learn more
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. 1 An Overview of Modern Data Science

About this book

The Azure Data Scientist Associate Certification Guide helps you acquire practical knowledge for machine learning experimentation on Azure. It covers everything you need to pass the DP-100 exam and become a certified Azure Data Scientist Associate.

Starting with an introduction to data science, you'll learn the terminology that will be used throughout the book and then move on to the Azure Machine Learning (Azure ML) workspace. You'll discover the studio interface and manage various components, such as data stores and compute clusters.

Next, the book focuses on no-code and low-code experimentation, and shows you how to use the Automated ML wizard to locate and deploy optimal models for your dataset. You'll also learn how to run end-to-end data science experiments using the designer provided in Azure ML Studio.

You'll then explore the Azure ML Software Development Kit (SDK) for Python and advance to creating experiments and publishing models using code. The book also guides you in optimizing your model's hyperparameters using Hyperdrive before demonstrating how to use responsible AI tools to interpret and debug your models. Once you have a trained model, you'll learn to operationalize it for batch or real-time inferences and monitor it in production.

By the end of this Azure certification study guide, you'll have gained the knowledge and the practical skills required to pass the DP-100 exam.

Publication date:
December 2021
Publisher
Packt
Pages
448
ISBN
9781800565005

 

1 An Overview of Modern Data Science

Data science has its roots in the early eighteenth century and has gained tremendous popularity during the last couple of decades.

In this book, you will learn how to run a data science project within Azure, the Microsoft public cloud infrastructure. You will gain all skills needed to become a certified Azure Data Scientist Associate. You will start with this chapter, which gives some foundational terminology used throughout the book. Then, you will deep dive into Azure Machine Learning (AzureML) services. You will start by provisioning a workspace. You will then work on the no-code, low-code experiences build in the AzureML Studio web interface. Then, you will deep dive into the code-first data science experimentation, working with the AzureML Software Development Kit (SDK).

In this chapter, you will learn some fundamental data science-related terms needed for the DP 100 exam. You will start by understanding the typical life cycle of a data science project. You will then read about big data and how Apache Spark technology enables you to train machine learning models against them. Then, you will explore what the DevOps mindset is and how it can help you become a member of a highly efficient, multi-disciplinary, agile team that builds machine learning-enhanced products.

In this chapter, we are going to cover the following main topics:

  • The evolution of data science
  • Working on a data science project
  • Using Spark in data science
  • Adopting the DevOps mindset
 

The evolution of data science

If you try to find the roots of the data science practices, you will probably end up discovering evidence at the beginning of civilization. In the eighteenth century, governments were gathering demographic and financial data for taxation purposes, a practice called statistics. As years progressed, the use of this term was expanded to include the summarization and analysis of the data collected. In 1805, Adrien-Marie Legendre, a French mathematician, published a paper describing the least squares to fit linear equations, although most people credit Carl Friedrich Gauss for the complete description he published a couple of years later. In 1900, Karl Pearson published in the Philosophical Magazine his observations on the chi-square statistic, a cornerstone in data science for hypothesis testing. In 1962, John Tukey, the scientist famous for the fast Fourier transformation and the box plot, published a paper expressing his passion for data analysis and how statistics needed to evolve into a new science.

On the other hand, with the rise of informatics in the middle of the twentieth century, the field of Artificial Intelligence (AI) was introduced in 1955 by John McCarthy as the official term for thinking machines. AI is a field of computer science that develops systems that can imitate intelligent human behavior. Using programming languages such as Information Processing Language (IPL) and LISt Processor (LISP), developers were writing programs that could manipulate lists and various other data structures to solve complex problems. In 1955, Arthur Samuel's checkers player was the first piece of software that would learn from the games it has already played by storing board states and the chance of winning if ending up in that state in a cache. This checker program may have been the first example of machine learning, a subfield of AI that utilizes historical data and the patterns encoded in the data to train models and enable systems to mimic human tasks without explicitly coding the entire logic. In fact, you can think of machine learning models as software code that is generated by training an algorithm against a dataset to recognize certain types of patterns.

In 2001, William S. Cleveland published the first article in which the term data science was used in the way we refer to it today, a science at the intersection of statistics, data analysis, and informatics that tries to explain phenomena based on data.

Although most people correlate data science with machine learning, data science has a much broader scope, which includes the analysis and preparation of data before the actual machine learning model training process, as you will see in the next section.

 

Working on a data science project

A data science project aims to infuse an application with intelligence extracted from data. In this section, you will discover the common tasks and key considerations needed within such a project. There are quite a few well-established life cycle processes, such as Team Data Science Process (TDSP) and Cross-Industry Standard Process for Data Mining (CRISP-DM), that describe the iterative stages executed in a typical project. The most common stages are shown in Figure 1.1:

Figure 1.1 – The iterative stages of a data science project

Although the diagram shows some indicative flows between the phases, you are free to jump from one phase to any other if needed. Moreover, this approach is iterative, and the data science team should go through multiple iterations, improving its business understanding and the resulting model until the success criteria are met. You will read more about the benefits of an iterative process in this chapter’s Adopting the DevOps mindset section. The data science process starts from the business understanding phase, something you will read more about in the next section.

Understanding of the business problem

The first stage in a data science project is that of business understanding. In this stage, the data science team collaborates with the business stakeholders to define a short, straightforward question that machine learning will try to answer.

Figure 1.2 shows the five most frequent questions that machine learning can answer:

Figure 1.2 – Five questions machine learning can answer

Behind each of those questions, there is a group of modeling techniques you will use.

  • Regression models allow you to predict a numeric value based on one or more features. For example, in Chapter 8, Experimenting with Python Code, you will be trying to predict a numeric value based on 10 measurements that were taken one year before the value you are trying to predict. Training a regression model is a supervised machine learning task, meaning that you need to provide enough sample data to train the model to predict the desired numeric value.
  • Classification models allow you to predict a class label for a given set of inputs. This label can be as simple as a yes/no label or a blue, green, or red color. For example, in Chapter 5, Letting the Machines Do the Model Training, you will be training a classification model to detect whether a customer is going to cancel their phone subscription or not. Predicting whether a person is going to stop doing something is referred to as churn or attrition detection. Training a classification model is a supervised machine learning task and requires a labeled dataset to train the model. A labeled dataset contains both the inputs and the label that you want the model to predict.
  • Clustering is an unsupervised machine learning task that groups data. In contrast to the previous two model types, clustering doesn’t require any training data. It operates on the given dataset and creates the desired number of clusters, assigning each data point to the collection it belongs. A common use case of clustering models is when you try to identify distinct consumer groups in your customer base that you will be targeting with specific marketing campaigns.
  • Recommender systems are designed to recommend the best options based on user profiles. Search engines, e-shops, and popular video streaming platforms utilize this type of model to produce personalized recommendations on what to do next.
  • Anomaly detection models can detect outliers from a dataset or within a data stream. Outliers are items that don’t belong with the rest of the elements, indicating anomalies. For example, if a vibration sensor of a machine starts sending abnormal measurements, it may be a good indication that the device is about to fail.

During the business understanding phase, you will try to understand the problem statement and define the success criteria. Setting up proper expectations of what machine learning can and cannot do is key to ensure alignment between teams.

Throughout a data science project, it is common to have multiple rounds of business understandings. The data science team acquires a lot of insights after exploring datasets or training a model. It is helpful to bring those gathered insights to the business stakeholders and either verify your team’s hypothesis or gain even more insights into the problem you are tackling. For example, business stakeholders may explain a pattern that you may detect in the data but cannot explain its rationale.

Once you get a good grasp of what you are trying to solve, you need to get some data, explore them, and even label them, something you will read about in the next section.

Acquiring and exploring the data

After understanding the problem you are trying to solve, it’s time to gather the data to support the machine learning process. Data can have many forms and formats. It can be either well-structured tabular data stored in database systems or even files, such as images, stored in file shares. Initially, you will not know which data to collect, but you must start from somewhere. A typical anecdote while looking for data is the belief that there is always an Excel file that will contain critical information, and you must keep asking for it until you find it.

Once you have located the data, you will have to analyze it to understand whether the dataset is complete or not. Data is often stored within on-premises systems or Online Transactional Processing (OLTP) databases that you cannot easily access. Even if data is accessible, it is not advised to explore it directly within the source system, as you may accidentally impact the performance of the underlying engine that hosts the data. For example, a complex query on top of a sales table may affect the performance of the e-shop solution. In these cases, it is common to export the required datasets in a file format, such as the most interoperable Comma-Separated Values (CSV) format or the much more optimized for analytical processing Parquet format. These files are then uploaded to cheap cloud storage and become available for further analysis.

Within Microsoft Azure, the most common target is either a Blob container within a storage account or a folder in the filesystem of Azure Data Lake Gen 2, which offers a far more granular access control mechanism. Copying the data can be done in a one-off manner by using tools such as AzCopy or Storage Explorer. If you would like to configure a repeatable process that could pull incrementally new data on a schedule, you can use more advanced tools such as the pipelines of Azure Data Factory or Azure Synapse Analytics. In Chapter 4, Configuring the Workspace, you will review the components needed to pull data from on-premises and the available datastores to which you can connect from within the AzureML workspace to access the various datasets. In the Working with datasets section of Chapter 4, Configuring the Workspace, you will read about the dataset types supported by AzureML and how you can explore them to gain insights into the info stored within them.

A common task when gathering data is the data cleansing step. You remove duplicate records, impute missing values, or fix common data entry issues during this step. For example, you could harmonize a country text field by replacing UK records with United Kingdom. Within AzureML, you can perform such cleansing operations either in the designer that you will see in Chapter 6, Visual Model Training and Publishing, or through the notebooks experience you will be working with from Chapter 7, The AzureML Python SDK, onward. Although you may start doing those cleansing operations with AzureML, as the project matures, these cleansing activities tend to move within the pipelines of Azure Data Factory or Azure Synapse Analytics, which pulls the data out of the source systems.

Important note

While doing data cleansing, be aware of yak shaving. The term yak shaving was coined in the 90s to describe the situation where, while working on a task, you realize that you must do another task, which leads to another one, and so on. This chain of tasks may take you away from your original goal. For example, you may realize that some records have invalid encoding on the country text field example, but you can understand the referenced country. You decide to change the export encoding of the CSV file, and you realize that the export tool you were using is old and doesn’t support UTF-8. That leads you to a quest to find a system administrator to get your software updated. Instead of going down that route, make a note of what needs to be done and add it to your backlog. You can fix this issue in the next iteration when you will have a better understanding of whether you actually need this field or not.

Another common task is labeling the dataset, especially if you will be dealing with supervised machine learning models. For example, if you are curating a dataset to predict whether a customer will churn or not, you will have to flag the records of the customers that canceled their subscriptions. A more complex labeling case is when you create a sentiment analysis model for social media messages. In that case, you will need to get a feed of messages, go through them, and assign a label on whether it is a positive or negative sentiment.

Within AzureML Studio, you can create labeling projects that allow you to scale the labeling efforts of datasets. AzureML allows you to define either a text labeling or an image labeling task. You then bring in team members to label the data based on the given instructions. Once the team has started labeling the data, AzureML automatically trains a model relative to your defined task. When the model is good enough, it starts providing suggestions to the labelers to improve their productivity. Figure 1.3 shows the labeling project creation wizard and the various options available currently in the image labeling task:

Figure 1.3 – Creating an AzureML labeling project

Through this project phase, you should have discovered the related source systems and produced a cleansed dataset ready for the machine learning training. In the next section, you will learn how to create additional data features that will assist the model training process, a process known as feature engineering.

Feature engineering

During the feature engineering phase, you will be generating new data features that will better represent the problem you are trying to solve and help machines learn from the dataset. For example, the following code block creates a new feature named product_id by transforming the product column of the sales dataset:

product_map = { "orange juice": 1, "lemonade juice": 2 }
dataset["product_id"] = dataset["product"].map(product_map)

This code block uses the pandas map method to convert text into numerical values. The product column is referred to as being a categorical variable, as all records are within a finite number of categories, in this case, orange juice or lemonade juice. If you had a 1-to-5 rating feature in the same dataset, that would have been a discrete numeric variable with a finite number of values that it can take, in this case, only 1, 2, 3, 4, or 5. If you had a column that kept how many liters or gallons the customer bought, that would have been a continuous numeric variable that could take any numeric value greater than or equal to zero, such as half a liter. Besides numeric values, dates fields are also considered as continuous variables.

Important note

Although the product_id feature is a discrete numeric variable in the preceding example, features such as that are commonly treated as a categorical variable, as you will see in Chapter 5, Letting the Machines Do the Model Training.

There are many featurization techniques available. An indicative list is as follows:

  • Scaling of numeric features: This technique converts all numeric features into ranges that can be easily compared. For example, in Chapter 8, Experimenting with Python Code, you will be training a machine learning model on top of medical measurements. Blood glucose measurements range from 80 to 200 mg/dL, while blood pressure measurements range from 60 to 128 mm Hg. These numeric values are scaled down using their mean value, a transformation referred to as standardization or Z-score normalization. Their values end up within the -1 to 1 range, which allows machines to extract better insights.
  • Split: Splitting a column into two new features is something very common. For example, the full name will be split into name and surname for further analysis.
  • Binning: This technique groups continuous features into distinct groups or bins that may expose important information regarding the problem you are trying to solve. For example, if you have the year of birth, you can create bins to group the different generations. In this case, folks with a year of birth between 1965 and 1980 would have been the generation X group, and people in the 1981 to 1996 range would have formed the millennial bin. It is common to use the clustering models that you saw in the Understanding of the business problem section to produce cohorts and define those bins.
  • One-hot encoding of categorical features: Some machine learning algorithms cannot handle categorical data and require all inputs to be numeric. In the example with product, you performed a label encoding. You converted the categorical variable into a numeric one. A typical example for label encoding is t-shirt sizes where you convert small to 1, medium to 2, and large to 3. In the product example though, you accidentally defined the order between orange juice (1) and lemonade juice (2), which may confuse a machine learning algorithm. In this case, instead of the ordinal encoding used in the example that produced the product_id feature, you could have utilized one-hot encoding. In this case, you would introduce two binary features called orange_juice and lemonade_juice. These features would accept either 0 or 1 values, depending on which juice the customer bought.
  • Generate lag features: If you deal with time-series data, you may need to produce features from values from preceding time. For example, if you are trying to forecast the temperature 10 minutes from now, you may need to have the current temperature and the temperature 30 minutes ago and 1 hour ago. These two additional past temperatures are lag features that you will have to engineer.

Important note

Making all those transformations in big datasets may require a tremendous amount of memory and processing time. This is where technologies like Spark come into play to parallelize the process. You will learn more about Spark in the Using Spark in data science section of this chapter.

In Chapter 10, Understanding Model Results, you will use the MinMaxScaler method from the sklearn library to scale numeric features.

As a last step in the feature engineering stage, you normally remove unnecessary or highly correlated features, a process called feature selection. You will be dropping columns that will not be used to train the machine learning model. By dropping those columns, you reduce the memory requirements of the machines that will be doing the training, you reduce the computation time needed to train the model, and the resulting model will be much smaller in size.

While creating those features, it is logical that you may need to go back to the Acquiring and exploring the data phase or even to the Understanding of the business problem stage to get more data and insights. At some point, though, your training dataset will be ready to train the model, something you will read about in the next section.

Training the model

As soon as you have prepared the dataset, the machine learning training process can begin. If the model requires supervised learning and you have enough data, you split them into a training dataset and validation dataset in a 70% to 30% or 80% to 20% ratio. You select the model type you want to train, specify the model’s training parameters (called hyperparameters), and train the model. With the remaining validation dataset, you evaluate the trained model’s performance according to a metric and you decide whether the model is good enough to move to the next stage, or perhaps return to the Understanding of the business problem stage. The training process of a supervised model is depicted in Figure 1.4:

Figure 1.4 – Training a supervised machine learning model

There are a couple of variations to the preceding statement:

  • If the model is in the unsupervised learning category, such as the clustering algorithms, you just pass all the data to train the model. You then evaluate whether the detected clusters address the business need or not, modify the hyperparameters, and try again.
  • If you have a model that requires supervised learning but don’t have enough data, the k-fold cross validation technique is commonly used. With k-fold, you specify the number of folds you want to split the dataset. AzureML uses AutoML and performs either 10 folds if the data is less than 1,000 rows or 3 folds if the dataset is between 1,000 and 20,000 rows. Once you have those folds, you start an iterative process where you do the following:
    • Keep a fold away for validation and train with the rest of the folds a new model.
    • Evaluate the produced model against the fold that you kept out.
    • Record the model score and discard the model.
    • Repeat step 1 by keeping another fold away for validation until all folds have been used for validation.
    • Produce the aggregated model’s performance.

Important note

In the machine learning research literature, there is an approach called semi-supervised learning. In that approach, a small amount of labeled data is combined with a large amount of unlabeled data to train the model.

Instead of training a single model, evaluating the results, and trying again with a different set of hyperparameters, you can automate the process and evaluate multiple models in parallel. This process is called hyperparameter tuning, something you will dive deep into in Chapter 9, Optimizing the ML Model. In the same chapter, you will learn how you can even automate the model selection, an AzureML capability referred to as AutoML.

Metrics help you select the model that minimizes the difference between the predicted value and the actual one. They differ depending on the model type you are training. In regression models, metrics try to minimize the error between the predicted value and the actual one. The most common ones are Mean Absolute Error (MAE), Root Mean Squared Error (RMSE), Relative Squared Error (RSE), Relative Absolute Error (RAE), the coefficient of determination (), and Normalized Root Mean Squared Error (NRMSE), which you are going to see in Chapter 8, Experimenting with Python Code.

In a classification model, metrics are slightly different, as they have to evaluate both how many results it got right and how many it misclassified. For example, in the churn binary classification problem, there are four possible results:

  • The model predicted that the customer would churn, and the customer churned. This is considered a True Positive (TP).
  • The model predicted that the customer would churn, but the customer remained loyal. This is considered a False Positive (FP), since the model was wrong about the customer leaving.
  • The model predicted that the customer would not churn, and the customer churned. This is considered a False Negative (FN), since the model was wrong about the customer being loyal.
  • The model predicted that the customer would not churn, and the customer remained loyal. This is considered a True Negative (TN).

These four states make up the confusion matrix that is shown in Figure 1.5:

Figure 1.5 – The classification model’s evaluation

Through that confusion matrix, you can calculate other metrics, such as accuracy, which calculates the total number of correct results in the evaluation test (in this case, 1132 TP + 2708 TN = 3840 records versus 2708 + 651 + 2229 + 1132 = 6720 total records). On the other hand, precision or Positive Predictive Value (PPV) evaluates how many true predictions are actually true (in this case, 1132 TP versus 1132 + 2229 total true predictions). Recall, also known as sensitivity, measures how many actual true values were correctly classified (in this case, 1132 TP versus 1132 + 651 total true actuals). Depending on the business problem you are trying to solve, you will have to find the balance between the various metrics, as one metric may be more helpful than others. For example, during the COVID-19 pandemic, a model that determines whether someone is infected with recall equal to one would identify all infected patients. However, it may have accidentally misclassified some of the not-infected ones, which other metrics, such as precision, would have depicted.

Important note

Be aware when your model fits your data too well. This is something that we refer to as overfitting, and it may indicate that the model has identified a certain pattern within your training dataset that may not exist in real life. Such models tend to perform poorly when put into production and make inferences on top of unknown data. A common reason for overfitting is a biased training dataset that exposes only a subset of real-world examples. Another reason is target leakage, which means that somehow the value you are trying to predict is passed as an input to the model, perhaps through a feature engineered using the target column. See the Further reading section for guidance on how to handle overfitting and imbalanced data.

As you have seen so far, there are many things to consider while training a machine learning model, and throughout this book, you will get some hands-on experience in training models. In most cases, the first thing you will have to select is the type of computer that is going to run the training process. Currently, you have two options, Central Processing Unit (CPU) or Graphics Processing Unit (GPU) compute targets. Both targets have at least a CPU in them, as this is the core element of any modern computer. The difference is that the GPU compute targets also offer some very powerful graphic cards that can perform massive parallel data processing, making training much faster. To take advantage of the GPU, the model you are training needs to support GPU-based training. GPU is usually used in neural network training with frameworks such as TensorFlow, PyTorch, and Keras.

Once you have trained a machine learning model that satisfies the success criteria defined during the Understanding of the business problem stage of the data science project, it is time to operationalize it and start making inferences with it. That’s what you will read about in the next section.

Deploying the model

When it comes to model operationalization, you have two main approaches:

  • Real-time inferences: The model is always loaded, waiting to make inferences on top of incoming data. Typical use cases are web and mobile applications that invoke a model to predict based on user input.
  • Batch inferences: The model is loaded every time the batch process is invoked, and it generates predictions on top of the incoming batch of records. For example, imagine that you have trained a model to identify your face in pictures and you want to label all the images you have on your hard drive. You will configure a process to use the model against each image, storing the results in a text or CSV file.

The main difference between these two is whether you already have the data to perform the predictions or not. If you already have the data and they do not change, you can make inferences in batch mode. For example, if you are trying to predict the football scores for next week’s matches, you can run a batch inference and store the results in a database. When customers ask for specific predictions, you can retrieve the value from the database. During the football match, though, the model predicting the end score needs features such as the current number of players and how many injuries there are, information that will become available in real time. In those situations, you might want to deploy a web service that exposes a REST API, where you send in the required information and the model is making the real-time inference. You will dive deep into both real-time and batch approaches in Chapter 12, Operationalizing Models with Code.

In this section, you have reviewed the project life cycle of a data science project and went through all the stages, from understanding what needs to be done all the way to operationalizing a model by deploying a batch or real-time service. Especially for real-time streaming, you may have heard the term structured streaming, a scalable processing engine built on Spark to allow developers to perform real-time inferences the same way they would perform batch inference on top of static data. You will learn more about Spark in the next section.

 

Using Spark in data science

At the beginning of the twenty-first century, the big data problem became a reality. Data stored in data centers was growing in volumes and velocity. In 2021, we refer to datasets as big data when they reach at least a couple of terabytes in size, while it is not uncommon to see even petabytes of data in large organizations. These datasets increase at a rapid rate, which can be from a couple of gigabytes per day to even per minute, for example, when you are storing user interactions with a website in an online store to perform clickstream analysis.

In 2009, a research project started at the University of California, Berkeley, trying to provide the parallel computing tools needed to handle big data. In 2014, the first version of Apache Spark was released from this research project. Members from that research team founded the Databricks company, one of the most significant contributors to the open source Apache Spark project.

Apache Spark provides an easy-to-use scalable solution that allows people to perform parallel processing on top of data in a distributed manner. The main idea behind the Spark architecture is that a driver node is responsible for executing your code. Your code is split into smaller parallel actions that can be performed against smaller portions of data. These smaller jobs are scheduled to be executed by the worker nodes, as seen in Figure 1.6:

Figure 1.6 – Parallel processing of big data in a Spark cluster

For example, suppose you wanted to calculate how many products your company sold during the last year. In that case, Spark could spin up 12 jobs that would produce the monthly aggregates, and then the results would be processed by another job that would sum up the totals for all months. If you were tempted to load the entire dataset into memory and perform those aggregates directly from there, let’s examine how much memory you would need within that computer. Let’s assume that the sales data for a single month is stored in a CSV file that is 1 GB. This file will require approximately 10 GB of memory to load. The compressed Parquet files will require even more memory. For example, a similar 1 GB parquet file may end up needing 40 GB of memory to load as a pandas.DataFrame object. As you can understand, loading all 12 files in memory simultaneously is an impossible task. You need to parallelize the processing, something Spark can do for you automatically.

Important note

The Parquet files are stored in a columnar format, which allows you to load partially any number of columns you need. In the 1 GB Parquet example, if you load only half the columns from the dataset, you will probably need only 20 GB of memory. This is one of the reasons why the Parquet format is widely used in analytical loads.

Spark is written in the Scala programming language. It offers APIs for Scala, Python, Java, R, and even C#. Still, the data science community is either working on Scala to achieve maximum computational performance and utilizing the Java library ecosystem or Python, which is widely adopted by the modern data science community. When you are writing Python code to utilize the Spark engine, you are using the PySpark tool to perform operations on top of Resilient Distributed Datasets (RDDs) or Spark.DataFrame objects introduced later in the Spark framework. To benefit from the distributed nature of Spark, you need to be handling big datasets. This means that Spark may be overkill if you deal with only hundreds of thousands of records or even a couple of millions of records.

Spark offers two machine learning libraries, the old MLLib and the new version called Spark ML. Spark ML uses the Spark.DataFrame structure, a distributed collection of data, and offers similar functionality to the DataFrame objects used in Python pandas or R. Moreover, the Koalas project provides an implementation that allows data scientists with existing knowledge of pandas.DataFrame manipulations to use their existing coding skills on top of Spark.

AzureML allows you to execute Spark jobs on top of PySpark, either using its native compute clusters or by attaching to Azure Databricks or Synapse Spark pools. Although you will not write any PySpark code in this book, in Chapter 12, Operationalizing Models with Code, you will learn how to achieve similar parallelization benefits without the need for Spark or a driver node.

No matter whether you are coding in regular Python, PySpark, R, or Scala, you are producing some code artifacts that are probably part of a larger system. In the next section, you will explore the DevOps mindset, which emphasizes the communication and collaboration of software engineers, data scientists, and system administrators to achieve faster release of valuable product features.

 

Adopting the DevOps mindset

DevOps is a team mindset that tries to minimize the silos between developers and system operators to shorten the development life cycle of a product. Developers are constantly changing a product to introduce new features and modify existing behaviors. On the other side, system operators need to keep the production systems stable and up and running. In the past, these two groups of people were isolated, and developers were throwing the new piece of software over to the operations team who would try to deploy it in production. As you can imagine, things didn’t work that well all the time, causing frictions between those two groups. When it comes to DevOps, one fundamental practice is that a team needs to be autonomous and should contain all required disciplines, both developers and operators.

When it comes to data science, some people refer to the practice as MLOps, but the fundamental ideas remain the same. A team should be self-sufficient, capable of developing all required components for the overall solution, from the data engineering parts that bring in data and the training of the models all the way to operationalizing the model in production. These teams usually work in an agile manner, which embraces an iterative approach, seeking constant improvement based on feedback, as seen in Figure 1.7:

Figure 1.7 – The feedback flow in an agile MLOps team

The MLOps team operates on its backlog and performs the iterative steps you saw in the Working on a data science project section. Once the model is ready, the system administrators, who are part of the team, are aware of what needs to be done to take the model into production. The model is monitored closely, and if a defect or performance degradation is observed, a backlog item is created for the MLOps team to address in their next sprint.

In order to minimize the development and deployment life cycle of new features in production, automation needs to be embraced. The goal of a DevOps team is to minimize the number of human interventions in the deployment process and automate as many repeatable tasks as possible.

Figure 1.8 shows the most frequently used components while developing real-time models using the MLOps mindset:

Figure 1.8 – Components usually seen in MLOps-driven data science projects

Let’s analyze those components:

  • ARM templates allow you to automate the deployment of Azure resources. This enables the team to spin up and down development, testing, or even production environments in no time. These artifacts are stored within Azure DevOps in a Git version-control repository. The deployment of multiple environments is automated using Azure DevOps pipelines. You are going to read about ARM templates in Chapter 2, Deploying Azure Machine Learning Workspace Resources.
  • Using Azure Data Factory, the data science team orchestrates the pulling and cleansing of the data from the source systems. The data is copied within a data lake, which is accessible from the AzureML workspace. Azure Data Factory uses ARM templates to define its orchestration pipelines, templates that are stored within the Git repository to track changes and be able to deploy in multiple environments.
  • Within the AzureML workspace, data scientists are working on their code. Initially, they start working on Jupyter notebooks. Notebooks are a great way to prototype some ideas, as you will see in Chapter 7, The AzureML Python SDK. As the project progresses, the scripts are exported from the notebooks and are organized into coding scripts. All those code artifacts are version-controlled into Git, using the terminal and commands such as the ones seen in Figure 1.9:

    Figure 1.9 – Versioning a notebook and a script file using Git within AzureML
  • When a model is trained, if it is performing better than the model that is currently in production, it is registered within AzureML, and an event is emitted. This event is captured by the AzureML DevOps plugin, which triggers the automatic deployment of the model in the test environment. The model is tested within that environment, and if all tests pass and no errors have been logged in Application Insights, which is monitoring the deployment, the artifacts can be automatically deployed to the next environment, all the way to production.

The ability to ensure both code and model quality plays a crucial role in this automation process. In Python, you can use various tools, such as Flake8, Bandit, and Black, to ensure code quality, check for common security issues, and consistently format your code base. You can also use the pytest framework to write your functional testing, where you will be testing the model results against a golden dataset. With pytest, you can even perform integration testing to verify that the end-to-end system is working as expected.

Adopting DevOps is a never-ending journey. The team will be becoming better every time you repeat the process. The trick is to build trust in the end-to-end development and deployment process so that everyone is confident to make changes and deploy them in production. When the process fails, understand why it failed and learn from your mistakes. Create the mechanisms that will prevent future failures and move on.

 

Summary

In this chapter, you learned about the origins of data science and how it relates to machine learning. You then learned about the iterative nature of a data science project and discovered the various phases you will be working on. Starting from the problem understanding phase, you will then acquire and explore data, create new features, train a model, and then deploy to verify your hypothesis. Then, you saw how you can scale out the processing of big data files using the Spark ecosystem. In the last section, you discovered the DevOps mindset that helps agile teams be more efficient, meaning that they develop and deploy new product features in short periods of time. You saw the components that are commonly used within an MLOps-driven team, and you saw that in the epicenter of that diagram, you find AzureML.

In the next chapter, you will learn how to deploy an AzureML workspace and understand the Azure resources that you will be using in your data science journey throughout this book.

 

Further reading

This section offers a list of helpful web resources that will help you augment your knowledge of the topics addressed in this chapter:

About the Authors

  • Andreas Botsikas

    Andreas Botsikas is an experienced advisor working in the software industry. He has worked in the finance sector, leading highly efficient DevOps teams, and architecting and building high-volume transactional systems. He then traveled the world, building AI-infused solutions with a group of engineers and data scientists. Currently, he works as a trusted advisor for customers onboarding into Azure, de-risking and accelerating their cloud journey. He is a strong engineering professional with a Doctor of Philosophy (Ph.D.) in resource optimization with artificial intelligence from the National Technical University of Athens.

    Browse publications by this author
  • Michael Hlobil

    Michael Hlobil is an experienced architect focused on quickly understanding customers' business needs, with over 25 years of experience in IT pitfalls and successful projects, and is dedicated to creating solutions based on the Microsoft Platform. He has an MBA in Computer Science and Economics (from the Technical University and the University of Vienna) and an MSc (from the ESBA) in Systemic Coaching. He was working on advanced analytics projects in the last decade, including massive parallel systems and Machine Learning systems. He enjoys working with customers and supporting the journey to the cloud.

    Browse publications by this author
Azure Data Scientist Associate Certification Guide
Unlock this book and the full library for $5 a month*
Start now