Reader small image

You're reading from  Azure Data Scientist Associate Certification Guide

Product typeBook
Published inDec 2021
Reading LevelBeginner
PublisherPackt
ISBN-139781800565005
Edition1st Edition
Languages
Right arrow
Authors (2):
Andreas Botsikas
Andreas Botsikas
author image
Andreas Botsikas

Andreas Botsikas is an experienced advisor working in the software industry. He has worked in the finance sector, leading highly efficient DevOps teams, and architecting and building high-volume transactional systems. He then traveled the world, building AI-infused solutions with a group of engineers and data scientists. Currently, he works as a trusted advisor for customers onboarding into Azure, de-risking and accelerating their cloud journey. He is a strong engineering professional with a Doctor of Philosophy (Ph.D.) in resource optimization with artificial intelligence from the National Technical University of Athens.
Read more about Andreas Botsikas

Michael Hlobil
Michael Hlobil
author image
Michael Hlobil

Michael Hlobil is an experienced architect focused on quickly understanding customers' business needs, with over 25 years of experience in IT pitfalls and successful projects, and is dedicated to creating solutions based on the Microsoft Platform. He has an MBA in Computer Science and Economics (from the Technical University and the University of Vienna) and an MSc (from the ESBA) in Systemic Coaching. He was working on advanced analytics projects in the last decade, including massive parallel systems and Machine Learning systems. He enjoys working with customers and supporting the journey to the cloud.
Read more about Michael Hlobil

View More author details
Right arrow

Chapter 11: Working with Pipelines

In this chapter, you will learn how you can author repeatable processes, defining pipelines that consist of multiple steps. You can use these pipelines to author training pipelines that transform your data and then train models, or you can use them to perform batch inferences using pre-trained models. Once you register one of those pipelines, you can invoke it using either an HTTP endpoint or through the SDK, or even configure them to execute on a schedule. With this knowledge, you will be able to implement and consume pipelines by using the Azure Machine Learning (AzureML) SDK.

In this chapter, we are going to cover the following main topics:

  • Understanding AzureML pipelines
  • Authoring a pipeline
  • Publishing a pipeline to expose it as an endpoint
  • Scheduling a recurring pipeline

Technical requirements

You will need to have access to an Azure subscription. Within that subscription, you will need a resource group named packt-azureml-rg. You will need to have either a Contributor or Owner Access control (IAM) role on the resource group level. Within that resource group, you should have already deployed a machine learning resource named packt-learning-mlw, as described in Chapter 2, Deploying Azure Machine Learning Workspace Resources.

You will also need to have a basic understanding of the Python language. The code snippets target Python version 3.6 or newer. You should also be familiar with working in the notebook experience within AzureML studio, something that was covered in Chapter 8, Experimenting with Python Code.

This chapter assumes you have registered the loans dataset you generated in Chapter 10, Understanding Model Results. It is also assumed that you have created a compute cluster named cpu-sm-cluster, as described in the Working with compute...

Understanding AzureML pipelines

In Chapter 6, Visual Model Training and Publishing, you saw how you can design a training process using building boxes. Similar to those workflows, the AzureML SDK allows you to author Pipelines that orchestrate multiple steps. For example, in this chapter, you will author a Pipeline that consists of two steps. The first step pre-processes the loans dataset that is regarded as raw training data and stores it in a temporary location. The second step then reads this data and trains a machine learning model, which will be stored in a blob store location. In this example, each step will be nothing more than a Python script file that is being executed in a specific compute target using a predefined Environment.

Important note

Do not confuse the AzureML Pipelines with the sklearn Pipelines you read in Chapter 10, Understanding Model Results. The sklearn ones allow you to chain various transformations and feature engineering methods to transform the data...

Authoring a pipeline

Let's assume that you need to create a repeatable workflow that has two steps:

  1. It loads the data from a registered dataset and splits it into training and test datasets. These datasets are converted into a special construct needed by the LightGBM tree-based algorithm. The converted constructs are stored to be used by the next step. In our case, you will use the loans dataset that you registered in Chapter 10, Understanding Model Results. You will be writing the code for this step within a folder named step01.
  2. It loads the pre-processed data and trains a LightGBM model that is then stored in the /models/loans/ folder of the default datastore attached to the AzureML workspace. You will be writing the code for this step within a folder named step02.

    Each step will be a separate Python file, taking some arguments to specify where to read the data from and where to write the data to. These scripts will utilize the same mechanics as the scripts you authored...

Publishing a pipeline to expose it as an endpoint

So far, you have defined a pipeline using the AzureML SDK. If you had to restart the kernel of your Jupyter notebook, you would lose the reference to the pipeline you defined, and you would have to rerun all the cells to recreate the pipeline object. The AzureML SDK allows you to publish a pipeline that effectively registers it as a versioned object within the workspace. Once a pipeline is published, it can be submitted without the Python code that constructed it.

In a new cell in your notebook, add the following code:

published_pipeline = pipeline.publish(
    "Loans training pipeline", 
    description="A pipeline to train a LightGBM model")

This code publishes the pipeline and returns a PublishedPipeline object, the versioned object registered within the workspace. The most interesting attribute of that object is the endpoint, which returns the REST endpoint URL...

Scheduling a recurring pipeline

Being able to invoke a pipeline through the published REST endpoint is great when you have third-party systems that need to invoke a training process after a specific event has occurred. For example, suppose you are using Azure Data Factory to copy data from your on-premises databases. You could use the Machine Learning Execute Pipeline activity and trigger a published pipeline, as shown in Figure 11.9:

Figure 11.9 – Sample Azure Data Factory pipeline triggering an AzureML published pipeline following a copy activity

If you wanted to schedule the pipeline to be triggered monthly, you would need to publish the pipeline as you did in the previous section, get the published pipeline ID, create a ScheduleRecurrence, and then create the Schedule. Return to your notebook where you already have a reference to published_pipeline. Add a new cell with the following code:

from azureml.pipeline.core.schedule import ScheduleRecurrence...

Summary

In this chapter, you learned how you can define AzureML pipelines using the AzureML SDK. These pipelines allow you to orchestrate various steps in a repeatable manner. You started by defining a training pipeline consisting of two steps. You then learned how to trigger the pipeline and how to troubleshoot potential code issues. Then you published the pipeline to register it within the AzureML workspace and acquire an HTTP endpoint that third-party software systems could use to trigger pipeline executions. In the last section, you learned how to schedule the recurrence of a published pipeline.

In the next chapter, you will learn how to operationalize the models you have been training so far in the book. Within that context, you will use the knowledge you acquired in this chapter to author batch inference pipelines, something that you can publish and trigger with HTTP or have it scheduled, as you learned in this chapter.

Questions

In each chapter, you will find a couple of questions to validate your understanding of the topics discussed in this chapter.

  1. What affects the execution order of the pipeline steps?

    a. The order in which the steps were defined when constructing the Pipeline object.

    b. The data dependencies between the steps.

    c. All steps execute in parallel, and you cannot affect the execution order.

  2. True or false: All steps within a pipeline need to execute within the same compute target and Environment.
  3. True or false: PythonScriptStep, by default, reuses the previous execution results if nothing has changed in the parameters or the code files.
  4. You are trying to debug a child run execution issue. Which of the following methods should you call in the StepRun object?

    a. get_file_names

    b. get_details_with_logs

    c. get_metrics

    d. get_details

  5. You have just defined a pipeline in Python code. What steps do you need to make to schedule a daily execution of that pipeline?
...

Further reading

This section offers a list of helpful web resources to help you augment your knowledge of the AzureML SDK and the various code snippets used in this chapter.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Azure Data Scientist Associate Certification Guide
Published in: Dec 2021Publisher: PacktISBN-13: 9781800565005
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Andreas Botsikas

Andreas Botsikas is an experienced advisor working in the software industry. He has worked in the finance sector, leading highly efficient DevOps teams, and architecting and building high-volume transactional systems. He then traveled the world, building AI-infused solutions with a group of engineers and data scientists. Currently, he works as a trusted advisor for customers onboarding into Azure, de-risking and accelerating their cloud journey. He is a strong engineering professional with a Doctor of Philosophy (Ph.D.) in resource optimization with artificial intelligence from the National Technical University of Athens.
Read more about Andreas Botsikas

author image
Michael Hlobil

Michael Hlobil is an experienced architect focused on quickly understanding customers' business needs, with over 25 years of experience in IT pitfalls and successful projects, and is dedicated to creating solutions based on the Microsoft Platform. He has an MBA in Computer Science and Economics (from the Technical University and the University of Vienna) and an MSc (from the ESBA) in Systemic Coaching. He was working on advanced analytics projects in the last decade, including massive parallel systems and Machine Learning systems. He enjoys working with customers and supporting the journey to the cloud.
Read more about Michael Hlobil