All ML projects are unique
    That said, no matter the details, broadly speaking, all successful ML projects actually have a good deal in common. They require the translation of a business problem into a technical problem, a lot of research and understanding, proofs of concept, analyses, iterations, the consolidation of work, the construction of the final product, and its deployment to an appropriate environment. That is ML engineering in a nutshell!
    Developing this a bit further, you can start to bucket these activities into rough categories or stages, the results of each being necessary inputs for later stages. This is shown in Figure 2.6 :
    Figure 2.6: The stages that any ML project goes through as part of the ML development process.
    Each category of work has a slightly different flavor, but taken together, they provide the backbone of any good ML project. The next few sections will develop the details of each of these categories and begin to show you how they can be used to build your ML engineering solutions. As we will discuss later, it is also not necessary for you to tackle your entire project in four steps like this; you can actually work through each of these steps for a specific feature or part of your overall project. This will be covered in the Selecting a software development methodology  section.
    Let’s make this a bit more real. The main focusTable 2.1 :
    
      
        
          
            Stage 
           
          
            Outputs 
           
         
        
          
            Discover
           
          
            Clarity on the business question.
            Clear arguments for ML over another approach.
            Definition of the KPIs and metrics you want to optimize.
            A sketch of the route to value.
           
         
        
          
            Play
           
          
            Detailed understanding of the data.
            Working proof of concept.
            Agreement on the model/algorithm/logic that will solve the problem.
            Evidence that a solution is doable within realistic resource scenarios.
            Evidence that good ROI can be achieved.
           
         
        
          
            Develop
           
          
            A working solution that can be hosted on appropriate and available infrastructure.
            Thorough test results and performance metrics (for algorithms and software).
            An agreed retraining and model deployment strategy.
            Unit tests, integration tests, and regression tests.
            Solution packaging and pipelines.
           
         
        
          
            Deploy
           
          
            A working and tested deployment process.
            Provisioned infrastructure with appropriate security and performance characteristics.
            Mode retraining and management processes.
            An end-to-end working solution!
           
         
       
    
    Table 2.1: The outputs of the different stages of the ML development process.
    
      IMPORTANT NOTE
      You may think that an ML engineer only really needs to consider the latter two stages, develop , and deploy , and that earlier stages are owned by the data scientist or even a business analyst. We will indeed focus mainly on these stages throughout this book and this division of labor can work very well. It is, however, crucially important that if you are going to build an ML solution, you understand all of the motivations and development steps that have gone before – you wouldn’t build a new type of rocket without understanding where you want to go first, would you?
     
    Comparing this to CRISP-DM 
    The high-level categorization of project steps that we will outline in the rest of this chapter has many similarities to, and some
    
      Business understanding : This is all about getting to know the business problem and domain area. This becomes part of the Discover  phase in the four-step model.Data understanding : Extending the knowledge of the business domain to include the state of the data, its location, and how it is relevant to the problem. Also included in the Discover  phase.Data preparation : Starting to take the data and transform it for downstream use. This will often have to be iterative. Captured in the Play  stage.Modeling : Taking the prepared data and then developing analytics on top of it; this could now include ML of various levels of sophistication. This is an activity that occurs both in the Play  and Develop  phases of the four-step methodology.Evaluation : This stage is concerned with confirming whether the solution will meet the business requirements and performing a holistic review of the work that has gone before. This helps confirm if anything was overlooked or could be improved upon. This is very much part of the Develop  and Deploy  phases; in the methodology we will describe in this chapter, these tasks are very much more baked in across the project.Deployment : In CRISP-DM, this was originally focused on deploying simple analytics solutions like dashboards or scheduled ETL pipelines that would run the decided-upon analytics models. 
    In the world of model ML engineering, this stage can represent, well, anything talked about in this book! CRISP-DM suggests sub-stages around planning and then reviewing the deployment.
 
    As you can see
    The CRISP-DM methodology is just another way to group the important activities of any data project in order
    
      The process outlined in CRISP-DM is relatively rigid and quite linear. This can be beneficial for providing structure but might inhibit moving fast in a project. 
      The methodology is very big on documentation. Most steps detail writing some kind of report, review, or summary. Writing and maintaining good documentation is absolutely critical in a project but there can be a danger of doing too much. 
      CRISP-DM was written in a world before “big data” and large-scale ML. It is unclear to me whether its details still apply in such a different world, where classic extract-transform-load patterns are only one of so many. 
      CRISP-DM definitely comes from the data world and then tries to move toward the idea of a deployable solution in the last stage. This is laudable, but in my opinion, this is not enough. ML engineering is a different discipline in the sense that it is far closer to classic software engineering than not. This is a point that this book will argue time and again. It is therefore important to have a methodology where the concepts of deployment and development are aligned with software and modern ML techniques all the way through. 
     
    The four-step  methodology
    Given this, let’s now go through the four steps in detail.
    Discover 
    Before you start workingdiscovery  in business analysis and is crucial if your ML project is going to be a success.
    The key things to do during the discovery phase are the following:
    
      Speak to the customer! And then speak to them again : You must understand the end user requirements in detail if you are to design and build the right system.Document everything : You will be judged on how well you deliver against the requirements, so make sure that all of the key points from your discussion are documented and signed off by members of your team and the customer or their appropriate representative.Define the metrics that matter : It is very easy at the beginning of a project to get carried away and to feel like you can solve any and every problem with the amazing new tool you are going to build. Fight this tendency as aggressively as you can, as it can easily cause major headaches later on. Instead, steer your conversations toward defining a single or very small number of metrics that define what success will look like.Start finding out where the data lives! : If you can start working out what kind of systems you will have to access to get the data you need, this saves you time later and can help you find any major issues before they derail your project. 
    Using user stories 
    Once you have spokenuser stories . User stories are concise and consistently formatted expressions of what the user or customer wants to see and the acceptance criteria for that feature or unit of work. For example, we may want to define a user story based on the taxi ride example from Chapter 1 , Introduction to ML Engineering : “As a user of our internal web service, I want to see anomalous taxi rides and be able to investigate them further.”
    Let’s begin!
    
      To add this in Jira, select the Create  button. 
      Next, select Story . 
      Then, fill in the details as you deem appropriate. 
     
    You have now added a user story
    Figure 2.7: An example user story in Jira.
    The data sources you use are particularly crucial to understand. As you know, garbage in, garbage out , or even worse, no data, no go ! The particular questions you have to answer about the data are mainly centered around access , technology , quality , and relevance .
    For access and technology, you are trying to pre-empt how much work the data engineers have to do to start their pipeline of work and how much this will hold up the rest of the project. It is therefore crucial that you get this one right.
    A good example would be if you find out quite quickly that the main bulk of data you will need lives in a legacy internal financial system with no real modern APIs and no access request mechanism for non-finance team members. If its main backend is on-premises and you need to migrate locked-down financial data to the cloud, but this makes your business nervous, then you know you have a lot of work to do before you type a line of code. If the data already lives in an enterprise data lake that your team has access to, then you are obviously in a better position. Any challenge is surmountable if the value proposition is strong enough, but finding all this out early will save you time, energy, and money later on.
    Relevance is a bit harder to find out before you kick off, but you can begin to get an idea. For example, if you want to perform the inventory forecast we discussed in Chapter 1 , Introduction to ML Engineering , do you need to pull in customer account information? If you want to create the classifier of premium  or non-premium  customers as marketing targets, also mentioned in Chapter 1 , Introduction to ML Engineering , do you need to have data on social media feeds? The question as to what is relevant will often be less clear-cut than for these examples but an important thing to remember is that you can always come back to it if you really missed something important. You are trying to capture the most important design decisions early, so common sense and lots of stakeholder and subject-matter expert engagement will go a long way.
    Data quality is something that you can try to anticipate a little before moving forward in your project with some questions to current users or consumers of the data or those involved in its entry processes. To get a more quantitative understanding though, you will often just need
    In the next section, we will look at how we develop proof-of-concept ML solutions in the most research-intensive phase, Play .
    Play 
    In the play  stage
    In this part of the process, you are not overly concerned with details of implementation, but with exploring the realms of possibility and gaining an in-depth understanding of the data and the problem, which goes beyond initial discovery work. Since the goal here is not to create production-ready  code or to build reusable tools, you should not worry about whether or not the code you are writing is of the highest quality, or using sophisticated patterns. For example, it will not be uncommon to see code that looks something like the following examples (taken, in fact, from the repo for this book):
    Figure 2.8: Some example prototype code that will be created during the play stage.
    Even a quick glance at these screenshots tells you a few things:
    
      The code is in a Jupyter notebook, which is run by a user interactively in a web browser. 
      The code sporadically calls methods to simply check or explore elements of the data (for example, df.head() and df.dtypes). 
      There is ad hoc code for plotting (and it’s not very intuitive!). 
      There is a variable called tmp, which is not very descriptive. 
     
    All of this is absolutely fine
    Develop 
    As we have mentioned
    This section explores several of those methodologies, processes, and considerations that can be employed in the development phase of our ML engineering projects.
    Selecting a software development methodology 
    One of the first thingsWaterfall , covers project workflows that fit quite naturally with the idea of building something complex (think a building or a car). In Waterfall methodologies, there are distinct and sequential phases of work, each with a clear set of outputs that are needed before moving on to the next phase. For example, a typical Waterfall project may have phases that broadly cover requirements-gathering, analysis, design, development, testing, and deployment (sound familiar?). The key thing is that in a Waterfall-flavored project, when you are in the requirements-gathering  phase, you should only  be working on gathering requirements, when in the testing phase, you should only  be working on testing, and so on. We will discuss the pros and cons of this for ML in the next few paragraphs after introducing another set of methodologies.
    The other set of methodologies, termed Agile , began its life after the introduction of the Agile Manifesto  in 2001 (https://agilemanifesto.org/ 
    What may not be so familiar to you if you have this type of scientific or academic background is that you can still embrace these concepts within a relatively strict framework that is centered around delivery outcomes. Agile software development methodologies are all about finding the balance between experimentation and delivery. This is often done by introducing the concepts of ceremonies  (such as Scrums  and Sprint  Retrospectives ) and roles  (such as Scrum Master  and Product Owner ).
    Further to this, within Agile development, there are two variants that are extremely popular: Scrum  and Kanban . Scrum projectsSprints  where the ideaflow  of tasks from an organized backlog into work in progress through to completed work.
    All of thesepost-deployment  work that has a focus on maintaining an already existing service (sometimes termed a business-as-usual  activity) such as further model improvements or software optimizations in a Kanban framework. It may make sense to do the main delivery of your core body of work in Sprints with very clear outcomes. But you can chop and change and see what fits best for your use cases, your team, and your organization.
    But what makes applying these types of workflows to ML projects different? What do we need to think about in this world of ML that we didn’t before? Well, some of the key points are the following:
    
      You don’t know what you don’t know : You cannot know whether you will be able to solve the problem until you have seen the data. Traditional software engineering is not as critically dependent on the data that will flow through the system as ML engineering is. We can know how to solve a problem in principle, but if the appropriate data does not exist in sufficient quantity or is of poor quality, then we can’t solve the problem in practice.Your system is alive : If you build a classic website, with its backend database, shiny frontend, amazing load-balancing, and other features, then realistically, if the resource is there, it can just run forever. Nothing fundamental changes about the website and how it runs over time. Clicks still get translated into actions and page navigation still happens the same way. Now, consider putting some ML-generated advertising content based on typical user profiles in there. What is a typical user profile  and does that change with time? With more traffic and more users, do behaviors that we never saw before become the new normal?  Your system is learning all the time and that leads to the problems of model drift  and distributional shift , as well as more complex update and rollback scenarios.Nothing is certain : When building a system that uses rule-based logic, you know whatIf X , then Y  means just that, always. With ML models, it is often much harder to know what the answer is when you ask the question, which is in fact why these algorithms are so powerful.  
    But it does mean that you can have unpredictable behavior, either for the reasons discussed previously or simply because the algorithm has learned something that is not obvious about the data to a human observer, or, because ML algorithms can be based on probabilistic and statistical concepts, results come attached to some uncertainty or fuzziness . A classic example is when you apply logistic regression and receive the probability of the data point belonging to one of the classes. It’s a probability so you cannot say with certainty that it is the case; just how likely it is! This is particularly important to consider when the outputs of your ML system will be leveraged by users or other systems to make decisions.
    Given these issues, in the next section, we’ll try and understand what development methodologies can help us when we build our ML solutions. In Table 2.2 , we can see some advantages and disadvantages
    
      
        
          
            Methodology 
           
          
            Pros 
           
          
            Cons 
           
         
        
          
            Agile
           
          
            Flexibility is expected.
            Faster dev to deploy cycles.
           
          
            If not well managed, can easily have scope drift.
            Kanban or Sprints may not work well for some projects.
           
         
        
          
            Waterfall
           
          
            Clearer path to deployment.
            Clear staging and ownership of tasks.
           
          
            Lack of flexibility.
            Higher admin overheads.
           
         
       
    
    Table 2.2: Agile versus Waterfall for ML development.
    Let’s move on to the next section!
    Package management (conda and pip) 
    If I told you to writepip and conda come in.
    pip is the standard package manager in Python and the one recommended for use by the Python Package Authority. 
    It retrieves and installs Python packages from PyPI, the Python Package Index. pip is super easy to use and is often the suggested way to install packages in tutorials and books.
    conda is the package and environment  manager that comes with the Anaconda and Miniconda Python distributions. A key strength of conda is that although it comes from the Python ecosystem, and it has excellent capabilities there, it is actually a more general package manager. As such, if your project requires dependencies outside Python (the NumPy and SciPy libraries being good examples), then although pip can install these, it can’t track all the non-Python dependencies, nor manage their versions. With conda, this is solved.
    You can also use pip within conda environments, soconda to manage the environments I create and then use that to install any packages I think may require non-Python dependencies that perhaps are not captured well within pip, and then I can use pip most of the time within the created conda environment. Given this, throughout the book, you may see pip or conda installation commands used interchangeably. This is perfectly fine.
    To get started with Conda, if you haven’t already, youIndividual  distribution installer from the Anaconda website (https://www.anaconda.com/products/individual 
    The Anaconda
    First, if we want to create a conda environment called mleng with Python version 3.8 installed, we simply execute the following in our terminal:
    conda env --name mleng python=3.10
We can then activate the conda environment by running the following:
    source activate mleng
This means that any new conda or pip commands will install packages in this environment and not system-wide.
    We often want to share the details of our environment with others working on the same project, so it can be useful to export all the package configurations to a .yml file:
    conda export env > environment.yml
The GitHub repository for this book contains a file called mleng-environment.yml for you to create your own instance of the mleng environment. The following command creates an environment with this configuration using this file:
    conda env create --file environment.yml
This pattern of creating a conda environment from an environment file is a nice way to get your environments set up for running the examples in each of the chapters in the book. So, the Technical requirements  section in each chapter will point to the name of the correct environment YAML file contained in the book’s repository. 
    These commands, coupled with your classic conda or pip install command, will set you up for your project quite nicely!
    conda install <package-name>
Or
    pip install <package-name>
I think it’s always a good practice to have many options for doing something, and in general, this is good engineering practice. So given that, now that we have covered the classic Python environment and package managers in conda and pip, we will cover one more package manager. This is a tool that I like for its ease of use and versatility. I think it provides a nice extensionconda and pip and 
    Poetry 
    Poetry is another packagepip Python installation package, but one that also has some environment management capability. The next steps will explain how to set up and use Poetry for a very basic use case. 
    We will build on this with some later examples in the book. First, follow these steps:
    
       First, as usual, we will install Poetry:
        pip install poetry
 
      After Poetrypoetry new command, followed by the name of your project:
        poetry new mleng-with-python
 
      This will create a new directory named mleng-with-python with the necessary files and directories for a Python project. To manage your project’s dependencies, you can add them to the pyproject.toml file in the root directory of your project. This file contains all of the configuration information for your project, including its dependencies and package metadata.
    For example, if you are building a ML project and want to use the scikit-learn library, you would add the following to your pyproject.toml file:
    [tool.poetry.dependencies]
scikit-learn = "*"
 
     
    
      You can then install the dependencies for your project by running the following command. This will install the scikit-learn library and any other dependencies specified in your pyproject.toml file:
        poetry install
 
      To use a dependency in your project, you can simply import it in your Python code like so:
        from  sklearn import  datasets
from  sklearn.model_selection import  train_test_split
from  sklearn.linear_model import  LogisticRegression
 
     
    As you can see, getting started with Poetry is very easy. We will return to using Poetry throughout the book in order to give you examples that complement the knowledge of Conda that we will develop. Chapter 4 , Packaging Up , will discuss this in detail and will show you how
    Code version control 
    If you are going to writeGit .
    We will not go into how Git works under the hood here (there are whole books on the topic!) but we will focus on understanding the key practical elements of using it:
    
      You already have a GitHub account from earlier in the chapter, so the first thing to do is to create a repository with Python as the language and initialize README.md and .gitignore files. The next thing to do is to get a local copy of this repository by running the following command in Bash, Git Bash, or another terminal:
        git clone <repo-name>
 
      Now that you have done this, go into the README.md file and make some edits (anything will do). Then, run the following commands to tell Git to monitor  this file and to save your changes locally with a message briefly explaining what these are:
        git add README.md
git commit -m "I've made a nice change …"
This now means that your local Git instance has stored what you’ve changed and is ready to share that with the remote repo.
 
     
    
      You can then incorporate these changes into the main branch by doing the following:
        git push origin main
If you now go back to the GitHub site, you will see that the changes have taken place in your remote repository and that the comments you added have accompanied the change.
  
     
    
      Other people in your team can then get the updated changes by running the following:
        git pull origin main
 
     
    These steps are the absolute
    Git strategies 
    The presenceDiscover  and Play ) but if you want to engineer something for deployment (and you are reading this book, so this is likely where your head is at), then it is fundamentally important.
    Great, but what do we mean by a Git strategy?
    Well, let’s imagine that we just try to develop our solution without a shared direction on how to organize the versioning and code.
    ML engineer A  wants to start building some of the data science code into a Spark ML pipeline (more on this later) so creates a branch from main called pipeline1spark:
    git checkout -b pipeline1spark
They then get to work on the branch and writes some nice code in a new file called pipeline.py:
    
tokenizer = Tokenizer(inputCol="text" , outputCol="words" )
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(),
                      outputCol="features" )
lr = LogisticRegression(maxIter=10 , regParam=0.001 )
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
Great, they’ve made some excellent progress in translating some previous sklearn code into Spark, which was deemed more appropriate for the use case. They then keep working in this branch because it has all of their additions, and they think it’s better to do everything in one place. When they want to push the branch to the remote repository, they run the following commands:
    git push origin pipeline1spark
ML engineer B  comes along, and they wantA ’s pipeline code and build some extra stepsA ’s code has a branch with this work, so they know enough about Git to create another branch with A ’s code in it, which B  calls pipeline:
    git pull origin pipeline1spark
git checkout pipeline1spark
git checkout -b pipeline
They then add some code to read the parameters for the model from a variable:
    lr = LogisticRegression(maxIter=model_config["maxIter" ], 
                        regParam=model_config["regParam" ])
Cool, engineer B  has made an update that is starting to abstract away some of the parameters. They then push their new branch to the remote repository:
    git push origin pipeline
Finally, ML engineer C  joins the team and wants to get started on the code. Opening up Git and looking at the branches, they see there are three:
    main
pipeline1spark
pipeline
So, which one should be taken as the most up to date? If they want to make new edits, where should they branch from? It isn’t clear, but more dangerous than that is if they are tasked with pushing deployment code to the execution environment, they may think that main has all the relevant changes. On a far busier project that’s been going on for a while, they may even branch off from main and duplicate some of B  and C ’s work! In a small project, you would waste time going on this wild goose chase; in a large project with many different lines of work, you would have very little chance of maintaining a good workflow:
    
lr = LogisticRegression(maxIter=10 , regParam=0.001 )
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
lr = LogisticRegression(maxIter=model_config["maxIter" ], 
                        regParam=model_config[" regParam" ])
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
If these commitsmain branch at the same time, then we will get what is called a merge conflict , and in each case, the engineer will have to choose which piece of code to keep, the current or new example. This would look something like this if engineer A  pushed their changes to main first:
    <<<<<<< HEAD
lr = LogisticRegression(maxIter=10 , regParam=0.001 )
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
=======
lr = LogisticRegression(maxIter=model_config["maxIter" ], 
                        regParam=model_config["regParam" ])
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
>>>>>>> pipeline
The delimiters in the code
    
      IMPORTANT NOTE
      Although, in this simple case, we could potentially trust the engineers to select the better  code, allowing situations like this to occur very frequently is a huge risk to your project. This not only wastes a huge amount of precious development time but it could also mean that you actually end up with worse code!
     
    The way to avoid confusion and extra work like this is to have a very clear strategy for the use of the version control system in place, such as the one we will now explore.
    The Gitflow workflow 
    The biggest problem 
    One of the most popular of these strategies is the Gitflow workflow . This builds on the basic idea of having branches that are dedicated to features and extends it to incorporate the concept of releases and hotfixes, which are particularly relevant to projects with a continuous deployment element.
    The main idea is we have several types of branches, each with clear and specific reasons for existing:
    
      Main  contains your official releases and should only contain the stable version of your code.Dev  acts as the main point for branching from and merging to for most work in the repository; it contains the ongoing development of the code base and acts as a staging area before main.Feature  branches should not be merged straight into the main branch; everything should branch off from dev and then be merged back into dev.Release  branches are created from dev to kick off a build or release process before being merged into main and dev and then deleted.Hotfix  branches are for removing bugs in deployed or production software. You can branch this from main before merging into main and dev when done. 
    This can all be summarized diagrammatically as in Figure 2.9 , which shows how the different branches contribute to the evolution of your code base in the Gitflow workflow:
    Figure 2.9: The Gitflow workflow.
    This diagram is taken from https://lucamezzalira.com/2014/03/10/git-flow-vs-github-flow/ https://www.atlassian.com/git/tutorials/comparing-workflows/gitflow-workflow 
    If your ML project can follow
    Figure 2.10: Example code changes upon a pull request in GitHub. 
    One important aspect we haven’t discussed yet is the concept of code reviews. These are triggeredpull request , where you make known your intention to merge into another branch and allow another team member to review your code before this executes. This is the natural way to introduce code review to your workflow. You do this whenever you want to merge your changes and update them into dev or main branches. The proposed changes can then be made visible to the rest of the team, where they can be debated and iterated on with further commits before completing the merge. 
    This enforces code review to improve quality, as well as creating an audit trail and safeguards for updates. Figure 2.10  shows an example of how changes
    Now that we have discussed some of the best practices for applying version control to your code, let’s explore how to version control the models you produce during your ML project.
    Model version control 
    In any ML engineering MLflow , an open-source platform from Databricks  under the stewardship of the Linux
    To install MLflow, run the following command in your chosen Python environment:
    pip install mlflow
The main aim of MLflow is to provide a platform via which you can log model experiments, artifacts, and performance metrics. It does this through some very simple APIs provided by the Python mlflow library, interfaced to selected storage solutions through a series of centrally developed and community plugins. It also comes with functionalityGraphical User Interface  (GUI ), which will look something like Figure 2.11 :
    Figure 2.11: The MLflow tracking server UI with some forecasting runs.
    The library is extremelyChapter 1 , Introduction to ML Engineering , and add some basic MLflow functionality for tracking performance metrics and saving the trained Prophet model:
    
      First, we make the relevant imports, including MLflow’s pyfunc module, which acts as a general interface for saving and loading models that can be written as Python functions. This facilitates working with libraries and tools not natively supported in MLflow (such as the fbprophet library):
        import  pandas as  pd
from  fbprophet import  Prophet
from  fbprophet.diagnostics import  cross_validation
from  fbprophet.diagnostics import  performance_metrics
import  mlflow
import  mlflow.pyfunc
 
      To create a more seamless integration with the forecasting models from fbprophet, we define a small wrapper class that inherits from the mlflow.pyfunc.PythonModel object:
        class  FbProphetWrapper (mlflow.pyfunc.PythonModel):
    def  __init__ (self, model ):
        self.model = model
        super ().__init__()
    def  load_context (self, context ):
        from  fbprophet import  Prophet
        return 
    def  predict (self, context, model_input ):
        future = self.model.make_future_dataframe(
            periods=model_input["periods" ][0 ])
        return  self.model.predict(future)
We now wrap the functionality for training and prediction into a single helper function called train_predict() to make running multiple times simpler. We will not define all of the details inside this function here but let’s run through
 
     
    
      First, we need to let MLflow know that we are now starting a training run we wish to track:
        with  mlflow.start_run():
    
 
      Inside this loop, we then define and train the model, using parameters defined elsewhere in the code:
        
model = Prophet(
    yearly_seasonality=seasonality_params['yearly' ],
    weekly_seasonality=seasonality_params[' weekly' ],
    daily_seasonality=seasonality_params['daily' ]
)
model.fit(df_train)
 
      We then perform some cross-validation to calculate some metrics we would like to log:
        
df_cv = cross_validation(model, initial="730 days" , 
                         period="180 days" , horizon="365 days" )
df_p = performance_metrics(df_cv)
 
      We can log these metrics, for example, the Root Mean Squared Error  (RMSE ) here, to our MLflow server:
        
mlflow.log_metric("rmse" , df_p.loc[0 , "rmse" ])
 
      Then finally, wemlflow.pyfunc.log_model("model" , python_model=FbProphetWrapper(model))
print (
    "Logged model with URI: runs:/{run_id}/model" .format (
        run_id=mlflow.active_run().info.run_id
    )
)
 
      With only a few extra lines, we have started to perform version control on our models and track the statistics of different runs! 
     
    There are many different ways to save the ML model you have built to MLflow (and in general), which is particularly important when tracking model versions. Some of the main options are as follows:
    
    
      joblib : joblib is a general-purposeNumPy arrays, so is useful for data storage. We will use joblib more in later chapters. It is important to note that joblib suffers from the same security issues as pickle, so knowing the lineage of your joblib files is incredibly important.JSON : If pickle and joblib aren’t appropriate, you MLeap : MLeap is a serializationJava Virtual Machine  (JVM ). It has integrations with Scala, PySpark, and Scikit-Learn but you will oftenSpark ecosystem .ONNX : The Open Neural Network Exchange  (ONNX ) format is aimed at being completely cross-platformscikit-learn API. It is an excellent option if you are building a neural network though. 
    In Chapter 3 , From Model to Model Factory , we will export our models to MLflow using some of these formats, but they are all compatible with MLflow and so you should feel comfortable using them as part of your ML engineering workflow.
    The final section of this chapter will introduce some important concepts for planning how you wish to deploy
    Deploy 
    The final stageDevOps  and MLOps  come into play.
    Let’s elaborate on these two core concepts, laying the groundwork for later chapters and exploring how to begin deploying our work.
    Knowing your deployment options 
    In Chapter 5 , Deployment Patterns and Tools , we will cover in detail what you need to get your ML engineeringdevelop  to deploy  stage, but to pre-empt that and provide a taster of what is to come, let’s explore the different types of deployment options we have at our disposal:
    
      On-premises deployment : The first option we have is to ignore the public cloud altogetherThe big advantage of on-premises deployment is security and peace of mind that none of your data is going to traverse your company firewall. The big downsides are that it requires a larger investment upfront for hardware and that you have to expend a lot of effort to successfully configure and manage that hardware effectively. We will not be discussing on-premises deployment in detail in this book, but all of the concepts we will employ around software development, packaging, environment management, and training and prediction
 
    
      Infrastructure-as-a-Service  (IaaS ): If you are going to use the cloud, one of the lowest levels of abstractionAWS , Simple Storage Service  (S3 ) and Elastic Compute Cloud  (EC2 ) are good examples of IaaS offerings.Platform-as-a-Service  (PaaS ): PaaS solutions are the next level up in terms of abstractionAWS  Lambda  functions, which are serverless functions that can scale almost without limit.  
    All you are required to do is enter the main piece of code you want to execute inside the function. Another good example is Databricks , which provides a very intuitive UI on top of the Spark cluster  infrastructure, with the ability to provision, configure, and
    Being aware of these different options and their capabilities can help you design your ML solution and ensure that you focus your team’s engineering effort where it is most needed and will be most valuable. If your ML engineer is working on configuring routers, for example, you have definitely gone wrong somewhere.
    But once you have selected the components you’ll use and provisioned the infrastructure, how
    Understanding DevOps and MLOps 
    A very powerful idea in modernbuckets  of time being assignedCI/CD . CI/CD is a core part of DevOps  and its ML-focused cousin MLOps , which both aim
    The CI part is mainly focused on the stable incorporation of ongoing changes to the code base while ensuring functionality remains stable. The CD part is all about taking the resultant stable version of the solution and pushing it to the appropriate infrastructure. 
    Figure 2.12  shows a high-level view of this process:
    Figure 2.12: A high-level view of CI/CD processes.
    In order to make CI/CD a reality, you need to incorporate tools that help automate tasks that you would traditionally perform manually in your development and deployment process. For example, if you can automate the running of tests upon merging of code, or the pushing of your code artifacts/models to the appropriate environment, then you are well on your way to CI/CD.
    We can break this out further and think of the different types of tasks that fall into the DevOps or MLOps lifecycles for a solution. Development tasks will typically cover all of the activities that take you from a blank screen on your computer to a working piece of software. This means that development
    Table 2.3  splits out these typical tasks
    
      
        
          
            Lifecycle Stage 
           
          
            Activity 
           
          
            Details 
           
          
            Tools 
           
         
        
          
            Dev
           
          
            Testing
           
          
            Unit tests: tests aimed at testing the functionality smallest pieces of code.
           
          
            pytest or unittest
           
         
        
          
            Integration tests: ensure that interfaces within the code and to other solutions work.
           
          
            Selenium
           
         
        
          
            Acceptance tests: business focused tests.
           
          
            Behave
           
         
        
          
            UI tests: ensuring any frontends behave as expected.
           
           
        
          
            Linting
           
          
            Raise minor stylistic errors and bugs.
           
          
            flake8 or bandit
           
         
        
          
            Formatting
           
          
            Enforce well-formatted code automatically.
           
          
            black or sort
           
         
        
          
            Building
           
          
            The final stage of bringing the solution together.
           
          
            Docker, twine, or pip
           
         
       
    
    Table 2.3: Details of the development activities carried out in any DevOps or MLOps project.
    Next, we can think about the ML activities within MLOps, which this book will be very concerned with. This covers all of the tasks that a classic Python software engineer would not have to worry about, but that are crucially important to get right for ML engineers like us. This includes the development of capabilities to automatically train the ML models, to run the predictions or inferences the model should generate, and to bring that together inside code pipelines. It also covers the staging and management of the versions of your models, which heavily complements the idea of versioning your application code, as we do using tools like Git. Finally, an ML engineer also has to consider that they have to build out specific monitoring capabilities for the operational mode of their solution, which is not covered in traditional DevOps workflows. For an ML solution, you may have to consider monitoring things like precision, recall, the f1-score, population stability, entropy, and data drift in order to know if the model component of your solution is behaving within a tolerable range. This is very different from classic softwareTable 2.4  for some more details on these types of activities.
    
      
        
          
            Lifecycle Stage 
           
          
            Activity 
           
          
            Details 
           
          
            Tools 
           
         
        
          
            ML
           
          
            Training
           
          
            Train the model .
           
          
            Any ML package.
           
         
        
          
            Predicting
           
          
            Run the predictions or inference steps.
           
          
            Any ML package.
           
         
        
          
            Building
           
          
            Creating the pipelines and application logic in which the model is embedded.
           
          
            sklearn pipelines, Spark ML pipelines, ZenML.
           
         
        
          
            Staging
           
          
            Tag and release the appropriate version of your models and pipelines.
           
          
            MLflow or Comet.ml.
           
         
        
          
            Monitoring
           
          
            Track the solution performance and raise alerts when necessary.
           
          
            Seldon, Neptune.ai, Evidently.ai, or Arthur.ai.
           
         
       
    
    Table 2.4: Details on the ML-centered activities carried out during an MLOps project.
    Finally, in either DevOps or MLOps, there is the Ops piece, which refers to Operations. This is all about how the solution will actually run, how it will alert you if there is an issue, and if it can recover successfully. Naturally then, operations will cover activities relating to the final packaging, build, and release of your solution. It also has to cover another type of monitoring, which is different from the performance monitoring of ML models. This monitoring has more of a focus on infrastructure utilization, stability, and scalability, on solution latency, and on the general running Table 2.5 .
    
      
        
          
            Lifecycle Stage 
           
          
            Activity 
           
          
            Details 
           
          
            Tools 
           
         
        
          
            Ops
           
          
            Releasing
           
          
            Taking the software you have built and storing it somewhere central for reuse.
           
          
            Twine, pip, GitHub, or BitBucket.
           
         
        
          
            Deploying
           
          
            Pushing the software you have built to the appropriate target location and environment.
           
          
            Docker, GitHub Actions, Jenkins, TravisCI, or CircleCI.
           
         
        
          
            Monitoring
           
          
            Tracking the performance and utilization of the underlying infrastructure and general software performance, alerting where necessary.
           
          
            DataDog, Dynatrace, or Prometheus.
           
         
       
    
    Table 2.5: Details of the activities carried out in order to make a solution operational in a DevOps or MLOps project.
    Now that we have elucidated the core concepts needed across the MLOps lifecycle, in the next section, we will discuss how to implement CI/CD practices so that we can start making this a reality in our ML engineering projects. We will also extend this to cover automated testing of the performance
    Building our first CI/CD example with GitHub Actions 
    We will use GitHub Actionshttps://docs.github.com/en/actions 
    When using GitHub Actions, you have to create a .yml file that tells GitHub when to perform the required actions and, of course, what actions to perform. This .yml file should be put in a folder called .github/workflows in the root directory of your repository. You will have to create this if it doesn’t already exist. We will do this in a new branch called feature/actions. Create this branch by running:
    git checkout –b feature/actions
Then, create a .yml file called github-actions-basic.yml. In the following steps, we will build up this example .yml file for a Pythonlinter  (a solution to check for bugs, syntax errors, and other issues), and then run some unit tests. This example comes from the GitHub Starter Workflows repository (https://github.com/actions/starter-workflows/blob/main/ci/python-package-conda.yml github-actions-basic.yml and then execute the following:
    
      First, you define the name of the GitHub Actions workflow and what Git event will trigger it:
        name:  Python  package 
on:  [push ]
 
      You then list the jobs you want to execute as part of the workflow, as well as their configuration. For example, here we have one job called build, which we want to run on the latest Ubuntu distribution, and we want to attempt the build using several different versions of Python:
        jobs: 
  build: 
    runs-on:  ubuntu-latest 
    strategy: 
      matrix: 
        python-version:  [3.9 , 3.10 ]
 
      You then define the stepsuses keyword grabs standard GitHub Actions; for example, in the first step, the workflow uses the v2  version of the checkout action, and the second step sets up the Python versions we want to run in the workflow:
        steps: 
-  uses:  actions/checkout@v3 
-  name:  Set  up  Python  ${{  matrix.python-version  }} 
uses:  actions/setup-python@v4 
with: 
  python-version:  ${{  matrix.python-version  }} 
 
      The next step installspip and a requirements.txt file (but you can use conda of course!):
        -  name:  Install  dependencies 
run:  | 
  python  -m  pip  install  --upgrade  pip 
  pip  install  flake8  pytest 
  if  [ -f  requirements.txt  ];  then  pip  install  -r  requirements.txt;  fi 
-  name:  Lint  with  flake8 
 
      We then run some linting:
        -  name:  Lint  with  flake8 
run:  | 
  flake8  .  --count  --select=E9,F63,F7,F82  --show-source  --statistics 
  flake8  .  --count  --exit-zero  --max-complexity=10  --max-line- 
  length=127  --statistics 
 
      Finally, we run our tests using our favorite Python testing library. For this step, we do not want to run through the entire repository, as it is quite complex, so for this example, we use the working-directory keyword to only run pytest in that directory. 
    Since it contains a simple test function in test_basic.py, this will automatically pass:
    -  name:  Test  with  pytest 
run:  pytest 
working-directory:  Chapter02 
 
     
    We have now built up the GitHub Actions.yml file, commit it, and then push it:
    git add .github/workflows/github-actions-basic.yml
git commit –m "Basic CI run with dummy test"
git push origin feature/actions
After you have run these commands in the terminal, you can navigate to the GitHub UI and then click on Actions  in the top menu bar. You will then be presented with a view of all action runs for the repository like that in Figure 2.13. 
    Figure 2.13: The GitHub Actions run as viewed from the GitHub UI.
    If you then click on the run, you will be presented with details of all jobs that ran within the Actions  run, as shown in Figure 2.14 .
    Figure 2.14: GitHub Actions run details from the GitHub UI.
    Finally, you canFigure 2.15 . Clicking on these will also show the outputs from each of the steps. This is extremely useful for analyzing any failures in the run.
    Figure 2.15: The GitHub Actions run steps as shown on the GitHub UI.
    What we have shown so far is an example of CI. For this to be extended to cover CD, we need to include steps that push the produced solution to its target host destination. Examples are building a Python package and publishing it to pip, or creating a pipeline and pushing it to another systemAirflow DAG  in Chapter 5 , Deployment Patterns and Tools . And that, in a nutshell, is how you start building your CI/CD pipelines. As mentioned, later in the book, we will build workflows specific to our ML solutions.
    Now we will look
    Continuous model performance testing 
    As ML engineers, we not only 
    The process I will now walk you through shows how you can take some base reference data and start to build up some different flavors of tests to give confidence that your model will perform as expected when you deploy it.
    We have already introduced how to test automatically with Pytest and GitHub Actions, the good news is that we can just extend this concept to include the testing of some model performance metrics. To do this, you need a few things in place:
    
      Within the action or tests, you need to retrieve the reference data for performing the model validation. This can be done by pulling from a remote data store like an object store or a database, as long as you provide the appropriate credentials. I would suggest storing these as secrets in Github. Here, we will use a dataset generated in place using the sklearn library as a simple example. 
      You need to retrieve the model or models you wish to test from some location as well. This could be a full-fledged model registry or some other storage mechanism. The same points around access and secrets management as in point 1  apply. Here we will pull a model from the Hugging Face Hub (more on Hugging Face in Chapter 3 ), but this could equally have been an MLflow Tracking instance or some other tool. 
      You need to define the tests you want to run and that you are confident will achieve the desired outcome. You do not want to write tests that are far too sensitive and trigger failed builds for spurious reasons, and you also want to try and define tests that are useful for capturing the types of failures you would want to flag.  
     
    For point 1 , here we grabsklearn library and make it available to the tests through a pytest fixture:
    @pytest.fixture 
def  test_dataset () -> Union [np.array, np.array]:
    
    X, y = load_wine(return_X_y=True )
    
    y = y == 2 
    
    X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                        random_state=42 )
    return  X_test, y_test
For point 2 , I will use the Hugging Face Hub package to retrieve the stored model. As mentioned in the bullets above, you will need to adapt this to whatever model storage mechanism you are accessing. The repository in this case is public so there is no need to store any secrets; if you did need to do this, please use the GitHub Secrets store.
    @pytest.fixture 
def  model () -> sklearn.ensemble._forest.RandomForestClassifier:
    REPO_ID = " electricweegie/mlewp-sklearn-wine" 
    FILENAME = "rfc.joblib" 
    model = joblib.load(hf_hub_download(REPO_ID, FILENAME))
    return  model
Now, we just need to write the tests. Let’s start simple with a test that confirms that the predictions of the model produce the correct object types:
    def  test_model_inference_types (model, test_dataset ):
    assert  isinstance (model.predict(test_dataset[0 ]), np.ndarray)
    assert  isinstance (test_dataset[0 ], np.ndarray)
    assert  isinstance (test_dataset[1 ], np.ndarray)
We can then write a test to assert some specific conditions on the performance of the model on the test dataset is met:
    def  test_model_performance (model, test_dataset ):
    metrics = classification_report(y_true=test_dataset[1 ], 
                                    y_pred=model.predict(test_dataset[0 ]),
                                    output_dict=True )
    assert  metrics['False' ]['f1-score' ] > 0.95 
    assert  metrics['False' ]['precision' ] > 0.9 
    assert  metrics['True' ]['f1-score' ] > 0.8 
    assert  metrics['True' ]['precision' ] > 0.8 
The previous test can be thought of as something like a data-driven unit test and will make sure that if you change something
    This means we are performing some continuous model validation as part of our CI/CD process!
    Figure 2.16: Successfully executing model validation tests as part of a CI/CD process using GitHub Actions.
    More sophisticated tests
    Continuous model training 
    An important extensionChapter 3 , From Model to Model Factory , and about how to deploy ML models in general in Chapter 5 , Deployment Patterns and Tools . Given this, we will not cover the details of deploying to different targets here but instead show you how to build continuous training steps into your CI/CD pipelines.
    This is actually simpler than you probably think. As you have hopefully noticed by now, CI/CD is really all about automating a series of steps, which are triggered upon particular events occurring during the development process. Each of these steps can be very simple or more complex, but fundamentally it is always just other programs we are executing in the specified order upon activating the trigger event.
    In this case, since we are concerned with continuous training, we should ask ourselves, when would we want to retrain during code development? Remember that we are ignoring the most obvious cases of retraining on a schedule or upon a drift in model performance or data quality, as these are touched on in later chapters. If we only consider that the code is changing for now, the natural answer is to train only when there is a substantial change to the code. 
    For example, if a trigger was fired every time we committed our code to version control, this would likely result in a lot of costly compute cycles being used for not much gain, as the ML model will likely not perform very differently in each case. We could
    As a reminder, when building CI/CD in GitHub Actions, you create or edit YAML files contained in the .github folder of your Git repository. If we want to trigger a training process upon a pull request, then we can add something like:
    name:  Continous  Training  Example 
on:  [pull_request ]
And then we need to define the steps for pushing the appropriate training script to the target system and running it. First, this would likely require some fetching of access tokens. Let’s assume this is for AWS and that you have loaded your appropriate AWS credentials as GitHub Secrets; for more information, see Chapter 5 , Deployment Patterns and Tools . We would then be able to retrieve these in the first step of a deploy-trainer job:
    jobs: 
  deploy-trainer  
    runs-on:  [ubuntu-latest ]
    steps: 
    -  name: Checkout       uses:  actions/checkout@v3 
    -  name:  Configure  AWS  Credentials 
      uses:  aws-actions/configure-aws-credentials@v2 
      with: 
        aws-access-key-id:  ${{  secrets.AWS_ACCESS_KEY_ID  }} 
        aws-secret-access-key:  ${{  secrets.AWS_SECRET_ACCESS_KEY  }} 
        aws-region:  us-east-2 
        role-to-assume:  ${{  secrets.AWS_ROLE_TO_ASSUME  }} 
        role-external-id:  ${{  secrets.AWS_ROLE_EXTERNAL_ID  }} 
        role-duration-seconds:  1200 
        role-session-name:  TrainingSession      
You may then want to copy your repository files to a target S3  destination; perhaps they contain modules that the main training script needs to run. You could then do something like this:
        -  name: Copy files to target destination 
    run:  aws  s3  sync  .  s3://<S3-BUCKET-NAME> 
And finally, you would wantChapter 5 , Deployment Patterns and Tools :
        -  name:  Run  training  job 
       run:  | 
        
And with that, you have all the key pieces you need to run continuous ML model training to complement the other section on continuous model performance testing. This is how you bring the DevOps concept of CI/CD to the world of MLOps!