Reader small image

You're reading from  The Machine Learning Solutions Architect Handbook - Second Edition

Product typeBook
Published inApr 2024
PublisherPackt
ISBN-139781805122500
Edition2nd Edition
Right arrow
Author (1)
David Ping
David Ping
author image
David Ping

David Ping is an accomplished author and industry expert with over 28 years of experience in the field of data science and technology. He currently serves as the leader of a team of highly skilled data scientists and AI/ML solutions architects at AWS. In this role, he assists organizations worldwide in designing and implementing impactful AI/ML solutions to drive business success. David's extensive expertise spans a range of technical domains, including data science, ML solution and platform design, data management, AI risk, and AI governance. Prior to joining AWS, David held positions in renowned organizations such as JPMorgan, Credit Suisse, and Intel Corporation, where he contributed to the advancements of science and technology through engineering and leadership roles. With his wealth of experience and diverse skill set, David brings a unique perspective and invaluable insights to the field of AI/ML.
Read more about David Ping

Right arrow

ML versus traditional software

Before I started working in the field of AI/ML, I spent many years building computer software platforms for large financial services institutions. Some of the business problems I worked on had complex rules, such as identifying companies for comparable analysis for investment banking deals or creating a master database for all the different companies’ identifiers from the different data providers. We had to implement hardcoded rules in database-stored procedures and application server backends to solve these problems. We often debated if certain rules made sense or not for the business problems we tried to solve.

As rules changed, we had to reimplement the rules and make sure the changes did not break anything. To test for new releases or changes, we often replied to human experts to exhaustively test and validate all the business logic implemented before the production release. It was a very time-consuming and error-prone process and required a significant amount of engineering, testing against the documented specification, and rigorous change management for deployment every time new rules were introduced, or existing rules needed to be changed. We often replied to users to report business logic issues in production, and when an issue was reported in production, we sometimes had to open up the source code to troubleshoot or explain the logic of how it worked. I remember I often asked myself if there were better ways to do this.

After I started working in the field of AI/ML, I started to solve many similar challenges using ML techniques. With ML, I did not need to come up with complex rules that often require deep data and domain expertise to create or maintain the complex rules for decision making. Instead, I focused on collecting high-quality data and used ML algorithms to learn the rules and patterns from the data directly. This new approach eliminated many of the challenging aspects of creating new rules (for example, a deep domain expertise requirement, or avoiding human bias) and maintaining existing rules. To validate the model before the production release, we could examine model performance metrics such as accuracy. While it still required data science expertise to interpret the model metrics against the nature of the business problems and dataset, it did not require exhaustive manual testing of all the different scenarios. When a model was deployed into production, we would monitor if the model performed as expected by monitoring any significant changes in production data versus the data we have collected for model training. We would collect new unseen data and labels for production data and test the model performance periodically to ensure that its predictive accuracy remains robust when faced with new, previously unseen production data. To explain why a model made a decision the way it did, we did not need to open up the source code to re-examine the hardcoded logic. Instead, we would rely on ML techniques to help explain the relative importance of different input features to understand what factors were most influential in the decision-making by the ML models.

The following figure shows a graphical view of the process differences between developing a piece of software and training an ML model:

Figure 1.1: ML and computer software

Now that you know the difference between ML and traditional software, it is time to dive deep into understanding the different stages in an ML lifecycle.

ML lifecycle

One of the early ML projects that I worked on was a fascinating yet daunting sports predictive analytics problem for a major league brand. I was given a list of predictive analytics outcomes to think about to see if there were ML solutions for the problems. I was a casual viewer of the sport; I didn’t know anything about the analytics to be generated, nor the rules of the games in the detail that was needed. I was provided with some sample data but had no idea what to do with it.

The first thing I started to work on was an immersion in the sport itself. I delved into the intricacies of the game, studying the different player positions and events that make up each game and play. Only after being armed with the newfound domain knowledge did the data start to make sense. Together with the stakeholder, we evaluated the impact of the different analytics outcomes and assessed the modeling feasibility based on the data we had. With a clear understanding of the data, we came up with a couple of top ML analytics with the most business impact to focus on. We also decided how they would be integrated into the existing business workflow, and how they would be measured on their impacts.

Subsequently, I delved deeper into the data to ascertain what information was available and what was lacking. The raw dataset had a lot of irrelevant data points that needed to be removed while the relevant data points needed to be transformed to provide the strongest signals for model training. I processed and prepared the dataset based on a few of the ML algorithms I had considered and conducted experiments to determine the best approach. I lacked a tool to track the different experiment results, so I had to document what I had done manually. After some initial rounds of experimentation, it became evident that the existing data was not sufficient to train a high-performance model. Hence, I decided to build a custom deep learning model to incorporate data of different modalities as the data points had temporal dependencies and required additional spatial information for the modeling. The data owner was able to provide the additional datasets I required, and after more experiments with custom algorithms and significant data preparations and feature engineering, I eventually trained a model that met the business objectives.

After completing the model, another hard challenge began – deploying and operationalizing the model in production and integrating it into the existing business workflow and system architecture. We engaged in many architecture and engineering discussions and eventually built out a deployment architecture for the model.

As you can see from my personal experience, the journey from business idea to ML production deployment involved many steps. A typical lifecycle of an ML project follows a formal structure, which includes several essential stages like business understanding, data acquisition and understanding, data preparation, model building, model evaluation, and model deployment. Since a big component of the lifecycle is experimentation with different datasets, features, and algorithms, the whole process is highly iterative. Furthermore, it is essential to note that there is no guarantee of a successful outcome. Factors such as the availability and quality of data, feature engineering techniques (the process of using domain knowledge to extract useful features from raw data), and the capability of the learning algorithms, among others, can all affect the final results.

Figure 1.2: ML lifecycle

The preceding figure illustrates the key steps in ML projects, and in the subsequent sections, we will delve into each of these steps in greater detail.

Business problem understanding and ML problem framing

The first stage in the lifecycle is business understanding. This stage involves the understanding of the business goals and defining business metrics that can measure the project’s success. For example, the following are some examples of business goals:

  • Cost reduction for operational processes, such as document processing.
  • Mitigation of business or operational risks, such as fraud and compliance.
  • Product or service revenue improvements, such as better target marketing, new insight generation for better decision making, and increased customer satisfaction.

To measure the success, you may use specific business metrics such as the number of hours reduced in a business process, an increased number of true positive frauds detected, a conversion rate improvement from target marketing, or the number of churn rate reductions. This is an essential step to get right to ensure there is sufficient justification for an ML project and that the outcome of the project can be successfully measured.

After you have defined the business goals and business metrics, you need to evaluate if there is an ML solution for the business problem. While ML has a wide scope of applications, it is not always an optimal solution for every business problem.

Data understanding and data preparation

The saying that “data is the new oil” holds particularly true for ML. Without the required data, you cannot move forward with an ML project. That’s why the next step in the ML lifecycle is data acquisition, understanding, and preparation.

Based on the business problems and ML approach, you will need to gather and comprehend the available data to determine if you have the right data and data volume to solve the ML problem. For example, suppose the business problem to address is credit card fraud detection. In that case, you will need datasets such as historical credit card transaction data, customer demographics, account data, device usage data, and networking access data. Detailed data analysis is then necessary to determine if the dataset features and quality are sufficient for the modeling tasks. You also need to decide if the data needs labeling, such as fraud or not-fraud. During this step, depending on the data quality, a significant amount of data wrangling might be performed to prepare and clean the data and to generate the dataset for model training and model evaluation, depending on the data quality.

Model training and evaluation

Using the training and validation datasets established, a data scientist must run a number of experiments using different ML algorithms and dataset features for feature selection and model development. This is a highly iterative process and could require numerous runs of data processing and model development to find the right algorithm and dataset combination for optimal model performance. In addition to model performance, factors such as data bias and model explainability may need to be considered to comply with internal or regulatory requirements.

Prior to deployment into production, the model quality must be validated using the relevant technical metrics, such as the accuracy score. This is usually accomplished using a holdout dataset, also known as a test dataset, to gauge how the model performs on unseen data. It is crucial to understand which metrics are appropriate for model validation, as they vary depending on the ML problems and the dataset used. For example, model accuracy would be a suitable validation metric for a document classification use case if the number of document types is relatively balanced. However, model accuracy would not be a good metric to evaluate the model performance for a fraud detection use case – this is because the number of frauds is small and even if the model predicts not-fraud all the time, the model accuracy could still be very high.

Model deployment

After the model is fully trained and validated to meet the expected performance metric, it can be deployed into production and the business workflow. There are two main deployment concepts here. The first involves the deployment of the model itself to be used by a client application to generate predictions. The second concept is to integrate this prediction workflow into a business workflow application. For example, deploying the credit fraud model would either host the model behind an API for real-time prediction or as a package that can be loaded dynamically to support batch predictions. Moreover, this prediction workflow also needs to be integrated into business workflow applications for fraud detection, which might include the fraud detection of real-time transactions, decision automation based on prediction output, and fraud detection analytics for detailed fraud analytics.

Model monitoring

The ML lifecycle does not end with model deployment. Unlike software, whose behavior is highly deterministic since developers explicitly code its logic, an ML model could behave differently in production from its behavior in model training and validation. This could be caused by changes in the production data characteristics, data distribution, or the potential manipulation of request data. Therefore, model monitoring is an important post-deployment step for detecting model performance degradation (a.k.a model drift) or dataset distribution change in the production environment (a.k.a data drift).

Business metric tracking

The actual business impact should be tracked and measured as an ongoing process to ensure the model delivers the expected business benefits. This may involve comparing the business metrics before and after the model deployment, or A/B testing where a business metric is compared between workflows with or without the ML model. If the model does not deliver the expected benefits, it should be re-evaluated for improvement opportunities. This could also mean framing the business problem as a different ML problem. For example, if churn prediction does not help improve customer satisfaction, then consider a personalized product/service offering to solve the problem.

ML challenges

Over the years, I have worked on many real-world problems using ML solutions and encountered different challenges faced by different industries during ML adoptions.

I often get the same question when working on ML projects: We have a lot of data – can you help us figure out what insights we can generate using ML? I refer to companies with this question as having a business use case challenge. Not being able to identify business use cases for ML is a very big hurdle for many companies. Without a properly identified business problem and its value proposition and benefit, it becomes difficult to initiate an ML project.

In my conversations with different companies across their industries, data-related challenges emerge as a frequent issue. This includes data quality, data inventory, data accessibility, data governance, and data availability. This problem affects both data-poor and data-rich companies and is often exacerbated by data silos, data security, and industry regulations.

The shortage of data science and ML talent is another major challenge I have heard from many companies. Companies, in general, are having a tough time attracting and retaining top ML talents, which is a common problem across all industries. As ML platforms become more complex and the scope of ML projects increases, the need for other ML-related functions starts to surface. Nowadays, in addition to just data scientists, an organization would also need functional roles for ML product management, ML infrastructure engineering, and ML operations management.

Based on my experiences, I have observed that cultural acceptance of ML-based solutions is another significant challenge for broad adoption. There are individuals who perceive ML as a threat to their job functions, and their lack of knowledge in ML makes them hesitant to adopt these new methods in their business workflows.

The practice of ML solutions architecture aims to help solve some of the challenges in ML. In the next section, we will explore ML solutions architecture and its role in the ML lifecycle.

ML solutions architecture

When I initially worked with companies as an ML solutions architect, the landscape was quite different from what it is now. The focus was mainly on data science and modeling, and the problems at hand were small in scope. Back then, most of the problems could be solved using simple ML techniques. The datasets were small, and the infrastructure required was not too demanding. The scope of the ML initiative at these companies was limited to a few data scientists or teams. As an ML architect at that time, I primarily needed to have solid data science skills and general cloud architecture knowledge to get the job done.

In more recent years, the landscape of ML initiatives has become more intricate and multifaceted, necessitating involvement from a broader range of functions and personas at companies. My engagement has expanded to include discussions with business executives about ML strategies and organizational design to facilitate the broad adoption of AI/ML throughout their enterprises. I have been tasked with designing more complex ML platforms, utilizing a diverse range of technologies for large enterprises to meet stringent security and compliance requirements. ML workflow orchestration and operations have become increasingly crucial topics of discussion, and more and more companies are looking to train large ML models with enormous amounts of training data. The number of ML models trained and deployed by some companies has skyrocketed to tens of thousands from a few dozen models in just a few years. Furthermore, sophisticated and security-sensitive customers have sought guidance on topics such as ML privacy, model explainability, and data and model bias. As an ML solutions architect, I’ve noticed that the skills and knowledge required to be successful in this role have evolved significantly.

Trying to navigate the complexities of a business, data, science, and technology landscape can be a daunting task. As an ML solutions architect, I have seen firsthand the challenges that companies face in bringing all these pieces together. In my view, ML solutions architecture is an essential discipline that serves as a bridge connecting the different components of an ML initiative. Drawing on my years of experience working with companies of all sizes and across diverse industries, I believe that an ML solutions architect plays a pivotal role in identifying business needs, developing ML solutions to address these needs, and designing the technology platforms necessary to run these solutions. By collaborating with various business and technology partners, an ML solutions architect can help companies unlock the full potential of their data and realize tangible benefits from their ML initiatives.

The following figure illustrates the core functional areas covered by the ML solutions architecture:

Figure 1.3: ML solutions architecture coverage

In the following sections, we will explore each of these areas in greater detail:

  • Business understanding: Business problem understanding and transformation using AI and ML.
  • Identification and verification of ML techniques: Identification and verification of ML techniques for solving specific ML problems.
  • System architecture of the ML technology platform: System architecture design and implementation of the ML technology platforms.
  • MLOps: ML platform automation technical design.
  • Security and compliance: Security, compliance, and audit considerations for the ML platform and ML models.

So, let’s dive in!

Business understanding and ML transformation

The goal of the business workflow analysis is to identify inefficiencies in the workflows and determine if ML can be applied to help eliminate pain points, improve efficiency, or even create new revenue opportunities.

Picture this: you are tasked with improving a call center’s operations. You know there are inefficiencies that need to be addressed, but you’re not sure where to start. That’s where business workflow analysis comes in. By analyzing the call center’s workflows, you can identify pain points such as long customer wait times, knowledge gaps among agents, and the inability to extract customer insights from call recordings. Once you have identified these issues, you can determine what data is available and which business metrics need to be improved. This is where ML comes in. You can use ML to create virtual assistants for common customer inquiries, transcribe audio recordings to allow for text analysis, and detect customer intent for product cross-sell and up-sell. But sometimes, you need to modify the business process to incorporate ML solutions. For example, if you want to use call recording analytics to generate insights for cross-selling or up-selling products, but there’s no established process to act on those insights, you may need to introduce an automated target marketing process or a proactive outreach process by the sales team.

Identification and verification of ML techniques

Once you have come up with a list of ML options, the next step is to determine if the assumption behind the ML approach is valid. This could involve conducting a simple proof of concept (POC) modeling to validate the available dataset and modeling approach, or technology POC using pre-built AI services, or testing of ML frameworks. For example, you might want to test the feasibility of text transcription from audio files using an existing text transcription service or build a customer propensity model for a new product conversion from a marketing campaign.

It is worth noting that ML solutions architecture does not focus on developing new machine algorithms, a job best suited for applied data scientists or research data scientists. Instead, ML solutions architecture focuses on identifying and applying ML algorithms to address a range of ML problems such as predictive analytics, computer vision, or natural language processing. Also, the goal of any modeling task here is not to build production-quality models but rather to validate the approach for further experimentations by full-time applied data scientists.

System architecture design and implementation

The most important aspect of the ML solutions architect’s role is the technical architecture design of the ML platform. The platform will need to provide the technical capability to support the different phases of the ML cycle and personas, such as data scientists and operations engineers. Specifically, an ML platform needs to have the following core functions:

  • Data explorations and experimentation: Data scientists use ML platforms for data exploration, experimentation, model building, and model evaluation. ML platforms need to provide capabilities such as data science development tools for model authoring and experimentation, data wrangling tools for data exploration and wrangling, source code control for code management, and a package repository for library package management.
  • Data management and large-scale data processing: Data scientists or data engineers will need the technical capability to ingest, store, access, and process large amounts of data for cleansing, transformation, and feature engineering.
  • Model training infrastructure management: ML platforms will need to provide model training infrastructure for different modeling training using different types of computing resources, storage, and networking configurations. It also needs to support different types of ML libraries or frameworks, such as scikit-learn, TensorFlow, and PyTorch.
  • Model hosting/serving: ML platforms will need to provide the technical capability to host and serve the model for prediction generations, for real-time, batch, or both.
  • Model management: Trained ML models will need to be managed and tracked for easy access and lookup, with relevant metadata.
  • Feature management: Common and reusable features will need to be managed and served for model training and model serving purposes.

ML platform workflow automation

A key aspect of ML platform design is workflow automation and continuous integration/continuous deployment (CI/CD), also known as MLOps. ML is a multi-step workflow – it needs to be automated, which includes data processing, model training, model validation, and model hosting. Infrastructure provisioning automation and self-service is another aspect of automation design. Key components of workflow automation include the following:

  • Pipeline design and management: The ability to create different automation pipelines for various tasks, such as model training and model hosting.
  • Pipeline execution and monitoring: The ability to run different pipelines and monitor the pipeline execution status for the entire pipeline and each of the steps in the ML cycle such as data processing and model training.
  • Model monitoring configuration: The ability to monitor the model in production for various metrics, such as data drift (where the distribution of data used in production deviates from the distribution of data used for model training), model drift (where the performance of the model degrades in the production compared with training results), and bias detection (the ML model replicating or amplifying bias towards certain individuals).

Security and compliance

Another important aspect of ML solutions architecture is the security and compliance consideration in a sensitive or enterprise setting:

  • Authentication and authorization: The ML platform needs to provide authentication and authorization mechanisms to manage access to the platform and different resources and services.
  • Network security: The ML platform needs to be configured for different network security controls such as a firewall and an IP address access allowlist to prevent unauthorized access.
  • Data encryption: For security-sensitive organizations, data encryption is another important aspect of the design consideration for the ML platform.
  • Audit and compliance: Audit and compliance staff need the information to help them understand how decisions are made by the predictive models if required, the lineage of a model from data to model artifacts, and any bias exhibited in the data and model. The ML platform will need to provide model explainability, bias detection, and model traceability across the various datastore and service components, among other capabilities.

Various industry technology providers have established best practices to guide the design and implementation of ML infrastructure, which is part of the ML solutions architect’s practices. Amazon Web Services, for example, created Machine Learning Lens to provide architectural best practices across crucial domains like operational excellence, security, reliability, performance, cost optimization, and sustainability. Following these published guidelines can help practitioners implement robust and effective ML solutions.

Summary

In this chapter, I have shared some of my personal experience as an ML solutions architect and provided an overview of core concepts and components involved in the ML lifecycle. We discussed the key responsibilities of the ML solutions architect role throughout the lifecycle. This chapter aimed to give you an understanding of the technical and business domains required to work effectively as an ML solutions architect. With this foundational knowledge, you should now have an appreciation for the breadth of this role and its integral part in delivering successful ML solutions.

In the upcoming chapter, we will dive into various ML use cases across different industries, such as financial services and media and entertainment, to gain further insights into the practical applications of ML.

Join our community on Discord

Join our community’s Discord space for discussions with the author and other readers:

https://packt.link/mlsah

You have been reading a chapter from
The Machine Learning Solutions Architect Handbook - Second Edition
Published in: Apr 2024Publisher: PacktISBN-13: 9781805122500
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
David Ping

David Ping is an accomplished author and industry expert with over 28 years of experience in the field of data science and technology. He currently serves as the leader of a team of highly skilled data scientists and AI/ML solutions architects at AWS. In this role, he assists organizations worldwide in designing and implementing impactful AI/ML solutions to drive business success. David's extensive expertise spans a range of technical domains, including data science, ML solution and platform design, data management, AI risk, and AI governance. Prior to joining AWS, David held positions in renowned organizations such as JPMorgan, Credit Suisse, and Intel Corporation, where he contributed to the advancements of science and technology through engineering and leadership roles. With his wealth of experience and diverse skill set, David brings a unique perspective and invaluable insights to the field of AI/ML.
Read more about David Ping