You're reading from The Machine Learning Solutions Architect Handbook - Second Edition

Product type Book

Published in Apr 2024

Publisher Packt

ISBN-13 9781805122500

Pages 602 pages

Edition 2nd Edition

Languages

Concepts

Machine Learning

Author (1):

David Ping

Table of Contents (19) Chapters

Preface

Navigating the ML Lifecycle with ML Solutions Architecture

Exploring ML Business Use Cases

Exploring ML Algorithms

Data Management for ML

Exploring Open-Source ML Libraries

Kubernetes Container Orchestration Infrastructure Management

Open-Source ML Platforms

Building a Data Science Environment Using AWS ML Services

Designing an Enterprise ML Architecture with AWS ML Services

Advanced ML Engineering

Building ML Solutions with AWS AI Services

AI Risk Management

Bias, Explainability, Privacy, and Adversarial Attacks

Charting the Course of Your ML Journey

Navigating the Generative AI Project Lifecycle

Designing Generative AI Platforms and Solutions

Other Books You May Enjoy

Index

Exploring Open-Source ML Libraries

There is a wide range of machine learning (ML) and data science technologies available, encompassing both open-source and commercial products. Different organizations have adopted different approaches when it comes to building their ML platforms. Some have opted for in-house teams that leverage open-source technology stacks, allowing for greater flexibility and customization. Others have chosen commercial products to focus on addressing specific business and data challenges. Additionally, some organizations have adopted a hybrid architecture, combining open-source and commercial tools to harness the benefits of both. As a practitioner in ML solutions architecture, it is crucial to be knowledgeable about the available open-source ML technologies and their applications in building robust ML solutions.

In the upcoming chapters, our focus will be on exploring different open-source technologies for experimentation, model building, and the development...

Join our book community on Discord

https://packt.link/EarlyAccessCommunity

Qr code Description automatically generated

As an ML solutions architecture practitioner, I often receive requests for guidance on designing data management platforms for ML workloads. Although data management platform architecture is typically treated as a separate technical discipline, it plays a crucial role in ML workloads. To create a comprehensive ML platform, ML solutions architects must understand the essential data architecture considerations for machine learning and be familiar with the technical design of a data management platform that caters to the needs of data scientists and automated ML pipelines. In this chapter, we will explore the intersection of data management and ML, discussing key considerations for designing a data management platform specifically tailored for ML. We will delve into the core architecture components of such a platform and examine relevant AWS technologies and services that can be used to build it.

Technical requirements

In this chapter, you will need access to an AWS account and AWS services such as Amazon S3, Amazon Lake Formation, AWS Glue, and AWS Lambda. If you do not have an AWS account, follow the official AWS website's instructions to create an account.

Data management considerations for ML

Data management is a broad and complex topic. Many organizations have dedicated data management teams and organizations to manage and govern the various aspects of a data platform. Historically, data management primarily revolved around fulfilling the requirements of transactional systems and analytics systems. However, as ML solutions gain prominence, there are now additional business and technology factors to consider when it comes to data management platforms. The advent of ML introduces new requirements and challenges that necessitate an evolution in data management practices to effectively support these advanced solutions.To understand where data management intersects with the ML workflow, let's bring back the ML life cycle, as illustrated in the following figure:

Figure 4.1: Intersection of data management and the ML life cycle

At a high level, data management intersects with the ML life cycle in three stages: data understanding and preparation...

Data management architecture for ML

Depending on the scale of your ML initiatives, it is important to consider different data management architecture patterns to effectively support them.For small-scale ML projects characterized by limited data scope, a small team size, and minimal cross-functional dependencies, a purpose-built data pipeline tailored to meet the specific project requirements can be a suitable approach. For instance, if your project involves working with structured data sourced from an existing data warehouse and a publicly available dataset, you can consider developing a straightforward data pipeline. This pipeline would extract the necessary data from the data warehouse and public domain and store it in a dedicated storage location owned by the project team. This data extraction process can be scheduled as needed to facilitate further analysis and processing. The diagram below illustrates a simplified data management flow designed to support a small-scale ML project...

Data storage and management

ML workloads often require data from diverse sources and in various formats, and the sheer volume of data can be substantial, particularly when dealing with unstructured data. To address these requirements, cloud object data storage solutions like Amazon S3 are commonly employed as the underlying storage medium. Conceptually, cloud object storage can be likened to a file storage system that accommodates files of different formats. Moreover, the storage system allows for the organization of files using prefixes, which serve as virtual folders for enhanced object management. It is important to note that these prefixes do not correspond to physical folder structures. The term "object storage" stems from the fact that each file is treated as an independent object, bundled with metadata and assigned a unique identifier. Object storage boasts features such as virtually unlimited storage capacity, robust object analytics based on metadata, API-based access...

Data ingestion

The data ingestion component plays a crucial role in acquiring data from diverse sources, including structured, semi-structured, and unstructured formats, such as databases, knowledge graph, social media, file storage, and IoT devices. Its primary responsibility is to store this data persistently in various storage solutions like object data storage (e.g., Amazon S3), data warehouses, or other data stores. Effective data ingestion patterns should incorporate both real-time streaming and batch ingestion mechanisms to cater to different types of data sources and ensure timely and efficient data acquisition.Various data ingestion technologies and tools cater to different ingestion patterns. For streaming data ingestion, popular choices include Apache Kafka, Apache Spark Streaming, and Amazon Kinesis/Kinesis Firehose. These tools enable real-time data ingestion and processing. On the other hand, for batch-oriented data ingestion, tools like Secure File Transfer Protocol (SFTP...

Data cataloging

Data catalog plays a crucial role in data governance and enables data analysts and scientists to discover and access data stored in a central data storage. It becomes particularly important during the data understanding and exploration phase of the ML life cycle when scientists need to search and comprehend available data for their ML projects. When evaluating a data catalog technology, consider the following key factors:

Metadata catalog: The technology should support a central data catalog for effective management of data lake metadata. This involves handling metadata such as database names, table schemas, and table tags. The Hive metastore catalog is a popular standard for managing metadata catalogs.
Automated data cataloging: The capability to automatically discover and catalog datasets, as well as infer data schemas from various data sources like Amazon S3, relational databases, NoSQL databases, and logs. Typically, this functionality is implemented through a crawler...

Data processing

The data processing functionality of a data lake encompasses the frameworks and compute resources necessary for various data processing tasks, such as data correction, transformation, merging, splitting, and ML feature engineering. Common data processing frameworks include Python shell scripts and Apache Spark. The essential requirements for data processing technology are as follows:

Integration and compatibility with the underlying storage technology: The ability to seamlessly work with the native storage system simplifies data access and movement between the storage and processing layers.
Integration with the data catalog: The capability to interact with the data catalog's metastore for querying databases and tables within the catalog.
Scalability: The capacity to scale compute resources up or down to accommodate changing data volumes and processing velocity requirements.
Language and framework support: Support for popular data processing libraries and frameworks...