You're reading from Modern Data Architecture on AWS

Product typeBook

Published inAug 2023

PublisherPackt

ISBN-139781801813396

Edition1st Edition

Concepts

Data Science

Author (1)

Behram Irani

Batch Data Ingestion

In this chapter, we will look at the following key topics:

Database migration using AWS DMS
SaaS data ingestion using Amazon AppFlow
Data ingestion using AWS Glue
File and storage migration

So far, we have looked at creating scalable data lakes using Amazon S3 as the storage layer and AWS Glue Data Catalog as the metadata repository. We looked at how you can create layers of a data lake in S3 so that data can be systematically managed for specific personas in your organization. The very first layer we created in S3 was the raw layer, which is meant to store the source system data without any major changes. This also means that we need to first identify all the source systems that we need data from so that we can create a centralized data lake.

The mechanism by which we bring the data over into the raw layer of the data lake in S3 is also termed data ingestion. Data ingestion can either be in batches, where we bring the data over in...

Database migration using AWS DMS

In the prologue, we saw how in recent times, types and volumes of data have exponentially grown. However, a vast amount of data still resides in relational data stores, such as databases and data warehouses. So, let’s get going with relational data stores as the low-hanging fruit for data migration, and tie it back to our GreatFin corporation’s use cases.

Use case for database migration and replication

All lines of business (LOBs) at GreatFin have their transactional data sitting in on-prem databases such as Oracle and SQL Server, and they want the data all centralized in a data lake for them to have self-service analytics and derive insights from the data across all these systems.

Some reports need to get the latest data for analytics as soon as the source databases commit the transaction. This will allow the business to see near-real-time dashboards in order for them to make quick decisions.

At the same time, some LOBs want...

SaaS data ingestion using Amazon AppFlow

We are in an era where a lot of applications are SaaS based. Every SaaS-based application is different and has its own mechanism for capturing and storing data. Many SaaS applications also allow for reporting inside them, but many times, organizations want a holistic view across the whole data platform, which means they may want to join different datasets from multiple such applications, to derive the right level of insights from the data.

Let’s try to correlate this with a use case from GreatFin. If you recall from Chapter 2, the marketing department wanted to find top leads for offering a new type of certificate of deposit (CD) account to select a few high-net-worth customers. Let’s use that example to build our SaaS data ingestion use case.

Use case for data migration from a SaaS application

The marketing department ran a campaign to identify top leads who would be a great fit for offering a new CD account. The lead...

Data ingestion using AWS Glue

In our data lake in Chapter 2, we introduced Glue Data Catalog, which is one of the key components of data lake design. Glue is also a popular ETL tool for data engineers, who want to ingest data from the source systems and transform the data as it flows between the different layers of the data lake. Glue provides complete flexibility to deal with any kind of data engineering complexity. In essence, Glue ETL can help extract data from any source system, transform it, and load it into any target system.

Since this chapter is all about batch data ingestion and we want to keep most of our focus on ingesting data into the data lake in S3, we will focus on those use cases. We have a dedicated chapter for data processing later, where we will revisit Glue ETL.

Use case for data ingestion using modern ETL techniques

The business at GreatFin wants to derive value from all the data available in its existing data stores; some are stored in older-generation...

File and storage migration

A lot of data still resides in files for many reasons. When the data resides in files, we just need an easy transfer mechanism to bring it over into the raw layer of the data lake in S3. In this section, we will explore some of the AWS services that make it easy to transfer files into the AWS ecosystem.

AWS DataSync

AWS DataSync makes it easy to continuously migrate on-prem data into many AWS storages, including Amazon S3. DataSync has an agent that needs to be deployed that will help do all the heavy lifting for the data migration. Before we look at the usage patterns, let’s look at a use case at GreatFin that makes DataSync very appealing.

Use case for data migration using AWS DataSync

Multiple LOBs at GreatFin want to save costs by retiring multi-terabyte data stored on their on-prem storage systems. They want to continuously replicate new data as it arrives on their on-prem storage. Also, because of regulatory requirements, they have...

Summary

In this chapter, we looked at how you can migrate data in batches into different AWS storage systems, especially a data lake in S3. Data ingestion is mostly the first step in data migration, and it can get really complicated if the correct set of tools is not leveraged for appropriate source and target data stores.

We also looked at how you can use DMS and SCT to migrate/replicate on-prem databases into AWS data stores and how you can bring over data into the data lake built on S3. We then looked at how you can use AppFlow to migrate data from SaaS-based applications into the data lake. We also looked at how the versatility of Glue ETL helps during the initial data ingestion stage. And finally, we looked at all the other storage and file transfer services, including DataSync, Transfer Family, and Snow Family.

This brings us to the end of an important chapter where we were able to hydrate data stores in AWS with purpose-built modern data ingestion services. Since this...

References

DMS workshop: https://catalog.us-east-1.prod.workshops.aws/workshops/976050cc-0606-4b23-b49f-ca7b8ac4b153/en-US/400
AppFlow workshop: https://catalog.us-east-1.prod.workshops.aws/workshops/9787ec94-1ace-44cc-91e5-976ad7ddc0b1/en-US/salesforce/salesforce2s3
Glue ETL data ingestion blog: https://aws.amazon.com/blogs/big-data/ingest-data-from-snowflake-to-amazon-s3-using-aws-glue-marketplace-connectors/
Data migrations using Snowball Edge: https://docs.aws.amazon.com/dms/latest/userguide/CHAP_LargeDBs.SBS.html

The rest of the chapter is locked

You have been reading a chapter from

Modern Data Architecture on AWS

Published in: Aug 2023Publisher: PacktISBN-13: 9781801813396

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Behram Irani

Behram Irani is currently a technology leader with Amazon Web Services (AWS) specializing in data, analytics and AI/ML. He has spent over 18 years in the tech industry helping organizations, from start-ups to large-scale enterprises, modernize their data platforms. In the last 6 years working at AWS, Behram has been a thought leader in the data, analytics and AI/ML space; publishing multiple papers and leading the digital transformation efforts for many organizations across the globe. Behram has completed his Bachelor of Engineering in Computer Science from the University of Pune and has an MBA degree from the University of Florida.
Read more about Behram Irani

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages