Reader small image

You're reading from  Modern Data Architecture on AWS

Product typeBook
Published inAug 2023
PublisherPackt
ISBN-139781801813396
Edition1st Edition
Concepts
Right arrow
Author (1)
Behram Irani
Behram Irani
author image
Behram Irani

Behram Irani is currently a technology leader with Amazon Web Services (AWS) specializing in data, analytics and AI/ML. He has spent over 18 years in the tech industry helping organizations, from start-ups to large-scale enterprises, modernize their data platforms. In the last 6 years working at AWS, Behram has been a thought leader in the data, analytics and AI/ML space; publishing multiple papers and leading the digital transformation efforts for many organizations across the globe. Behram has completed his Bachelor of Engineering in Computer Science from the University of Pune and has an MBA degree from the University of Florida.
Read more about Behram Irani

Right arrow

Batch Data Ingestion

In this chapter, we will look at the following key topics:

  • Database migration using AWS DMS
  • SaaS data ingestion using Amazon AppFlow
  • Data ingestion using AWS Glue
  • File and storage migration

So far, we have looked at creating scalable data lakes using Amazon S3 as the storage layer and AWS Glue Data Catalog as the metadata repository. We looked at how you can create layers of a data lake in S3 so that data can be systematically managed for specific personas in your organization. The very first layer we created in S3 was the raw layer, which is meant to store the source system data without any major changes. This also means that we need to first identify all the source systems that we need data from so that we can create a centralized data lake.

The mechanism by which we bring the data over into the raw layer of the data lake in S3 is also termed data ingestion. Data ingestion can either be in batches, where we bring the data over in...

Database migration using AWS DMS

In the prologue, we saw how in recent times, types and volumes of data have exponentially grown. However, a vast amount of data still resides in relational data stores, such as databases and data warehouses. So, let’s get going with relational data stores as the low-hanging fruit for data migration, and tie it back to our GreatFin corporation’s use cases.

Use case for database migration and replication

All lines of business (LOBs) at GreatFin have their transactional data sitting in on-prem databases such as Oracle and SQL Server, and they want the data all centralized in a data lake for them to have self-service analytics and derive insights from the data across all these systems.

Some reports need to get the latest data for analytics as soon as the source databases commit the transaction. This will allow the business to see near-real-time dashboards in order for them to make quick decisions.

At the same time, some LOBs want...

SaaS data ingestion using Amazon AppFlow

We are in an era where a lot of applications are SaaS based. Every SaaS-based application is different and has its own mechanism for capturing and storing data. Many SaaS applications also allow for reporting inside them, but many times, organizations want a holistic view across the whole data platform, which means they may want to join different datasets from multiple such applications, to derive the right level of insights from the data.

Let’s try to correlate this with a use case from GreatFin. If you recall from Chapter 2, the marketing department wanted to find top leads for offering a new type of certificate of deposit (CD) account to select a few high-net-worth customers. Let’s use that example to build our SaaS data ingestion use case.

Use case for data migration from a SaaS application

The marketing department ran a campaign to identify top leads who would be a great fit for offering a new CD account. The lead...

Data ingestion using AWS Glue

In our data lake in Chapter 2, we introduced Glue Data Catalog, which is one of the key components of data lake design. Glue is also a popular ETL tool for data engineers, who want to ingest data from the source systems and transform the data as it flows between the different layers of the data lake. Glue provides complete flexibility to deal with any kind of data engineering complexity. In essence, Glue ETL can help extract data from any source system, transform it, and load it into any target system.

Since this chapter is all about batch data ingestion and we want to keep most of our focus on ingesting data into the data lake in S3, we will focus on those use cases. We have a dedicated chapter for data processing later, where we will revisit Glue ETL.

Use case for data ingestion using modern ETL techniques

The business at GreatFin wants to derive value from all the data available in its existing data stores; some are stored in older-generation...

File and storage migration

A lot of data still resides in files for many reasons. When the data resides in files, we just need an easy transfer mechanism to bring it over into the raw layer of the data lake in S3. In this section, we will explore some of the AWS services that make it easy to transfer files into the AWS ecosystem.

AWS DataSync

AWS DataSync makes it easy to continuously migrate on-prem data into many AWS storages, including Amazon S3. DataSync has an agent that needs to be deployed that will help do all the heavy lifting for the data migration. Before we look at the usage patterns, let’s look at a use case at GreatFin that makes DataSync very appealing.

Use case for data migration using AWS DataSync

Multiple LOBs at GreatFin want to save costs by retiring multi-terabyte data stored on their on-prem storage systems. They want to continuously replicate new data as it arrives on their on-prem storage. Also, because of regulatory requirements, they have...

Summary

In this chapter, we looked at how you can migrate data in batches into different AWS storage systems, especially a data lake in S3. Data ingestion is mostly the first step in data migration, and it can get really complicated if the correct set of tools is not leveraged for appropriate source and target data stores.

We also looked at how you can use DMS and SCT to migrate/replicate on-prem databases into AWS data stores and how you can bring over data into the data lake built on S3. We then looked at how you can use AppFlow to migrate data from SaaS-based applications into the data lake. We also looked at how the versatility of Glue ETL helps during the initial data ingestion stage. And finally, we looked at all the other storage and file transfer services, including DataSync, Transfer Family, and Snow Family.

This brings us to the end of an important chapter where we were able to hydrate data stores in AWS with purpose-built modern data ingestion services. Since this...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Modern Data Architecture on AWS
Published in: Aug 2023Publisher: PacktISBN-13: 9781801813396
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Behram Irani

Behram Irani is currently a technology leader with Amazon Web Services (AWS) specializing in data, analytics and AI/ML. He has spent over 18 years in the tech industry helping organizations, from start-ups to large-scale enterprises, modernize their data platforms. In the last 6 years working at AWS, Behram has been a thought leader in the data, analytics and AI/ML space; publishing multiple papers and leading the digital transformation efforts for many organizations across the globe. Behram has completed his Bachelor of Engineering in Computer Science from the University of Pune and has an MBA degree from the University of Florida.
Read more about Behram Irani