Reader small image

You're reading from  Data Engineering with AWS - Second Edition

Product typeBook
Published inOct 2023
PublisherPackt
ISBN-139781804614426
Edition2nd Edition
Right arrow
Author (1)
Gareth Eagar
Gareth Eagar
author image
Gareth Eagar

Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Read more about Gareth Eagar

Right arrow

Ingesting Batch and Streaming Data

Having developed a high-level architecture for our data pipeline, we can now dive deep into the varied components of the architecture. We will start with data ingestion so that in the hands-on section of this chapter, we can ingest data and use that data for the hands-on activities in future chapters.

Data engineers are often faced with the challenge of the five Vs of data. These are the variety of data (the diverse types and formats of data); the volume of data (the size of the dataset); the velocity of the data (how quickly the data is generated and needs to be ingested); the veracity or validity of the data (the quality, completeness, and credibility of the data); and finally, the value of data (the value that the data can provide the business with).

In this chapter, we will look at several different types of data sources and examine the various tools available within AWS for ingesting data from these sources. We will also look at how...

Technical requirements

In the hands-on sections of this chapter, we will use the AWS Database Migration Service (DMS) service to ingest data from a database source, and then we will ingest streaming data using Amazon Kinesis. To ingest data from a database, you need IAM permissions that allow your user to create an RDS database, an EC2 instance, a DMS instance, and a new IAM role and policy.

For the hands-on section on ingesting streaming data, you will need IAM permissions to create a Kinesis Data Firehose instance, as well as permissions to deploy a CloudFormation template. The CloudFormation template that is deployed will create IAM roles, a Lambda function, as well as Amazon Cognito users and other Cognito resources.

To query the newly ingested data, you will need permission to create an AWS Glue Crawler and permission to use Amazon Athena to query data.

You can find the code files of this chapter in the GitHub repository at the following link: https://github.com/PacktPublishing...

Understanding data sources

Over the past decade, the amount and the variety of data that gets generated each year has significantly increased. Today, industry analysts talk about the volume of global data generated in a year in terms of zettabytes (ZB), a unit of measurement equal to a billion terabytes (TB). By some estimates, a little over 1 ZB of data existed in the world in 2012, and yet by the end of 2025, there will be an estimated 181 ZB of data created, captured, copied, and consumed worldwide.

In our pipeline whiteboarding session (covered in Chapter 5, Architecting Data Engineering Pipelines), we identified several data sources that we wanted to ingest and transform to best enable our data consumers. For each data source that is identified in a whiteboarding session, you need to develop an understanding of the variety, volume, velocity, veracity, and value of data; we’ll move on to cover those now.

Data variety

In the past decade, the variety of data...

Ingesting data from a relational database

A common source of data for analytical projects is data that comes from a relational database system such as MySQL, PostgreSQL, SQL Server, or an Oracle database. Organizations often have multiple siloed databases, and they want to bring the data from these varied databases into a central location for analytics.

It is common for these projects to include ingesting historical data that already exists in the database, as well as syncing ongoing new and changed data from the database. There are a variety of tools that can be used to ingest from database sources, as we will discuss in this section.

AWS DMS

The primary AWS service for ingesting data from a database is AWS DMS, though there are other ways to ingest data from a database source. As a data engineer, you need to evaluate both the source and the target to determine which ingestion tool will be best suited.

AWS DMS is intended for doing either one-off ingestion of historical...

Ingesting streaming data

An increasingly common source of data for analytic projects is data that is continually generated and needs to be ingested in near real time. Some common sources of this type of data are as follows:

  • Data from IoT devices (such as smartwatches, smart appliances, and so on)
  • Telemetry data from various types of vehicles (cars, airplanes, and so on)
  • Sensor data (from manufacturing machines, weather stations, and so on)
  • Live gameplay data from mobile games
  • Mentions of the company brand on various social media platforms

For example, Boeing, the aircraft manufacturer, has a system called Airplane Health Management (AHM) that collects in-flight airplane data and relays it in real time to Boeing systems. Boeing processes the information and makes it immediately available to airline maintenance staff via a web portal.

In this section, we will look at several tools and services for ingesting streaming data, as well as things...

Hands-on – ingesting data with AWS DMS

As we discussed earlier in this chapter, AWS DMS can be used to replicate a database into an Amazon S3-based data lake (among other uses). Follow the steps in this section to do the following:

  1. Deploy a CloudFormation template that configures a MySQL RDS instance and then deploys an EC2 instance to load a demo database into MySQL.
  2. Set up a DMS replication instance and configure endpoints and tasks.
  3. Run the DMS instance in full-load mode.
  4. Run a Glue Crawler to add the tables that were newly loaded into S3 into the AWS Glue Data Catalog.
  5. Query the data with Amazon Athena.
  6. Delete the CloudFormation template in order to remove the resources that have been deployed.

NOTE

The following steps assume the use of your AWS account’s default VPC and security group. You will need to modify the steps as needed if you’re not using the default.

Deploying MySQL and an...

Hands-on – ingesting streaming data

Earlier in this chapter, we looked at two options for ingesting streaming data into AWS, namely Amazon Kinesis and Amazon MSK. AWS provides an open-source solution for streaming sample data to Amazon Kinesis; therefore, in this section, we will use the Amazon Kinesis service to ingest streaming data. To generate streaming data, we will use the AWS open-source Amazon Kinesis Data Generator (KDG).

In this section, we will perform the following tasks:

  1. Configure Amazon Kinesis Data Firehose to ingest streaming data, and write the data out to Amazon S3.
  2. Configure Amazon KDG to create mock streaming data.

To get started, let’s configure a new Kinesis Data Firehose instance to ingest streaming data and write it out to our Amazon S3 data lake.

Configuring Kinesis Data Firehose for streaming delivery to Amazon S3

Kinesis Data Firehose is designed to enable you to easily ingest data from streaming sources...

Summary

In this chapter, we reviewed several ways to ingest common data types into AWS. We reviewed how AWS DMS and AWS Glue can be used to ingest data from a relational database to S3, and how Amazon Kinesis and Amazon MSK can be used to ingest streaming data.

In the hands-on section of this chapter, we used both the AWS DMS and Amazon Kinesis services to ingest data and then used AWS Glue to add the newly ingested data to the AWS Glue Data Catalog and query the data with Amazon Athena.

In the next chapter, Chapter 7, Transforming Data to Optimize for Analytics, we will review how we can transform the ingested data to optimize it for analytics, a core task for data engineers.

Learn more on Discord

To join the Discord community for this book – where you can share feedback, ask questions to the author, and learn about new releases – follow the QR code below:

https://discord.gg/9s5mHNyECd

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with AWS - Second Edition
Published in: Oct 2023Publisher: PacktISBN-13: 9781804614426
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Gareth Eagar

Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Read more about Gareth Eagar