Reader small image

You're reading from  Data Wrangling on AWS

Product typeBook
Published inJul 2023
PublisherPackt
ISBN-139781801810906
Edition1st Edition
Tools
Right arrow
Authors (3):
Navnit Shukla
Navnit Shukla
author image
Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

Sankar M
Sankar M
author image
Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

Sampat Palani
Sampat Palani
author image
Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani

View More author details
Right arrow

Working with Amazon S3

In previous chapters, we repeatedly discussed the concepts of big data and data lakes and how organizations are using them to store and extract valuable insights from their data through various data wrangling processes, as outlined in Chapter 1, using Amazon Web Services (AWS) services such as AWS Glue DataBrew, the AWS SDK for Pandas, and SageMaker Data Wrangler. This chapter will delve deeper into the specifics of big data and data lakes.

Specifically, we will be covering the following topics:

  • The definition and concept of big data
  • The characteristics of big data
  • The concept and definition of a data lake
  • Best practices for building a data lake on Amazon Simple Storage Service (Amazon S3)
  • The layout and organization of data on Amazon S3

We will begin by exploring the definition and characteristics of big data.

What is big data?

Big data refers to extremely large datasets that are too complex and diverse to be processed and analyzed using traditional data management and analytics tools. Big data often comes from multiple sources, such as sensors, social media, and e-commerce platforms, and it may include structured, semi-structured, and unstructured data.

The volume, velocity, and variety of big data present significant challenges for data management and analysis. Traditional data storage and processing systems are not designed to handle such large and complex datasets, and they may not be able to provide the performance, scalability, and flexibility required for big data applications.

To overcome these challenges, organizations have turned to big data technologies, such as Apache Hadoop, Apache Spark, and Apache Flink. These technologies are designed to support the storage, processing, and analysis of big data at scale, and they provide a distributed and parallel architecture that...

5 Vs of big data

The 5 Vs of big data are five key characteristics that define the concept of big data. These characteristics help to understand the nature of big data and how it can be effectively analyzed and used. Let’s look at these in more detail, as follows:

  • Volume: Big data refers to extremely large datasets that are too large to be processed using traditional methods. These datasets can range from a few terabytes to several petabytes in size.

For example, Twitter alone generates over 500 million tweets per day, which amounts to a large volume of data that must be stored, processed, and analyzed. Another example of big data would be data generated by large e-commerce companies such as Amazon. This data may include customer purchase history, website clickstream data, and customer service interactions. This data can be collected from various sources such as online transactions, mobile apps, social media, emails, and customer service interactions. All of this...

What is a data lake?

A data lake is a centralized repository that allows organizations to store all of their structured and unstructured data at any scale. This approach to data storage and management provides organizations with a single, unified platform for storing and managing data from a variety of different sources, including social media, sensors, and transactional systems.

Data lakes are designed to support the storage of large amounts of data in its raw format, allowing it to be processed and analyzed at a later stage by various teams within the organization. This approach to data storage and management provides organizations with the flexibility to collect and store data from a wide range of sources, without the need to preprocess or structure the data in any specific way.

One of the key benefits of using a data lake is that it allows organizations to store and manage data from a variety of different sources, including both structured and unstructured data. This means...

Data lake layouts

A data lake layout refers to the way that data is organized and structured within a data lake. This can include the physical location of the data within the data lake, as well as the logical organization of the data into different categories, such as structured and unstructured data.

In general, data lake layouts are designed to support the efficient storage and management of large amounts of data from a variety of different sources. This can include organizing data by source, by type, or by some other criterion that is relevant to the organization’s data management needs.

Some common elements of data lake layouts include the following:

  • Physical location: The physical location of the data within the data lake, such as on-premises storage or cloud-based storage
  • Logical organization: The logical organization of the data into different categories, such as structured and unstructured data
  • Data lineage: The history of the data, including where...

Challenges and considerations when building a data lake on Amazon S3

When building a data lake on Amazon S3 or data lake in general, here are some challenges and considerations one should be aware of:

  • Data ingestion: The process of bringing data into a data lake can be challenging, particularly when the data comes from multiple sources with varying formats and structures. This can lead to difficulties in ensuring data quality and consistency. Additionally, handling large volumes of data can be a challenge, particularly as the data grows. Another issue is keeping schema changes consistent throughout all downstream applications.
  • Data governance: Maintaining data quality, security, and regulatory compliance can be difficult when dealing with a large volume of data in a data lake. Implementing policies and standards for data classification, quality, and retention, as well as managing access and permissions, including role-based access control (RBAC) and data encryption, can...

Summary

In this chapter, we have discussed what big data is, the characteristics of big data, what a data lake is, why we need data lakes, and how a data lake can be built on Amazon S3 by providing an overview of the benefits of data lakes, the different layers of a data lake, and the best practices for building a data lake on Amazon S3. We also provided details on organizing and managing the data within a data lake on S3, including using features such as file formats, partitions, S3 lifecycle management, Amazon S3 Intelligent-Tiering, and so on. The chapter also discussed some challenges and considerations when building a data lake on Amazon S3, such as cost and performance.

In the next chapter, we are going to learn about AWS Glue. AWS Glue is a data integration service that lets you bring data from different data sources and allows you to perform ETL on top of it using frameworks such as Apache Spark and Python.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Wrangling on AWS
Published in: Jul 2023Publisher: PacktISBN-13: 9781801810906
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

author image
Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

author image
Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani