You're reading from Data Wrangling on AWS

Product typeBook

Published inJul 2023

PublisherPackt

ISBN-139781801810906

Edition1st Edition

Tools

AWS

Concepts

Data Analysis

Authors (3):

Navnit Shukla

Sankar M

Sampat Palani

View More author details

Working with Amazon S3

In previous chapters, we repeatedly discussed the concepts of big data and data lakes and how organizations are using them to store and extract valuable insights from their data through various data wrangling processes, as outlined in Chapter 1, using Amazon Web Services (AWS) services such as AWS Glue DataBrew, the AWS SDK for Pandas, and SageMaker Data Wrangler. This chapter will delve deeper into the specifics of big data and data lakes.

Specifically, we will be covering the following topics:

The definition and concept of big data
The characteristics of big data
The concept and definition of a data lake
Best practices for building a data lake on Amazon Simple Storage Service (Amazon S3)
The layout and organization of data on Amazon S3

We will begin by exploring the definition and characteristics of big data.

What is big data?

Big data refers to extremely large datasets that are too complex and diverse to be processed and analyzed using traditional data management and analytics tools. Big data often comes from multiple sources, such as sensors, social media, and e-commerce platforms, and it may include structured, semi-structured, and unstructured data.

The volume, velocity, and variety of big data present significant challenges for data management and analysis. Traditional data storage and processing systems are not designed to handle such large and complex datasets, and they may not be able to provide the performance, scalability, and flexibility required for big data applications.

To overcome these challenges, organizations have turned to big data technologies, such as Apache Hadoop, Apache Spark, and Apache Flink. These technologies are designed to support the storage, processing, and analysis of big data at scale, and they provide a distributed and parallel architecture that...

5 Vs of big data

The 5 Vs of big data are five key characteristics that define the concept of big data. These characteristics help to understand the nature of big data and how it can be effectively analyzed and used. Let’s look at these in more detail, as follows:

Volume: Big data refers to extremely large datasets that are too large to be processed using traditional methods. These datasets can range from a few terabytes to several petabytes in size.

For example, Twitter alone generates over 500 million tweets per day, which amounts to a large volume of data that must be stored, processed, and analyzed. Another example of big data would be data generated by large e-commerce companies such as Amazon. This data may include customer purchase history, website clickstream data, and customer service interactions. This data can be collected from various sources such as online transactions, mobile apps, social media, emails, and customer service interactions. All of this...

What is a data lake?

A data lake is a centralized repository that allows organizations to store all of their structured and unstructured data at any scale. This approach to data storage and management provides organizations with a single, unified platform for storing and managing data from a variety of different sources, including social media, sensors, and transactional systems.

Data lakes are designed to support the storage of large amounts of data in its raw format, allowing it to be processed and analyzed at a later stage by various teams within the organization. This approach to data storage and management provides organizations with the flexibility to collect and store data from a wide range of sources, without the need to preprocess or structure the data in any specific way.

One of the key benefits of using a data lake is that it allows organizations to store and manage data from a variety of different sources, including both structured and unstructured data. This means...

Data lake layouts

A data lake layout refers to the way that data is organized and structured within a data lake. This can include the physical location of the data within the data lake, as well as the logical organization of the data into different categories, such as structured and unstructured data.

In general, data lake layouts are designed to support the efficient storage and management of large amounts of data from a variety of different sources. This can include organizing data by source, by type, or by some other criterion that is relevant to the organization’s data management needs.

Some common elements of data lake layouts include the following:

Physical location: The physical location of the data within the data lake, such as on-premises storage or cloud-based storage
Logical organization: The logical organization of the data into different categories, such as structured and unstructured data
Data lineage: The history of the data, including where...

Challenges and considerations when building a data lake on Amazon S3

When building a data lake on Amazon S3 or data lake in general, here are some challenges and considerations one should be aware of:

Data ingestion: The process of bringing data into a data lake can be challenging, particularly when the data comes from multiple sources with varying formats and structures. This can lead to difficulties in ensuring data quality and consistency. Additionally, handling large volumes of data can be a challenge, particularly as the data grows. Another issue is keeping schema changes consistent throughout all downstream applications.
Data governance: Maintaining data quality, security, and regulatory compliance can be difficult when dealing with a large volume of data in a data lake. Implementing policies and standards for data classification, quality, and retention, as well as managing access and permissions, including role-based access control (RBAC) and data encryption, can...

Summary

In this chapter, we have discussed what big data is, the characteristics of big data, what a data lake is, why we need data lakes, and how a data lake can be built on Amazon S3 by providing an overview of the benefits of data lakes, the different layers of a data lake, and the best practices for building a data lake on Amazon S3. We also provided details on organizing and managing the data within a data lake on S3, including using features such as file formats, partitions, S3 lifecycle management, Amazon S3 Intelligent-Tiering, and so on. The chapter also discussed some challenges and considerations when building a data lake on Amazon S3, such as cost and performance.

In the next chapter, we are going to learn about AWS Glue. AWS Glue is a data integration service that lets you bring data from different data sources and allows you to perform ETL on top of it using frameworks such as Apache Spark and Python.

The rest of the chapter is locked

You have been reading a chapter from

Data Wrangling on AWS

Published in: Jul 2023Publisher: PacktISBN-13: 9781801810906

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (3)

Navnit Shukla

Navnit Shukla is an accomplished Senior Solution Architect with a specialization in AWS analytics. With an impressive career spanning 12 years, he has honed his expertise in databases and analytics, establishing himself as a trusted professional in the field. Currently based in Orange County, CA, Navnit's primary responsibility lies in assisting customers in building scalable, cost-effective, and secure data platforms on the AWS cloud.
Read more about Navnit Shukla

Sankar M

Sankar Sundaram has been working in IT Industry since 2007, specializing in databases, data warehouses, analytics space for many years. As a specialized Data Architect, he helps customers build and modernize data architectures and help them build secure, scalable, and performant data lake, database, and data warehouse solutions. Prior to joining AWS, he has worked with multiple customers in implementing complex data architectures.
Read more about Sankar M

Sampat Palani

Sam Palani has over 18+ years as developer, data engineer, data scientist, a startup cofounder and IT leader. He holds a master's in Business Administration with a dual specialization in Information Technology. His professional career spans across 5 countries across financial services, management consulting and the technology industries. He is currently Sr Leader for Machine Learning and AI at Amazon Web Services, where he is responsible for multiple lines of the business, product strategy and thought leadership. Sam is also a practicing data scientist, a writer with multiple publications, speaker at key industry conferences and an active open source contributor. Outside work, he loves hiking, photography, experimenting with food and reading.
Read more about Sampat Palani

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages