You're reading from Data Engineering with AWS - Second Edition

Product typeBook

Published inOct 2023

PublisherPackt

ISBN-139781804614426

Edition2nd Edition

Concepts

Data Engineering

Author (1)

Gareth Eagar

The challenges of ever-growing datasets

Organizations have many assets, such as physical assets, intellectual property, the knowledge of their employees, and trade secrets. But for too long, organizations did not fully recognize that they had another extremely valuable asset, and they failed to maximize the use of it – the vast quantities of data that they had gathered over time.

That is not to say that organizations ignored these data assets, but rather, due to the expense and complex nature of storing and managing this data, organizations tended to only analyze and keep a subset of their data.

Initially, data may have been stored in a single database, but as organizations and their data requirements grew, the number of databases exponentially increased. Today, with the modern application development approach of microservices, companies commonly have hundreds, or even thousands, of databases. Faced with many data silos, organizations invested in data warehousing systems that would enable them to ingest data from multiple siloed databases into a central location for analytics. But due to the expense of these systems, there were limitations on how much data could be stored, and some datasets would either be excluded or only aggregate data would be loaded into the data warehouse. Data would also only be kept for a limited period of time, as data storage for these systems was expensive, and therefore, it was not economical to keep historical data for long periods. There was also a lack of widely available tools and compute power to enable the analysis of extremely large, comprehensive datasets.

As organizations continued to grow, multiple data warehouses and data marts would be implemented for different business units or groups, and organizations still lacked a centralized, single-source-of-truth repository for their data. Organizations were also faced with new types of data, such as semi-structured or even unstructured data, and analyzing these datasets using traditional tools was a challenge.

As a result, new technologies were invented that were better able to work with very large datasets and different data types. Hadoop was a technology created in the early 2000s at Yahoo as part of a search engine project that wanted to index 1 billion web pages. Over the next few years, Hadoop, and the underlying MapReduce technology, became a popular way for all types of companies to store and process very large datasets. However, running a Hadoop cluster was a complex and expensive operation requiring specialized skills.

The next evolution for big data processing was the development of Spark (later taken on as an Apache project and now known as Apache Spark), a new processing framework for working with big data. Spark showed significant increases in performance when working with large datasets due to the fact that it did most processing in memory, significantly reducing the amount of reading and writing to and from disks. Today, Apache Spark is often regarded as the gold standard for processing large datasets and is used by a wide array of companies.

In parallel with the rise of Apache Spark as a popular big data processing tool was the rise of the concept of data lakes – an approach that uses low-cost object storage as a physical storage layer for a variety of data types, Apache Hive as a central catalog of all the datasets, and makes that data available for processing with a wide variety of tools, including Apache Spark. AWS’s own website uses the following definition of data lakes:

A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. You can store your data as-is, without having to first structure the data, and run different types of analytics—from dashboards and visualizations to big data processing, real-time analytics, and machine learning to guide better decisions.

Data lakes were great for centralizing data but were often run by a centralized team that would look to ingest data from all the data silos in an organization, transform and aggregate the data, and make it available centrally for use by other teams. While this was a definite improvement on having databases and data warehouses spread out with no central repository or governance, there was still room for improvement.

By centralizing all the data and having a single team manage this repository of data, the teams working to transform and extract additional value out of the data were often not the people most familiar with the business context behind the data.

To address this, Zhamak Dehghani (a data consultant working for Thoughtworks at the time) developed a new approach, which eventually became known as a data mesh. While we will cover a brief introduction to data mesh architecture here, we will cover this topic in more detail in Chapter 15, Implementing a Data Mesh Architecture.

With a data mesh architecture, the idea is to make the teams that generate the data responsible for creating an analytics version of the data, and then make that data easily accessible to the rest of the organization without needing to make multiple copies of the data.

By 2022, this concept had gained widespread appeal, and many companies were working to implement a data mesh approach for their data. Some took a limited view of what a data mesh was and considered it primarily a means of sharing data between teams without needing to physically move or copy the data. However, a full data mesh implementation went well beyond the technology of how to share data. A data mesh implementation meant a change to the processes of how operational data was converted into analytical data, and along with it, the personas that were responsible for the data. A data mesh implementation was not just a technical implementation but rather a change to the culture and operation of teams within an organization.

Whereas in many organizations, large-scale analytics had been done by a centralized team, with a data mesh approach, the team that owns an application that generates data must productize that data to make it available to the rest of the business. Much like DevOps had changed how development teams worked to create and support software, the data mesh approach meant that product teams needed to work differently in how they generated and shared analytical data. They would appoint a data product manager who would be responsible for the task of taking operational data and creating an analytics data product from that. For the product they created, they would take ownership of ensuring data quality, the freshness of data, communication around changes to the product (such as schema changes), and so on. They would effectively be product managers, outlining a roadmap for the data analytics product they would create, and they would be responsive to customer feedback on the data product.

With a data mesh, there would still be a central data team, but this team would be responsible for creating a centralized data platform, which all the different data product teams could then use. So rather than being data engineers that transformed data, the centralized team would focus on being data platform engineers, creating a standardized platform that met best practices. They would then make this platform available to individual development teams to use in order to create and share their own data analytic products.

Having looked at how data analytics became an essential tool in organizations, let’s now look at the roles that enable maximizing the value of data for a modern organization.

You have been reading a chapter from

Data Engineering with AWS - Second Edition

Published in: Oct 2023Publisher: PacktISBN-13: 9781804614426

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €14.99/month. Cancel anytime

Author (1)

Gareth Eagar

Gareth Eagar has over 25 years of experience in the IT industry, starting in South Africa, working in the United Kingdom for a while, and now based in the USA. Having worked at AWS since 2017, Gareth has broad experience with a variety of AWS services, and deep expertise around building data platforms on AWS. While Gareth currently works as a Solutions Architect, he has also worked in AWS Professional Services, helping architect and implement data platforms for global customers. Gareth frequently speaks on data related topics.
Read more about Gareth Eagar

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages