You're reading from Serverless ETL and Analytics with AWS Glue

Product typeBook

Published inAug 2022

Reading LevelExpert

PublisherPackt

ISBN-139781800564985

Edition1st Edition

Languages

Python

Tools

AWS Glue

Concepts

Data Analysis

Authors (6):

Vishal Pathak

Subramanya Vajiraya

Noritaka Sekiyama

Tomohiro Tanaka

Albert Quiroga

Ishan Gaur

View More author details

Data warehouses and data marts

In an organization, it is not uncommon for day-to-day operations to be performed and stored in several transactional operating systems. However, when higher-level business decisions are to be made using data gathered from these systems, it would be easier to collate necessary information from these sources and build a centralized repository for datasets to gather actionable intelligence.

A data warehouse is a centralized repository of data that’s been gathered from various sources within an organization. The collated data within this repository is analyzed and can be used to make business decisions. A data mart, on the other hand, is a subset of a data warehouse aligned toward a specific business unit within an organization.

The concept of data warehouses was introduced in the late 1980s. Data warehouses are subject-oriented, integrated, time-variant, and non-volatile. This means that data warehouses are designed to be able to make sense of the data in a specific subject rather than ongoing operations, such as sales, marketing, and HR. Data warehouses are also designed to integrate data for several different source systems, such as Enterprise Resource Planning (ERP), Human Resource Management Systems (HRMSs), Customer Relationship Management (CRM), Financial Management Systems (FMSs), and any other operational systems within an organization. The data within a data warehouse is usually structured, but it can be unstructured as well. Data warehouses also allow users to analyze the data at different grains of time, such as year, month, and day. The data in data warehouses is non-volatile and maintains history. So, changes in the source systems result in newer entries in the data warehouses where the new state of the data is used while preserving the old state of the data.

In Inmon’s top-down data warehousing approach, data architects and modelers start by looking at the holistic data landscape of an organization and identifying the main subject areas and entities under it. Inmon’s data warehouse is normalized and avoids redundancy. This simplifies the data ingestion process but is not optimized for queries. Hence, data marts are built on top of data warehouses and users access these data marts for their queries.

While data marts can be based on a star or snowflake schema, the star schema is generally preferred because it results in faster queries due to fewer joins. In 1996, Ralph Kimball introduced the star schema methodology to the data management world. This follows the bottom-up approach and creates data marts based on the business requirements instead of starting with an enterprise data warehouse.

In a data mart, data is stored at multiple levels and the table at the correct level is picked for processing the data. The atomic level by which the facts may be defined is known as the grain or granularity of the table.

For example, let’s consider a retail sales dataset for a retail store chain operating in different countries. A customer could buy several products in a single sale and the same customer could buy higher quantities of the same product within the same sale. We can have a table that contains region information that can be linked to sales and product tables.

So, while selecting a grain, it is beneficial to have the fact table populated with the most atomic grain. This allows us to be as granular as we want with the information we query. If we define the grain at the sales transaction level, we can query individual sales transactions and get information such as the amount per sale, payment method, and so on. However, we won’t be able to get the product information in a particular sale. To mitigate this, let’s say we define the grain at the product in a sales transaction level. We can query product-related information along with sales information.

These different levels of pre-computation help us avoid heavy computations at query time. For example, if a user is querying for sales_amount at the region level, it might be far easier to select the data from the table that contains the sales_amount and region columns.

As we can see, data marts are helpful for working with datasets related to a specific context or a business line. However, a centralized data warehouse is beneficial when our analysis needs data to be aggregated from a variety of sources across the organization to extract actionable intelligence from the dataset.

A fresh approach to data warehousing came with the introduction of data vaults. This is a hybrid approach that incorporates the best-normalized model and a denormalized star schema. This approach to data modeling can be quite helpful when working with multi-source systems or data sources that have constantly changing relationships. This makes it easier to ingest data from multiple sources. Also, because of the way the data is modeled, data vaults make it easier to audit and track data.

Data transformation is a requirement for the data to be loaded into a data warehouse. This creates entry barriers and lags in delivering value to customers. Generally, organizations have multiple sources of data and they must be imported into a data warehouse to make business decisions or even to know if it adds value. Later, if the user discovers that combining the data from certain sources is not delivering the value that was initially expected, then this results in time and resources being wasted. Also, it is not always possible to forecast the analytical requirements in a world where businesses have to constantly evolve to stay relevant. What happens if a business user needs historical data that isn’t available in the data warehouse? Around 2015, data lakes were created to solve these problems.

You have been reading a chapter from

Serverless ETL and Analytics with AWS Glue

Published in: Aug 2022Publisher: PacktISBN-13: 9781800564985

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (6)

Vishal Pathak

Vishal Pathak is a Data Lab Solutions Architect at AWS. Vishal works with customers on their use cases, architects solutions to solve their business problems, and helps them build scalable prototypes. Prior to his journey in AWS, Vishal helped customers implement business intelligence, data warehouse, and data lake projects in the US and Australia.
Read more about Vishal Pathak

Subramanya Vajiraya

Subramanya Vajiraya is a Big data Cloud Engineer at AWS Sydney specializing in AWS Glue. He obtained his Bachelor of Engineering degree specializing in Information Science & Engineering from NMAM Institute of Technology, Nitte, KA, India (Visvesvaraya Technological University, Belgaum) in 2015 and obtained his Master of Information Technology degree specialized in Internetworking from the University of New South Wales, Sydney, Australia in 2017. He is passionate about helping customers solve challenging technical issues related to their ETL workload and implementing scalable data integration and analytics pipelines on AWS.
Read more about Subramanya Vajiraya

Noritaka Sekiyama

Noritaka Sekiyama is a Senior Big Data Architect on the AWS Glue and AWS Lake Formation team. He has 11 years of experience working in the software industry. Based in Tokyo, Japan, he is responsible for implementing software artifacts, building libraries, troubleshooting complex issues and helping guide customer architectures
Read more about Noritaka Sekiyama

Tomohiro Tanaka

Tomohiro Tanaka is a senior cloud support engineer at AWS. He works to help customers solve their issues and build data lakes across AWS Glue, AWS IoT, and big data technologies such Apache Spark, Hadoop, and Iceberg.
Read more about Tomohiro Tanaka

Albert Quiroga

Albert Quiroga works as a senior solutions architect at Amazon, where he is helping to design and architect one of the largest data lakes in the world. Prior to that, he spent four years working at AWS, where he specialized in big data technologies such as EMR and Athena, and where he became an expert on AWS Glue. Albert has worked with several Fortune 500 companies on some of the largest data lakes in the world and has helped to launch and develop features for several AWS services.
Read more about Albert Quiroga

Ishan Gaur

Ishan Gaur has more than 13 years of IT experience in soft ware development and data engineering, building distributed systems and highly scalable ETL pipelines using Apache Spark, Scala, and various ETL tools such as Ab Initio and Datastage. He currently works at AWS as a senior big data cloud engineer and is an SME of AWS Glue. He is responsible for helping customers to build out large, scalable distributed systems and implement them in AWS cloud environments using various big data services, including EMR, Glue, and Athena, as well as other technologies, such as Apache Spark, Hadoop, and Hive.
Read more about Ishan Gaur

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages