You're reading from Fundamentals of Analytics Engineering

Product type Book

Published in Mar 2024

Publisher Packt

ISBN-13 9781837636457

Pages 332 pages

Edition 1st Edition

Languages

Concepts

Data Analysis

Authors (7):

Dumky De Wilde

Fanny Kassapian

Jovan Gligorevic

Juan Manuel Perafan

Lasse Benninga

Ricardo Angel Granados Lopez

Taís Laurindo Pereira

View More author details

Table of Contents (23) Chapters

Preface

Prologue

Part 1:Introduction to Analytics Engineering

Chapter 1: What Is Analytics Engineering?

Chapter 2: The Modern Data Stack

Part 2: Building Data Pipelines

Chapter 3: Data Ingestion

Chapter 4: Data Warehousing

Chapter 5: Data Modeling

Chapter 6: Transforming Data

Chapter 7: Serving Data

Part 3: Hands-On Guide to Building a Data Platform

Chapter 8: Hands-On Analytics Engineering

Part 4: DataOps

Chapter 9: Data Quality and Observability

Chapter 10: Writing Code in a Team

Chapter 11: Automating Workflows

Part 5: Data Strategy

Chapter 12: Driving Business Adoption

Chapter 13: Data Governance

Chapter 14: Epilogue

Index

Other Books You May Enjoy

Data Ingestion

A data platform is useless without any actual data in it. To access your data, combine it with other sources, enrich it, or share it across an organization, you will first need to get that data into your data platform. This is the process we call data ingestion. Data ingestion comes in all sorts and forms. Everyone is familiar with the age-old process of emailing Excel sheets back and forth, but luckily, there are more advanced and consistent ways of adding data to your platform.

Whether clicking your way through a managed ingestion tool such as Fivetran, Stitch, or Airbyte, or writing scripts to handle the parallel processing of multiple real-time data streams in a distributed system such as Spark, learning the steps of a data ingestion pipeline will help you build robust solutions. Building such a solution will help you guarantee the quality of your data, keep your stakeholders happy, and allow you to spend less time debugging and fixing broken code and more time...

Digging into the problem of moving data between two systems

We have talked, in Chapter 1, What Is Analytics Engineering?, about the changing process of extracting, transforming, and loading (ETL) data, but understanding these steps is only part of ingesting data. Whenever you add new data to your data platform, whether that is sales data, currency exchange data, web analytics data, or video footage, you will have to make certain choices around the frequency, quality, reliability, and retention of that data and many other choices. If you do not think ahead at the beginning, reality will catch up with you when your data provider makes a change, a pipeline accidentally runs twice, or the requirements from the business change. But why do we need to move data from one system to the other in the first place?

You might consider it a given that you have to manipulate data a bit to make it fit your purpose. Maybe you’re used to pivoting tables in an Excel sheet, adjusting formulas...

Understanding the eight essential steps of a data ingestion pipeline

It goes without saying that every data ingestion pipeline is a unique snowflake that is special to your organization and requirements. Nevertheless, every pipeline shares a few common characteristics that are essential to setting up a long-term process of moving data from source to destination. We have shown in Chapter 1 how essential the process of ELT is to analytics engineering. The data ingestion pipeline is where that process takes place so that afterward, it can be used in the data platform.

When talking about ETL, it is easy to say that data ingestion is just those three steps. But behind the acronym is a way more complex process. Yes, sometimes that process can be as simple as a few clicks in a nice interface, but other times, especially when the origin of the data is unique to your organization, you will have to create a custom pipeline or integration, and the additional complexity that comes with it....

Managing the quality and scalability of data ingestion pipelines – the three key topics

Apart from going through the steps of setting up a data ingestion pipeline, there are also three important topics that are relevant to each step of the entire pipeline.

Scalability and resilience

As the load on your pipeline increases over time, there is more and more pressure on your pipeline to keep up in terms of performance. Even though you might start with a sequential, single-thread program as is common, for example, when writing in Python, over time, you might want to consider turning parts of your pipeline into loosely coupled functions that can scale independently. For example, the extraction might happen on a single machine that you might have to increase in size over time, while the transformations and loading scales dynamically with serverless functions depending on the load.

In any case, you will have to implement some sort of error or exception handling to be able to...

Working with data ingestion – an example pipeline

Let’s look at our data ingestion steps in practice. Assume that we do analytics for a factory specializing in Dutch delicacies: Stroopwafels. The CEO of this patisserie paradise has requested better insights into the effectiveness of providing Stroopwafel samples to potential customers. To answer their questions, we need to do the following:

Understand which potential customers (leads) have received samples. This data is available in a CRM tool where data from offline events and online requests is captured.
Understand whether these potential customers have purchased more than once. This data is only available in our highly secure, on-premise enterprise resource planning (ERP) tool.

We will go through the steps to get data from both systems.

Trigger

We have discussed with the CEO that daily updates are enough for the insights. We already have a scheduling tool such as Airflow available and will...

Summary

In this chapter, we looked at why we need to ingest data in the first place and the different steps needed to create a data ingestion pipeline that is robust and reliable. We learned that there are eight essential steps to ingesting data, which can be covered both by off-the-shelf ETL tools as well as by custom scripts, depending on the specific needs of your data ingestion step. We also learned that to guarantee the long-term quality of your data ingestion pipeline, you need to consider the three key topics of scalability and resilience, monitoring, logging, and alerting, and finally, governance.

With this knowledge, you should be able to capture the data you need from a source system. In the next chapter, we will look at how to load and use this data in your data warehouse and how to pick one for your needs.