Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Fundamentals of Analytics Engineering

You're reading from  Fundamentals of Analytics Engineering

Product type Book
Published in Mar 2024
Publisher Packt
ISBN-13 9781837636457
Pages 332 pages
Edition 1st Edition
Languages
Authors (7):
Dumky De Wilde Dumky De Wilde
Profile icon Dumky De Wilde
Fanny Kassapian Fanny Kassapian
Profile icon Fanny Kassapian
Jovan Gligorevic Jovan Gligorevic
Profile icon Jovan Gligorevic
Juan Manuel Perafan Juan Manuel Perafan
Profile icon Juan Manuel Perafan
Lasse Benninga Lasse Benninga
Profile icon Lasse Benninga
Ricardo Angel Granados Lopez Ricardo Angel Granados Lopez
Profile icon Ricardo Angel Granados Lopez
Taís Laurindo Pereira Taís Laurindo Pereira
Profile icon Taís Laurindo Pereira
View More author details

Table of Contents (23) Chapters

Preface Prologue
Part 1:Introduction to Analytics Engineering
Chapter 1: What Is Analytics Engineering? Chapter 2: The Modern Data Stack Part 2: Building Data Pipelines
Chapter 3: Data Ingestion Chapter 4: Data Warehousing Chapter 5: Data Modeling Chapter 6: Transforming Data Chapter 7: Serving Data Part 3: Hands-On Guide to Building a Data Platform
Chapter 8: Hands-On Analytics Engineering Part 4: DataOps
Chapter 9: Data Quality and Observability Chapter 10: Writing Code in a Team Chapter 11: Automating Workflows Part 5: Data Strategy
Chapter 12: Driving Business Adoption Chapter 13: Data Governance Chapter 14: Epilogue Index
Other Books You May Enjoy

Data Ingestion

A data platform is useless without any actual data in it. To access your data, combine it with other sources, enrich it, or share it across an organization, you will first need to get that data into your data platform. This is the process we call data ingestion. Data ingestion comes in all sorts and forms. Everyone is familiar with the age-old process of emailing Excel sheets back and forth, but luckily, there are more advanced and consistent ways of adding data to your platform.

Whether clicking your way through a managed ingestion tool such as Fivetran, Stitch, or Airbyte, or writing scripts to handle the parallel processing of multiple real-time data streams in a distributed system such as Spark, learning the steps of a data ingestion pipeline will help you build robust solutions. Building such a solution will help you guarantee the quality of your data, keep your stakeholders happy, and allow you to spend less time debugging and fixing broken code and more time...

Digging into the problem of moving data between two systems

We have talked, in Chapter 1, What Is Analytics Engineering?, about the changing process of extracting, transforming, and loading (ETL) data, but understanding these steps is only part of ingesting data. Whenever you add new data to your data platform, whether that is sales data, currency exchange data, web analytics data, or video footage, you will have to make certain choices around the frequency, quality, reliability, and retention of that data and many other choices. If you do not think ahead at the beginning, reality will catch up with you when your data provider makes a change, a pipeline accidentally runs twice, or the requirements from the business change. But why do we need to move data from one system to the other in the first place?

You might consider it a given that you have to manipulate data a bit to make it fit your purpose. Maybe you’re used to pivoting tables in an Excel sheet, adjusting formulas...

Understanding the eight essential steps of a data ingestion pipeline

It goes without saying that every data ingestion pipeline is a unique snowflake that is special to your organization and requirements. Nevertheless, every pipeline shares a few common characteristics that are essential to setting up a long-term process of moving data from source to destination. We have shown in Chapter 1 how essential the process of ELT is to analytics engineering. The data ingestion pipeline is where that process takes place so that afterward, it can be used in the data platform.

When talking about ETL, it is easy to say that data ingestion is just those three steps. But behind the acronym is a way more complex process. Yes, sometimes that process can be as simple as a few clicks in a nice interface, but other times, especially when the origin of the data is unique to your organization, you will have to create a custom pipeline or integration, and the additional complexity that comes with it....

Managing the quality and scalability of data ingestion pipelines – the three key topics

Apart from going through the steps of setting up a data ingestion pipeline, there are also three important topics that are relevant to each step of the entire pipeline.

Scalability and resilience

As the load on your pipeline increases over time, there is more and more pressure on your pipeline to keep up in terms of performance. Even though you might start with a sequential, single-thread program as is common, for example, when writing in Python, over time, you might want to consider turning parts of your pipeline into loosely coupled functions that can scale independently. For example, the extraction might happen on a single machine that you might have to increase in size over time, while the transformations and loading scales dynamically with serverless functions depending on the load.

In any case, you will have to implement some sort of error or exception handling to be able to...

Working with data ingestion – an example pipeline

Let’s look at our data ingestion steps in practice. Assume that we do analytics for a factory specializing in Dutch delicacies: Stroopwafels. The CEO of this patisserie paradise has requested better insights into the effectiveness of providing Stroopwafel samples to potential customers. To answer their questions, we need to do the following:

  1. Understand which potential customers (leads) have received samples. This data is available in a CRM tool where data from offline events and online requests is captured.
  2. Understand whether these potential customers have purchased more than once. This data is only available in our highly secure, on-premise enterprise resource planning (ERP) tool.

We will go through the steps to get data from both systems.

Trigger

We have discussed with the CEO that daily updates are enough for the insights. We already have a scheduling tool such as Airflow available and will...

Summary

In this chapter, we looked at why we need to ingest data in the first place and the different steps needed to create a data ingestion pipeline that is robust and reliable. We learned that there are eight essential steps to ingesting data, which can be covered both by off-the-shelf ETL tools as well as by custom scripts, depending on the specific needs of your data ingestion step. We also learned that to guarantee the long-term quality of your data ingestion pipeline, you need to consider the three key topics of scalability and resilience, monitoring, logging, and alerting, and finally, governance.

With this knowledge, you should be able to capture the data you need from a source system. In the next chapter, we will look at how to load and use this data in your data warehouse and how to pick one for your needs.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Fundamentals of Analytics Engineering
Published in: Mar 2024 Publisher: Packt ISBN-13: 9781837636457
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}