Reader small image

You're reading from  Data Engineering with Python

Product typeBook
Published inOct 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839214189
Edition1st Edition
Languages
Right arrow
Author (1)
Paul Crickard
Paul Crickard
author image
Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard

Right arrow

Chapter 11: Building a Production Data Pipeline

In this chapter, you will build a production data pipeline using the features and techniques that you have learned in this section of the book. The data pipeline will be broken into processor groups that perform a single task. Those groups will be version controlled and they will use the NiFi variable registry so that they can be deployed in a production environment.

In this chapter, we're going to cover the following main topics:

  • Creating a test and production environment
  • Building a production data pipeline
  • Deploying a data pipeline in production

Creating a test and production environment

In this chapter, we will return to using PostgreSQL for both the extraction and loading of data. The data pipeline will require a test and production environment, each of which will have a staging and a warehouse table. To create the databases and tables, you will use PgAdmin4.

Creating the databases

To use PgAdmin4, perform the following steps:

  1. Browse to http://localhostw/pgadmin4/l, enter your username and password, and then click the Login button. Once logged in, expand the server icon in the left panel.
  2. To create the databases, right-click on the databases icon and select Create | Database. Name the database test.
  3. Next, you will need to add the tables. To create the staging table, right-click on Tables | Create | Table. On the General tab, name the table staging. Then, select the Columns tab. Using the plus sign, create the fields shown in the following screenshot:

    Figure 11.1 – The columns used in the staging...

Building a production data pipeline

The data pipeline you build will do the following:

  • Read files from the data lake.
  • Insert the files into staging.
  • Validate the staging data.
  • Move staging to the warehouse.

The final data pipeline will look like the following screenshot:

Figure 11.3 – The final version of the data pipeline

We will build the data pipeline processor group by processor group. The first processor group will read the data lake.

Reading the data lake

In the first section of this book, you read files from NiFi and will do the same here. This processor group will consist of three processors – GetFile, EvaluateJsonPath, and UpdateCounter – and an output port. Drag the processors and port to the canvas. In the following sections, you will configure them.

GetFile

The GetFile processor reads files from a folder, in this case, our data lake. If you were reading a data lake in Hadoop, you would...

Deploying a data pipeline in production

In the previous chapter, you learned how to deploy data to production, so I will not go into any great depth here, but merely provide a review. To put the new data pipeline into production, perform the following steps:

  1. Browse to your production NiFi instance. I have another instance of NiFi running on port 8080 on localhost.
  2. Drag and drop processor groups to the canvas and select Import. Choose the latest version of the processor groups you just built.
  3. Modify the variables on the processor groups to point to the database production. The table names can stay the same.

You can then run the data pipeline and you will see that the data is populated in the production database staging and warehouse tables.

The data pipeline you just built read files from a data lake, put them into a database table, ran a query to validate the table, and then inserted them into the warehouse. You could have built this data pipeline with a handful...

Summary

In this chapter, you learned how to build and deploy a production data pipeline. You learned how to create TEST and PRODUCTION environments and built the data pipeline in TEST. You used the filesystem as a sample data lake and learned how you would read files from the lake and monitor them as they were processed. Instead of loading data into the data warehouse, this chapter taught you how to use a staging database to hold the data so that it could be validated before being loaded into the data warehouse. Using Great Expectations, you were able to build a validation processor group that would scan the staging database to determine whether the data was ready to be loaded into the data warehouse. Lastly, you learned how to deploy the data pipeline into PRODUCTION. With these skills, you can now fully build, test, and deploy production batch data pipelines.

In the next chapter, you will learn how to build Apache Kafka clusters. Using Kafka, you will begin to learn how to process...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Python
Published in: Oct 2020Publisher: PacktISBN-13: 9781839214189
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard