Reader small image

You're reading from  Data Engineering with Python

Product typeBook
Published inOct 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839214189
Edition1st Edition
Languages
Right arrow
Author (1)
Paul Crickard
Paul Crickard
author image
Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard

Right arrow

Chapter 3: Reading and Writing Files

In the previous chapter, we looked at how to install various tools, such as NiFi, Airflow, PostgreSQL, and Elasticsearch. In this chapter, you will be learning how to use these tools. One of the most basic tasks in data engineering is moving data from a text file to a database. In this chapter, you will read data from and write data to several different text-based formats, such as CSV and JSON.

In this chapter, we're going to cover the following main topics:

  • Reading and writing files in Python
  • Processing files in Airflow
  • NiFi processors for handling files
  • Reading and writing data to databases in Python
  • Databases in Airflow
  • Database processors in NiFi

Writing and reading files in Python

The title of this section may sound strange as you are probably used to seeing it written as reading and writing, but in this section, you will write data to files first, then read it. By writing it, you will understand the structure of the data and you will know what it is you are trying to read.

To write data, you will use a library named faker. faker allows you to easily create fake data for common fields. You can generate an address by simply calling address(), or a female name using name_female(). This will simplify the creation of fake data while at the same time making it more realistic.

To install faker, you can use pip:

pip3 install faker

With faker now installed, you are ready to start writing files. The next section will start with CSV files.

Writing and reading CSVs

The most common file type you will encounter is Comma-Separated Values (CSV). A CSV is a file made up of fields separated by commas. Because commas are...

Building data pipelines in Apache Airflow

Apache Airflow uses Python functions, as well as Bash or other operators, to create tasks that can be combined into a Directed Acyclic Graph (DAG) – meaning each task moves in one direction when completed. Airflow allows you to combine Python functions to create tasks. You can specify the order in which the tasks will run, and which tasks depend on others. This order and dependency are what make it a DAG. Then, you can schedule your DAG in Airflow to specify when, and how frequently, your DAG should run. Using the Airflow GUI, you can monitor and manage your DAG. By using what you learned in the preceding sections, you will now make a data pipeline in Airflow.

Building a CSV to a JSON data pipeline

Starting with a simple DAG will help you understand how Airflow works and will help you to add more functions to build a better data pipeline. The DAG you build will print out a message using Bash, then read the CSV and print a list...

Handling files using NiFi processors

In the previous sections, you learned how to read and write CSV and JSON files using Python. Reading files is such a common task that tools such as NiFi have prebuilt processors to handle it. In this section, you will learn how to handle files using NiFi processors.

Working with CSV in NiFi

Working with files in NiFi requires many more steps than you had to use when doing the same tasks in Python. There are benefits to using more steps and using Nifi, including that someone who does not know code can look at your data pipeline and understand what it is you are doing. You may even find it easier to remember what it is you were trying to do when you come back to your pipeline in the future. Also, changes to the data pipeline do not require refactoring a lot of code; rather, you can reorder processors via drag and drop.

In this section, you will create a data pipeline that reads in the data.CSV file you created in Python. It will run a query...

Summary

In this chapter, you learned how to process CSV and JSON files using Python. Using this new skill, you have created a data pipeline in Apache Airflow by creating a Python function to process a CSV and transform it into JSON. You should now have a basic understanding of the Airflow GUI and how to run DAGs. You also learned how to build data pipelines in Apache NiFi using processors. The process for building more advanced data pipelines is the same, and you will learn the skills needed to accomplish this throughout the rest of this book.

In the next chapter, you will learn how to use Python, Airflow, and NiFi to read and write data to databases. You will learn how to use PostgreSQL and Elasticsearch. Using both will expose you to standard relational databases that can be queried using SQL and NoSQL databases that allow you to store documents and use their own query languages.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Python
Published in: Oct 2020Publisher: PacktISBN-13: 9781839214189
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard