Reader small image

You're reading from  Data Engineering with Python

Product typeBook
Published inOct 2020
Reading LevelBeginner
PublisherPackt
ISBN-139781839214189
Edition1st Edition
Languages
Right arrow
Author (1)
Paul Crickard
Paul Crickard
author image
Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard

Right arrow

Chapter 2: Building Our Data Engineering Infrastructure

In the previous chapter, you learned what data engineers do and their roles and responsibilities. You were also introduced to some of the tools that they use, primarily the different types of databases, programming languages, and data pipeline creation and scheduling tools.

In this chapter, you will install and configure several tools that will help you throughout the rest of this book. You will learn how to install and configure two different databases – PostgreSQL and Elasticsearch – two tools to assist in building workflows – Airflow and Apache NiFi, and two administrative tools – pgAdmin for PostgreSQL and Kibana for Elasticsearch.

With these tools, you will be able to write data engineering pipelines to move data from one source to another and also be able to visualize the results. As you learn how to build pipelines, being able to see the data and how it has transformed will be useful to...

Installing and configuring Apache NiFi

Apache NiFi is the primary tool used in this book for building data engineering pipelines. NiFi allows you to build data pipelines using prebuilt processors that you can configure for your needs. You do not need to write any code to get NiFi pipelines working. It also provides a scheduler to set how frequently you would like your pipelines to run. In addition, it will handle backpressure – if one task works faster than another, you can slow down the task.

To install Apache NiFi, you will need to download it from https://nifi.apache.org/download.html:

  1. By using curl, you can download NiFi using the following command line:
    curl https://mirrors.estointernet.in/apache/nifi/1.12.1/nifi-1.12.1-bin.tar.gz
  2. Extract the NiFi files from the .tar.gz file using the following command:
    tar xvzf nifi.tar.gz
  3. You will now have a folder named nifi-1.12.1. You can run NiFi by executing the following from inside the folder:
     bin/nifi.sh start...

Installing and configuring Apache Airflow

Apache Airflow performs the same role as Apache NiFi; however, it allows you to create your data flows using pure Python. If you are a strong Python developer, this is probably an ideal tool for you. It is currently one of the most popular open source data pipeline tools. What it lacks in a polished GUI – compared to NiFi – it more than makes up for in the power and freedom to create tasks.

Installing Apache Airflow can be accomplished using pip. But, before installing Apache Airflow, you can change the location of the Airflow install by exporting AIRFLOW_HOME. If you want Airflow to install to opt/airflow, export the AIRLFOW_HOME variable, as shown:

export AIRFLOW_HOME=/opt/airflow

The default location for Airflow is ~/airflow, and for this book, this is the location I will use. The next consideration before installing Airflow is to determine which sub-packages you want to install. If you do not specify any, Airflow...

Installing and configuring Elasticsearch

Elasticsearch is a search engine. In this book, you will use it as a NoSQL database. You will move data both to and from Elasticsearch to other locations. To download Elasticsearch, take the following steps:

  1. Use curl to download the files, as shown:
    curl https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.0-darwin-x86_64.tar.gz --output elasticsearch.tar.gz
  2. Extract the files using the following command:
    tar xvzf elasticsearch.tar.gz
  3. You can edit the config/elasticsearch.yml file to name your node and cluster. Later in this book, you will set up an Elasticsearch cluster with multiple nodes. For now, I have changed the following properties:
    cluster.name: DataEngineeringWithPython 
    node.name: OnlyNode
  4. Now, you can start Elasticsearch. To start Elasticsearch, run the following:
    bin/elasticsearch
  5. Once Elasticsearch has started, you can see the results at http://localhost:9200. You should see the following output...

Installing and configuring Kibana

Elasticsearch does not ship with a GUI, but rather an API. To add a GUI to Elasticsearch, you can use Kibana. By using Kibana, you can better manage and interact with Elasticsearch. Kibana will allow you to access the Elasticsearch API in a GUI, but more importantly, you can use it to build visualizations and dashboards of your data held in Elasticsearch. To install Kibana, take the following steps:

  1. Using wget, add the key:
    wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -
  2. Then, add the repository along with it:
    echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list
  3. Lastly, update apt and install Kibana:
    sudo apt-get update
    sudo apt-get install kibana
  4. The configuration files for Kibana are located in etc/kibana and the application is in /usr/share/kibana/bin. To launch Kibana, run the following:
    bin/kibana
  5. When Kibana...

Installing and configuring PostgreSQL

PostgreSQL is an open source relational database. It compares to Oracle or Microsoft SQL Server. PostgreSQL also has a plugin – postGIS – which allows spatial capabilities in PostgreSQL. In this book, it will be the relational database of choice. PostgreSQL can be installed on Linux as a package:

  1. For a Debian-based system, use apt-get, as shown:
    sudo apt-get install postgresql-11
  2. Once the packages have finished installing, you can start the database with the following:
    sudo pg_ctlcluster 11 main start
  3. The default user, postgres, does not have a password. To add one, connect to the default database:
    sudo -u postgres psql
  4. Once connected, you can alter the user and assign a password:
    ALTER USER postgres PASSWORD ‚postgres';
  5. To create a database, you can enter the following command:
    sudo -u postgres createdb dataengineering

Using the command line is fast, but sometimes, a GUI makes life easier. PostgreSQL...

Installing pgAdmin 4

pgAdmin 4 will make managing PostgreSQL much easier if you are new to relational databases. The web-based GUI will allow you to view your data and allow you to visually create tables. To install pgAdmin 4, take the following steps:

  1. You need to add the repository to Ubuntu. The following commands should be added to the repository:
    wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
    sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main" >> /etc/apt/sources.list.d/pgdg.list'
    sudo apt update
    sudo apt install pgadmin4 pgadmin4-apache2 -y
  2. You will be prompted to enter an email address for a username and then for a password. You should see the following screen:

    Figure 2.27 – Creating a user for pgAdmin 4

  3. When the install has completed, you can browse to http://localhost/pgadmin4 and you will be presented with the login screen, as shown in the following...

Summary

In this chapter, you learned how to install and configure many of the tools used by data engineers. Having done so, you now have a working environment in which you can build data pipelines. In production, you would not run all these tools on a single machine, but for the next few chapters, this will help you learn and get started quickly. You now have two working databases – Elasticsearch and PostgreSQL – as well as two tools for building data pipelines – Apache NiFi and Apache Airflow.

In the next chapter, you will start to use Apache NiFi and Apache Airflow (Python) to connect to files, as well as Elasticsearch and PostgreSQL. You will build your first pipeline in NiFi and Airflow to move a CSV to a database.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Data Engineering with Python
Published in: Oct 2020Publisher: PacktISBN-13: 9781839214189
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard