You're reading from Data Engineering with Python

Product typeBook

Published inOct 2020

Reading LevelBeginner

PublisherPackt

ISBN-139781839214189

Edition1st Edition

Languages

Python

Concepts

Data Analysis

Author (1)

Paul Crickard

Chapter 2: Building Our Data Engineering Infrastructure

In the previous chapter, you learned what data engineers do and their roles and responsibilities. You were also introduced to some of the tools that they use, primarily the different types of databases, programming languages, and data pipeline creation and scheduling tools.

In this chapter, you will install and configure several tools that will help you throughout the rest of this book. You will learn how to install and configure two different databases – PostgreSQL and Elasticsearch – two tools to assist in building workflows – Airflow and Apache NiFi, and two administrative tools – pgAdmin for PostgreSQL and Kibana for Elasticsearch.

With these tools, you will be able to write data engineering pipelines to move data from one source to another and also be able to visualize the results. As you learn how to build pipelines, being able to see the data and how it has transformed will be useful to...

Installing and configuring Apache NiFi

Apache NiFi is the primary tool used in this book for building data engineering pipelines. NiFi allows you to build data pipelines using prebuilt processors that you can configure for your needs. You do not need to write any code to get NiFi pipelines working. It also provides a scheduler to set how frequently you would like your pipelines to run. In addition, it will handle backpressure – if one task works faster than another, you can slow down the task.

To install Apache NiFi, you will need to download it from https://nifi.apache.org/download.html:

By using curl, you can download NiFi using the following command line:

curl https://mirrors.estointernet.in/apache/nifi/1.12.1/nifi-1.12.1-bin.tar.gz

Extract the NiFi files from the .tar.gz file using the following command:
```
tar xvzf nifi.tar.gz
```
You will now have a folder named nifi-1.12.1. You can run NiFi by executing the following from inside the folder:
```
 bin/nifi.sh start...
```

Installing and configuring Apache Airflow

Apache Airflow performs the same role as Apache NiFi; however, it allows you to create your data flows using pure Python. If you are a strong Python developer, this is probably an ideal tool for you. It is currently one of the most popular open source data pipeline tools. What it lacks in a polished GUI – compared to NiFi – it more than makes up for in the power and freedom to create tasks.

Installing Apache Airflow can be accomplished using pip. But, before installing Apache Airflow, you can change the location of the Airflow install by exporting AIRFLOW_HOME. If you want Airflow to install to opt/airflow, export the AIRLFOW_HOME variable, as shown:

export AIRFLOW_HOME=/opt/airflow

The default location for Airflow is ~/airflow, and for this book, this is the location I will use. The next consideration before installing Airflow is to determine which sub-packages you want to install. If you do not specify any, Airflow...

Installing and configuring Elasticsearch

Elasticsearch is a search engine. In this book, you will use it as a NoSQL database. You will move data both to and from Elasticsearch to other locations. To download Elasticsearch, take the following steps:

Use curl to download the files, as shown:

curl https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-7.6.0-darwin-x86_64.tar.gz --output elasticsearch.tar.gz

Extract the files using the following command:
```
tar xvzf elasticsearch.tar.gz
```
You can edit the config/elasticsearch.yml file to name your node and cluster. Later in this book, you will set up an Elasticsearch cluster with multiple nodes. For now, I have changed the following properties:
```
cluster.name: DataEngineeringWithPython 
node.name: OnlyNode
```
Now, you can start Elasticsearch. To start Elasticsearch, run the following:
```
bin/elasticsearch
```
Once Elasticsearch has started, you can see the results at http://localhost:9200. You should see the following output...

Installing and configuring Kibana

Elasticsearch does not ship with a GUI, but rather an API. To add a GUI to Elasticsearch, you can use Kibana. By using Kibana, you can better manage and interact with Elasticsearch. Kibana will allow you to access the Elasticsearch API in a GUI, but more importantly, you can use it to build visualizations and dashboards of your data held in Elasticsearch. To install Kibana, take the following steps:

Using wget, add the key:

wget -qO - https://artifacts.elastic.co/GPG-KEY-elasticsearch | sudo apt-key add -

Then, add the repository along with it:

echo "deb https://artifacts.elastic.co/packages/7.x/apt stable main" | sudo tee -a /etc/apt/sources.list.d/elastic-7.x.list

Lastly, update apt and install Kibana:

sudo apt-get update
sudo apt-get install kibana

The configuration files for Kibana are located in etc/kibana and the application is in /usr/share/kibana/bin. To launch Kibana, run the following:
```
bin/kibana
```
When Kibana...

Installing and configuring PostgreSQL

PostgreSQL is an open source relational database. It compares to Oracle or Microsoft SQL Server. PostgreSQL also has a plugin – postGIS – which allows spatial capabilities in PostgreSQL. In this book, it will be the relational database of choice. PostgreSQL can be installed on Linux as a package:

For a Debian-based system, use apt-get, as shown:
```
sudo apt-get install postgresql-11
```
Once the packages have finished installing, you can start the database with the following:
```
sudo pg_ctlcluster 11 main start
```
The default user, postgres, does not have a password. To add one, connect to the default database:
```
sudo -u postgres psql
```
Once connected, you can alter the user and assign a password:
```
ALTER USER postgres PASSWORD ‚postgres';
```
To create a database, you can enter the following command:
```
sudo -u postgres createdb dataengineering
```

Using the command line is fast, but sometimes, a GUI makes life easier. PostgreSQL...

Installing pgAdmin 4

pgAdmin 4 will make managing PostgreSQL much easier if you are new to relational databases. The web-based GUI will allow you to view your data and allow you to visually create tables. To install pgAdmin 4, take the following steps:

You need to add the repository to Ubuntu. The following commands should be added to the repository:

wget --quiet -O - https://www.postgresql.org/media/keys/ACCC4CF8.asc | sudo apt-key add -
sudo sh -c 'echo "deb http://apt.postgresql.org/pub/repos/apt/ `lsb_release -cs`-pgdg main" >> /etc/apt/sources.list.d/pgdg.list'
sudo apt update
sudo apt install pgadmin4 pgadmin4-apache2 -y

You will be prompted to enter an email address for a username and then for a password. You should see the following screen:
Figure 2.27 – Creating a user for pgAdmin 4
When the install has completed, you can browse to http://localhost/pgadmin4 and you will be presented with the login screen, as shown in the following...

Summary

In this chapter, you learned how to install and configure many of the tools used by data engineers. Having done so, you now have a working environment in which you can build data pipelines. In production, you would not run all these tools on a single machine, but for the next few chapters, this will help you learn and get started quickly. You now have two working databases – Elasticsearch and PostgreSQL – as well as two tools for building data pipelines – Apache NiFi and Apache Airflow.

In the next chapter, you will start to use Apache NiFi and Apache Airflow (Python) to connect to files, as well as Elasticsearch and PostgreSQL. You will build your first pipeline in NiFi and Airflow to move a CSV to a database.

The rest of the chapter is locked

You have been reading a chapter from

Data Engineering with Python

Published in: Oct 2020Publisher: PacktISBN-13: 9781839214189

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages