Reader small image

You're reading from  Data Ingestion with Python Cookbook

Product typeBook
Published inMay 2023
PublisherPackt
ISBN-139781837632602
Edition1st Edition
Right arrow
Author (1)
Gláucia Esppenchutz
Gláucia Esppenchutz
author image
Gláucia Esppenchutz

Gláucia Esppenchutz is a data engineer with expertise in managing data pipelines and vast amounts of data using cloud and on-premises technologies. She worked in companies such as Globo, BMW Group, and Cloudera. Currently, she works at AiFi, specializing in the field of data operations for autonomous systems. She comes from the biomedical field and shifted her career ten years ago to chase the dream of working closely with technology and data. She is in constant contact with the open source community, mentoring people and helping to manage projects, and has collaborated with the Apache, PyLadies group, FreeCodeCamp, Udacity, and MentorColor communities.
Read more about Gláucia Esppenchutz

Right arrow

Configuring Docker for Airflow

In this book, we will use Airflow to orchestrate data ingests and provide logs to monitor our pipelines.

Airflow can be installed directly on your local machine and any server using PyPi (https://pypi.org/project/apache-airflow/) or a Docker container (https://hub.docker.com/r/apache/airflow). An official and supported version of Airflow can be found on Docker Hub, and the Apache Foundation community maintains it.

However, there are some additional steps to configure our Airflow. Thankfully, the Apache Foundation also has a docker-compose file that contains all other requirements to make Airflow work. We just need to complete a few more steps.

Getting ready

Let’s start by initializing our Docker application on our machine. You can use the desktop version or the CLI command.

Make sure you are inside your project folder for this. Create a folder to store Airflow internal components and the docker-compose.yaml file:

my-project$ mkdir airflow-local
my-project$ cd airflow-local

How to do it…

  1. First, we fetch the docker-compose.yaml file directly from the Airflow official docs:
    my-project/airflow-local$ curl -LfO 'https://airflow.apache.org/docs/apache-airflow/2.3.0/docker-compose.yaml'

You should see output like this:

Figure 1.8 – Airflow container image download progress

Figure 1.8 – Airflow container image download progress

Note

Check the most stable version of this docker-compose file when you download it, since new, more appropriate versions may be available after this book is published.

  1. Next, we create the dags, logs, and plugins folders as follows:
    my-project/airflow-local$ mkdir ./dags ./logs ./plugins
  2. Then, we create and set the Airflow user as follows:
    my-project/airflow-local$ echo -e "AIRFLOW_UID=$(id -u)\nAIRFLOW_GID=0" > .env

Note

If you have any error messages related to the AIRFLOW_UID variable, you can create a .env file in the same folder where your docker-compose.yaml file is and define the variable as AIRFLOW_UID=50000.

  1. Then, we initialize the database:
    my-project/airflow-local$ docker-compose up airflow-init

After executing the command, you should see output similar to this:

Creating network "airflow-local_default" with the default driver
Creating volume "airflow-local_postgres-db-volume" with default driver
Pulling postgres (postgres:13)...
13: Pulling from library/postgres
(...)
Status: Downloaded newer image for postgres:13
Pulling redis (redis:latest)...
latest: Pulling from library/redis
bd159e379b3b: Already exists
(...)
Status: Downloaded newer image for redis:latest
Pulling airflow-init (apache/airflow:2.3.0)...
2.3.0: Pulling from apache/airflow
42c077c10790: Pull complete
(...)
Status: Downloaded newer image for apache/airflow:2.3.0
Creating airflow-local_postgres_1 ... done
Creating airflow-local_redis_1    ... done
Creating airflow-local_airflow-init_1 ... done
Attaching to airflow-local_airflow-init_1
(...)
airflow-init_1       | [2022-10-09 09:49:26,250] {manager.py:213} INFO - Added user airflow
airflow-init_1       | User "airflow" created with role "Admin"
(...)
airflow-local_airflow-init_1 exited with code 0
  1. Then, we start the Airflow service:
    my-project/airflow-local$ docker-compose up
  2. Then, we need to check the Docker processes. Using the following CLI command, you will see the Docker images running:
    my-project/airflow-local$ docker ps

These are the images we see:

Figure 1.9 – The docker ps command output

Figure 1.9 – The docker ps command output

In the Docker Desktop application, you can also see the same containers running but with a more friendly interface:

Figure 1.10 – A Docker desktop view of the Airflow containers running

Figure 1.10 – A Docker desktop view of the Airflow containers running

  1. Then, we access Airflow in a web browser:

In your preferred browser, type http://localhost:8080/home. The following screen will appear:

Figure 1.11 – The Airflow UI login page

Figure 1.11 – The Airflow UI login page

  1. Then, we log in to the Airflow platform. Since it’s a local application used for testing and learning, the default credentials (username and password) for administrative access in Airflow are airflow.

When logged in, the following screen will appear:

Figure 1.12 – The Airflow UI main page

Figure 1.12 – The Airflow UI main page

  1. Then, we stop our containers. We can stop our containers until we reach Chapter 9, when we will explore data ingest in Airflow:
    my-project/airflow-local$ docker-compose stop

How it works…

Airflow is an open source platform that allows batch data pipeline development, monitoring, and scheduling. However, it requires other components, such as an internal database, to store metadata to work correctly. In this example, we use PostgreSQL to store the metadata and Redis to cache information.

All this can be installed directly in our machine environment one by one. Even though it seems quite simple, it may not be due to compatibility issues with OS, other software versions, and so on.

Docker can create an isolated environment and provide all the requirements to make it work. With docker-compose, it becomes even simpler, since we can create dependencies between the components that can only be created if the others are healthy.

You can also open the docker-compose.yaml file we downloaded for this recipe and take a look to explore it better. We will also cover it in detail in Chapter 9.

See also

If you want to learn more about how this docker-compose file works, you can look at the Apache Airflow official Docker documentation on the Apache Airflow documentation page: https://airflow.apache.org/docs/apache-airflow/stable/howto/docker-compose/index.html.

Previous PageNext Page
You have been reading a chapter from
Data Ingestion with Python Cookbook
Published in: May 2023Publisher: PacktISBN-13: 9781837632602
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Gláucia Esppenchutz

Gláucia Esppenchutz is a data engineer with expertise in managing data pipelines and vast amounts of data using cloud and on-premises technologies. She worked in companies such as Globo, BMW Group, and Cloudera. Currently, she works at AiFi, specializing in the field of data operations for autonomous systems. She comes from the biomedical field and shifted her career ten years ago to chase the dream of working closely with technology and data. She is in constant contact with the open source community, mentoring people and helping to manage projects, and has collaborated with the Apache, PyLadies group, FreeCodeCamp, Udacity, and MentorColor communities.
Read more about Gláucia Esppenchutz