You're reading from Data Engineering with Python

Product typeBook

Published inOct 2020

Reading LevelBeginner

PublisherPackt

ISBN-139781839214189

Edition1st Edition

Languages

Python

Concepts

Data Analysis

Author (1)

Paul Crickard

Building a NiFi cluster

In this book, you have built a Kafka cluster, a ZooKeeper cluster, and a Spark cluster. Instead of increasing the power of a single server, through clustering, you are able to add more machines to increase the processing power of a data pipeline. In this chapter, you will learn how to cluster NiFi so that your data pipelines can run across multiple machines.

In this appendix, we're going to cover the following main topics:

The basics of NiFi clustering
Building a NiFi cluster
Building a distributed data pipeline
Managing the distributed data pipeline

The basics of NiFi clustering

Clustering in Apache NiFi follows a Zero-Master Clustering architecture. In this type of clustering, there is no pre-defined master. Every node can perform the same tasks, and the data is split between them. NiFi uses Zookeeper when deployed as a cluster.

Zookeeper will elect a Cluster Coordinator. The Cluster Coordinator is responsible for deciding whether new nodes can join – the nodes will connect to the coordinator – and to provide the updated flows to the new nodes.

While it sounds like the Cluster Coordinator is the master, it is not. You can make changes to the data pipelines on any node and they will be replicated to all the other nodes, meaning a non-Cluster Coordinator or a non-Primary Node can submit changes.

The Primary Node is also elected by Zookeeper. On the Primary Node, you can run isolated processes. An isolated process is a NiFi processor that runs only on the Primary Node. This is important because think what...

Building a NiFi cluster

In this section, you will build a two-node cluster on different machines. Just like with MiNiFi, however, there are some compatibility issues with the newest versions of NiFi and Zookeeper. To work around these issues and demonstrate the concepts, this chapter will use an older version of NiFi and the pre-bundled Zookeeper. To build the NiFi cluster, perform the following steps:

As root, or using sudo, open your /etc/hosts file. You will need to assign names to the machines that you will use in your cluster. It is best practice to use a hostname instead of IP addresses. Your hosts file should look like the following example:
```
127.0.0.1	localhost
::1		localhost
127.0.1.1	pop-os.localdomain	pop-os
10.0.0.63	nifi-node-2
10.0.0.148	nifi-node-1
```
In the preceding hosts file, I have added the last two lines. The nodes are nifi-node-1 and nifi-node-2 and you can see that they have different IP addresses. Make these changes in the hosts file for each machine...

Building a distributed data pipeline

Building a distributed data pipeline is almost exactly the same as building a data pipeline to run on a single machine. NiFi will handle the logistics of passing and recombining the data. A basic data pipeline is shown in the following screenshot:

Figure 16.4 – A basic data pipeline to generate data, extract attributes to json, and write to disk

The preceding data pipeline uses the GenerateFlowFile processor to create unique flowfiles. This is passed downstream to the AttributesToJSON processor, which extracts the attributes and writes to the flowfile content. Lastly, the file is written to disk at /home/paulcrickard/output.

Before running the data pipeline, you will need to make sure that you have the output directory for the PutFile processor on each node. Earlier, I said that data pipelines are no different when distributed, but there are some things you must keep in mind, one being that PutFile will write...

Managing the distributed data pipeline

The preceding data pipeline runs on each node. To compensate for that, you had to create the same path on both nodes for the PutFile processor to work. Earlier, you learned that there are several processors that can result in race conditions – trying to read the same file at the same time – which will cause problems. To resolve these issues, you can specify that a processor should only run on the Primary Node – as an isolated process.

In the configuration for the PutFile processor, select the Scheduling tab. In the dropdown for Scheduling Strategy, choose On primary node, as shown in the following screenshot:

Figure 16.6 – Running a processor on the Primary Node only

Now, when you run the data pipeline, the files will only be placed on the Primary Node. You can schedule processors such as GetFile or ExecuteSQL to do the same thing.

To see the load of the data pipeline on each node, you...

Summary

In this Appendix, you learned the basics of NiFi clustering, as well as how to build a cluster with the embedded Zookeeper and how to build distributed data pipelines. NiFi handles most of the distribution of data; you only need to keep in mind the gotchas – such as race conditions and the fact that processors need to be configured to run on any node. Using a NiFi cluster allows you to manage NiFi on several machines from a single instance. It also allows you to process large amounts of data and have some redundancy in case an instance crashes.

The rest of the chapter is locked

You have been reading a chapter from

Data Engineering with Python

Published in: Oct 2020Publisher: PacktISBN-13: 9781839214189

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Paul Crickard

Paul Crickard authored a book on the Leaflet JavaScript module. He has been programming for over 15 years and has focused on GIS and geospatial programming for 7 years. He spent 3 years working as a planner at an architecture firm, where he combined GIS with Building Information Modeling (BIM) and CAD. Currently, he is the CIO at the 2nd Judicial District Attorney's Office in New Mexico.
Read more about Paul Crickard

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages