Packt+ | Advance your knowledge in tech

You're reading from Elasticsearch Indexing

Product typeBook

Published inDec 2015

Publisher

ISBN-139781783987023

Edition1st Edition

Tools

Elasticsearch

Concepts

Enterprise Search

Author (1)

Huseyin Akdogan

Chapter 5. Anatomy of an Elasticsearch Cluster

In the previous chapter, we looked at the analysis process and analyzers. We talked about character filters, tokenizers, and token filters. Then, we reviewed an analyzer pipeline. Finally, we saw how to create a custom analyzer. In this chapter, we will discover the anatomy of an Elasticsearch cluster. We will try to look closely at the core components of an Elasticsearch cluster. In addition, we will examine the question: how do we configure my cluster correctly? By the end of this chapter, we would have covered the following topics:

What are basic components of an Elasticsearch cluster
What are key concepts behind distribution architecture
What primary and replica shards do
How to choose the right amount of shards and replicas

Basic concepts

An Elasticsearch cluster is a physical and a logical partition of the nodes that are allocated into it. Initially, you don't need to do any configuration for your cluster. When a node is started, Elasticsearch creates a directory based on the defined cluster name and then the node is allocated to this directory. In the background, Elasticsearch created some shards, and probably replicas as well (unless otherwise noted), when you created an index. The generated shards are also allocated in the same node.

Elasticsearch is built to scale. It will be sufficient to increase the number of nodes when more capacity is needed. In this case, the cluster will reorganize itself to take advantage of the extra hardware and will distribute the load. Elasticsearch provides clustering in a good manner, and this ability is one of the most important advantages.

In the following section, we will look closely at the basic components of an Elasticsearch cluster.

Node

A node is a single instance of the Elasticsearch server and it can host data. This means that shards of indices are allowed to be allocated on the nodes. By default, each node is considered to be a data node, but you can turn the setting off.

Note

You can make a non-data node by adding node.data: false to the elasticsearch.yml file.

Non-data nodes

There are two types of non-data nodes: dedicated master nodes and client nodes.

Dedicated master nodes

Dedicated master nodes will have the settings node.data: false and node.master :true. Such nodes are responsible for managing the cluster. Index and search requests are not sent to these nodes.

Client nodes

Client nodes will have the settings node.data: false and node.master: false. It can be used to balance the load because all HTTP communication will be performed through these nodes.

Tribe node

Another type of Elasticsearch node is tribe node. Normally, a node is associated with a single cluster. But sometimes, all the connected clusters may feel...

Shards

When you create an index, Elasticsearch subdivides your index into multiple Lucene indices that are called shards. The process of this subdividing is called sharding. Shards are automatically managed by Elasticsearch and are in themselves a fully functional and independent index. You can define a number of shards. By default, a shard is being refreshed per second. Elasticsearch thus supports real-time search. Shards are useful when working with large data because when you have a large index, disk capacity of a single node may not be sufficient or may be too slow to serve search requests. Shards solve such, and similar, problems and allow you to horizontally scale your content volume.

Replicas

By default, Elasticsearch creates five primary shards and a copy of each primary shard when you create an index. These copies are called replicas. So, the replica shard is simply a copy of a primary shard. Replica shards are used to improve the search performance and failover. If a node crashes in a way, Elasticsearch uses one of the available replica shards of the node to avoid any data loss. For this reason, a replica of a primary shard will not be allocated in the same node with the primary shard. Hence, choosing the right amount of shards and replicas is very important. Unlike primary shards, replicas can be added and removed at any time. The number of primary shards must be specified before creating an index.

Explaining the architecture of distribution

Initially, we don't have an index and data when we start a single node. In this case, it means we have an empty cluster. When we create an index with the default settings, the cluster will take the following view:

As mentioned earlier, Elasticsearch creates five primary shards and a copy of each primary shard by default. But replicas do not appear in the preceding view. Why?

Let's look for the answer to our question by using the Cat API:

curl -XGET 'localhost:9200/_cat/shards'

my_index 4 p STARTED    0144b 192.168.1.22 Digitek
my_index 4 r UNASSIGNED
my_index 0 p STARTED    0144b 192.168.1.22 Digitek
my_index 0 r UNASSIGNED
my_index 3 p STARTED    0144b 192.168.1.22 Digitek
my_index 3 r UNASSIGNED
my_index 1 p STARTED    0144b 192.168.1.22 Digitek
my_index 1 r UNASSIGNED
my_index 2 p STARTED    0144b 192.168.1.22 Digitek
my_index 2 r UNASSIGNED

As you can see, there are five shards in our cluster, their states are STARTED, and there are other five...

Correctly configuring the cluster

While understanding the distribution of shards is essential, understanding the distribution of documents is also critical. Elasticsearch works to evenly spread the documents at shards. This is an appropriate behavior. Having a shard with the majority of the data cannot be wise.

Let's start two Elasticsearch nodes and create an index by running the following command:

curl -XPUT localhost:9200/my_index -d '{
  settings: {
    number_of_shards: 2,
    number_of_replicas: 0
  }
}'
{"acknowledged":true}

We've created an index without replicas that are built of two shards. Now we add a document to index:

curl -XPOST localhost:9200/my_index/document -d '{
  "title": "The first document"
}'
{"_index":"my_index","_type":"document","_id":"AU_iaqgDlNVjy8IaI4FM","_version":1,"created":true}

We will get the current shard level stats of the my_index by using the following command:
curl -XGET 'localhost:9200/my_index/_stats?level=shards&pretty'
{
...
"shards": {
    ...

Choosing the right amount of shards and replicas

If you have a limited dataset and the dataset grows by a small amount, you can use only a single primary shard with a replica. If your dataset is not limited and grows by a large amount, the optimal number of shards is dependent on the target number of nodes.

Actually, a single node can be sufficient for many simple use cases, but to reduce the fault tolerance when considering the nature of distributed architecture and to prevent data loss, you can use more than one node. So, we need to find the answer to the first question: How many nodes will work?

Even to answer this question, we need to find out the answers to a few questions. For example: Do we need to use the non-data node? If we don't need to use non-data nodes, considering the Elasticsearch shard allocation policy, we can say that a node requires at least one shard to be the data node - as well as a replica. In that case, we can follow the following formula:

Max number of data nodes ...

Summary

In this chapter, you looked at the basic concepts of an Elasticsearch cluster and saw the core components of it. After this, we discussed how to configure a cluster correctly. Finally, we discussed choosing the right amount of shards and replicas. In the next chapter, you will learn tips to improve indexing performance. We will look at the memory setting and how the optimization of mapping definition improves index performance. In addition, we will examine segment merge policy and will look at some relevant cases related to this topic. And, finally, we will look at the bulk API.

The rest of the chapter is locked

You have been reading a chapter from

Elasticsearch Indexing

Published in: Dec 2015Publisher: ISBN-13: 9781783987023

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Huseyin Akdogan

Hüseyin Akdoğan began his software adventure with the GwBasic programming language. He started learning the Visual Basic language after QuickBasic and developed many applications until 2000, after which he stepped into the world of Web with PHP. After this, he came across Java! In addition to counseling and training activities since 2005, he developed enterprise applications with JavaEE technologies. His areas of expertise are JavaServer Faces, Spring Frameworks, and big data technologies such as NoSQL and Elasticsearch. Along with these, he is also trying to specialize in other big data technologies. Hüseyin also writes articles on Java and big data technologies and works as a technical reviewer of big data books. He was a reviewer of one of the bestselling books, Mastering Elasticsearch – Second Edition.
Read more about Huseyin Akdogan

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages