Reader small image

You're reading from  Elasticsearch Indexing

Product typeBook
Published inDec 2015
Publisher
ISBN-139781783987023
Edition1st Edition
Right arrow
Author (1)
Huseyin Akdogan
Huseyin Akdogan
author image
Huseyin Akdogan

Hüseyin Akdoğan began his software adventure with the GwBasic programming language. He started learning the Visual Basic language after QuickBasic and developed many applications until 2000, after which he stepped into the world of Web with PHP. After this, he came across Java! In addition to counseling and training activities since 2005, he developed enterprise applications with JavaEE technologies. His areas of expertise are JavaServer Faces, Spring Frameworks, and big data technologies such as NoSQL and Elasticsearch. Along with these, he is also trying to specialize in other big data technologies. Hüseyin also writes articles on Java and big data technologies and works as a technical reviewer of big data books. He was a reviewer of one of the bestselling books, Mastering Elasticsearch – Second Edition.
Read more about Huseyin Akdogan

Right arrow

Chapter 5. Anatomy of an Elasticsearch Cluster

In the previous chapter, we looked at the analysis process and analyzers. We talked about character filters, tokenizers, and token filters. Then, we reviewed an analyzer pipeline. Finally, we saw how to create a custom analyzer. In this chapter, we will discover the anatomy of an Elasticsearch cluster. We will try to look closely at the core components of an Elasticsearch cluster. In addition, we will examine the question: how do we configure my cluster correctly? By the end of this chapter, we would have covered the following topics:

  • What are basic components of an Elasticsearch cluster

  • What are key concepts behind distribution architecture

  • What primary and replica shards do

  • How to choose the right amount of shards and replicas

Basic concepts


An Elasticsearch cluster is a physical and a logical partition of the nodes that are allocated into it. Initially, you don't need to do any configuration for your cluster. When a node is started, Elasticsearch creates a directory based on the defined cluster name and then the node is allocated to this directory. In the background, Elasticsearch created some shards, and probably replicas as well (unless otherwise noted), when you created an index. The generated shards are also allocated in the same node.

Elasticsearch is built to scale. It will be sufficient to increase the number of nodes when more capacity is needed. In this case, the cluster will reorganize itself to take advantage of the extra hardware and will distribute the load. Elasticsearch provides clustering in a good manner, and this ability is one of the most important advantages.

In the following section, we will look closely at the basic components of an Elasticsearch cluster.

Node


A node is a single instance of the Elasticsearch server and it can host data. This means that shards of indices are allowed to be allocated on the nodes. By default, each node is considered to be a data node, but you can turn the setting off.

Note

You can make a non-data node by adding node.data: false to the elasticsearch.yml file.

Non-data nodes

There are two types of non-data nodes: dedicated master nodes and client nodes.

Dedicated master nodes

Dedicated master nodes will have the settings node.data: false and node.master :true. Such nodes are responsible for managing the cluster. Index and search requests are not sent to these nodes.

Client nodes

Client nodes will have the settings node.data: false and node.master: false. It can be used to balance the load because all HTTP communication will be performed through these nodes.

Tribe node

Another type of Elasticsearch node is tribe node. Normally, a node is associated with a single cluster. But sometimes, all the connected clusters may feel...

Shards


When you create an index, Elasticsearch subdivides your index into multiple Lucene indices that are called shards. The process of this subdividing is called sharding. Shards are automatically managed by Elasticsearch and are in themselves a fully functional and independent index. You can define a number of shards. By default, a shard is being refreshed per second. Elasticsearch thus supports real-time search. Shards are useful when working with large data because when you have a large index, disk capacity of a single node may not be sufficient or may be too slow to serve search requests. Shards solve such, and similar, problems and allow you to horizontally scale your content volume.

Replicas


By default, Elasticsearch creates five primary shards and a copy of each primary shard when you create an index. These copies are called replicas. So, the replica shard is simply a copy of a primary shard. Replica shards are used to improve the search performance and failover. If a node crashes in a way, Elasticsearch uses one of the available replica shards of the node to avoid any data loss. For this reason, a replica of a primary shard will not be allocated in the same node with the primary shard. Hence, choosing the right amount of shards and replicas is very important. Unlike primary shards, replicas can be added and removed at any time. The number of primary shards must be specified before creating an index.

Explaining the architecture of distribution


Initially, we don't have an index and data when we start a single node. In this case, it means we have an empty cluster. When we create an index with the default settings, the cluster will take the following view:

As mentioned earlier, Elasticsearch creates five primary shards and a copy of each primary shard by default. But replicas do not appear in the preceding view. Why?

Let's look for the answer to our question by using the Cat API:

curl -XGET 'localhost:9200/_cat/shards'

my_index 4 p STARTED    0144b 192.168.1.22 Digitek
my_index 4 r UNASSIGNED
my_index 0 p STARTED    0144b 192.168.1.22 Digitek
my_index 0 r UNASSIGNED
my_index 3 p STARTED    0144b 192.168.1.22 Digitek
my_index 3 r UNASSIGNED
my_index 1 p STARTED    0144b 192.168.1.22 Digitek
my_index 1 r UNASSIGNED
my_index 2 p STARTED    0144b 192.168.1.22 Digitek
my_index 2 r UNASSIGNED

As you can see, there are five shards in our cluster, their states are STARTED, and there are other five...

Correctly configuring the cluster


While understanding the distribution of shards is essential, understanding the distribution of documents is also critical. Elasticsearch works to evenly spread the documents at shards. This is an appropriate behavior. Having a shard with the majority of the data cannot be wise.

Let's start two Elasticsearch nodes and create an index by running the following command:

curl -XPUT localhost:9200/my_index -d '{
  settings: {
    number_of_shards: 2,
    number_of_replicas: 0
  }
}'
{"acknowledged":true}

We've created an index without replicas that are built of two shards. Now we add a document to index:

curl -XPOST localhost:9200/my_index/document -d '{
  "title": "The first document"
}'
{"_index":"my_index","_type":"document","_id":"AU_iaqgDlNVjy8IaI4FM","_version":1,"created":true}

We will get the current shard level stats of the my_index by using the following command:
curl -XGET 'localhost:9200/my_index/_stats?level=shards&pretty'
{
...
"shards": {
    ...

Choosing the right amount of shards and replicas


If you have a limited dataset and the dataset grows by a small amount, you can use only a single primary shard with a replica. If your dataset is not limited and grows by a large amount, the optimal number of shards is dependent on the target number of nodes.

Actually, a single node can be sufficient for many simple use cases, but to reduce the fault tolerance when considering the nature of distributed architecture and to prevent data loss, you can use more than one node. So, we need to find the answer to the first question: How many nodes will work?

Even to answer this question, we need to find out the answers to a few questions. For example: Do we need to use the non-data node? If we don't need to use non-data nodes, considering the Elasticsearch shard allocation policy, we can say that a node requires at least one shard to be the data node - as well as a replica. In that case, we can follow the following formula:

Max number of data nodes ...

Summary


In this chapter, you looked at the basic concepts of an Elasticsearch cluster and saw the core components of it. After this, we discussed how to configure a cluster correctly. Finally, we discussed choosing the right amount of shards and replicas. In the next chapter, you will learn tips to improve indexing performance. We will look at the memory setting and how the optimization of mapping definition improves index performance. In addition, we will examine segment merge policy and will look at some relevant cases related to this topic. And, finally, we will look at the bulk API.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Elasticsearch Indexing
Published in: Dec 2015Publisher: ISBN-13: 9781783987023
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Huseyin Akdogan

Hüseyin Akdoğan began his software adventure with the GwBasic programming language. He started learning the Visual Basic language after QuickBasic and developed many applications until 2000, after which he stepped into the world of Web with PHP. After this, he came across Java! In addition to counseling and training activities since 2005, he developed enterprise applications with JavaEE technologies. His areas of expertise are JavaServer Faces, Spring Frameworks, and big data technologies such as NoSQL and Elasticsearch. Along with these, he is also trying to specialize in other big data technologies. Hüseyin also writes articles on Java and big data technologies and works as a technical reviewer of big data books. He was a reviewer of one of the bestselling books, Mastering Elasticsearch – Second Edition.
Read more about Huseyin Akdogan