Elasticsearch 7.0 Cookbook - Fourth Edition

4.6 (5 reviews total)
By Alberto Paro
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Getting Started

About this book

Elasticsearch is a Lucene-based distributed search server that allows users to index and search unstructured content with petabytes of data. With this book, you'll be guided through comprehensive recipes on what's new in Elasticsearch 7, and see how to create and run complex queries and analytics.

Packed with recipes on performing index mapping, aggregation, and scripting using Elasticsearch, this fourth edition of Elasticsearch Cookbook will get you acquainted with numerous solutions and quick techniques for performing both every day and uncommon tasks such as deploying Elasticsearch nodes, integrating other tools to Elasticsearch, and creating different visualizations. You will install Kibana to monitor a cluster and also extend it using a variety of plugins. Finally, you will integrate your Java, Scala, Python, and big data applications such as Apache Spark and Pig with Elasticsearch, and create efficient data applications powered by enhanced functionalities and custom plugins.

By the end of this book, you will have gained in-depth knowledge of implementing Elasticsearch architecture, and you'll be able to manage, search, and store data efficiently and effectively using Elasticsearch.

Publication date:
April 2019
Publisher
Packt
Pages
724
ISBN
9781789956504

 

Getting Started

In this chapter, we will cover the following recipes:

  • Downloading and installing Elasticsearch
  • Setting up networking
  • Setting up a node
  • Setting up Linux systems
  • Setting up different node types
  • Setting up a coordinator node
  • Setting up an ingestion node
  • Installing plugins in Elasticsearch
  • Removing a plugin
  • Changing logging settings
  • Setting up a node via Docker
  • Deploying on Elasticsearch Cloud Enterprise
 

Technical requirements

Elasticsearch runs on Linux/macOS X/Windows and its only requirement is to have Java 8.x installed. Usually, I recommend using the Oracle JDK, which is available at https://github.com/aparo/elasticsearch-7.x-cookbook.

If you don't want to go into the details of installing and configuring your Elasticsearch instance, for a quick start, you can skip to the Setting up a node via Docker recipe at the end of this chapter and fire up Docker Compose, which will install an Elasticsearch instance with Kibana and other tools quickly.

 

Downloading and installing Elasticsearch

Elasticsearch has an active community and the release cycles are very fast.

Because Elasticsearch depends on many common Java libraries (Lucene, Guice, and Jackson are the most famous ones), the Elasticsearch community tries to keep them updated and fixes bugs that are discovered in them and in the Elasticsearch core. The large user base is also a source of new ideas and features for improving Elasticsearch use cases.

For these reasons, if possible, it's best to use the latest available release (usually the more stable and bug-free one).

Getting ready

To install Elasticsearch, you need a supported operating system (Linux/macOS X/Windows) with a Java Java virtual machine (JVM) 1.8 or higher installed (the Sun Oracle JDK is preferred. More information on this can be found at http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html). A web browser is required to download the Elasticsearch binary release. At least 1 GB of free disk space is required to install Elasticsearch.

How to do it…

We will start by downloading Elasticsearch from the web. The latest version is always downloadable at https://www.elastic.co/downloads/elasticsearch. The versions that are available for different operating systems are as follows:

  • elasticsearch-{version-number}.zip and elasticsearch-{version-number}.msi are for the Windows operating systems.
  • elasticsearch-{version-number}.tar.gz is for Linux/macOS X, while elasticsearch-{version-number}.deb is for Debian-based Linux distributions (this also covers the Ubuntu family); this is installable with Debian using the dpkg -i elasticsearch-*.deb command.
  • elasticsearch-{version-number}.rpm is for Red Hat-based Linux distributions (this also covers the Cent OS family). This is installable with the rpm -i elasticsearch-*.rpm command.
The preceding packages contain everything to start Elasticsearch. This book targets version 7.x or higher. The latest and most stable version of Elasticsearch was 7.0.0. To check out whether this is the latest version or not, visit https://www.elastic.co/downloads/elasticsearch.

Extract the binary content. After downloading the correct release for your platform, the installation involves expanding the archive in a working directory.

Choose a working directory that is safe to charset problems and does not have a long path. This prevents problems when Elasticsearch creates its directories to store index data.

For the Windows platform, a good directory in which to install Elasticsearch could be c:\es, on Unix and /opt/es on macOS X.

To run Elasticsearch, you need a JVM 1.8 or higher installed. For better performance, I suggest that you use the latest Sun/Oracle version.

If you are a macOS X user and you have installed Homebrew (http://brew.sh/ ), the first and the second steps are automatically managed by the brew install elasticsearch command.

Let's start Elasticsearch to check if everything is working. To start your Elasticsearch server, just access the directory, and for Linux and macOS X execute the following:

# bin/elasticsearch

Alternatively, you can type the following command line for Windows:

# bin\elasticserch.bat

Your server should now start up and show logs similar to the following:

[2018-10-28T16:19:41,189][INFO ][o.e.n.Node ] [] initializing ...
[2018-10-28T16:19:41,245][INFO ][o.e.e.NodeEnvironment ] [fyBySLM] using [1] data paths, mounts [[/ (/dev/disk1s1)]], net usable_space [141.9gb], net total_space [465.6gb], types [apfs]
[2018-10-28T16:19:41,246][INFO ][o.e.e.NodeEnvironment ] [fyBySLM] heap size [989.8mb], compressed ordinary object pointers [true]
[2018-10-28T16:19:41,247][INFO ][o.e.n.Node ] [fyBySLM] node name derived from node ID [fyBySLMcR3uqKiYC32P5Sg]; set [node.name] to override
[2018-10-28T16:19:41,247][INFO ][o.e.n.Node ] [fyBySLM] version[6.4.2], pid[50238], build[default/tar/04711c2/2018-09-26T13:34:09.098244Z], OS[Mac OS X/10.14/x86_64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_181/25.181-b13]
[2018-10-28T16:19:41,247][INFO ][o.e.n.Node ] [fyBySLM] JVM arguments [-Xms1g, -Xmx1g,
... truncated ...
[2018-10-28T16:19:42,511][INFO ][o.e.p.PluginsService ] [fyBySLM] loaded module [aggs-matrix-stats]
[2018-10-28T16:19:42,511][INFO ][o.e.p.PluginsService ] [fyBySLM] loaded module [analysis-common]
...truncated...
[2018-10-28T16:19:42,513][INFO ][o.e.p.PluginsService ] [fyBySLM] no plugins loaded
...truncated...
[2018-10-28T16:19:46,776][INFO ][o.e.n.Node ] [fyBySLM] initialized
[2018-10-28T16:19:46,777][INFO ][o.e.n.Node ] [fyBySLM] starting ...
[2018-10-28T16:19:46,930][INFO ][o.e.t.TransportService ] [fyBySLM] publish_address {127.0.0.1:9300}, bound_addresses {[::1]:9300}, {127.0.0.1:9300}
[2018-10-28T16:19:49,983][INFO ][o.e.c.s.MasterService ] [fyBySLM] zen-disco-elected-as-master ([0] nodes joined)[, ], reason: new_master {fyBySLM}{fyBySLMcR3uqKiYC32P5Sg}{-pUWNdRlTwKuhv89iQ6psg}{127.0.0.1}{127.0.0.1:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
...truncated...
[2018-10-28T16:19:50,452][INFO ][o.e.l.LicenseService ] [fyBySLM] license [b2754b17-a4ec-47e4-9175-4b2e0d714a45] mode [basic] - valid

How it works…

The Elasticsearch package generally contains the following directories:

  • bin: This contains the scripts to start and manage Elasticsearch.
  • elasticsearch.bat: This is the main executable script to start Elasticsearch.
  • elasticsearch-plugin.bat: This is a script to manage plugins.
  • config: This contains the Elasticsearch configs. The most important ones are as follows:
    • elasticsearch.yml: This is the main config file for Elasticsearch
    • log4j2.properties: This is the logging config file
  • lib: This contains all the libraries required to run Elasticsearch.
  • logs: This directory is empty at installation time, but in the future, it will contain the application logs.
  • modules: This contains the Elasticsearch default plugin modules.
  • pluginsThis directory is empty at installation time, but it's the place where custom plugins will be installed.

During Elasticsearch startup, the following events happen:

  • A node name is generated automatically (that is, fyBySLM) if it is not provided in elasticsearch.yml. The name is randomly generated, so it's a good idea to set it to a meaningful and memorable name instead.
  • A node name hash is generated for this node, for example, fyBySLMcR3uqKiYC32P5Sg.
  • The default installed modules are loaded. The most important ones are as follows:
    • aggs-matrix-stats: This provides support for aggregation matrix stats.
    • analysis-common: This is a common analyzer for Elasticsearch, which extends the language processing capabilities of Elasticsearch.
    • ingest-common: These include common functionalities for the ingest module.
    • lang-expression/lang-mustache/lang-painless: These are the default supported scripting languages of Elasticsearch. 
    • mapper-extras: This provides an extra mapper type to be used, such as token_count and scaled_float.
    • parent-join: This provides an extra query, such as has_children and has_parent.
    • percolator: This provides percolator capabilities.
    • rank-eval: This provides support for the experimental rank evaluation APIs. These are used to evaluate hit scoring based on queries.
    • reindex: This provides support for reindex actions (reindex/update by query).
    • x-pack-*: All the xpack modules depend on a subscription for their activation.
  • If there are plugins, they are loaded.
  • If not configured, Elasticsearch binds the following two ports on the localhost 127.0.0.1 automatically:
    • 9300: This port is used for internal intranode communication.
    • 9200: This port is used for the HTTP REST API.
  • After starting, if indices are available, they are restored and ready to be used.

If these port numbers are already bound, Elasticsearch automatically increments the port number and tries to bind on them until a port is available (that is, 9201, 9202, and so on).

There are more events that are fired during Elasticsearch startup. We'll see them in detail in other recipes.

There's more…

During a node's startup, a lot of required services are automatically started. The most important ones are as follows:

  • Cluster services: This helps you manage the cluster state and intranode communication and synchronization
  • Indexing service: This helps you manage all the index operations, initializing all active indices and shards
  • Mapping service: This helps you manage the document types stored in the cluster (we'll discuss mapping in Chapter 2, Managing Mapping)
  • Network services: This includes services such as HTTP REST services (default on port 9200), and internal Elasticsearch protocol (port 9300) if the thrift plugin is installed
  • Plugin service: This manages loading the plugin 
  • Aggregation services: This provides advanced analytics on stored Elasticsearch documents such as statistics, histograms, and document grouping
  • Ingesting services: This provides support for document preprocessing before ingestion such as field enrichment, NLP processing, types conversion, and automatic field population
  • Language scripting services: This allows you to add new language scripting support to Elasticsearch

See also

The Setting up networking recipe we're going to cover next will help you with the initial network setup. Check the official Elasticsearch download page at https://www.elastic.co/downloads/elasticsearch to get the latest version.

 

Setting up networking

Correctly setting up networking is very important for your nodes and cluster.

There are a lot of different installation scenarios and networking issues. The first step for configuring the nodes to build a cluster is to correctly set the node discovery.

Getting ready

To change configuration files, you will need a working Elasticsearch installation and a simple text editor, as well as your current networking configuration (your IP).

How to do it…

To setup the networking, use the following steps:

  1. Using a standard Elasticsearch configuration config/elasticsearch.yml file, your node will be configured to bind on the localhost interface (by default) so that it can't be accessed by external machines or nodes.
  2. To allow another machine to connect to our node, we need to set network.host to our IP (for example, I have 192.168.1.164).
  3. To be able to discover other nodes, we need to list them in the discovery.zen.ping.unicast.hosts parameter. This means that it sends signals to the machine in a unicast list and waits for a response. If a node responds to it, they can join in a cluster.
  1. In general, from Elasticsearch version 6.x, the node versions are compatible. You must have the same cluster name (the cluster.name option in elasticsearch.yml) to let nodes join with each other.
The best practice is to have all the nodes installed with the same Elasticsearch version (major.minor.release). This suggestion is also valid for third-party plugins.
  1. To customize the network preferences, you need to change some parameters in the elasticsearch.yml file, as follows:
cluster.name: ESCookBook
node.name: "Node1"
network.host: 192.168.1.164
discovery.zen.ping.unicast.hosts: ["192.168.1.164","192.168.1.165[9300-9400]"]
  1. This configuration sets the cluster name to Elasticsearch, the node name, the network address, and it tries to bind the node to the address given in the discovery section by performing the following tasks:
    • We can check the configuration during node loading
    • We can now start the server and check whether the networking is configured, as follows:
    [2018-10-28T17:42:16,386][INFO ][o.e.c.s.MasterService ] [Node1] zen-disco-elected-as-master ([0] nodes joined)[, ], reason: new_master {Node1}{fyBySLMcR3uqKiYC32P5Sg}{IX1wpA01QSKkruZeSRPlFg}{192.168.1.164}{192.168.1.164:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}
    [2018-10-28T17:42:16,390][INFO ][o.e.c.s.ClusterApplierService] [Node1] new_master {Node1}{fyBySLMcR3uqKiYC32P5Sg}{IX1wpA01QSKkruZeSRPlFg}{192.168.1.164}{192.168.1.164:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true}, reason: apply cluster state (from master [master {Node1}{fyBySLMcR3uqKiYC32P5Sg}{IX1wpA01QSKkruZeSRPlFg}{192.168.1.164}{192.168.1.164:9300}{ml.machine_memory=17179869184, xpack.installed=true, ml.max_open_jobs=20, ml.enabled=true} committed version [1] source [zen-disco-elected-as-master ([0] nodes joined)[, ]]])
    [2018-10-28T17:42:16,403][INFO ][o.e.x.s.t.n.SecurityNetty4HttpServerTransport] [Node1] publish_address {192.168.1.164:9200}, bound_addresses {192.168.1.164:9200}
    [2018-10-28T17:42:16,403][INFO ][o.e.n.Node ] [Node1] started
    [2018-10-28T17:42:16,600][INFO ][o.e.l.LicenseService ] [Node1] license [b2754b17-a4ec-47e4-9175-4b2e0d714a45] mode [basic] - valid

    As you can see from my screen dump, the transport is bound to 192.168.1.164:9300. The REST HTTP interface is bound to 192.168.1.164:9200.

    How it works…

    The following are the main important configuration keys for networking management:

    • cluster.name: This sets up the name of the cluster. Only nodes with the same name can join together.
    • node.name: If not defined, this is automatically assigned by Elasticsearch.

    node.name allows defining a name for the node. If you have a lot of nodes on different machines, it is useful to set their names to something meaningful in order to easily locate them. Using a valid name is easier to remember than a generated name such as fyBySLMcR3uqKiYC32P5Sg.

    You must always set up a node.name if you need to monitor your server. Generally, a node name is the same as a host server name for easy maintenance.

    network.host defines the IP of your machine to be used to bind the node. If your server is on different LANs, or you want to limit the bind on only one LAN, you must set this value with your server IP.

    discovery.zen.ping.unicast.hosts allows you to define a list of hosts (with ports or a port range) to be used to discover other nodes to join the cluster. The preferred port is the transport one, usually 9300.

    The addresses of the hosts list can be a mix of the following:

    • Hostname, that is, myhost1
    • IP address, that is, 192.168.1.12
    • IP address or hostname with the port, that is, myhost1:9300, 192.168.168.1.2:9300
    • IP address or hostname with a range of ports, that is, myhost1:[9300-9400], 192.168.168.1.2:[9300-9400]

    See also

    The Setting up a node recipe in this chapter

     

    Setting up a node

    Elasticsearch allows the customization of several parameters in an installation. In this recipe, we'll see the most used ones to define where to store our data and improve overall performance.

    Getting ready

    As described in the downloading and installing Elasticsearch recipe, you need a working Elasticsearch installation and a simple text editor to change configuration files.

    How to do it…

    The steps required for setting up a simple node are as follows:

    1. Open the config/elasticsearch.yml file with an editor of your choice.
    2. Set up the directories that store your server data, as follows:
    • For Linux or macOS X, add the following path entries (using /opt/data as the base path):
    path.conf: /opt/data/es/conf
    path.data: /opt/data/es/data1,/opt2/data/data2
    path.work: /opt/data/work
    path.logs: /opt/data/logs
    path.plugins: /opt/data/plugins

    • For Windows, add the following path entries (using c:\Elasticsearch as the base path):
    path.conf: c:\Elasticsearch\conf
    path.data: c:\Elasticsearch\data
    path.work: c:\Elasticsearch\work
    path.logs: c:\Elasticsearch\logs
    path.plugins: c:\Elasticsearch\plugins
    1. Set up the parameters to control the standard index shard and replication at creation. These parameters are as follows:
    index.number_of_shards: 1
    index.number_of_replicas: 1

    How it works…

    The path.conf parameter defines the directory that contains your configurations, mainly elasticsearch.yml and logging.yml. The default is $ES_HOME/config, with ES_HOME to install the directory of your Elasticsearch server.

    It's useful to set up the config directory outside your application directory so that you don't need to copy the configuration files every time you update your Elasticsearch server.

    The path.data parameter is the most important one. This allows us to define one or more directories (in a different disk) where you can store your index data. When you define more than one directory, they are managed similarly to RAID 0 (their space is sum up), favoring locations with the most free space.

    The path.work parameter is a location in which Elasticsearch stores temporary files.

    The path.log parameter is where log files are put. These control how a log is managed in logging.yml.

    The path.plugins parameter allows you to override the plugins path (the default is $ES_HOME/plugins). It's useful to put system-wide plugins in a shared path (usually using NFS) in case you want a single place where you store your plugins for all of the clusters.

    The main parameters are used to control index and shards in index.number_of_shards, which controls the standard number of shards for a new created index, and index.number_of_replicas, which controls the initial number of replicas.

    See also

     

    Setting up Linux systems

    If you are using a Linux system (generally in a production environment), you need to manage extra setup to improve performance or to resolve production problems with many indices.

    This recipe covers the following two common errors that happen in production:

    • Too many open files that can corrupt your indices and your data
    • Slow performance in search and indexing due to the garbage collector
    Big problems arise when you run out of disk space. In this scenario, some files can get corrupted. To prevent your indices from corruption and possible data, it is best to monitor the storage spaces. Default settings prevent index writing and block the cluster if your storage is over 80% full.

    Getting ready

    As we described in the Downloading and installing Elasticsearch recipe in this chapter, you need a working Elasticsearch installation and a simple text editor to change configuration files.

    How to do it…

    To improve the performance on Linux systems, we will perform the following steps:

    1. First, you need to change the current limit for the user that runs the Elasticsearch server. In these examples, we will call this elasticsearch.
    2. To allow Elasticsearch to manage a large number of files, you need to increment the number of file descriptors (number of files) that a user can manage. To do so, you must edit your /etc/security/limits.conf file and add the following lines at the end:
    elasticsearch - nofile 65536
    elasticsearch - memlock unlimited
    1. Then, a machine restart is required to be sure that the changes have been made.
    2. The new version of Ubuntu (that is, version 16.04 or later) can skip the /etc/security/limits.conf file in the init.d scripts. In these cases, you need to edit /etc/pam.d/ and remove the following comment line:
    # session required pam_limits.so
    1. To control memory swapping, you need to set up the following parameter in elasticsearch.yml:
    bootstrap.memory_lock
    1. To fix the memory usage size of the Elasticsearch server, we need to set up the same values for Xmsand Xmx in $ES_HOME/config/jvm.options (that is, we set 1 GB of memory in this case), as follows:
    -Xms1g
    -Xmx1g

    How it works…

    The standard limit of file descriptors (https://www.bottomupcs.com/file_descriptors.xhtml ) (maximum number of open files for a user) is typically 1,024 or 8,096. When you store a lot of records in several indices, you run out of file descriptors very quickly, so your Elasticsearch server becomes unresponsive and your indices may become corrupted, causing you to lose your data.

    Changing the limit to a very high number means that your Elasticsearch doesn't hit the maximum number of open files.

    The other setting for memory prevents Elasticsearch from swapping memory and give a performance boost in a environment. This setting is required because, during indexing and searching, Elasticsearch creates and destroys a lot of objects in memory. This large number of create/destroy actions fragments the memory and reduces performance. The memory then becomes full of holes and, when the system needs to allocate more memory, it suffers an overhead to find compacted memory. If you don't set bootstrap.memory_lock: true, Elasticsearch dumps the whole process memory on disk and defragments it back in memory, freezing the system. With this setting, the defragmentation step is done all in memory, with a huge performance boost.

     

    Setting up different node types

    Elasticsearch is natively designed for the cloud, so when you need to release a production environment with a huge number of records and you need high availability and good performance, you need to aggregate more nodes in a cluster.

    Elasticsearch allows you to define different types of nodes to balance and improve overall performance.

    Getting ready

    As described in the Downloading and installing Elasticsearch recipe, you need a working Elasticsearch installation and a simple text editor to change the configuration files.

    How to do it…

    For the advanced setup of a cluster, there are some parameters that must be configured to define different node types.

    These parameters are in the config/elasticsearch.yml, file and they can be set with the following steps:

    1. Set up whether the node can be a master or not, as follows:
    node.master: true
    1. Set up whether a node must contain data or not, as follows:
    node.data: true
    1. Set up whether a node can work as an ingest node, as follows:
    node.ingest: true

    How it works…

    The node.master parameter establishes that the node can become a master for the cloud. The default value for this parameter is true. A master node is an arbiter for the cloud; it takes decisions about shard management, keeps the cluster status, and is the main controller of every index action. If your master nodes are on overload, all the clusters will have performance penalties. The master node is the node that distributes the search across all data nodes and aggregates/rescores the result to return them to the user. In big data terms, it's a Redux layer in the Map/Redux search in Elasticsearch.

    The number of master nodes must always be even.

    The node.data parameter allows you to store data in the node. The default value for this parameter is true. This node will be a worker that is responsible for indexing and searching data.

    By mixing these two parameters, it's possible to have different node types, as shown in the following table:

    node.master

    node.data

    Node description

    true

    true

    This is the default node. It can be the master, which contains data.

    false

    true

    This node never becomes a master node; it only holds data. It can be defined as a workhorse for your cluster.

    true

    false

    This node only serves as a master in order to avoid storing any data and to have free resources. This will be the coordinator of your cluster.

    false

    false

    This node acts as a search load balancer (fetching data from nodes, aggregating results, and so on). This kind of node is also called a coordinator or client node.

    The most frequently used node type is the first one, but if you have a very big cluster or special needs, you can change the scopes of your nodes to better serve searches and aggregations.

    There's more…

    Related to the number of master nodes, there are settings that require at least half of them plus one to be available to ensure that the cluster is in a safe state (no risk of split brain: https://www.elastic.co/guide/en/elasticsearch/reference/6.4/modules-node.html#split-brain). This setting is discovery.zen.minimum_master_nodes, and it must be set to the following equation:

    (master_eligible_nodes / 2) + 1

    To have a High Availability (HA) cluster, you need at least three nodes that are masters with the value of minimum_master_nodes set to 2.

     

    Setting up a coordinator node

    The master nodes that we have seen previously are the most important for cluster stability. To prevent the queries and aggregations from creating instability in your cluster, coordinator (or client/proxy) nodes can be used to provide safe communication with the cluster.

    Getting ready

    You need a working Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe in this chapter, and a simple text editor to change configuration files.

    How to do it…

    For the advance setup of a cluster, there are some parameters that must be configured to define different node types.

    These parameters are in the config/elasticsearch.yml, file and they can be setup a coordinator node with the following steps:

    1. Set up the node so that it's not a master, as follows:
    node.master: false
    1. Set up the node to not contain data, as follows:
    node.data: false

    How it works…

    The coordinator node is a special node that works as a proxy/pass thought for the cluster. Its main advantages are as follows:

    • It can easily be killed or removed from the cluster without causing any problems. It's not a master, so it doesn't participate in cluster functionalities and it doesn't contain data, so there are no data relocations/replications due to its failure.
    • It prevents the instability of the cluster due to a developers' /users bad queries. Sometimes, a user executes aggregations that are too large (that is, date histograms with a range of some years and intervals of 10 seconds). Here, the Elasticsearch node could crash. (In its newest version, Elasticsearch has a structure called circuit breaker to prevent similar issues, but there are always borderline cases that can bring instability using scripting, for example. The coordinator node is not a master and its overload doesn't cause any problems for cluster stability.
    • If the coordinator or client node is embedded in the application, there are less round trips for the data, speeding up the application.
    • You can add them to balance the search and aggregation throughput without generating changes and data relocation in the cluster.
     

    Setting up an ingestion node

    The main goals of Elasticsearch are indexing, searching, and analytics, but it's often required to modify or enhance the documents before storing them in Elasticsearch.

    The following are the most common scenarios in this case: 

    • Preprocessing the log string to extract meaningful data
    • Enriching the content of textual fields with Natural Language Processing (NLP) tools
    • Enriching the content using machine learning (ML) computed fields
    • Adding data modification or transformation during ingestion, such as the following:
      • Converting IP in geolocalization
      • Adding datetime fields at ingestion time
      • Building custom fields (via scripting) at ingestion time

    Getting ready

    You need a working Elasticsearch installation, as described in the Downloading and installing Elasticsearch recipe, as well as a simple text editor to change configuration files.

    How to do it…

    To set up an ingest node, you need to edit the config/elasticsearch.yml file and set up the ingest property to trueas follows:

    node.ingest: true
    Every time you change your elasticsearch.yml file, a node restart is required.

    How it works…

    The default configuration for Elasticsearch is to set the node as an ingest node (refer to Chapter 12, Using the Ingest module, for more information on the ingestion pipeline).

    As the coordinator node, using the ingest node is a way to provide functionalities to Elasticsearch without suffering cluster safety.

    If you want to prevent a node from being used for ingestion, you need to disable it with node.ingest: false. It's a best practice to disable this in the master and data nodes to prevent ingestion error issues and to protect the cluster. The coordinator node is the best candidate to be an ingest one.

    If you are using NLP, attachment extraction (via, attachment ingest plugin), or logs ingestion, the best practice is to have a pool of coordinator nodes (no master, no data) with ingestion active.

    The attachment and NLP plugins in the previous version of Elasticsearch were available in the standard data node or master node. These give a lot of problems to Elasticsearch due to the following reasons:

    • High CPU usage for NLP algorithms that saturates all CPU on the data node, giving bad indexing and searching performances
    • Instability due to the bad format of attachment and/or Apache Tika bugs (the library used for managing document extraction)
    • NLP or ML algorithms require a lot of CPU or stress the Java garbage collector, decreasing the performance of the node

    The best practice is to have a pool of coordinator nodes with ingestion enabled to provide the best safety for the cluster and ingestion pipeline.

    There's more…

    Having known about the four kinds of Elasticsearch nodes, you can easily understand that a waterproof architecture designed to work with Elasticsearch should be similar to this one:

     

    Installing plugins in Elasticsearch

    One of the main features of Elasticsearch is the possibility to extend it with plugins. Plugins extend Elasticsearch features and functionalities in several ways.

    In Elasticsearch, these plugins are native plugins. These are JAR files that contain application code, and are used for the following reasons:

    • Script engines
    • Custom analyzers, tokenizers, and scoring
    • Custom mapping
    • REST entry points
    • Ingestion pipeline stages
    • Supporting new storages (Hadoop, GCP Cloud Storage)
    • Extending X-Pack (that is, with a custom authorization provider)

    Getting ready

    You need a working Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe, as well as a prompt/shell to execute commands in the Elasticsearch install directory.

    How to do it…

    Elasticsearch provides a script for automatic downloads and for the installation of plugins in bin/directory called elasticsearch-plugin.

    The steps that are required to install a plugin are as follows:

    1. Calling the plugin and installing the Elasticsearch command with the plugin name reference.

    For installing the ingested attachment plugin used to extract text from files, simply call and type the following command if you're using Linux:

    bin/elasticsearch-plugin install ingest-attachment

    And for Windows, type the following command:

    elasticsearch-plugin.bat install ingest-attachment
    1. If the plugin needs to change security permissions, a warning is prompted and you need to accept this if you want to continue.
    2. During the node's startup, check that the plugin is correctly loaded.

    In the following screenshot, you can see the installation and the startup of the Elasticsearch server, along with the installed plugin:

    Remember that a plugin installation requires an Elasticsearch server restart.

    How it works…

    The elasticsearch-plugin.bat script is a wrapper for the Elasticsearch plugin manager. This can be used to install or remove a plugin (using the remove options).

    There are several ways to install the plugin, for example:

    • Passing the URL of the plugin (ZIP archive), as follows:
    bin/elasticsearch-plugin install http://mywoderfulserve.com/plugins/awesome-plugin.zip
    • Passing the file path of the plugin (ZIP archive), as follows:
    bin/elasticsearch-plugin install file:///tmp/awesome-plugin.zip
    • Using the install parameter with the GitHub repository of the plugin. The install parameter, which must be given, is formatted in the following way:
    <username>/<repo>[/<version>]

    During the installation process, Elasticsearch plugin manager is able to do the following:

    • Download the plugin
    • Create a plugins directory in ES_HOME/plugins, if it's missing
    • Optionally, ask if the plugin wants special permission to be executed
    • Unzip the plugin content in the plugin directory
    • Remove temporary files

    The installation process is completely automatic; no further actions are required. The user must only pay attention to the fact that the process ends with an Installed message to be sure that the install process has completed correctly.

    Restarting the server is always required to be sure that the plugin is correctly loaded by Elasticsearch.

    There's more…

    If your current Elasticsearch application depends on one or more plugins, a node can be configured to start up only if these plugins are installed and available. To achieve this behavior, you can provide the plugin.mandatory directive in the elasticsearch.yml configuration file.

    For the previous example (ingest-attachment), the config line to be added is as follows:

    plugin.mandatory:ingest-attachment

    There are also some hints to remember while installing plugins: updating some plugins in a node environment can cause malfunctions due to different plugin versions in different nodes. If you have a big cluster for safety, it's better to check for updates in a separate environment to prevent problems (and remember to upgrade the plugin in all the nodes).

    To prevent the fact updating an Elasticsearch version server which could also break your custom binary plugins due to some internal API changes, in Elasticsearch 5.x or higher, the plugins need to have the same version of Elasticsearch server in their manifest.

    Upgrading an Elasticsearch server version means upgrading all the installed plugins.

    See also

     

    Removing a plugin

    You have installed some plugins, and now you need to remove a plugin because it's not required. Removing an Elasticsearch plugin is easy if everything goes right, otherwise you will need to manually remove it.

    This recipe covers both cases.

    Getting ready

    You need a working Elasticsearch installation, as described in the Downloading and installing Elasticsearch recipe, and a prompt or shell to execute commands in the Elasticsearch install directory. Before removing a plugin, it is safer to stop the Elasticsearch server to prevent errors due to the deletion of a plugin JAR.

    How to do it…

    The steps to remove a plugin are as follows:

    1. Stop your running node to prevent exceptions that are caused due to the removal of a file.
    2. Use the Elasticsearch plugin manager, which comes with its script wrapper (bin/elasticsearch-plugin).

    On Linux and macOS X, type the following command:

    elasticsearch-plugin remove ingest-attachment

    On Windows, type the following command:

    elasticsearch-plugin.bat remove ingest-attachment
    1. Restart the server.

    How it works…

    The plugin manager's remove command tries to detect the correct name of the plugin and remove the directory of the installed plugin.

    If there are undeletable files on your plugin directory (or strange astronomical events that hit your server), the plugin script might fail to manually remove a plugin, so you need to follow these steps:

    1. Go into the plugins directory
    2. Remove the directory with your plugin name

     

    Changing logging settings

    Standard logging settings work very well for general usage.

    Changing the log level can be useful for checking for bugs or understanding malfunctions due to bad configuration or strange plugin behavior. A verbose log can be used from the Elasticsearch community to solve such problems.

    If you need to debug your Elasticsearch server or change how the logging works (that is, remoting send events), you need to change the log4j2.properties file.

    Getting ready

    You need a working Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe, and a simple text editor to change configuration files.

    How to do it…

    In the config directory in your Elasticsearch install directory, there is a log4j2.properties file that controls the working settings.

    The steps that are required for changing the logging settings are as follows:

    1. To emit every kind of logging Elasticsearch could produce, you can change the current root level logging, which is as follows:
    rootLogger.level = info
    1. This needs to be changed to the following:
    rootLogger.level = debug
    1. Now, if you start Elasticsearch from the command line (with bin/elasticsearch -f), you should see a lot of information, like the following, which is not always useful (except to debug unexpected issues):

    How it works…

    The Elasticsearch logging system is based on the log4j library (http://logging.apache.org/log4j/).

    Log4j is a powerful library  that's used to manage logging. Covering all of its functionalities is outside the scope of this book; if a user needs advanced usage, there are a lot of books and articles on the internet about it.

     

    Setting up a node via Docker

    Docker ( https://www.docker.com/ ) has become a common way to deploy application servers for testing or production.

    Docker is a container system that makes it possible to easily deploy replicable installations of server applications. With Docker, you don't need to set up a host, configure it, download the Elasticsearch server, unzip it, or start the server—everything is done automatically by Docker.

    Getting ready

    How to do it…

    1. If you want to start a vanilla server, just execute the following command:
    docker pull docker.elastic.co/elasticsearch/elasticsearch:7.0.0
    1. An output similar to the following will be shown:
    7.0.0: Pulling from elasticsearch/elasticsearch
    256b176beaff: Already exists
    1af8ca1bb9f4: Pull complete
    f910411dc8e2: Pull complete
    0c0400545052: Pull complete
    6e4d2771ff41: Pull complete
    a14f19907b79: Pull complete
    ea299a414bdf: Pull complete
    a644b305c472: Pull complete
    Digest: sha256:3da16b2f3b1d4e151c44f1a54f4f29d8be64884a64504b24ebcbdb4e14c80aa1
    Status: Downloaded newer image for docker.elastic.co/elasticsearch/elasticsearch:7.0.0
    1. After downloading the Elasticsearch image, we can start a develop instance that can be accessed outside from Docker:
    docker run -p 9200:9200 -p 9300:9300 -e "http.host=0.0.0.0" -e "transport.host=0.0.0.0" docker.elastic.co/elasticsearch/elasticsearch:7.0.0

    You'll see the output of the ElasticSearch server starting.

    1. In another window/Terminal, to check if the Elasticsearch server is running, execute the following command:
    docker ps

    The output will be similar to the following:

    CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
    b99b252732af docker.elastic.co/elasticsearch/elasticsearch:7.0.0 "/usr/local/bin/dock…" 2 minutes ago Up 2 minutes 0.0.0.0:9200->9200/tcp, 0.0.0.0:9300->9300/tcp gracious_bassi
    1. The default exported ports are 9200 and 9300.

    How it works…

    The Docker container provides a Debian Linux installation with Elasticsearch installed.

    Elasticsearch Docker installation is easily repeatable and does not require a lot of editing and configuration.

    The default installation can be tuned into in several ways, for example:

    1. You can pass a parameter to Elasticsearch via the command line using the -e flag, as follows:
    docker run -d docker.elastic.co/elasticsearch/elasticsearch:7.0.0 elasticsearch -e "node.name=NodeName"
    1. You can customize the default settings of the environment that's providing custom Elasticsearch configuration by providing a volume mount point at /usr/share/elasticsearch/configas follows:
    docker run -d -v "$PWD/config":/usr/share/elasticsearch/config docker.elastic.co/elasticsearch/elasticsearch:7.0.0
    1. You can persist the data between Docker reboots configuring a local data mount point to store index data. The path to be used as a mount point is /usr/share/elasticsearch/configas follows:
    docker run -d -v "$PWD/esdata":/usr/share/elasticsearch/data docker.elastic.co/elasticsearch/elasticsearch:7.0.0

    There's more…

    The official Elasticsearch images are not only provided by Docker. There are also several customized images for custom purposes. Some of these are optimized for large cluster deployments or more complex Elasticsearch cluster topologies than the standard ones.

    Docker is very handy for testing several versions of Elasticsearch in a clean way, without installing too much stuff on the host machine.

    In the code repository directory ch01/docker/, there is a docker-compose.yaml file that provides a full environment that will set up the following elements:

    • elasticsearch, which will be available at http://localhost:9200
    • kibana, which will be available at http://localhost:5601
    • cerebro, which will be available at http://localhost:9000

    To install all the applications, you can simply execute docker-compose up -d. All the required binaries will be downloaded and installed in Docker, and they will then be ready to be used.

    See also

     

    Deploying on Elasticsearch Cloud Enterprise

    The Elasticsearch company provides Elasticsearch Cloud Enterprise (ECE), which is the same tool that's used in the Elasticsearch Cloud (https://www.elastic.co/cloud) and is offered for free. This solution, which is available on PAAS on AWS or GCP (Google Cloud Platform), can be installed on-premise to provide an enterprise solution on top of Elasticsearch.

    If you need to manage multiple elastic deployments across teams or geographies, you can leverage ECE to centralize deployment management for the following functions:

    • Provisioning
    • Monitoring
    • Scaling
    • Replication
    • Upgrades
    • Backup and restoring

    Centralizing the management of deployments with ECE enforces uniform versioning, data governance, backup, and user policies. Increased hardware utilization through better management can also reduce the total cost.

    Getting ready

    As this solution targets large installations of many servers, the minimum testing requirement is an 8 GB RAM node. The ECE solution lives at the top of Docker and must be installed on the nodes.

    ECE supports only some operative systems, such as the following:

    • Ubuntu 16.04 with Docker 18.03
    • Ubuntu 14.04 with Docker 1.11
    • RHEL/CentOS 7+ with Red Hat Docker 1.13

    On other configurations, the ECE could work, but it is not supported in case of issues.

    How to do it…

    Before installing ECE, the following prerequisities are to be checked:

    1. Your user must be a Docker enabled one. In the case of an error due to a non-Docker user, add your user with sudo usermod -aG docker $USER.
    2. In the case of an error when you try to access /mnt/data, give your user permission to access this directory.
    3. You need to add the following line to your /etc/sysctl.conf (a reboot is required): vm.max_map_count = 262144.
    1. To be able to use the ECE, it must initially be installed on the first host, as follows:
    bash <(curl -fsSL https://download.elastic.co/cloud/elastic-cloud-enterprise.sh) install

    The installation process should manage these steps automatically, as shown in the following screenshot:

    At the end, the installer should provide your credentials so that you can access your cluster in a similar output, as follows:

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    Elastic Cloud Enterprise installation completed successfully
    Ready to copy down some important information and keep it safe?
    Now you can access the Cloud UI using the following addresses:
    http://192.168.1.244:12400
    https://192.168.1.244:12443

    Admin username: admin
    Password: OCqHHqvF0JazwXPm48wfEHTKN0euEtn9YWyWe1gwbs8
    Read-only username: readonly
    Password: M27hoE3z3v6x5xyHnNleE5nboCDK43X9KoNJ346MEqO

    Roles tokens for adding hosts to this installation:
    Basic token (Don't forget to assign roles to new runners in the Cloud UI after installation.)
    eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJiZDI3NjZjZi1iNWExLTQ4YTYtYTRlZi1iYzE4NTlkYjQ5ZmEiLCJyb2xlcyI6W10sImlzcyI6ImN1cnJlbnQiLCJwZXJzaXN0ZW50Ijp0cnVlfQ.lbh9oYPiJjpy7gI3I-_yFBz9T0blwNbbwtWF_-c_D3M

    Allocator token (Simply need more capacity to run Elasticsearch clusters and Kibana? Use this token.)
    eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJjYTk4ZDgyNi1iMWYwLTRkZmYtODBjYS0wYWYwMTM3M2MyOWYiLCJyb2xlcyI6WyJhbGxvY2F0b3IiXSwiaXNzIjoiY3VycmVudCIsInBlcnNpc3RlbnQiOnRydWV9.v9uvTKO3zgaE4nr0SDfg6ePrpperIGtvcGVfZHtmZmY
    Emergency token (Lost all of your coordinators? This token will save your installation.)
    eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiI5N2ExMzg5Yi1jZWE4LTQ2MGItODM1ZC00MDMzZDllNjAyMmUiLCJyb2xlcyI6WyJjb29yZGluYXRvciIsInByb3h5IiwiZGlyZWN0b3IiXSwiaXNzIjoiY3VycmVudCIsInBlcnNpc3RlbnQiOnRydWV9._0IvJrBQ7RkqzFyeFGhSAQxyjCbpOO15qZqhzH2crZQ

    To add hosts to this Elastic Cloud Enterprise installation, include the following parameters when you install the software
    on additional hosts: --coordinator-host 192.168.1.244 --roles-token 'eyJ0eXAiOiJKV1QiLCJhbGciOiJIUzI1NiJ9.eyJzdWIiOiJiZDI3NjZjZi1iNWExLTQ4YTYtYTRlZi1iYzE4NTlkYjQ5ZmEiLCJyb2xlcyI6W10sImlzcyI6ImN1cnJlbnQiLCJwZXJzaXN0ZW50Ijp0cnVlfQ.lbh9oYPiJjpy7gI3I-_yFBz9T0blwNbbwtWF_-c_D3M'

    These instructions use the basic token, but you can substitute one of the other tokens provided. You can also generate your own tokens. For example:
    curl -H 'Content-Type: application/json' -u
    admin: OCqHHqvF0JazwXPm48wfEHTKN0euEtn9YWyWe1gwbs8 http://192.168.1.244:12300/api/v1/platform/configuration/security/enrollment-tokens -d '{ "persistent": true, "roles": [ "allocator"] }'

    To learn more about generating tokens, see Generate Role Tokens in the documentation.

    System secrets have been generated and stored in /mnt/data/elastic/bootstrap-state/bootstrap-secrets.json.
    Keep the information in the bootstrap-secrets.json file secure by removing the file and placing it into secure storage, for example.

    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
    1. In my case, I can access the installed interface at http://192.168.1.244:12400.

    After logging into the admin interface, you will see your actual cloud state, as follows:

    1. You can now press on Create Deployment to fire your first Elasticsearch cluster, as follows:

      1. You need to define a name (that is, a book-cluster). Using standard options for this is okay. After pressing Create Deployment, ECE will start to build your cluster, as follows:

      1. After a few minutes, the cluster should be up and running, as follows:

      How it works…

      Elasticsearch Cloud Enterprise allows you to manage a large Elasticsearch cloud service that can create an instance via deployments. By default, the standard deployment will fire an ElasticSearch node with 4 GB RAM, 32 GB disk, and a Kibana instance.

      You can define a lot of parameters during the deployments for ElasticSearch, such as the following:

      • The RAM used for instances from 1 GB to 64 GB. The storage is proportional to the memory, so you can go from 1 GB RAM and 128 GB storage to 64 GB RAM and 2 TB storage.
      • If the node requires ML.
      • Master configurations if you have more than six data nodes.
      • The plugins that are required to be installed.

      For Kibana, you can only configure the memory (from 1 GB to 8 GB) and pass extra parameters (usually used for custom maps).

      ECE does all the provisioning and, if you want a monitoring component and other X-Pack features, it's able to autoconfigure your cluster to manage all the required functionalities.

      Elasticsearch Cloud Enterprise is very useful if you need to manage several Elasticsearch/Kibana clusters, because it leverages all the infrastructure problems.

      A benefit of using a deployed Elasticsearch cluster is that, during deployment, a proxy is installed. This is very handy for managing the debugging of Elasticsearch calls.

      See also

      About the Author

      • Alberto Paro

        Alberto Paro is an engineer, project manager, and software developer. He currently works as Big Data Practice Leader in NTTDATA in Italy on big data technologies, native cloud, and NoSQL solutions. He loves to study emerging solutions and applications mainly related to cloud and big data processing, NoSQL, NLP, and neural networks. In 2000, he graduated in computer science engineering from Politecnico di Milano. Then, he worked with many companies mainly using Scala/Java and Python on knowledge management solutions and advanced data mining products using the state-of-the-art big data software. A lot of his time is spent teaching how to effectively use big data solutions, NoSQL datastores, and related technologies.

        Browse publications by this author

      Latest Reviews

      (5 reviews total)
      Excelent book, with many examples and situations explained.
      Excellent content as expected. Thanks
      Átlagos a könyv a tartalm.

      Recommended For You

      Book Title
      Unlock this full book FREE 10 day trial
      Start Free Trial