You're reading from Elasticsearch 5.x Cookbook - Third Edition
Elasticsearch 5.x introduces a set of powerful functionalities, targeting the problems that arise during ingestion of documents via the ingest node.
An Elasticsearch node can be master, data, or ingest.
The idea to split the ingest component from the others, is to create a more stable cluster due to problems that can arise during pre-processing documents.
To create a more stable cluster, the ingest nodes should be isolated by the master or data nodes, in the event that some problems may occur, such as a crash due to an attachment plugin and high loads due to complex type manipulation.
The job of ingest nodes is to pre-process the documents before sending them to the data nodes. This process is called a pipeline definition and every single step of this pipeline is a processor definition.
You need an up-and-running Elasticsearch installation as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.
To define an ingestion pipeline, you need to provide a description and some processors, as follows:
We will define a pipeline that adds a field
user
with the value,john
:{ "description" : "Add user john field", "processors" : [ { "set" : { "field": "user", "value": "john" } } ] }
The power of the pipeline definition is the ability for to be updated and created without a node restart (compared to Logstash). The definition is stored in a cluster state via the put pipeline API.
After having defined a pipeline, we need to provide it to the Elasticsearch cluster.
You need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.
To execute curl
via the command line, you need to install curl
for your operative system.
To store or update an ingestion pipeline in Elasticsearch, we will perform the following steps:
We can store the ingest pipeline via a
PUT
call:curl -XPUT 'http://127.0.0.1:9200/_ingest/pipeline/add-user- john' -d '{ "description" : "Add user john field", "processors" : [ { "set" : { "field": "user", "value": "john...
After having stored your pipeline, it is common to retrieve its content, for checking its definition. This action can be done via the get pipeline API.
You need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.
To execute curl
via the command line, you need to install curl
for your operative system.
To retrieve an ingestion pipeline in Elasticsearch, we will perform the following steps:
We can retrieve the ingest pipeline via a
GET
call:curl -XGET 'http://127.0.0.1:9200/_ingest/pipeline/add-user- john'
The result returned by Elasticsearch, if everything is okay, should be as follows:
{ "add-user-john" : { "description" : "Add user john field", "processors" : [ { "set" : { "field" : "user", "value" : "john" ...
To clean up our Elasticsearch cluster for obsolete or unwanted pipelines, we need to call the delete pipeline API with the ID of the pipeline.
You need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.
To execute curl
via the command line, you need to install curl
for your operative system.
To delete an ingestion pipeline in Elasticsearch, we will perform the following steps:
We can delete the ingest pipeline via a
DELETE
call:curl -XDELETE 'http://127.0.0.1:9200/_ingest/pipeline/add-user- john'
The result returned by Elasticsearch, if everything is okay, should be:
{"acknowledged":true}
The ingest part of every architecture is very sensitive, so the Elasticsearch team has created the possibility of simulating your pipelines without the need to store them in Elasticsearch.
The simulate pipeline API allows a user to test/improve and check functionalities of your pipeline without deployment in the Elasticsearch cluster.
You need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.
To execute curl
via the command-line, you need to install curl
for your operative system.
To simulate an ingestion pipeline in Elasticsearch, we will perform the following steps:
We can need to execute a call passing both the pipeline and a sample subset of a document to test the pipeline against:
curl -XPOST 'http://127.0.0.1:9200/_ingest/pipeline/_simulate' -d '{ "pipeline": { "description": "Add user john field...
Elasticsearch provides by default a large set of ingest processors. Their number and functionalities can also change from minor versions to extended versions for new scenarios.
In this recipe, we will see the most commonly used ones.
You need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.
To execute curl
via the command-line, you need to install curl
for your operative system.
To use several processors in an ingestion pipeline in Elasticsearch, we will perform the following steps:
We execute a simulate pipeline API call using several processors with a sample subset of a document to test the pipeline against:
curl -XPOST 'http://127.0.0.1:9200/_ingest/pipeline/_simulate? pretty' -d '{ "pipeline": { "description": "Testing some build-processors", "processors": [ { ...
Elasticsearch provides a large number of built-in processors that increases with every release. In the preceding examples, we have seen the set
and the replace
ones. In this recipe, we will cover one of the most used for log analysis: the grok processor, which is well known to Logstash users.
You need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.
To execute curl
via the command line, you need to install curl
for your operative system.
To test a grok pattern against some log lines, we will perform the following steps:
We will execute a call passing both the pipeline with our grok processor and a sample subset of a document to test the pipeline against:
curl -XPOST 'http://127.0.0.1:9200/_ingest/pipeline/_simulate? pretty' -d '{ "pipeline": { "description": "Testing grok pattern", "processors": [...
It's easy to make a cluster irresponsive in Elasticsearch prior to 5.x, using the attachment mapper. The metadata extraction from a document requires a very high CPU operation and if you are ingesting a lot of documents, your cluster is under load.
To prevent this scenario, Elasticsearch introduces the ingest node. An ingest node can be held under very high pressure without causing problems to the rest of the Elasticsearch cluster.
The attachment processor allows us to use the document extraction capabilities of Tika in an ingest node.
You need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.
To execute curl
via the command line, you need to install curl
for your operative system.
Another interesting processor is the GeoIP one that allows us to map an IP address to a GeoPoint and other location data.
You need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe in Chapter 2, Downloading and Setup.
To execute curl
via the command line, you need to install curl
for your operative system.
To be able to use the ingest GeoIP processor, perform the following steps:
You need to install it as a plugin via:
bin/elasticsearch-plugin install ingest-geoip
The output will be something like the following one:
-> Downloading ingest-geoip from elastic [=================================================] 100%?? @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ @ WARNING: plugin requires additional permissions @ @@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@ * java.lang.RuntimePermission...