Reader small image

You're reading from  Advanced Elasticsearch 7.0

Product typeBook
Published inAug 2019
Reading LevelBeginner
PublisherPackt
ISBN-139781789957754
Edition1st Edition
Languages
Right arrow
Author (1)
Wai Tak Wong
Wai Tak Wong
author image
Wai Tak Wong

Wai Tak Wong is a faculty member in the Department of Computer Science at Kean University, NJ, USA. He has more than 15 years professional experience in cloud software design and development. His PhD in computer science was obtained at NJIT, NJ, USA. Wai Tak has served as an associate professor in the Information Management Department of Chung Hua University, Taiwan. A co-founder of Shanghai Shellshellfish Information Technology, Wai Tak acted as the Chief Scientist of the R&D team, and he has published more than a dozen algorithms in prestigious journals and conferences. Wai Tak began his search and analytics technology career with Elasticsearch in the real estate market and later applied this to data management and FinTech data services.
Read more about Wai Tak Wong

Right arrow

Preprocessing Documents in Ingest Pipelines

In the previous chapter, we learned all four aggregation families and practiced different types of aggregations with many examples, using investor exchange (IEX) and exchange-traded fund (ETF) historical data. We have completed the study on two key features of Elasticsearch – search and aggregation. In this chapter, we'll switch to the data preparation and enrichment features. You will recall from the Elasticsearch Architecture section of Chapter 1, Overview of Elasticsearch 7, that there are four types of Elasticsearch nodes, and one of them is the ingest node. You can preprocess documents through the predefined pipeline processors before the actual indexing operation starts. All nodes are enabled as ingest by default and you can disable the capability of a node in the configuration file.

In this chapter, we will cover...

Ingest APIs

Basic ingest CRUD APIs allow you to manage the entire life cycle process, from creation, update, retrieval deletion, and execution of the ingest pipeline. A pipeline is formed by a list of supported processors that are executed sequentially. Let's describe each CRUD API as follows:

  • Create/update the ingest pipeline: To define a pipeline, you need to specify a list of processors (which will be executed in order) and a description to tell us what functions will be performed. The PUT request is used to create the pipeline with an identifier. If the pipeline was created previously, it will be an update request and will overwrite the original contents. Let's take an example of creating a pipeline with the range_ratio identifier by using a script processor. The range ratio is to compute the difference between the high and low price, and then set the ratio between...

Accessing data in pipelines

Fields in the _source field can be accessed directly in a pipeline definition simply by either using the field name or adding the _source prefix to the field name. On the other hand, if you are referring to the value of a field, then you can use the {{field_name}} template snippet to retrieve the value. Let's take an example of using a set processor to add an ingest_timestamp field to the document, which records the timestamp value when the ingest processing occurs. This value is provided by the API, as we mentioned in the result of the _simulate pipeline example. The partially relevant codes are as follows:

"processors": [{
"set": {
"field": "ingest_timestamp",
"value": "{{_ingest.timestamp}}"
}
}

Dynamic mapping fields and field values are also supported in the same...

Processors

Nearly 30 processors are supported in the ingest pipeline. When the document is indexed, each processor executes as it is declared in the pipeline. A processor is defined with a name and configured with its own parameters. Before we introduce each processor, let's examine some of the common parameters as described in the following table:

Parameter Name Description
field The name of the field to be accessed in the processor. Most of the processors require this parameter.
target_field The name of the destined field to be accessed in the processor. The default value depends on individual parameters. About half of the processors support this optional parameter.
ignore_missing If the field pointed by the field parameter is missing, or the value of the field is a null value in the indexing document, it fails the execution. If this boolean parameter is set...

Conditional execution in pipelines

As we mentioned in the Processors section, the optional if parameter is designed to let users define conditions for executing the pipeline processor. Let's demo a simple case with an example. The rating field of the documents in the index of cf_etf is a single space string when no rating is given for ETF in the original source. We can use the remove processor to remove the rating field in such a condition before the indexing operation, as shown in the following code block:

"pipeline": {
"description":"remove the rating field if the rating is equal to a single space string",
"processors":[{
"remove": {
"field": "rating",
"if": "ctx.rating == ' '"
}
}]
}

You will recall the dividend information...

Handling failures in pipelines

As discussed in the Ingest APIs section of this chapter, a pipeline is formed by a list of supported processors that are executed sequentially. If an exception occurs, the whole process is halted. Let's show an exception with an example. The processor in the pipeline is to remove the rating field from the indexing document. However, the rating field is optional and it may not be present. When an error occurs, you can check out the root clause in the error field. When the field rating is missing in the remove processor, it shows you that the reason is field [rating] not present as part of path [rating]:

If the error can be ignored, you can set the optional ingore_failure parameter to true to silently ignore the failure and continue the execution of the next processor. Another choice is to use the on_failure parameter to catch the exception...

Summary

Time flies so fast! We are in the middle of this book. In this chapter, we have performed the Ingest APIs and practiced most of the pipeline processors. We have also learned about data access to documents that pass through the pipeline processor. Finally, we have discussed how to handle exceptions when errors occur during pipeline processing.

In the next chapter, we will discuss how to use aggregation frameworks for exploratory data analysis. We'll give you a few examples, such as collecting metrics and log data generated by the system for operational data analytics, ingesting financial investment fund data before performing analytic operations, and performing simple sentiment analysis using Elasticsearch.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Advanced Elasticsearch 7.0
Published in: Aug 2019Publisher: PacktISBN-13: 9781789957754
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Wai Tak Wong

Wai Tak Wong is a faculty member in the Department of Computer Science at Kean University, NJ, USA. He has more than 15 years professional experience in cloud software design and development. His PhD in computer science was obtained at NJIT, NJ, USA. Wai Tak has served as an associate professor in the Information Management Department of Chung Hua University, Taiwan. A co-founder of Shanghai Shellshellfish Information Technology, Wai Tak acted as the Chief Scientist of the R&D team, and he has published more than a dozen algorithms in prestigious journals and conferences. Wai Tak began his search and analytics technology career with Elasticsearch in the real estate market and later applied this to data management and FinTech data services.
Read more about Wai Tak Wong