Home Data Elasticsearch 8.x Cookbook - Fifth Edition

Elasticsearch 8.x Cookbook - Fifth Edition

By Alberto Paro
books-svg-icon Book
eBook $41.99 $28.99
Print $30.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $41.99 $28.99
Print $30.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 2: Managing Mappings
About this book
Elasticsearch is a Lucene-based distributed search engine at the heart of the Elastic Stack that allows you to index and search unstructured content with petabytes of data. With this updated fifth edition, you'll cover comprehensive recipes relating to what's new in Elasticsearch 8.x and see how to create and run complex queries and analytics. The recipes will guide you through performing index mapping, aggregation, working with queries, and scripting using Elasticsearch. You'll focus on numerous solutions and quick techniques for performing both common and uncommon tasks such as deploying Elasticsearch nodes, using the ingest module, working with X-Pack, and creating different visualizations. As you advance, you'll learn how to manage various clusters, restore data, and install Kibana to monitor a cluster and extend it using a variety of plugins. Furthermore, you'll understand how to integrate your Java, Scala, Python, and big data applications such as Apache Spark and Pig with Elasticsearch and create efficient data applications powered by enhanced functionalities and custom plugins. By the end of this Elasticsearch cookbook, you'll have gained in-depth knowledge of implementing the Elasticsearch architecture and be able to manage, search, and store data efficiently and effectively using Elasticsearch.
Publication date:
May 2022
Publisher
Packt
Pages
750
ISBN
9781801079815

 

Chapter 2: Managing Mappings

Mapping is a primary concept in Elasticsearch that defines how the search engine should process a document and its fields to be effectively used in search and aggregations.

Search engines perform the following two main operations:

  • Indexing: This action is used to receive a document, process it, and store it in an index.
  • Searching: This action is used to retrieve the data from the index based on a query.

These two operations are strictly connected; an error in the indexing step leads to unwanted or missing search results.

Elasticsearch, by default, has explicit mapping at the index level. When indexing, if a mapping is not provided, a default one is created and guesses the structure from the JSON data fields that the document is composed of. This new mapping is then automatically propagated to all the cluster nodes: it will begin part of the cluster's state.

The default type mapping has sensible default values, but when you want to change their behavior or customize several other aspects of indexing (object to special fields, storing, ignoring, completion, and so on), you need to provide a new mapping definition.

In this chapter, we'll look at all the possible mapping field types that document mappings are composed of.

In this chapter, we will cover the following recipes:

  • Using explicit mapping creation
  • Mapping base types
  • Mapping arrays
  • Mapping an object
  • Mapping a document
  • Using dynamic templates in document mapping
  • Managing nested objects
  • Managing a child document with a join field
  • Adding a field with multiple mappings
  • Mapping a GeoPoint field
  • Mapping a GeoShape field
  • Mapping an IP field
  • Mapping an Alias field
  • Mapping a Percolator field
  • Mapping the Rank Feature and Feature Vector fields
  • Mapping the Search as you type field
  • Using the Range Field type
  • Using the Flattened field type
  • Using the Point and Shape field types
  • Using the Dense Vector field type
  • Using the Histogram field type
  • Adding metadata to a mapping
  • Specifying different analyzers
  • Using index components and templates
 

Technical requirements

To follow and test the commands shown in this chapter, you must have a working Elasticsearch cluster installed on your system, as described in Chapter 1, Getting Started.

To simplify how you manage and execute these commands, I suggest that you install Kibana so that you have a more advanced environment to execute Elasticsearch queries.

 

Using explicit mapping creation

If we consider the index as a database in the SQL world, mapping is similar to the create table definition.

Elasticsearch can understand the structure of the document that you are indexing (reflection) and create the mapping definition automatically. This is called explicit mapping creation.

Getting ready

To execute the code in this recipe, you will need an up-and-running Elasticsearch installation, as described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.

To execute these commands, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar platforms. I suggest using the Kibana console to provide code completion and better character escaping for Elasticsearch.

To understand the examples and code in this recipe, basic knowledge of JSON is required.

How to do it…

You can explicitly create a mapping by adding a new document to Elasticsearch. For this, perform the following steps:

  1. Create an index, as shown in the following code:
    PUT test

The output will be as follows:

{ "acknowledged" : true, "shards_acknowledged" : true,
 "index" : "test" }
  1. Put a document in the index, as shown in the following code:
    PUT test/_doc/1
    {"name":"Paul", "age":35}

The output will be as follows:

{
  "_index" : "test", "_id" : "1", "_version" : 1,
  "result" : "created",
  "_shards" : {"total" : 2, "successful" : 1, "failed" : 0 },
  "_seq_no" : 0,  "_primary_term" : 1
}
  1. Get the mapping with the following code:
    GET test/_mapping
  2. The mapping that's auto-created by Elasticsearch should look as follows:
    {
      "test" : {
        "mappings" : {
          "properties" : {
            "age" : { "type" : "long" },
            "name" : {
              "type" : "text",
              "fields" : {
                "keyword" : {"type" : "keyword", "ignore_above" : 256 }
    } } } } } }
  3. To delete the index, you can use the following command:
    DELETE test

The output will be as follows:

{ "acknowledged" : true }

How it works…

The first command line (Step 1) creates an index where we can configure the mappings in the future, if required, and store documents in it.

The second command (Step 2) inserts a document in the index (we'll learn how to create the index in the Creating an index recipe of Chapter 3, Basic Operations, and record indexing in the Indexing a document recipe of Chapter 3, Basic Operations).

Elasticsearch reads all the default properties for the field of the mapping and starts to process them as follows:

  • If the field is already present in the mapping and the value of the field is valid (it matches the correct type), Elasticsearch does not need to change the current mappings.
  • If the field is already present in the mapping but the value of the field is of a different type, it tries to upgrade the field type (that is, from integer to long). If the types are not compatible, it throws an exception, and the indexing process fails.
  • If the field is not present, it tries to auto-detect the type of field. It updates the mappings with a new field mapping. (In the case of a null value, it skips the mapping update until it encounters a concrete type.)

There's more…

In Elasticsearch, every document has a unique identifier, called an ID for a single index, which is stored in the special _id field of the document.

The _id field can be provided at index time or can be assigned automatically by Elasticsearch if it is missing.

When a mapping type is created or changed, Elasticsearch automatically propagates mapping changes to all the nodes in the cluster so that all the shards are aligned to process that particular type.

In Elasticsearch 7.x, there was a default type (_doc): it was removed in Elasticsearch 8.x and above.

See also

Please refer to the following recipes in Chapter 3Basic Operations:

  • The Creating an index recipe, which is about putting new mappings in an index while it's being created
  • The Putting a mapping in an index recipe, which is about extending a mapping in an index
 

Mapping base types

Using explicit mapping makes it possible to start to quickly ingest the data using a schemaless approach without being concerned about field types. Thus, to achieve better results and performance in indexing, it's required to manually define a mapping.

Fine-tuning mapping brings some advantages, such as the following:

  • Reducing the index size on the disk (disabling functionalities for custom fields)
  • Indexing only interesting fields (general speed up)
  • Precooking data for fast search or real-time analytics (such as aggregations)
  • Correctly defining whether a field must be analyzed in multiple tokens or considered as a single token
  • Defining mapping types such as geo point, suggester, vectors, and so on

Elasticsearch allows you to use base fields with a wide range of configurations.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

To execute this recipe's examples, you will need to create an index with a test name, where you can put mappings, as explained in the Using explicit mapping creation recipe.

How to do it...

Let's use a semi real-world example of a shop order for our eBay-like shop:

  1. First, we must define an order:
Figure 2.1 – Example of an order

Figure 2.1 – Example of an order

  1. Our order record must be converted into an Elasticsearch mapping definition, as follows:
    PUT test/_mapping
    {  "properties" : {
          "id" : {"type" : "keyword"},
          "date" : {"type" : "date"},
          "customer_id" : {"type" : "keyword"},
          "sent" : {"type" : "boolean"},
          "name" : {"type" : "keyword"},
          "quantity" : {"type" : "integer"},
          "price" : {"type" : "double"},
          "vat" : {"type" : "double", "index": false}
    } }

Now, the mapping is ready to be put in the index. We will learn how to do this in the Putting a mapping in an index recipe of Chapter 3, Basic Operations.

How it works...

Field types must be mapped to one of the Elasticsearch base types, and options on how the field must be indexed need to be added.

The following table is a reference for the mapping types:

Figure 2.2 – Base type mapping

Figure 2.2 – Base type mapping

Depending on the data type, it's possible to give explicit directives to Elasticsearch when you're processing the field for better management. The most used options are as follows:

  • store (default false): This marks the field to be stored in a separate index fragment for fast retrieval. Storing a field consumes disk space but reduces computation if you need to extract it from a document (that is, in scripting and aggregations). The possible values for this option are true and false. They are always retuned as an array of values for consistency.

The stored fields are faster than others in aggregations.

  • index: This defines whether or not the field should be indexed. The possible values for this parameter are true and false. Index fields are not searchable (the default is true).
  • null_value: This defines a default value if the field is null.
  • boost: This is used to change the importance of a field (the default is 1.0).

boost works on a term level only, so it's mainly used in term, terms, and match queries.

  • search_analyzer: This defines an analyzer to be used during the search. If it's not defined, the analyzer of the parent object is used (the default is null).
  • analyzer: This sets the default analyzer to be used (the default is null).
  • norms: This controls the Lucene norms. This parameter is used to score queries better. If the field is only used for filtering, it's a best practice to disable it to reduce resource usage (true for analyzed fields and false for not_analyzed ones).
  • copy_to: This allows you to copy the content of a field to another one to achieve functionalities, similar to the _all field.
  • ignore_above: This allows you to skip the indexing string if it's bigger than its value. This is useful for processing fields for exact filtering, aggregations, and sorting. It also prevents a single term token from becoming too big and prevents errors due to the Lucene term's byte-length limit of 32,766. The maximum suggested value is 8191 (https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html).

There's more...

From Elasticsearch version 6.x onward, as shown in the Using explicit mapping creation recipe, the explicit inferred type for a string is a multifield mapping:

  • The default processing is text. This mapping allows textual queries (that is, term, match, and span queries). In the example provided in the Using explicit mapping creation recipe, this was name.
  • The keyword subfield is used for keyword mapping. This field can be used for exact term matching and aggregation and sorting. In the example provided in the Using explicit mapping creation recipe, the referred field was name.keyword.

Another important parameter, available only for text mapping, is term_vector (the vector of terms that compose a string). Please refer to the Lucene documentation for further details at https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/index/Terms.html.

term_vector can accept the following values:

  • no: This is the default value; that is, skip term vector.
  • yes: This is the store term vector.
  • with_offsets: This is the store term vector with a token offset (start, end position in a block of characters).
  • with_positions: This is used to store the position of the token in the term vector.
  • with_positions_offsets: This stores all the term vector data.
  • with_positions_payloads: This is used to store the position and payloads of the token in the term vector.
  • with_positions_offsets_payloads: This stores all the term vector data with payloads.

Term vectors allow fast highlighting but consume disk space due to storing additional text information. It's a best practice to only activate it in fields that require highlighting, such as title or document content.

See also

You can refer to the following sources for further details on the concepts of this chapter:

  • The online documentation on Elasticsearch provides a full description of all the properties for the different mapping fields at https://www.elastic.co/guide/en/elasticsearch/reference/master/mapping-params.html.
  • The Specifying a different analyzer recipe at the end of this chapter shows alternative analyzers to the standard one.
  • For newcomers who want to explore the concepts of tokenization, I would suggest reading the official Elasticsearch documentation at https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-tokenizers.html.
 

Mapping arrays

Array or multi-value fields are very common in data models (such as multiple phone numbers, addresses, names, aliases, and so on), but they're not natively supported in traditional SQL solutions.

In SQL, multi-value fields require you to create accessory tables that must be joined to gather all the values, leading to poor performance when the cardinality of the records is huge.

Elasticsearch, which works natively in JSON, provides support for multi-value fields transparently.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

To use an Array type in our mapping, perform the following steps:

  1. Every field is automatically managed as an array. For example, to store tags for a document, the mapping would be as follows:
    {  "properties" : {
          "name" : {"type" : "keyword"},
          "tag" : {"type" : "keyword", "store" : true},
          ...
    }
  2. This mapping is valid for indexing both documents. The following is the code for document1:
    {"name": "document1", "tag": "awesome"}
  3. The following is the code for document2:
    {"name": "document2", "tag": ["cool", "awesome", "amazing"] }

How it works…

Elasticsearch transparently manages the array: there is no difference if you declare a single value or a multi-value due to its Lucene core nature.

Multi-values for fields are managed in Lucene, so you can add them to a document with the same field name. For people with a SQL background, this behavior may be quite strange, but this is a key point in the NoSQL world as it reduces the need for a join query and creates different tables to manage multi-values. An array of embedded objects has the same behavior as simple fields.

 

Mapping an object

The object type is one of the most common field aggregation structures in documental databases.

An object is a base structure (analogous to a record in SQL): in JSON types, they are defined as key/value pairs inside the {} symbols.

Elasticsearch extends the traditional use of objects (which are flat in DBMS), thus allowing for recursive embedded objects.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. Again, I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

We can rewrite the mapping code from the previous recipe using an array of items:

PUT test/_doc/_mapping
{ "properties" : {
      "id" : {"type" : "keyword"},
      "date" : {"type" : "date"},
      "customer_id" : {"type" : "keyword", "store" : true},
      "sent" : {"type" : "boolean"},
      "item" : {
        "type" : "object",
        "properties" : {
          "name" : {"type" : "text"},
          "quantity" : {"type" : "integer"},
          "price" : {"type" : "double"},
          "vat" : {"type" : "double"}
} } } }

How it works…

Elasticsearch speaks native JSON, so every complex JSON structure can be mapped in it.

When Elasticsearch is parsing an object type, it tries to extract fields and processes them as its defined mapping. If not, it learns the structure of the object using reflection.

The most important attributes of an object are as follows:

  • properties: This is a collection of fields or objects (we can consider them as columns in the SQL world).
  • enabled: This establishes whether or not the object should be processed. If it's set to false, the data contained in the object is not indexed and it cannot be searched (the default is true).
  • dynamic: This allows Elasticsearch to add new field names to the object using a reflection on the values of the inserted data. If it's set to false, when you try to index an object containing a new field type, it'll be rejected silently. If it's set to strict, when a new field type is present in the object, an error will be raised, skipping the indexing process. The dynamic parameter allows you to be safe about making changes to the document's structure (the default is true).

The most used attribute is properties, which allows you to map the fields of the object in Elasticsearch fields.

Disabling the indexing part of the document reduces the index size; however, the data cannot be searched. In other words, you end up with a smaller file on disk, but there is a cost in terms of functionality.

See also

Some special objects are described in the following recipes:

  • The Mapping a document recipe
  • The Managing a child document with a join field recipe
  • The Mapping nested objects recipe
 

Mapping a document

The document mapping is also referred to as the root object. This has special parameters that control its behavior, and they are mainly used internally to do special processing, such as routing or time-to-live of documents.

In this recipe, we'll look at these special fields and learn how to use them.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

We can extend the preceding order example by adding some of the special fields, like so:

PUT test/_mapping
{ "_source": { "store": true },
    "_routing": { "required": true },
    "_index": { "enabled": true },
    "properties": {} }

How it works…

Every special field has parameters and value options, such as the following:

  • _id: This allows you to index only the ID part of the document. All the ID queries will speed up using the ID value (by default, this is not indexed and not stored).
  • _index: This controls whether or not the index must be stored as part of the document. It can be enabled by setting the "enabled": true parameter (enabled=false is the default).
  • _source: This controls how the document's source is stored. Storing the source is very useful, but it's a storage overhead, so it is not required. Consequently, it's better to turn it off (enabled=true is the default).
  • _routing: This defines the shard that will store the document. It supports additional parameters, such as required (true/false). This is used to force the presence of the routing value, raising an exception if it's not provided.

Controlling how to index and process a document is very important and allows you to resolve issues related to complex data types.

Every special field has parameters to set particular configurations, and some of their behaviors could change in different releases of Elasticsearch.

See also

Please refer to the Using dynamic templates in document mapping recipe in this chapter and the Putting a mapping in an index recipe of Chapter 3Basic Operations, to learn more.

 

Using dynamic templates in document mapping

In the Using explicit mapping creation recipe, we saw how Elasticsearch can guess the field type using reflection. In this recipe, we'll see how we can help it improve its guessing capabilities via dynamic templates.

The dynamic template feature is very useful. For example, it may be useful in situations where you need to create several indices with similar types because it allows you to move the need to define mappings from coded initial routines to automatic index-document creation. Typical usage is to define types for Logstash log indices.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

We can extend the previous mapping by adding document-related settings, as follows:

PUT test/_mapping
{
    "dynamic_date_formats":["yyyy-MM-dd", "dd-MM-yyyy"],\
    "date_detection": true,
    "numeric_detection": true,
    "dynamic_templates":[
      {"template1":{
        "match":"*",
        "match_mapping_type": "long",
        "mapping": {"type":" {dynamic_type}", "store": true}
      }}    ],
    "properties" : {...}
}

How it works…

The root object (document) controls the behavior of its fields and all its children object fields. In document mapping, we can define the following:

  • date_detection: This allows you to extract a date from a string (true is the default).
  • dynamic_date_formats: This is a list of valid date formats. This is used if date_detection is active.
  • numeric_detection: This enables you to convert strings into numbers, if possible (false is the default).
  • dynamic_templates: This is a list of templates that are used to change the explicit mapping inference. If one of these templates is matched, the rules that have been defined in it are used to build the final mapping.

A dynamic template is composed of two parts: the matcher and the mapping.

To match a field to activate the template, you can use several types of matchers, such as the following:

  • match: This allows you to define a match on the field name. The expression is a standard GLOB pattern (http://en.wikipedia.org/wiki/Glob_(programming)).
  • unmatch: This allows you to define the expression to be used to exclude matches (optional).
  • match_mapping_type: This controls the types of the matched fields; for example, string, integer, and so on (optional).
  • path_match: This allows you to match the dynamic template against the full dot notation of the field; for example, obj1.*.value (optional).
  • path_unmatch: This will do the opposite of path_match, excluding the matched fields (optional).
  • match_pattern: This allows you to switch the matchers to regex (regular expression); otherwise, the glob pattern match is used (optional).

The dynamic template mapping part is a standard one but can use special placeholders, such as the following:

  • {name}: This will be replaced with the actual dynamic field name.
  • {dynamic_type}: This will be replaced with the type of the matched field.

The order of the dynamic templates is very important; only the first one that is matched is executed. It is good practice to order the ones with more strict rules first, and then the others.

There's more...

Dynamic templates are very handy when you need to set a mapping configuration to all the fields. This can be done by adding a dynamic template, similar to this one:

"dynamic_templates" : [
  { "store_generic" : {
      "match" : "*", "mapping" : { "store" : true }
} } ]  

In this example, all the new fields, which will be added with explicit mapping, will be stored.

See also

  • You can find the default Elasticsearch behavior for creating a mapping in the Using explicit mapping creation recipe and the base way of defining a mapping in the Mapping a document recipe.
  • The glob pattern is available at http://en.wikipedia.org/wiki/Glob_pattern.
 

Managing nested objects

There is a special type of embedded object called a nested object. This resolves a problem related to Lucene's indexing architecture, in which all the fields of embedded objects are viewed as a single object (technically speaking, they are flattened). During the search, in Lucene, it is not possible to distinguish between values and different embedded objects in the same multi-valued array.

If we consider the previous order example, it's not possible to distinguish an item's name and its quantity with the same query since Lucene puts them in the same Lucene document object. We need to index them in different documents and then join them. This entire trip is managed by nested objects and nested queries.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

A nested object is defined as a standard object with the nested type.

Regarding the example in the Mapping an object recipe, we can change the type from object to nested, as follows:

PUT test/_mapping
{ "properties" : {
      "id" : {"type" : "keyword"},
      "date" : {"type" : "date"},
      "customer_id" : {"type" : "keyword"},
      "sent" : {"type" : "boolean"},
      "item" : {"type" : "nested",
        "properties" : {
            "name" : {"type" : "keyword"},
            "quantity" : {"type" : "long"},
            "price" : {"type" : "double"},
            "vat" : {"type" : "double"}
} } } }

How it works…

When a document is indexed, if an embedded object has been marked as nested, it's extracted by the original document before being indexed in a new external document and saved in a special index position near the parent document.

In the preceding example, we reused the mapping from the Mapping an object recipe, but we changed the type of the item from object to nested. No other action must be taken to convert an embedded object into a nested one.

The nested objects are special Lucene documents that are saved in the same block of data as its parent – this approach allows for fast joining with the parent document.

Nested objects are not searchable with standard queries, only with nested ones. They are not shown in standard query results.

The lives of nested objects are related to their parents: deleting/updating a parent automatically deletes/updates all the nested children. Changing the parent means Elasticsearch will do the following:

  • Mark old documents as deleted.
  • Mark all nested documents as deleted.
  • Index the new document version.
  • Index all nested documents.

There's more...

Sometimes, you must propagate information about the nested objects to their parent or root objects. This is mainly to build simpler queries about the parents (such as terms queries without using nested ones). To achieve this, two special properties of nested objects must be used:

  • include_in_parent: This makes it possible to automatically add the nested fields to the immediate parent.
  • include_in_root: This adds the nested object fields to the root object.

These settings add data redundancy, but they reduce the complexity of some queries, thus improving performance.

See also

  • Nested objects require a special query to search for them – this will be discussed in the Using nested queries recipe of Chapter 6, Relationships and Geo Queries.
  • The Managing a child document with a join field recipe shows another way to manage child/parent relationships between documents.
 

Managing a child document with a join field

In the previous recipe, we saw how it's possible to manage relationships between objects with the nested object type. The disadvantage of nested objects is their dependence on their parents. If you need to change the value of a nested object, you need to reindex the parent (this causes a potential performance overhead if the nested objects change too quickly). To solve this problem, Elasticsearch allows you to define child documents.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

In the following example, we have two related objects: an Order and an Item.

Their UML representation is as follows:

Figure 2.3 – UML example of an Order/Item relationship

Figure 2.3 – UML example of an Order/Item relationship

The final mapping should merge the field definitions of both Order and Item, as well as use a special field (join_field, in this example) that takes the parent/child relationship.

To use join_field, follow these steps:

  1. First, we must define the mapping, as follows:
    PUT test1/_mapping
    { "properties": {
        "join_field": {
          "type": "join", "relations": { "order": "item" }
        },
        "id": { "type": "keyword" },
        "date": { "type": "date" },
        "customer_id": { "type": "keyword" },
        "sent": { "type": "boolean" },
        "name": { "type": "text" },
        "quantity": { "type": "integer" },
        "vat": { "type": "double" }
    } }

The preceding mapping is very similar to the one in the previous recipe.

  1. If we want to store the joined records, we will need to save the parent first and then the children, like so:
    PUT test/_doc/1?refresh
    { "id": "1", "date": "2018-11-16T20:07:45Z", "customer_id": "100", "sent": true, "join_field": "order" }
    PUT test/_doc/c1?routing=1&refresh
     { "name": "tshirt", "quantity": 10, "price": 4.3, "vat": 8.5,
       "join_field": { "name": "item", "parent": "1" } }

The child item requires special management because we need to add routing with the parent (1 in the preceding example). Furthermore, we need to specify the parent name and its ID in the object.

How it works…

Mapping, in the case of multiple item relationships in the same index, needs to be computed as the sum of all the other mapping fields.

The relationship between objects must be defined in join_field.

There must only be a single join_field for mapping; if you need to provide a lot of relationships, you can provide them in the relations object.

The child document must be indexed in the same shard as the parent; so, when indexed, an extra parameter must be passed, which is routing (we'll learn how to do this in the Indexing a document recipe in Chapter 3, Basic Operations).

A child document doesn't need to reindex the parent document when we want to change its values. Consequently, it's fast in terms of indexing, reindexing (updating), and deleting.

There's more...

In Elasticsearch, we have different ways to manage relationships between objects, as follows:

  • Embedding with type=object: This is implicitly managed by Elasticsearch and it considers the embedding as part of the main document. It's fast, but you need to reindex the main document to change the value of the embedded object.
  • Nesting with type=nested: This allows you to accurately search and filter the parent by using nested queries on children. Everything works for the embedded object except for the query (you must use a nested query to search for them).
  • External children documents: Here, the children are the external document, with a join_field property to bind them to the parent. They must be indexed in the same shard as the parent. The join with the parent is a bit slower than the nested one. This is because the nested objects are in the same data block as the parent in the Lucene index and they are loaded with the parent; otherwise, the child document requires more read operations.

Choosing how to model the relationship between objects depends on your application scenario.

Tip

There is also another approach that can be used, but on big data documents, it creates poor performance – decoupling a join relationship. You do the join query in two steps: first, collect the ID of the children/other documents and then search for them in a field of their parent.

See also

Please refer to the Using the has_child query, Using the top_children query, and Using the has_parent query recipes of Chapter 6, Relationships and Geo Queries, for more details on child/parent queries.

 

Adding a field with multiple mappings

Often, a field must be processed with several core types or in different ways. For example, a string field must be processed as tokenized for search and not-tokenized for sorting. To do this, we need to define a fields multifield special property.

The fields property is a very powerful feature of mappings because it allows you to use the same field in different ways.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

To define a multifield property, we need to define a dictionary containing the fields subfield. The subfield with the same name as a parent field is the default one.

If we consider the item from our order example, we can index the name like so:

{ "name": {
    "type": "keyword",
    "fields": {
      "name": {"type": "keyword"},
      "tk": {"type": "text"},
      "code": {"type": "text","analyzer": "code_analyzer"}
} },

If we already have a mapping stored in Elasticsearch and we want to migrate the fields in a multi-field property, it's enough to save a new mapping with a different type, and Elasticsearch provides the merge automatically. New subfields in the fields property can be added without problems at any moment, but the new subfields will only be available while you're searching/aggregating newly indexed documents.

When you add a new subfield to already indexed data, you need to reindex your record to ensure you have it correctly indexed for all your records.

How it works…

During indexing, when Elasticsearch processes a fields property of the multifield type, it reprocesses the same field for every subfield defined in the mapping.

To access the subfields of a multifield, we must build a new path on the base field, plus use the subfield's name. In the preceding example, we have the following:

  • name: This points to the default multifield subfield-field (the keyword one).
  • name.tk: This points to the standard analyzed (tokenized) text field.
  • name.code: This points to a field that was analyzed with a code extractor analyzer.

As you may have noticed in the preceding example, we changed the analyzer to introduce a code extractor analyzer that allows you to extract the item code from a string.

By using the multifield, if we index a string such as Good Item to buy - ABC1234, we'll have the following:

  • name = Good Item to buy - ABC1234 (useful for sorting)
  • name.tk= ["good", "item", "to", "buy", "abc1234"] (useful for searching)
  • name.code = ["ABC1234"] (useful for searching and aggregations)

In the case of the code analyzer, if the code is not found in the string, no tokens are generated. This makes it possible to develop solutions that carry out information retrieval tasks at index time and uses these at search time.

There's more...

The fields property is very useful in data processing because it allows you to define several ways to process field data.

For example, if we are working on documental content (such as articles, word documents, and so on), we can define fields as subfield analyzers to extract names, places, date/time, geolocation, and so on.

The subfields of a multifield are standard core type fields – we can perform every process we want on them, such as search, filter, aggregation, and scripting.

See also

To find out more about what Elasticsearch analyzers you can use, please refer to the Specifying different analyzers recipe.

 

Mapping a GeoPoint field

Elasticsearch natively supports the use of geolocation types – special types that allow you to localize your document in geographic coordinates (latitude and longitude) around the world.

Two main types are used in the geographic world: the point and the shape. In this recipe, we'll look at GeoPoint – the base element of geolocation.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

The type of the field must be set to geo_point to define a GeoPoint.

We can extend the order example by adding a new field that stores the location of a customer. This will result in the following output:

PUT test/_mapping
{ "properties": {
    "id": {"type": "keyword",},
    "date": {"type": "date"},
    "customer_id": {"type": "keyword"},
    "customer_ip": {"type": "ip"},
    "customer_location": {"type": "geo_point"},
    "sent": {"type": "boolean"}
} }

How it works…

When Elasticsearch indexes a document with a GeoPoint field (lat_lon), it processes the latitude and longitude coordinates and creates special accessory field data to provide faster query capabilities on these coordinates. This is because a special data structure is created to internally manage latitude and longitude.

Depending on the properties, given the latitude and longitude, it's possible to compute the geohash value (for details, I suggest reading https://www.pubnub.com/learn/glossary/what-is-geohashing/). The indexing process also optimizes these values for special computation, such as distance, ranges, and shape match.

GeoPoint has special parameters that allow you to store additional geographic data:

  • lat_lon (the default is false): This allows you to store the latitude and longitude as the .lat and .lon fields. Storing these values improves the performance of many memory algorithms that are used in distance and shape calculus.

It makes sense to set lat_lon to true so that you store them if there is a single point value for a field. This speeds up searches and reduces memory usage during computation.

  • geohash (the default is false): This allows you to store the computed geohash value.
  • geohash_precision (the default is 12): This defines the precision to be used in geohash calculus.

For example, given a geo point value, [45.61752, 9.08363], it can be stored using one of the following syntaxes:

  • customer_location = [45.61752, 9.08363]
  • customer_location.lat = 45.61752
  • customer_location.lon = 9.08363
  • customer_location.geohash = u0n7w8qmrfj

There's more...

GeoPoint is a special type and can accept several formats as input:

  • lat and lon as properties, as shown here:
    { "customer_location": { "lat": 45.61752, "lon": 9.08363 },
  • lan and lon as strings, as follows:
    "customer_location": "45.61752,9.08363",
  • geohash as a string, as shown here:
    "customer_location": "u0n7w8qmrfj",
  • As a GeoJSON array (note that here, lat and lon are reversed), as shown in the following code snippet:
    "customer_location": [9.08363, 45.61752]
 

Mapping a GeoShape field

An extension of the concept of a point is its shape. Elasticsearch provides a type that allows you to manage arbitrary polygons in GeoShape.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.

To be able to use advanced shape management, Elasticsearch requires two JAR libraries in its classpath (usually the lib directory), as follows:

  • Spatial4J (v0.3)
  • JTS (v1.13)

How to do it…

To map a geo_shape type, a user must explicitly provide some parameters:

  • tree (the default is geohash): This is the name of the PrefixTree implementation – GeohashPrefixTree and quadtree for QuadPrefixTree.
  • precision: This is used instead of tree_levels to provide a more human value to be used in the tree level. The precision number can be followed by the unit; that is, 10 m, 10 km, 10 miles, and so on.
  • tree_levels: This is the maximum number of layers to be used in the prefix tree.
  • distance_error_pct: This sets the maximum errors that are allowed in a prefix tree (0,025% - max 0,5% by default).

The customer_location mapping, which we saw in the previous recipe using geo_shape, will be as follows:

"customer_location": {
  "type": "geo_shape",
  "tree": "quadtree",
  "precision": "1m" },

How it works…

When a shape is indexed or searched internally, a path tree is created and used.

A path tree is a list of terms that contain geographic information and are computed to improve performance in evaluating geo calculus.

The path tree also depends on the shape's type: point, linestring, polygon, multipoint, or multipolygon.

See also

To understand the logic behind the GeoShape, some good resources are the Elasticsearch page, which tells you about GeoShape, and the sites of the libraries that are used for geographic calculus (https://github.com/spatial4j/spatial4j and http://central.maven.org/maven2/com/vividsolutions/jts/1.13/, respectively).

 

Mapping an IP field

Elasticsearch is used in a lot of systems to collect and search logs, such as Kibana (https://www.elastic.co/products/kibana) and LogStash (https://www.elastic.co/products/logstash). To improve search when using IP addresses, Elasticsearch provides the IPv4 and IPv6 types, which can be used to store IP addresses in an optimized way.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.

How to do it…

You need to define the type of field that contains an IP address as ip.

Regarding the preceding order example, we can extend it by adding the customer IP, like so:

"customer_ip": { "type": "ip" }

The IP must be in the standard point notation form, as follows:

"customer_ip":"19.18.200.201"

How it works…

When Elasticsearch is processing a document and if a field is an IP one, it tries to convert its value into a numerical form and generates tokens for fast value searching.

The IP has special properties:

  • index (the default is true): This defines whether the field must be indexed. If not, false must be used.
  • doc_values (the default is true): This defines whether the field values should be stored in a column-stride fashion to speed up sorting and aggregations.

The other properties (store, boost, null_value, and include_in_all) work as other base types.

The advantage of using IP fields over strings is more speed in every range and filter and lower resource usage (disk and memory).

 

Mapping an Alias field

It is very common to have a lot of different types in several indices. Because Elasticsearch makes it possible to search in many indices, you should filter for common fields at the same time.

In the real world, these fields are not always called in the same way in all mappings (generally because they are derived from different entities); it's very common to have a mix of the added_date, timestamp, @timestamp, and date_add fields, all of which are referring to the same date concept.

The alias fields allow you to define an alias name to be resolved, as well as a query time to simplify the call for all the fields with the same meaning.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it...

If we take the order example that we saw in the previous recipes, we can add an alias for the price value to cost in the item subfield.

This can be achieved by following these steps:

  1. To add this alias, we need to have a mapping that's similar to the following:
    PUT test/_mapping
    { "properties": {
        "id": {"type": "keyword"},
        "date": {"type": "date"},
        "customer_id": {"type": "keyword"},
        "sent": {"type": "boolean"},
        "item": {
          "type": "object",
          "properties": {
            "name": {"type": "keyword"},
            "quantity": {"type": "long"},
            "price": {"type": "double"},
            "vat": {"type": "double"}
    } } } }
  2. Now, we can index a record, as follows:
    PUT test/_doc/1?refresh
    { "id": "1", "date": "2018-11-16T20:07:45Z",
      "customer_id": "100", "sent": true,
      "item": [ { "name": "tshirt", "quantity": 10, "price": 4.3, "vat": 8.5 } ] }
  3. We can search it using the cost alias, like so:
    GET test/_search
    { "query": { "term": { "item.cost": 4.3 } } }

The result will be the saved document.

How it works…

The alias is a convenient way to use the same name for your search field without the need to change the data structure of your fields. An alias field doesn't need to change a document's structure, thus allowing more flexibility for your data models.

The alias is resolved when the search indices in the query are expanded and have no performance penalties due to its usage.

If you try to index a document with a value in an alias field, an exception will be thrown.

The path value of the alias field must contain the full resolution of the target field, which must be concrete and must be known when the alias is defined.

In the case of an alias in a nested object, it must be in the same nested scope as the target.

 

Mapping a Percolator field

The Percolator is a special type of field that makes it possible to store an Elasticsearch query inside the field and use it in a percolator query.

The Percolator can be used to detect all the queries that match a document.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it...

To map a percolator field, follow these steps:

  1. We want to create a Percolator that matches some text in a body field. We can define the mapping like so:
    PUT test-percolator
    { "mappings": {
        "properties": {
          "query": { "type": "percolator"  },
          "body": { "type": "text" }
    } } }
  2. Now, we can store a document with a percolator query inside it, as follows:
    PUT test-percolator/_doc/1?refresh
    { "query": { "match": { "body": "quick brown fox"  }}}
  3. Now, let's execute a search on it, as shown in the following code:
    GET test-percolator/_search
    { "query": {
        "percolate": {
          "field": "query",
          "document": { "body": "fox jumps over the lazy dog" } } } }
  4. This will result in us retrieving the hits of the stored document, as follows:
    {
       ... truncated...
       "hits" : [
         {
         "_index" : "test-percolator", "_id" : "1",
         "_score" : 0.13076457,
         "_source" : {
             "query" : {
                 "match" : { "body" : "quick brown fox" }
             }
         },
         "fields" : { "_percolator_document_slot" : [0]       } } ] } }

How it works…

The percolator field stores an Elasticsearch query inside it.

Because all the Percolators are cached and are always active for performances, all the fields that are required in the query must be defined in the mapping of the document.

Since all the queries in all the Percolator documents will be executed against every document, for the best performance, the query inside the Percolator must be optimized so that they're executed quickly inside the percolator query. 

 

Mapping the Rank Feature and Feature Vector fields

It's common to want to score a document dynamically, depending on the context. For example, if you need to score more documents that are inside a category, the classic scenario is to boost (increase low-scored) documents that are based on a value, such as page rank, hits, or categories.

Elasticsearch provides two new ways to boost your scores based on values. One is the Rank Feature field, while the other is its extension, which is to use a vector of values.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

We want to use the rank_feature type to implement a common PageRank scenario where documents are scored based on the same characteristics. To achieve this, follow these steps:

  1. To be able to score based on a pagerank value and an inverse url length, we can use the following mapping:
    PUT test-rank
    {  "mappings": {
        "properties": {
          "pagerank": { "type": "rank_feature" },
          "url_length": {
            "type": "rank_feature",
            "positive_score_impact": false
    } } } }
  2. Now, we can store a document, as shown here:
    PUT test-rank/_doc/1
    { "pagerank": 5, "url_length": 20 }
  3. Now, we can execute a feature query on the pagerank value to return our record with a similar query, like so:
    GET test-rank/_search
    { "query": { "rank_feature": { "field":"pagerank" }}} 

    Important Note

    To query the special rank/rank_features types, we need to use the special rank_feature query type, which is only used for this special case.

The evolution of the previous feature's functionality is to define a vector of values using the rank_features type; usually, it can be used to score by topics, categories, or similar discerning facets. We can implement this functionality by following these steps:

  1. First, we must define the mapping for the categories field:
    PUT test-ranks
    { "mappings": {
        "properties": {
          "categories": { "type": "rank_features"  } } } }
  2. Now, we can store some documents in the index by using the following commands:
    PUT test-ranks/_doc/1
    { "categories": { "sport": 14.2, "economic": 24.3 } }
    PUT test-ranks/_doc/2
    { "categories": { "sport": 19.2, "economic": 23.1 } }
  3. Now, we can search based on the saved feature values, as shown here:
    GET test-ranks/_search
    { "query": { "feature": { "field": "categories.sport"   } } }

How it works…

rank_feature and rank_features are special type fields that are used for storing values and are mainly used to score the results.

Important Note

The values that are stored in these fields can only be queried using the feature query. This cannot be used in standard queries and aggregations.

The value numbers in rank_feature and rank_features can only be single positive values (multi-values are not allowed).

In the case of rank_features, the values must be a hash, composed of a string and a positive numeric value.

There is a flag that changes the behavior of scoring – positive_score_impact. This value is true by default, but if you want the value of the feature to decrease the score, you can set it to false. In the pagerank example, the length of url reduces the score of the document because the longer url is, the less relevant it becomes.

 

Mapping the Search as you type field

One of the most common scenarios is to provide the Search as you type functionality, which is typical of the Google search engine.

This capability is common in many use cases:

  • Completing titles in media websites
  • Completing product names in e-commerce websites
  • Completing document names or authors in document management systems
  • Suggesting best-associated terms to search on based on the actual knowledge base (collection of documents)

This type provides facilities to achieve this functionality.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion for Elasticsearch.

How to do it…

We want to use the search_as_you_type type to implement a completer (a widget that completes names/values) for titles for our media film streaming platform. To achieve this, follow these steps:

  1. To be able to prove "search as you type" on a title field, we will use the following mapping:
    PUT test-sayt
    { "mappings": {
        "properties": {
          "title": { "type": "search_as_you_type"  }
    } } } 
  2. Now, we can store some documents, as shown here:
    PUT test-sayt/_doc/1
    { "title": "Ice Age" }
    PUT test-sayt/_doc/2
    { "title": "The Polar Express" }
    PUT test-sayt/_doc/3
    { "title": "The Godfather" }
  3. Now, we can execute a match query on the title value to return our records:
    GET test-sayt/_search
    {
      "query": {
        "multi_match": {
          "query": "the p", "type": "bool_prefix",
          "fields": [ "title", "title._2gram", "title._3gram" ]
    } } }

The result will be something similar to the following:

{
  …truncated…
    "hits" : [
      {
        "_index" : "test-sayt", "_id" : "2", "_score" : 2.4208174,
        "_source" : { "title" : "The Polar Express" }
      },
    …truncated…
}

As you can see, more relevant results (that contain more code related to the search) score better!

How it works…

Due to the high demand for the Search as you type feature, this special mapping type was created.

This special mapping type is a helper that simplifies the process of creating a field with multiple subfields that can map the indexing requirements and provide an efficient Search as you type capability.

For example, for my title field, the following field and subfields are created:

The "search_as_you_type" field can be customized using the max_shingle_size parameter (the default is 3). This parameter allows you to define the maximum size of the gram to be created.

The number of ngram subfields is given by the max_shingle_size -1 value, but usually, the best values are 3 or 4. In the case of large values, it only increases the size of the index, but it doesn't generally provide query quality benefits.

See also

Please refer to the Using a match query recipe in Chapter 5, Text and Numeric Queries, to learn more about match queries.

 

Using the Range Fields type

Sometimes, we have values that represent a continuous range of values between an upper and lower bound. Some of the common scenarios of this are as follows:

  • Price range (that is, from $4 to $10)
  • Date interval (that is, from 8 A.M. to 8 P.M., December 2020, Summer 2021, Q3 2020, and so on)

In this case, for most queries, pointing to a value in the middle of them is not easy in Elasticsearch; for example, the worst case is to convert continuous values into discrete ones by extracting all the values using a prefixed interval. This kind of situation will largely increase the size of the index and reduce performance (queries).

Range mappings were created to provide continuous value support in Elasticsearch. For this reason, when it is not possible to store the exact value, but we have a range, we need to use range types.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it...

We want to use range types to implement stock mark values that are defined by low and high price values and the timeframe of the transaction. To achieve this, follow these steps:

  1. To populate our stock, we need to create an index with range fields. Let's use the following mapping:
    PUT test-range
    { "mappings": {
        "properties": {
          "price": { "type": "float_range" },
          "timeframe": { "type": "date_range" }
    } } } 
  2. Now, we can store some documents, as shown here:
    PUT test-range/_bulk
    {"index":{"_index":"test-range","_id":"1"}}
    {"price":{"gte":1.5,"lt":3.2},"timeframe":{"gte":"2022-01-01T12:00:00","lt":"2022-01-01T12:00:01"}}
    {"index":{"_index":"test-range","_id":"2"}}
    {"price":{"gte":1.7,"lt":3.7},"timeframe":{"gte":"2022-01-01T12:00:01","lt":"2022-01-01T12:00:02"}}
    {"index":{"_index":"test-range","_id":"3"}}
    {"price":{"gte":1.3,"lt":3.3},"timeframe":{"gte":"2022-01-01T12:00:02","lt":"2022-01-01T12:00:03"}}
  3. Now, we can execute a query for filtering on price and timeframe values to check the correct indexing of the data:
    GET test-range/_search
    { "query": {
        "bool": {
          "filter": [ 
             { "term": { "price": { "value": 2.4 } } },
              { "term": { "timeframe": { "value": "2022-01-01T12:00:02" } } }
    ] } } }

The result will be something similar to the following:

{
  …truncated…
    "hits" : [
      { "_index" : "test-range", "_id" : "3", "_score" : 0.0,
        "_source" : {
          "price" : { "gte" : 1.3, "lt" : 3.3 },
          "timeframe" : {
            "gte" : "2022-01-01T12:00:02",
            "lt" : "2022-01-01T12:00:03"
    …truncated…
}

How it works…

Not all the base types that support ranges can be used in ranges. The possible range types that are supported by Elasticsearch are as follows:

  • integer_range: This is used to store signed 32-bit integer values.
  • float_range: This is used to store signed 32-bit floating-point values.
  • long_range: This is used to store signed 64-bit integer values.
  • double_range: This is used to store signed 64-bit floating-point values.
  • date_range: This is used to store date values as 64-bit integers.
  • ip_range: This is used to store IPv4 and IPv6 values.

These range types are very useful for all cases where the values are not exact.

When you're storing a document in Elasticsearch, the field can be composed using the following parameters:

  • gt or gte for the lower bound of the range
  • lt or lte for the upper bound of the range

    Note

    Range types can be used for querying values, but they have limited support for aggregation: they only support histogram and cardinality aggregations.

See also

  • The Using the range query recipe in Chapter 5, Text and Numeric Queries, for range queries
  • The Executing histogram aggregations recipe in Chapter 7, Aggregation
 

Using the Flattened field type

In many applications, it is possible to define custom metadata or configuration composed of key-value pairs. This use case is not optimal for Elasticsearch. Creating a new mapping for every key will not be easy to manage as they evolve into large mappings.

X-Pack provides a type (free for use) to solve this problem: the flattened field type.

As the name suggests, it takes all the key-value pairs (also nested ones) and indices them in a flat way, thus solving the problem of the mapping explosion.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

We want to use Elasticsearch to store configurations with a varying number of fields. To achieve this, follow these steps:

  1. To create our configuration index with a flattened field, we will use the following mapping:
    PUT test-flattened
    { "mappings": {
        "properties": {
          "name": { "type": "keyword" },
          "configs": { "type": "flattened" } } } }
  2. Now, we can store some documents that contain our configuration data:
    PUT test-flattened/_bulk
    {"index":{"_index":"test-flattened","_id":"1"}}
    {"name":"config1","configs":{"key1":"value1","key3":"2022-01-01T12:00:01"}}
    {"index":{"_index":"test-flattened","_id":"2"}}
    {"name":"config2","configs":{"key1":true,"key2":30}}
    {"index":{"_index":"test-flattened","_id":"3"}}
    {"name":"config3","configs":{"key4":"test","key2":30.3}}
  3. Now, we can execute a query that's searching for the text in all the configurations:
    POST test-flattened/_search
    { "query": { "term": { "configs": "test" } } }

Alternatively, we can search for a particular key in the configs object, like so:

POST test-flattened/_search
{ "query": { "term": { "configs.key4": "test" } } }

The result for both queries will be as follows:

{ …truncated…
    "hits" : [
            {
        "_index" : "test-flattened", 
        "_id" : "3",  "_score" : 1.2330425,
        "_source" : {
          "name" : "config3",
          "configs" : { "key4" : "test", "key2" : 30.3    }
    …truncated…

How it works…

This special field type can take a JSON object that's been passed in a document and flatten key/value pairs that can be searched without defining a mapping for fields in the JSON content.

This helps since the mapping can explode due to the JSON containing a large number of different fields.

During the indexing process, tokens are created for each leaf value of the JSON object using a keyword analyzer. Due to this, the number, date, IP, and other formats are converted into text and the only queries that can be executed are the ones that are supported by keyword tokenization. This includes term, terms, terms_set, prefix, range (this is based on text), match, multi_match, query_string, simple_query_string, and exists.

See also

See Chapter 5, Text and Numeric Queries, for more references on the cited query types.

 

Using the Point and Shape field types

The power of geoprocessing Elasticsearch is used to provide capabilities to a large number of applications. However, it has one limitation: it only works for world coordinates.

Using Point and Shape types, X-Pack extends the geo capabilities to every two-dimensional planar coordinate system.

Common scenarios for this use case include mapping and documenting building coordinates and checking if documents are inside a shape.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

We want to use Elasticsearch to map a device's coordinates in our shop. To achieve this, follow these steps:

  1. To create our index for storing devices and their location, we will use the following mapping:
    PUT test-point
    { "mappings": {
        "properties": {
          "device": { "type": "keyword" },
          "location": { "type": "point" } } } }
  2. Now, we can store some documents that contain our device's data:
    PUT test-point/_bulk
    {"index":{"_index":"test-point","_id":"1"}}
    {"device":"device1","location":{"x":10,"y":10}}
    {"index":{"_index":"test-point","_id":"2"}}
    {"device":"device2","location":{"x":10,"y":15}}
    {"index":{"_index":"test-point","_id":"3"}}
    {"device":"device3","location":{"x":15,"y":10}}

At this point, we want to create shapes in our shop so that we can divide it into parts and check if the people/devices are inside the defined shape. To do this, follow these steps:

  1. First, let's create an index to store our shapes:
    PUT test-shape
    { "mappings": {
        "properties": { 
          "room": { "type": "keyword" },
          "geometry": { "type": "shape" } } } } 
  2. Now, we can store a document to test the mapping:
    POST test-shape/_doc/1
    { "room":"hall",
      "geometry" : {
        "type" : "polygon",
        "coordinates" : [
          [ [8.0, 8.0], [8.0, 12.0], [12.0, 12.0], [12.0, 8.0], [8.0, 8.0]] ] } }
  3. Now, let's search our devices in our stored shape:
    POST test-point/_search
    { "query": {
        "shape": {
          "location": {
            "indexed_shape": { "index": "test-shape", "id": "1", "path": "geometry" } } } } }

The result for both queries will be as follows:

{  …truncated…
    "hits" : [ {
"_index" : "test-point",  "_id" : "1", "_score" : 0.0,
        "_source" : {
          "device" : "device1",
          "location" : { "x" : 10, "y" : 10 }
    …truncated…

How it works…

The point and shape types are used to manage every type of two-dimensional planar coordinate system inside documents. Their usage is similar to geo_point and geo_shape.

The advantage of storing shapes in Elasticsearch is that you can simplify how you match constraints between coordinates and shapes. This was shown in our query example, where we loaded the shape's geometry from the test-shape index and the search from the test-point index.

Managing coordinate systems and shapes is a very large topic that requires knowledge of shape types and geo models since they are strongly bound to data models.

See also

  • The official documentation for Point types can be found at https://www.elastic.co/guide/en/elasticsearch/reference/current/point.html, while the official documentation for Shape types can be found at https://www.elastic.co/guide/en/elasticsearch/reference/current/shape.html.
  • The official documentation about Shape Query can be found at https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-shape-query.html.
 

Using the Dense Vector field type

Elasticsearch is often used to store machine learning data for training algorithms. X-Pack provides the Dense Vector field to store vectors that have up to 2,048 dimension values.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

We want to use Elasticsearch to store a vector of values for our machine learning models. To achieve this, follow these steps:

  1. To create an index to store a vector of values, we will use the following mapping:
    PUT test-dvector
    { "mappings": {
        "properties": {
          "vector": { "type": "dense_vector", "dims": 4 },
          "model": { "type": "keyword" } } } }
  2. Now, we can store a document to test the mapping:
    POST test-dvector/_doc/1
    { "model":"pipe_flood", "vector" : [8.1, 8.3, 12.1, 7.32] }

How it works...

The Dense Vector field is a helper field for storing vectors in Elasticsearch.

The ingested data for the field must be a list of floating-point values with the exact dimension of the value provided by the dims property of the mapping (4, in our example).

If the dimension of the vector field is incorrect, an exception is raised, and the document is not indexed.

For example, let's see what happens when we try to index a similar document with the wrong feature dimension:

POST test-dvector/_doc/1
{ "model":"pipe_flood", "vector" : [8.1, 8.3, 12.1] }

We will see a similar exception that enforces the right dimension size. Here, the document will not be stored:

{
  "error" : {
    "root_cause" : [
      {
        "type" : "mapper_parsing_exception",
        "reason" : "failed to parse"
      }
    ],
    "type" : "mapper_parsing_exception",
    "reason" : "failed to parse",
    "caused_by" : {
      "type" : "illegal_argument_exception",
      "reason" : "Field [vector] of type [dense_vector] of doc [1] has number of dimensions [3] less than defined in the mapping [4]"
    }
  },
  "status" : 400
}
 

Using the Histogram field type

Histograms are a common data type for analytics and machine learning analysis. We can store Histograms in the form of values and counts; they are not indexed, but they can be used in aggregations.

The histogram field type is a special mapping that's available in X-Pack that is commonly used to store the results of Histogram aggregations in Elasticsearch for further processing, such as to compare the aggregation results at different times.

Getting ready

You will need an up-and-running Elasticsearch installation, as described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

In this recipe, we will simulate a common use case of Histogram data that is stored in Elasticsearch. Here, we will use a Histogram that specifies the millimeters of rain divided by year for our advanced analytics solution. To achieve this, follow these steps:

  1. First, let's create an index for the Histogram by using the following mapping:
    PUT test-histo
    { "mappings": {
        "properties": {
          "histogram": { "type": "histogram" },
          "model": { "type": "keyword" } } } }
  2. Now, we can store a document to test the mapping:
    POST test-histo/_doc/1
    { "model":"show_level", "histogram" : { "values" : [2016, 2017, 2018, 2019, 2020, 2021],  "counts" : [283, 337, 323, 312, 236, 232] } }

How it works…

The histogram field type specializes in storing Histogram data. I must be provided as a JSON object composed of the values and counts fields with the same cardinality of items. The only supported aggregations are the following ones. We will look at these in more detail in Chapter 7, Aggregations:

  • Metric aggregations such as min, max, sum, value_count, and avg
  • The percentiles and percentile_ranks aggregations
  • The boxplot aggregation
  • The histogram aggregation

The data is not indexed, but you can also check the existence of a document by populating this field with the exist query.

See also

  • Aggregations will be discussed in more detail in Chapter 7, Aggregations
  • The Using the exist query recipe in Chapter 5, Text and Numeric Queries
 

Adding metadata to a mapping

Sometimes, when we are working with our mapping, we may need to store some additional data to be used for display purposes, ORM facilities, permissions, or simply to track them in the mapping.

Elasticsearch allows you to store every kind of JSON data you want in the mapping with the special _meta field.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.

How to do it…

The _meta mapping field can be populated with any data we want in JSON format, like so:

{ "_meta": {
    "attr1": ["value1", "value2"],
    "attr2": { "attr3": "value3" }
  } }

How it works…

When Elasticsearch processes a new mapping and finds a _meta field, it stores it as-is in the global mapping status and propagates the information to all the cluster nodes. The content of the _meta files is only checked to ensure it's a valid JSON format. Its content is not taken into consideration by Elasticsearch. You can populate it with everything you need to be in JSON format.

_meta is only used for storing purposes; it's not indexed and searchable. It can be used to enrich your mapping with custom information that can be used by your applications.

It can be used for the following reasons:

  • Storing type metadata:
    {"name": "Address", "description": "This entity store address information"}
  • Storing object relational mapping (ORM)-related information (such as mapping class and mapping transformations):
    {"class": "com.company.package.AwesomeClass", "properties" : { "address":{"class": "com.company.package.Address"}} }
  • Storing type permission information:
    {"read":["user1", "user2"], "write":["user1"]}
  • Storing extra type information (that is, the icon filename, which is used to display the type):
    {"icon":"fa fa-alert" }
  • Storing template parts for rendering web interfaces:
    {"fragment":"<div><h1>$name</h1><p>$description</p></div>" }
 

Specifying different analyzers

In the previous recipes, we learned how to map different fields and objects in Elasticsearch, and we described how easy it is to change the standard analyzer with the analyzer and search_analyzer properties.

In this recipe, we will look at several analyzers and learn how to use them to improve indexing and searching quality.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

How to do it…

Every core type field allows you to specify a custom analyzer for indexing and for searching as field parameters.

For example, if we want the name field to use a standard analyzer for indexing and a simple analyzer for searching, the mapping will be as follows:

{ "name": {
    "type": "string",
    "index_analyzer": "standard",
    "search_analyzer": "simple"
  } }

How it works…

The concept of the analyzer comes from Lucene (the core of Elasticsearch). An analyzer is a Lucene element that is composed of a tokenizer that splits text into tokens, as well as one or more token filters. These filters carry out token manipulation such as lowercasing, normalization, removing stop words, stemming, and so on.

During the indexing phase, when Elasticsearch processes a field that must be indexed, an analyzer is chosen. First, it checks whether it is defined in the index_analyzer field, then in the document, and finally, in the index.

Choosing the correct analyzer is essential to getting good results during the query phase.

Elasticsearch provides several analyzers in its standard installation. The following table shows the most common ones:

Figure 2.4 – List of the most common general-purpose analyzers

Figure 2.4 – List of the most common general-purpose analyzers

For special language purposes, Elasticsearch supports a set of analyzers aimed at analyzing text in a specific language, such as Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Italian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai.

See also

Several Elasticsearch plugins extend the list of available analyzers. The most famous ones are as follows:

 

Using index components and templates

Real-world index mapping can be very complex and often, parts of it can be reused between different indices types. To be able to simplify this management, mappings can be divided into the following:

  • Components: These will collect the reusable parts of the mapping.
  • Index templates: These aggregate the components in a single template.

Using components is the most manageable way to scale on large index mappings because they can simplify large template management.

Getting ready

You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1Getting Started.

To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.

How to do it…

We want to build an index mapping composed of two reusable components. To achieve this, follow these steps:

  1. First, we will create three components for the timestamp, order, and items. These will store parts of our index mapping:
    PUT _component_template/timestamp-management
    { "template": {
        "mappings": {
          "properties": {
            "@timestamp": { "type": "date"  } } } } }
    PUT _component_template/order-data
    { "template": {
        "mappings": {
          "properties": {
            "id": { "type": "keyword" },
            "date": { "type": "date" },
            "customer_id": { "type": "keyword" },
            "sent": { "type": "boolean" } } } } }
    PUT _component_template/items-data
    { "template": {
        "mappings": {
          "properties": {
            "item": {
              "type": "object",
              "properties": {
                "name": { "type": "keyword" },
                "quantity": { "type": "long" },
                "cost": { "type": "alias", "path": "item.price" },
                "price": { "type": "double" },
                "vat": { "type": "double" } } } } } } }
  2. Now, we can create an index template that can sum them up:
    PUT _index_template/order
    {
      "index_patterns": ["order*"],
      "template": {
        "settings": { "number_of_shards": 1 },
        "mappings": {
          "properties": { "id": { "type": "keyword" } } 
         },
        "aliases": { "order": { } }
      },
      "priority": 200,
      "composed_of": ["timestamp-management", "order-data", "items-data"],
      "version": 1,
      "_meta": { "description": "My order index template" } }

How it works…

The process of using index components to build indices templates is very simple: you can register as many components as you wish (Steps 1 and 2 in this recipe) and then aggregate them when you define the template (Step 3). By using this approach, your template is divided into blocks, and the index template is simpler to manage and easily reusable.

For simple use cases, using components to build indices template is too verbose. This approach shines when you need to manage different logs or documents in Elasticsearch that have common parts because you can refactorize them very quickly and reuse them.

Components are simple partial templates that are merged in an index template. Here, the parameters are as follows:

  • index_patterns: This is a list of index glob patterns. When an index is created, if its name matches the glob patterns, the template is applied when the index is created.
  • aliases: This is an optional alias definition to be applied to the created index.
  • template: This is the template to be applied to the index.
  • priority: This is an optional order of priority for applying this template. The standard priority of ELK components is 100, so if the value is set below 100, a custom template can override an ELK one.
  • version: This is an optional incremental number that is managed by the user to keep track of the updates that are made to the template.
  • _meta: This is an optional JSON object that contains metadata for the index.
  • composed_of: This is an optional list of index components that are merged to build the final index mapping.

    Note

    This functionality is available from Elasticsearch version 7.8 and above.

See also

The Adding metadata to a mapping recipe in this chapter about using the _meta field.

About the Author
  • Alberto Paro

    Alberto Paro is an engineer, manager, and software developer. He currently works as technology architecture delivery associate director of the Accenture Cloud First data and AI team in Italy. He loves to study emerging solutions and applications, mainly related to cloud and big data processing, NoSQL, Natural language processing (NLP), software development, and machine learning. In 2000, he graduated in computer science engineering from Politecnico di Milano. Then, he worked with many companies, mainly using Scala/Java and Python on knowledge management solutions and advanced data mining products, using state-of-the-art big data software. A lot of his time is spent teaching how to effectively use big data solutions, NoSQL data stores, and related technologies.

    Browse publications by this author
Elasticsearch 8.x Cookbook - Fifth Edition
Unlock this book and the full library FREE for 7 days
Start now