Mapping is a primary concept in Elasticsearch that defines how the search engine should process a document and its fields to be effectively used in search and aggregations.
Search engines perform the following two main operations:
These two operations are strictly connected; an error in the indexing step leads to unwanted or missing search results.
Elasticsearch, by default, has explicit mapping at the index level. When indexing, if a mapping is not provided, a default one is created and guesses the structure from the JSON data fields that the document is composed of. This new mapping is then automatically propagated to all the cluster nodes: it will begin part of the cluster's state.
The default type mapping has sensible default values, but when you want to change their behavior or customize several other aspects of indexing (object to special fields, storing, ignoring, completion, and so on), you need to provide a new mapping definition.
In this chapter, we'll look at all the possible mapping field types that document mappings are composed of.
In this chapter, we will cover the following recipes:
To follow and test the commands shown in this chapter, you must have a working Elasticsearch cluster installed on your system, as described in Chapter 1, Getting Started.
To simplify how you manage and execute these commands, I suggest that you install Kibana so that you have a more advanced environment to execute Elasticsearch queries.
If we consider the index as a database in the SQL world, mapping is similar to the create table definition.
Elasticsearch can understand the structure of the document that you are indexing (reflection) and create the mapping definition automatically. This is called explicit mapping creation.
To execute the code in this recipe, you will need an up-and-running Elasticsearch installation, as described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute these commands, you can use any HTTP client, such as curl
(https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar platforms. I suggest using the Kibana console to provide code completion and better character escaping for Elasticsearch.
To understand the examples and code in this recipe, basic knowledge of JSON is required.
You can explicitly create a mapping by adding a new document to Elasticsearch. For this, perform the following steps:
PUT test
The output will be as follows:
{ "acknowledged" : true, "shards_acknowledged" : true, "index" : "test" }
PUT test/_doc/1 {"name":"Paul", "age":35}
The output will be as follows:
{ "_index" : "test", "_id" : "1", "_version" : 1, "result" : "created", "_shards" : {"total" : 2, "successful" : 1, "failed" : 0 }, "_seq_no" : 0, "_primary_term" : 1 }
GET test/_mapping
{ "test" : { "mappings" : { "properties" : { "age" : { "type" : "long" }, "name" : { "type" : "text", "fields" : { "keyword" : {"type" : "keyword", "ignore_above" : 256 } } } } } } }
DELETE test
The output will be as follows:
{ "acknowledged" : true }
The first command line (Step 1) creates an index where we can configure the mappings in the future, if required, and store documents in it.
The second command (Step 2) inserts a document in the index (we'll learn how to create the index in the Creating an index recipe of Chapter 3, Basic Operations, and record indexing in the Indexing a document recipe of Chapter 3, Basic Operations).
Elasticsearch reads all the default properties for the field of the mapping and starts to process them as follows:
In Elasticsearch, every document has a unique identifier, called an ID for a single index, which is stored in the special _id
field of the document.
The _id
field can be provided at index time or can be assigned automatically by Elasticsearch if it is missing.
When a mapping type is created or changed, Elasticsearch automatically propagates mapping changes to all the nodes in the cluster so that all the shards are aligned to process that particular type.
In Elasticsearch 7.x, there was a default type (_doc
): it was removed in Elasticsearch 8.x and above.
Please refer to the following recipes in Chapter 3, Basic Operations:
Using explicit mapping makes it possible to start to quickly ingest the data using a schemaless approach without being concerned about field types. Thus, to achieve better results and performance in indexing, it's required to manually define a mapping.
Fine-tuning mapping brings some advantages, such as the following:
Elasticsearch allows you to use base fields with a wide range of configurations.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
To execute this recipe's examples, you will need to create an index with a test
name, where you can put mappings, as explained in the Using explicit mapping creation recipe.
Let's use a semi real-world example of a shop order for our eBay-like shop:
order
record must be converted into an Elasticsearch mapping definition, as follows:PUT test/_mapping { "properties" : { "id" : {"type" : "keyword"}, "date" : {"type" : "date"}, "customer_id" : {"type" : "keyword"}, "sent" : {"type" : "boolean"}, "name" : {"type" : "keyword"}, "quantity" : {"type" : "integer"}, "price" : {"type" : "double"}, "vat" : {"type" : "double", "index": false} } }
Now, the mapping is ready to be put in the index. We will learn how to do this in the Putting a mapping in an index recipe of Chapter 3, Basic Operations.
Field types must be mapped to one of the Elasticsearch base types, and options on how the field must be indexed need to be added.
The following table is a reference for the mapping types:
Depending on the data type, it's possible to give explicit directives to Elasticsearch when you're processing the field for better management. The most used options are as follows:
store
(default false
): This marks the field to be stored in a separate index fragment for fast retrieval. Storing a field consumes disk space but reduces computation if you need to extract it from a document (that is, in scripting and aggregations). The possible values for this option are true
and false
. They are always retuned as an array of values for consistency.The stored fields are faster than others in aggregations.
index
: This defines whether or not the field should be indexed. The possible values for this parameter are true
and false
. Index fields are not searchable (the default is true
).null_value
: This defines a default value if the field is null.boost
: This is used to change the importance of a field (the default is 1.0
).boost
works on a term level only, so it's mainly used in term, terms, and match queries.
search_analyzer
: This defines an analyzer to be used during the search. If it's not defined, the analyzer of the parent object is used (the default is null
).analyzer
: This sets the default analyzer to be used (the default is null
).norms
: This controls the Lucene norms. This parameter is used to score queries better. If the field is only used for filtering, it's a best practice to disable it to reduce resource usage (true
for analyzed fields and false
for not_analyzed
ones).copy_to
: This allows you to copy the content of a field to another one to achieve functionalities, similar to the _all
field.ignore_above
: This allows you to skip the indexing string if it's bigger than its value. This is useful for processing fields for exact filtering, aggregations, and sorting. It also prevents a single term token from becoming too big and prevents errors due to the Lucene term's byte-length limit of 32,766. The maximum suggested value is 8191
(https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html).From Elasticsearch version 6.x onward, as shown in the Using explicit mapping creation recipe, the explicit inferred type for a string is a multifield mapping:
text
. This mapping allows textual queries (that is, term, match, and span queries). In the example provided in the Using explicit mapping creation recipe, this was name
.keyword
subfield is used for keyword
mapping. This field can be used for exact term matching and aggregation and sorting. In the example provided in the Using explicit mapping creation recipe, the referred field was name.keyword
.Another important parameter, available only for text
mapping, is term_vector
(the vector of terms that compose a string). Please refer to the Lucene documentation for further details at https://lucene.apache.org/core/8_7_0/core/org/apache/lucene/index/Terms.html.
term_vector
can accept the following values:
no
: This is the default value; that is, skip term vector.yes
: This is the store term vector.with_offsets
: This is the store term vector with a token offset (start, end position in a block of characters).with_positions
: This is used to store the position of the token in the term vector.with_positions_offsets
: This stores all the term vector data.with_positions_payloads
: This is used to store the position and payloads of the token in the term vector.with_positions_offsets_payloads
: This stores all the term vector data with payloads.Term vectors allow fast highlighting but consume disk space due to storing additional text information. It's a best practice to only activate it in fields that require highlighting, such as title or document content.
You can refer to the following sources for further details on the concepts of this chapter:
Array or multi-value fields are very common in data models (such as multiple phone numbers, addresses, names, aliases, and so on), but they're not natively supported in traditional SQL solutions.
In SQL, multi-value fields require you to create accessory tables that must be joined to gather all the values, leading to poor performance when the cardinality of the records is huge.
Elasticsearch, which works natively in JSON, provides support for multi-value fields transparently.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
To use an Array
type in our mapping, perform the following steps:
{ "properties" : { "name" : {"type" : "keyword"}, "tag" : {"type" : "keyword", "store" : true}, ... }
document1
:{"name": "document1", "tag": "awesome"}
document2
:{"name": "document2", "tag": ["cool", "awesome", "amazing"] }
Elasticsearch transparently manages the array: there is no difference if you declare a single value or a multi-value due to its Lucene core nature.
Multi-values for fields are managed in Lucene, so you can add them to a document with the same field name. For people with a SQL background, this behavior may be quite strange, but this is a key point in the NoSQL world as it reduces the need for a join query and creates different tables to manage multi-values. An array of embedded objects has the same behavior as simple fields.
The object type is one of the most common field aggregation structures in documental databases.
An object is a base structure (analogous to a record in SQL): in JSON types, they are defined as key/value pairs inside the {}
symbols.
Elasticsearch extends the traditional use of objects (which are flat in DBMS), thus allowing for recursive embedded objects.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. Again, I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
We can rewrite the mapping code from the previous recipe using an array of items:
PUT test/_doc/_mapping { "properties" : { "id" : {"type" : "keyword"}, "date" : {"type" : "date"}, "customer_id" : {"type" : "keyword", "store" : true}, "sent" : {"type" : "boolean"}, "item" : { "type" : "object", "properties" : { "name" : {"type" : "text"}, "quantity" : {"type" : "integer"}, "price" : {"type" : "double"}, "vat" : {"type" : "double"} } } } }
Elasticsearch speaks native JSON, so every complex JSON structure can be mapped in it.
When Elasticsearch is parsing an object type, it tries to extract fields and processes them as its defined mapping. If not, it learns the structure of the object using reflection.
The most important attributes of an object are as follows:
properties
: This is a collection of fields or objects (we can consider them as columns in the SQL world).enabled
: This establishes whether or not the object should be processed. If it's set to false
, the data contained in the object is not indexed and it cannot be searched (the default is true
).dynamic
: This allows Elasticsearch to add new field names to the object using a reflection on the values of the inserted data. If it's set to false
, when you try to index an object containing a new field type, it'll be rejected silently. If it's set to strict
, when a new field type is present in the object, an error will be raised, skipping the indexing process. The dynamic parameter allows you to be safe about making changes to the document's structure (the default is true
).The most used attribute is properties
, which allows you to map the fields of the object in Elasticsearch fields.
Disabling the indexing part of the document reduces the index size; however, the data cannot be searched. In other words, you end up with a smaller file on disk, but there is a cost in terms of functionality.
Some special objects are described in the following recipes:
The document mapping is also referred to as the root object. This has special parameters that control its behavior, and they are mainly used internally to do special processing, such as routing or time-to-live of documents.
In this recipe, we'll look at these special fields and learn how to use them.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
We can extend the preceding order example by adding some of the special fields, like so:
PUT test/_mapping { "_source": { "store": true }, "_routing": { "required": true }, "_index": { "enabled": true }, "properties": {} }
Every special field has parameters and value options, such as the following:
_id
: This allows you to index only the ID part of the document. All the ID queries will speed up using the ID value (by default, this is not indexed and not stored)._index
: This controls whether or not the index must be stored as part of the document. It can be enabled by setting the "enabled": true
parameter (enabled=false
is the default)._source
: This controls how the document's source is stored. Storing the source is very useful, but it's a storage overhead, so it is not required. Consequently, it's better to turn it off (enabled=true
is the default)._routing
: This defines the shard that will store the document. It supports additional parameters, such as required
(true/false
). This is used to force the presence of the routing value, raising an exception if it's not provided.Controlling how to index and process a document is very important and allows you to resolve issues related to complex data types.
Every special field has parameters to set particular configurations, and some of their behaviors could change in different releases of Elasticsearch.
Please refer to the Using dynamic templates in document mapping recipe in this chapter and the Putting a mapping in an index recipe of Chapter 3, Basic Operations, to learn more.
In the Using explicit mapping creation recipe, we saw how Elasticsearch can guess the field type using reflection. In this recipe, we'll see how we can help it improve its guessing capabilities via dynamic templates.
The dynamic template feature is very useful. For example, it may be useful in situations where you need to create several indices with similar types because it allows you to move the need to define mappings from coded initial routines to automatic index-document creation. Typical usage is to define types for Logstash log indices.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
We can extend the previous mapping by adding document-related settings, as follows:
PUT test/_mapping { "dynamic_date_formats":["yyyy-MM-dd", "dd-MM-yyyy"],\ "date_detection": true, "numeric_detection": true, "dynamic_templates":[ {"template1":{ "match":"*", "match_mapping_type": "long", "mapping": {"type":" {dynamic_type}", "store": true} }} ], "properties" : {...} }
The root object (document) controls the behavior of its fields and all its children object fields. In document mapping, we can define the following:
date_detection
: This allows you to extract a date from a string (true
is the default).dynamic_date_formats
: This is a list of valid date formats. This is used if date_detection
is active.numeric_detection
: This enables you to convert strings into numbers, if possible (false
is the default).dynamic_templates
: This is a list of templates that are used to change the explicit mapping inference. If one of these templates is matched, the rules that have been defined in it are used to build the final mapping.A dynamic template is composed of two parts: the matcher and the mapping.
To match a field to activate the template, you can use several types of matchers, such as the following:
match
: This allows you to define a match on the field name. The expression is a standard GLOB pattern (http://en.wikipedia.org/wiki/Glob_(programming)).unmatch
: This allows you to define the expression to be used to exclude matches (optional).match_mapping_type
: This controls the types of the matched fields; for example, string, integer, and so on (optional).path_match
: This allows you to match the dynamic template against the full dot notation of the field; for example, obj1.*.value
(optional).path_unmatch
: This will do the opposite of path_match
, excluding the matched fields (optional).match_pattern
: This allows you to switch the matchers to regex
(regular expression); otherwise, the glob pattern match is used (optional).The dynamic template mapping part is a standard one but can use special placeholders, such as the following:
{name}
: This will be replaced with the actual dynamic field name.{dynamic_type}
: This will be replaced with the type of the matched field.The order of the dynamic templates is very important; only the first one that is matched is executed. It is good practice to order the ones with more strict rules first, and then the others.
Dynamic templates are very handy when you need to set a mapping configuration to all the fields. This can be done by adding a dynamic template, similar to this one:
"dynamic_templates" : [ { "store_generic" : { "match" : "*", "mapping" : { "store" : true } } } ]
In this example, all the new fields, which will be added with explicit mapping, will be stored.
There is a special type of embedded object called a nested object. This resolves a problem related to Lucene's indexing architecture, in which all the fields of embedded objects are viewed as a single object (technically speaking, they are flattened). During the search, in Lucene, it is not possible to distinguish between values and different embedded objects in the same multi-valued array.
If we consider the previous order example, it's not possible to distinguish an item's name and its quantity with the same query since Lucene puts them in the same Lucene document object. We need to index them in different documents and then join them. This entire trip is managed by nested objects and nested queries.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl
(https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
A nested object is defined as a standard object with the nested type.
Regarding the example in the Mapping an object recipe, we can change the type from object
to nested
, as follows:
PUT test/_mapping { "properties" : { "id" : {"type" : "keyword"}, "date" : {"type" : "date"}, "customer_id" : {"type" : "keyword"}, "sent" : {"type" : "boolean"}, "item" : {"type" : "nested", "properties" : { "name" : {"type" : "keyword"}, "quantity" : {"type" : "long"}, "price" : {"type" : "double"}, "vat" : {"type" : "double"} } } } }
When a document is indexed, if an embedded object has been marked as nested
, it's extracted by the original document before being indexed in a new external document and saved in a special index position near the parent document.
In the preceding example, we reused the mapping from the Mapping an object recipe, but we changed the type of the item from object
to nested
. No other action must be taken to convert an embedded object into a nested one.
The nested objects are special Lucene documents that are saved in the same block of data as its parent – this approach allows for fast joining with the parent document.
Nested objects are not searchable with standard queries, only with nested ones. They are not shown in standard query results.
The lives of nested objects are related to their parents: deleting/updating a parent automatically deletes/updates all the nested children. Changing the parent means Elasticsearch will do the following:
Sometimes, you must propagate information about the nested objects to their parent or root objects. This is mainly to build simpler queries about the parents (such as terms queries without using nested ones). To achieve this, two special properties of nested objects must be used:
include_in_parent
: This makes it possible to automatically add the nested fields to the immediate parent.include_in_root
: This adds the nested object fields to the root object.These settings add data redundancy, but they reduce the complexity of some queries, thus improving performance.
In the previous recipe, we saw how it's possible to manage relationships between objects with the nested object type. The disadvantage of nested objects is their dependence on their parents. If you need to change the value of a nested object, you need to reindex the parent (this causes a potential performance overhead if the nested objects change too quickly). To solve this problem, Elasticsearch allows you to define child documents.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
In the following example, we have two related objects: an Order and an Item.
Their UML representation is as follows:
The final mapping should merge the field definitions of both Order
and Item
, as well as use a special field (join_field
, in this example) that takes the parent/child relationship.
To use join_field
, follow these steps:
PUT test1/_mapping { "properties": { "join_field": { "type": "join", "relations": { "order": "item" } }, "id": { "type": "keyword" }, "date": { "type": "date" }, "customer_id": { "type": "keyword" }, "sent": { "type": "boolean" }, "name": { "type": "text" }, "quantity": { "type": "integer" }, "vat": { "type": "double" } } }
The preceding mapping is very similar to the one in the previous recipe.
PUT test/_doc/1?refresh { "id": "1", "date": "2018-11-16T20:07:45Z", "customer_id": "100", "sent": true, "join_field": "order" } PUT test/_doc/c1?routing=1&refresh { "name": "tshirt", "quantity": 10, "price": 4.3, "vat": 8.5, "join_field": { "name": "item", "parent": "1" } }
The child item requires special management because we need to add routing
with the parent (1 in the preceding example). Furthermore, we need to specify the parent name and its ID in the object.
Mapping, in the case of multiple item relationships in the same index, needs to be computed as the sum of all the other mapping fields.
The relationship between objects must be defined in join_field
.
There must only be a single join_field
for mapping; if you need to provide a lot of relationships, you can provide them in the relations
object.
The child document must be indexed in the same shard as the parent; so, when indexed, an extra parameter must be passed, which is routing
(we'll learn how to do this in the Indexing a document recipe in Chapter 3, Basic Operations).
A child document doesn't need to reindex the parent document when we want to change its values. Consequently, it's fast in terms of indexing, reindexing (updating), and deleting.
In Elasticsearch, we have different ways to manage relationships between objects, as follows:
type=object
: This is implicitly managed by Elasticsearch and it considers the embedding as part of the main document. It's fast, but you need to reindex the main document to change the value of the embedded object.type=nested
: This allows you to accurately search and filter the parent by using nested queries on children. Everything works for the embedded object except for the query (you must use a nested query to search for them).join_field
property to bind them to the parent. They must be indexed in the same shard as the parent. The join with the parent is a bit slower than the nested one. This is because the nested objects are in the same data block as the parent in the Lucene index and they are loaded with the parent; otherwise, the child document requires more read operations.Choosing how to model the relationship between objects depends on your application scenario.
Tip
There is also another approach that can be used, but on big data documents, it creates poor performance – decoupling a join relationship. You do the join query in two steps: first, collect the ID of the children/other documents and then search for them in a field of their parent.
Please refer to the Using the has_child query, Using the top_children query, and Using the has_parent query recipes of Chapter 6, Relationships and Geo Queries, for more details on child/parent queries.
Often, a field must be processed with several core types or in different ways. For example, a string field must be processed as tokenized
for search and not-tokenized
for sorting. To do this, we need to define a fields
multifield special property.
The fields
property is a very powerful feature of mappings because it allows you to use the same field in different ways.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
To define a multifield property, we need to define a dictionary containing the fields
subfield. The subfield with the same name as a parent field is the default one.
If we consider the item from our order
example, we can index the name like so:
{ "name": { "type": "keyword", "fields": { "name": {"type": "keyword"}, "tk": {"type": "text"}, "code": {"type": "text","analyzer": "code_analyzer"} } },
If we already have a mapping stored in Elasticsearch and we want to migrate the fields in a multi-field property, it's enough to save a new mapping with a different type, and Elasticsearch provides the merge automatically. New subfields in the fields
property can be added without problems at any moment, but the new subfields will only be available while you're searching/aggregating newly indexed documents.
When you add a new subfield to already indexed data, you need to reindex your record to ensure you have it correctly indexed for all your records.
During indexing, when Elasticsearch processes a fields
property of the multifield
type, it reprocesses the same field for every subfield defined in the mapping.
To access the subfields of a multifield, we must build a new path on the base field, plus use the subfield's name. In the preceding example, we have the following:
name
: This points to the default multifield subfield-field (the keyword one).name.tk
: This points to the standard analyzed (tokenized) text field.name.code
: This points to a field that was analyzed with a code extractor analyzer.As you may have noticed in the preceding example, we changed the analyzer to introduce a code extractor analyzer that allows you to extract the item code from a string.
By using the multifield, if we index a string such as Good Item to buy - ABC1234
, we'll have the following:
name
= Good Item to buy - ABC1234
(useful for sorting)name.tk
= ["good", "item", "to", "buy", "abc1234"]
(useful for searching)name.code
= ["ABC1234"]
(useful for searching and aggregations)In the case of the code analyzer, if the code is not found in the string, no tokens are generated. This makes it possible to develop solutions that carry out information retrieval tasks at index time and uses these at search time.
The fields
property is very useful in data processing because it allows you to define several ways to process field data.
For example, if we are working on documental content (such as articles, word documents, and so on), we can define fields as subfield analyzers to extract names, places, date/time, geolocation, and so on.
The subfields of a multifield are standard core type fields – we can perform every process we want on them, such as search, filter, aggregation, and scripting.
To find out more about what Elasticsearch analyzers you can use, please refer to the Specifying different analyzers recipe.
Elasticsearch natively supports the use of geolocation types – special types that allow you to localize your document in geographic coordinates (latitude and longitude) around the world.
Two main types are used in the geographic world: the point and the shape. In this recipe, we'll look at GeoPoint – the base element of geolocation.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
The type of the field must be set to geo_point
to define a GeoPoint.
We can extend the order example by adding a new field that stores the location of a customer. This will result in the following output:
PUT test/_mapping { "properties": { "id": {"type": "keyword",}, "date": {"type": "date"}, "customer_id": {"type": "keyword"}, "customer_ip": {"type": "ip"}, "customer_location": {"type": "geo_point"}, "sent": {"type": "boolean"} } }
When Elasticsearch indexes a document with a GeoPoint field (lat_lon
), it processes the latitude and longitude coordinates and creates special accessory field data to provide faster query capabilities on these coordinates. This is because a special data structure is created to internally manage latitude and longitude.
Depending on the properties, given the latitude and longitude, it's possible to compute the geohash
value (for details, I suggest reading https://www.pubnub.com/learn/glossary/what-is-geohashing/). The indexing process also optimizes these values for special computation, such as distance, ranges, and shape match.
GeoPoint has special parameters that allow you to store additional geographic data:
lat_lon
(the default is false
): This allows you to store the latitude and longitude as the .lat
and .lon
fields. Storing these values improves the performance of many memory algorithms that are used in distance and shape calculus.It makes sense to set lat_lon
to true
so that you store them if there is a single point value for a field. This speeds up searches and reduces memory usage during computation.
geohash
(the default is false
): This allows you to store the computed geohash
value.geohash_precision
(the default is 12
): This defines the precision to be used in geohash
calculus. For example, given a geo point value, [45.61752, 9.08363]
, it can be stored using one of the following syntaxes:
customer_location
= [45.61752, 9.08363]
customer_location.lat
= 45.61752
customer_location.lon
= 9.08363
customer_location.geohash
= u0n7w8qmrfj
GeoPoint is a special type and can accept several formats as input:
lat
and lon
as properties, as shown here:{ "customer_location": { "lat": 45.61752, "lon": 9.08363 },
lan
and lon
as strings, as follows:"customer_location": "45.61752,9.08363",
geohash
as a string, as shown here:"customer_location": "u0n7w8qmrfj",
GeoJSON
array (note that here, lat
and lon
are reversed), as shown in the following code snippet:"customer_location": [9.08363, 45.61752]
An extension of the concept of a point is its shape. Elasticsearch provides a type that allows you to manage arbitrary polygons in GeoShape.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To be able to use advanced shape management, Elasticsearch requires two JAR libraries in its classpath
(usually the lib
directory), as follows:
To map a geo_shape
type, a user must explicitly provide some parameters:
tree
(the default is geohash
): This is the name of the PrefixTree
implementation – GeohashPrefixTree
and quadtree
for QuadPrefixTree
.precision
: This is used instead of tree_levels
to provide a more human value to be used in the tree level. The precision number can be followed by the unit; that is, 10 m, 10 km, 10 miles, and so on.tree_levels
: This is the maximum number of layers to be used in the prefix tree.distance_error_pct
: This sets the maximum errors that are allowed in a prefix tree (0,025% - max 0,5%
by default).The customer_location
mapping, which we saw in the previous recipe using geo_shape
, will be as follows:
"customer_location": { "type": "geo_shape", "tree": "quadtree", "precision": "1m" },
When a shape is indexed or searched internally, a path tree is created and used.
A path tree is a list of terms that contain geographic information and are computed to improve performance in evaluating geo calculus.
The path tree also depends on the shape's type: point, linestring, polygon, multipoint, or multipolygon.
To understand the logic behind the GeoShape, some good resources are the Elasticsearch page, which tells you about GeoShape, and the sites of the libraries that are used for geographic calculus (https://github.com/spatial4j/spatial4j and http://central.maven.org/maven2/com/vividsolutions/jts/1.13/, respectively).
Elasticsearch is used in a lot of systems to collect and search logs, such as Kibana (https://www.elastic.co/products/kibana) and LogStash (https://www.elastic.co/products/logstash). To improve search when using IP addresses, Elasticsearch provides the IPv4 and IPv6 types, which can be used to store IP addresses in an optimized way.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
You need to define the type of field that contains an IP address as ip
.
Regarding the preceding order example, we can extend it by adding the customer IP, like so:
"customer_ip": { "type": "ip" }
The IP must be in the standard point notation form, as follows:
"customer_ip":"19.18.200.201"
When Elasticsearch is processing a document and if a field is an IP one, it tries to convert its value into a numerical form and generates tokens for fast value searching.
The IP has special properties:
index
(the default is true
): This defines whether the field must be indexed. If not, false
must be used. doc_values
(the default is true
): This defines whether the field values should be stored in a column-stride fashion to speed up sorting and aggregations.The other properties (store
, boost
, null_value
, and include_in_all
) work as other base types.
The advantage of using IP fields over strings is more speed in every range and filter and lower resource usage (disk and memory).
It is very common to have a lot of different types in several indices. Because Elasticsearch makes it possible to search in many indices, you should filter for common fields at the same time.
In the real world, these fields are not always called in the same way in all mappings (generally because they are derived from different entities); it's very common to have a mix of the added_date
, timestamp
, @timestamp
, and date_add
fields, all of which are referring to the same date concept.
The alias
fields allow you to define an alias name to be resolved, as well as a query time to simplify the call for all the fields with the same meaning.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
If we take the order example that we saw in the previous recipes, we can add an alias for the price
value to cost
in the item
subfield.
This can be achieved by following these steps:
PUT test/_mapping { "properties": { "id": {"type": "keyword"}, "date": {"type": "date"}, "customer_id": {"type": "keyword"}, "sent": {"type": "boolean"}, "item": { "type": "object", "properties": { "name": {"type": "keyword"}, "quantity": {"type": "long"}, "price": {"type": "double"}, "vat": {"type": "double"} } } } }
PUT test/_doc/1?refresh { "id": "1", "date": "2018-11-16T20:07:45Z", "customer_id": "100", "sent": true, "item": [ { "name": "tshirt", "quantity": 10, "price": 4.3, "vat": 8.5 } ] }
cost
alias, like so:GET test/_search { "query": { "term": { "item.cost": 4.3 } } }
The result will be the saved document.
The alias is a convenient way to use the same name for your search field without the need to change the data structure of your fields. An alias field doesn't need to change a document's structure, thus allowing more flexibility for your data models.
The alias is resolved when the search indices in the query are expanded and have no performance penalties due to its usage.
If you try to index a document with a value in an alias
field, an exception will be thrown.
The path
value of the alias
field must contain the full resolution of the target field, which must be concrete and must be known when the alias is defined.
In the case of an alias in a nested object, it must be in the same nested scope as the target.
The Percolator is a special type of field that makes it possible to store an Elasticsearch query inside the field and use it in a percolator
query.
The Percolator can be used to detect all the queries that match a document.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
To map a percolator
field, follow these steps:
body
field. We can define the mapping like so:PUT test-percolator { "mappings": { "properties": { "query": { "type": "percolator" }, "body": { "type": "text" } } } }
percolator
query inside it, as follows:PUT test-percolator/_doc/1?refresh { "query": { "match": { "body": "quick brown fox" }}}
GET test-percolator/_search { "query": { "percolate": { "field": "query", "document": { "body": "fox jumps over the lazy dog" } } } }
{ ... truncated... "hits" : [ { "_index" : "test-percolator", "_id" : "1", "_score" : 0.13076457, "_source" : { "query" : { "match" : { "body" : "quick brown fox" } } }, "fields" : { "_percolator_document_slot" : [0] } } ] } }
The percolator
field stores an Elasticsearch query inside it.
Because all the Percolators are cached and are always active for performances, all the fields that are required in the query must be defined in the mapping of the document.
Since all the queries in all the Percolator documents will be executed against every document, for the best performance, the query inside the Percolator must be optimized so that they're executed quickly inside the percolator
query.
It's common to want to score a document dynamically, depending on the context. For example, if you need to score more documents that are inside a category, the classic scenario is to boost (increase low-scored) documents that are based on a value, such as page rank, hits, or categories.
Elasticsearch provides two new ways to boost your scores based on values. One is the Rank Feature field, while the other is its extension, which is to use a vector of values.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
We want to use the rank_feature
type to implement a common PageRank scenario where documents are scored based on the same characteristics. To achieve this, follow these steps:
pagerank
value and an inverse url
length, we can use the following mapping:PUT test-rank { "mappings": { "properties": { "pagerank": { "type": "rank_feature" }, "url_length": { "type": "rank_feature", "positive_score_impact": false } } } }
PUT test-rank/_doc/1 { "pagerank": 5, "url_length": 20 }
pagerank
value to return our record with a similar query, like so:GET test-rank/_search { "query": { "rank_feature": { "field":"pagerank" }}}
Important Note
To query the special rank/rank_features
types, we need to use the special rank_feature
query type, which is only used for this special case.
The evolution of the previous feature's functionality is to define a vector of values using the rank_features
type; usually, it can be used to score by topics, categories, or similar discerning facets. We can implement this functionality by following these steps:
categories
field:PUT test-ranks { "mappings": { "properties": { "categories": { "type": "rank_features" } } } }
PUT test-ranks/_doc/1 { "categories": { "sport": 14.2, "economic": 24.3 } } PUT test-ranks/_doc/2 { "categories": { "sport": 19.2, "economic": 23.1 } }
GET test-ranks/_search { "query": { "feature": { "field": "categories.sport" } } }
rank_feature
and rank_features
are special type fields that are used for storing values and are mainly used to score the results.
Important Note
The values that are stored in these fields can only be queried using the feature
query. This cannot be used in standard queries and aggregations.
The value numbers in rank_feature
and rank_features
can only be single positive values (multi-values are not allowed).
In the case of rank_features
, the values must be a hash, composed of a string and a positive numeric value.
There is a flag that changes the behavior of scoring – positive_score_impact
. This value is true
by default, but if you want the value of the feature to decrease the score, you can set it to false
. In the pagerank
example, the length of url
reduces the score of the document because the longer url
is, the less relevant it becomes.
One of the most common scenarios is to provide the Search as you type
functionality, which is typical of the Google search engine.
This capability is common in many use cases:
This type provides facilities to achieve this functionality.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion for Elasticsearch.
We want to use the search_as_you_type
type to implement a completer (a widget that completes names/values) for titles for our media film streaming platform. To achieve this, follow these steps:
"search as you type"
on a title field, we will use the following mapping:PUT test-sayt { "mappings": { "properties": { "title": { "type": "search_as_you_type" } } } }
PUT test-sayt/_doc/1 { "title": "Ice Age" } PUT test-sayt/_doc/2 { "title": "The Polar Express" } PUT test-sayt/_doc/3 { "title": "The Godfather" }
title
value to return our records:GET test-sayt/_search { "query": { "multi_match": { "query": "the p", "type": "bool_prefix", "fields": [ "title", "title._2gram", "title._3gram" ] } } }
The result will be something similar to the following:
{ …truncated… "hits" : [ { "_index" : "test-sayt", "_id" : "2", "_score" : 2.4208174, "_source" : { "title" : "The Polar Express" } }, …truncated… }
As you can see, more relevant results (that contain more code related to the search) score better!
Due to the high demand for the Search as you type feature, this special mapping type was created.
This special mapping type is a helper that simplifies the process of creating a field with multiple subfields that can map the indexing requirements and provide an efficient Search as you type capability.
For example, for my title
field, the following field and subfields are created:
title
: This contains the text to be used. It's processed as a standard text field and accepts the standard text
parameters, as we saw regarding the text
field in the Mapping base types recipe of this chapter.title._2gram
: This contains the text with the applied shingle token filter (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-shingle-tokenfilter.html) with a size of 2. This aggregates two contiguous terms.title._3gram
: This is the same as title._2gram
but uses a size of 3 to aggregate three contiguous terms.title._index_prefix
: This wraps the maximum size gram (_3gram
, in our case) with an Edge N-Gram Token Filter (https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenfilter.html) to be able to provide initial completion.The "search_as_you_type"
field can be customized using the max_shingle_size
parameter (the default is 3
). This parameter allows you to define the maximum size of the gram to be created.
The number of ngram
subfields is given by the max_shingle_size -1
value, but usually, the best values are 3 or 4. In the case of large values, it only increases the size of the index, but it doesn't generally provide query quality benefits.
Please refer to the Using a match query recipe in Chapter 5, Text and Numeric Queries, to learn more about match queries.
Sometimes, we have values that represent a continuous range of values between an upper and lower bound. Some of the common scenarios of this are as follows:
In this case, for most queries, pointing to a value in the middle of them is not easy in Elasticsearch; for example, the worst case is to convert continuous values into discrete ones by extracting all the values using a prefixed interval. This kind of situation will largely increase the size of the index and reduce performance (queries).
Range mappings were created to provide continuous value support in Elasticsearch. For this reason, when it is not possible to store the exact value, but we have a range, we need to use range types.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
We want to use range types to implement stock mark values that are defined by low and high price values and the timeframe of the transaction. To achieve this, follow these steps:
PUT test-range { "mappings": { "properties": { "price": { "type": "float_range" }, "timeframe": { "type": "date_range" } } } }
PUT test-range/_bulk {"index":{"_index":"test-range","_id":"1"}} {"price":{"gte":1.5,"lt":3.2},"timeframe":{"gte":"2022-01-01T12:00:00","lt":"2022-01-01T12:00:01"}} {"index":{"_index":"test-range","_id":"2"}} {"price":{"gte":1.7,"lt":3.7},"timeframe":{"gte":"2022-01-01T12:00:01","lt":"2022-01-01T12:00:02"}} {"index":{"_index":"test-range","_id":"3"}} {"price":{"gte":1.3,"lt":3.3},"timeframe":{"gte":"2022-01-01T12:00:02","lt":"2022-01-01T12:00:03"}}
GET test-range/_search { "query": { "bool": { "filter": [ { "term": { "price": { "value": 2.4 } } }, { "term": { "timeframe": { "value": "2022-01-01T12:00:02" } } } ] } } }
The result will be something similar to the following:
{ …truncated… "hits" : [ { "_index" : "test-range", "_id" : "3", "_score" : 0.0, "_source" : { "price" : { "gte" : 1.3, "lt" : 3.3 }, "timeframe" : { "gte" : "2022-01-01T12:00:02", "lt" : "2022-01-01T12:00:03" …truncated… }
Not all the base types that support ranges can be used in ranges. The possible range types that are supported by Elasticsearch are as follows:
integer_range
: This is used to store signed 32-bit integer values.float_range
: This is used to store signed 32-bit floating-point values.long_range
: This is used to store signed 64-bit integer values.double_range
: This is used to store signed 64-bit floating-point values.date_range
: This is used to store date values as 64-bit integers.ip_range
: This is used to store IPv4 and IPv6 values.These range types are very useful for all cases where the values are not exact.
When you're storing a document in Elasticsearch, the field can be composed using the following parameters:
gt
or gte
for the lower bound of the rangelt
or lte
for the upper bound of the rangeNote
Range types can be used for querying values, but they have limited support for aggregation: they only support histogram and cardinality aggregations.
In many applications, it is possible to define custom metadata or configuration composed of key-value pairs. This use case is not optimal for Elasticsearch. Creating a new mapping for every key will not be easy to manage as they evolve into large mappings.
X-Pack provides a type (free for use) to solve this problem: the flattened
field type.
As the name suggests, it takes all the key-value pairs (also nested ones) and indices them in a flat way, thus solving the problem of the mapping explosion.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
We want to use Elasticsearch to store configurations with a varying number of fields. To achieve this, follow these steps:
flattened
field, we will use the following mapping:PUT test-flattened { "mappings": { "properties": { "name": { "type": "keyword" }, "configs": { "type": "flattened" } } } }
PUT test-flattened/_bulk {"index":{"_index":"test-flattened","_id":"1"}} {"name":"config1","configs":{"key1":"value1","key3":"2022-01-01T12:00:01"}} {"index":{"_index":"test-flattened","_id":"2"}} {"name":"config2","configs":{"key1":true,"key2":30}} {"index":{"_index":"test-flattened","_id":"3"}} {"name":"config3","configs":{"key4":"test","key2":30.3}}
POST test-flattened/_search { "query": { "term": { "configs": "test" } } }
Alternatively, we can search for a particular key in the configs
object, like so:
POST test-flattened/_search { "query": { "term": { "configs.key4": "test" } } }
The result for both queries will be as follows:
{ …truncated… "hits" : [ { "_index" : "test-flattened", "_id" : "3", "_score" : 1.2330425, "_source" : { "name" : "config3", "configs" : { "key4" : "test", "key2" : 30.3 } …truncated…
This special field type can take a JSON object that's been passed in a document and flatten key/value pairs that can be searched without defining a mapping for fields in the JSON content.
This helps since the mapping can explode due to the JSON containing a large number of different fields.
During the indexing process, tokens are created for each leaf value of the JSON object using a keyword
analyzer. Due to this, the number, date, IP, and other formats are converted into text and the only queries that can be executed are the ones that are supported by keyword tokenization. This includes term
, terms
, terms_set
, prefix
, range
(this is based on text), match
, multi_match
, query_string
, simple_query_string
, and exists
.
See Chapter 5, Text and Numeric Queries, for more references on the cited query types.
The power of geoprocessing Elasticsearch is used to provide capabilities to a large number of applications. However, it has one limitation: it only works for world coordinates.
Using Point and Shape types, X-Pack extends the geo capabilities to every two-dimensional planar coordinate system.
Common scenarios for this use case include mapping and documenting building coordinates and checking if documents are inside a shape.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
We want to use Elasticsearch to map a device's coordinates in our shop. To achieve this, follow these steps:
PUT test-point { "mappings": { "properties": { "device": { "type": "keyword" }, "location": { "type": "point" } } } }
PUT test-point/_bulk {"index":{"_index":"test-point","_id":"1"}} {"device":"device1","location":{"x":10,"y":10}} {"index":{"_index":"test-point","_id":"2"}} {"device":"device2","location":{"x":10,"y":15}} {"index":{"_index":"test-point","_id":"3"}} {"device":"device3","location":{"x":15,"y":10}}
At this point, we want to create shapes in our shop so that we can divide it into parts and check if the people/devices are inside the defined shape. To do this, follow these steps:
PUT test-shape { "mappings": { "properties": { "room": { "type": "keyword" }, "geometry": { "type": "shape" } } } }
POST test-shape/_doc/1 { "room":"hall", "geometry" : { "type" : "polygon", "coordinates" : [ [ [8.0, 8.0], [8.0, 12.0], [12.0, 12.0], [12.0, 8.0], [8.0, 8.0]] ] } }
POST test-point/_search { "query": { "shape": { "location": { "indexed_shape": { "index": "test-shape", "id": "1", "path": "geometry" } } } } }
The result for both queries will be as follows:
{ …truncated… "hits" : [ { "_index" : "test-point", "_id" : "1", "_score" : 0.0, "_source" : { "device" : "device1", "location" : { "x" : 10, "y" : 10 } …truncated…
The point
and shape
types are used to manage every type of two-dimensional planar coordinate system inside documents. Their usage is similar to geo_point
and geo_shape
.
The advantage of storing shapes in Elasticsearch is that you can simplify how you match constraints between coordinates and shapes. This was shown in our query example, where we loaded the shape's geometry from the test-shape
index and the search from the test-point
index.
Managing coordinate systems and shapes is a very large topic that requires knowledge of shape types and geo models since they are strongly bound to data models.
Elasticsearch is often used to store machine learning data for training algorithms. X-Pack provides the Dense Vector field to store vectors that have up to 2,048 dimension values.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
We want to use Elasticsearch to store a vector of values for our machine learning models. To achieve this, follow these steps:
PUT test-dvector { "mappings": { "properties": { "vector": { "type": "dense_vector", "dims": 4 }, "model": { "type": "keyword" } } } }
POST test-dvector/_doc/1 { "model":"pipe_flood", "vector" : [8.1, 8.3, 12.1, 7.32] }
The Dense Vector field is a helper field for storing vectors in Elasticsearch.
The ingested data for the field must be a list of floating-point values with the exact dimension of the value provided by the dims
property of the mapping (4
, in our example).
If the dimension of the vector field is incorrect, an exception is raised, and the document is not indexed.
For example, let's see what happens when we try to index a similar document with the wrong feature dimension:
POST test-dvector/_doc/1 { "model":"pipe_flood", "vector" : [8.1, 8.3, 12.1] }
We will see a similar exception that enforces the right dimension size. Here, the document will not be stored:
{ "error" : { "root_cause" : [ { "type" : "mapper_parsing_exception", "reason" : "failed to parse" } ], "type" : "mapper_parsing_exception", "reason" : "failed to parse", "caused_by" : { "type" : "illegal_argument_exception", "reason" : "Field [vector] of type [dense_vector] of doc [1] has number of dimensions [3] less than defined in the mapping [4]" } }, "status" : 400 }
Histograms are a common data type for analytics and machine learning analysis. We can store Histograms in the form of values and counts; they are not indexed, but they can be used in aggregations.
The histogram
field type is a special mapping that's available in X-Pack that is commonly used to store the results of Histogram aggregations in Elasticsearch for further processing, such as to compare the aggregation results at different times.
You will need an up-and-running Elasticsearch installation, as described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
In this recipe, we will simulate a common use case of Histogram data that is stored in Elasticsearch. Here, we will use a Histogram that specifies the millimeters of rain divided by year for our advanced analytics solution. To achieve this, follow these steps:
PUT test-histo { "mappings": { "properties": { "histogram": { "type": "histogram" }, "model": { "type": "keyword" } } } }
POST test-histo/_doc/1 { "model":"show_level", "histogram" : { "values" : [2016, 2017, 2018, 2019, 2020, 2021], "counts" : [283, 337, 323, 312, 236, 232] } }
The histogram
field type specializes in storing Histogram data. I must be provided as a JSON object composed of the values
and counts
fields with the same cardinality of items. The only supported aggregations are the following ones. We will look at these in more detail in Chapter 7, Aggregations:
min
, max
, sum
, value_count
, and avg
percentiles
and percentile_ranks
aggregationsboxplot
aggregationhistogram
aggregationThe data is not indexed, but you can also check the existence of a document by populating this field with the exist
query.
Sometimes, when we are working with our mapping, we may need to store some additional data to be used for display purposes, ORM facilities, permissions, or simply to track them in the mapping.
Elasticsearch allows you to store every kind of JSON data you want in the mapping with the special _meta
field.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
The _meta
mapping field can be populated with any data we want in JSON format, like so:
{ "_meta": { "attr1": ["value1", "value2"], "attr2": { "attr3": "value3" } } }
When Elasticsearch processes a new mapping and finds a _meta
field, it stores it as-is in the global mapping status and propagates the information to all the cluster nodes. The content of the _meta
files is only checked to ensure it's a valid JSON format. Its content is not taken into consideration by Elasticsearch. You can populate it with everything you need to be in JSON format.
_meta
is only used for storing purposes; it's not indexed and searchable. It can be used to enrich your mapping with custom information that can be used by your applications.
It can be used for the following reasons:
{"name": "Address", "description": "This entity store address information"}
{"class": "com.company.package.AwesomeClass", "properties" : { "address":{"class": "com.company.package.Address"}} }
{"read":["user1", "user2"], "write":["user1"]}
icon
filename, which is used to display the type):{"icon":"fa fa-alert" }
{"fragment":"<div><h1>$name</h1><p>$description</p></div>" }
In the previous recipes, we learned how to map different fields and objects in Elasticsearch, and we described how easy it is to change the standard analyzer with the analyzer
and search_analyzer
properties.
In this recipe, we will look at several analyzers and learn how to use them to improve indexing and searching quality.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
Every core type field allows you to specify a custom analyzer for indexing and for searching as field parameters.
For example, if we want the name
field to use a standard analyzer for indexing and a simple analyzer for searching, the mapping will be as follows:
{ "name": { "type": "string", "index_analyzer": "standard", "search_analyzer": "simple" } }
The concept of the analyzer comes from Lucene (the core of Elasticsearch). An analyzer is a Lucene element that is composed of a tokenizer that splits text into tokens, as well as one or more token filters. These filters carry out token manipulation such as lowercasing, normalization, removing stop words, stemming, and so on.
During the indexing phase, when Elasticsearch processes a field that must be indexed, an analyzer is chosen. First, it checks whether it is defined in the index_analyzer
field, then in the document, and finally, in the index.
Choosing the correct analyzer is essential to getting good results during the query phase.
Elasticsearch provides several analyzers in its standard installation. The following table shows the most common ones:
For special language purposes, Elasticsearch supports a set of analyzers aimed at analyzing text in a specific language, such as Arabic, Armenian, Basque, Brazilian, Bulgarian, Catalan, Chinese, CJK, Czech, Danish, Dutch, English, Finnish, French, Galician, German, Greek, Hindi, Hungarian, Indonesian, Italian, Norwegian, Persian, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish, and Thai.
Several Elasticsearch plugins extend the list of available analyzers. The most famous ones are as follows:
Real-world index mapping can be very complex and often, parts of it can be reused between different indices types. To be able to simplify this management, mappings can be divided into the following:
Using components is the most manageable way to scale on large index mappings because they can simplify large template management.
You will need an up-and-running Elasticsearch installation, as we described in the Downloading and installing Elasticsearch recipe of Chapter 1, Getting Started.
To execute the commands in this recipe, you can use any HTTP client, such as curl (https://curl.haxx.se/), Postman (https://www.getpostman.com/), or similar. I suggest using the Kibana console, which provides code completion and better character escaping for Elasticsearch.
We want to build an index mapping composed of two reusable components. To achieve this, follow these steps:
PUT _component_template/timestamp-management { "template": { "mappings": { "properties": { "@timestamp": { "type": "date" } } } } } PUT _component_template/order-data { "template": { "mappings": { "properties": { "id": { "type": "keyword" }, "date": { "type": "date" }, "customer_id": { "type": "keyword" }, "sent": { "type": "boolean" } } } } } PUT _component_template/items-data { "template": { "mappings": { "properties": { "item": { "type": "object", "properties": { "name": { "type": "keyword" }, "quantity": { "type": "long" }, "cost": { "type": "alias", "path": "item.price" }, "price": { "type": "double" }, "vat": { "type": "double" } } } } } } }
PUT _index_template/order { "index_patterns": ["order*"], "template": { "settings": { "number_of_shards": 1 }, "mappings": { "properties": { "id": { "type": "keyword" } } }, "aliases": { "order": { } } }, "priority": 200, "composed_of": ["timestamp-management", "order-data", "items-data"], "version": 1, "_meta": { "description": "My order index template" } }
The process of using index components to build indices templates is very simple: you can register as many components as you wish (Steps 1 and 2 in this recipe) and then aggregate them when you define the template (Step 3). By using this approach, your template is divided into blocks, and the index template is simpler to manage and easily reusable.
For simple use cases, using components to build indices template is too verbose. This approach shines when you need to manage different logs or documents in Elasticsearch that have common parts because you can refactorize them very quickly and reuse them.
Components are simple partial templates that are merged in an index template. Here, the parameters are as follows:
index_patterns
: This is a list of index glob patterns. When an index is created, if its name matches the glob patterns, the template is applied when the index is created.aliases
: This is an optional alias definition to be applied to the created index.template
: This is the template to be applied to the index.priority
: This is an optional order of priority for applying this template. The standard priority of ELK components is 100, so if the value is set below 100, a custom template can override an ELK one.version
: This is an optional incremental number that is managed by the user to keep track of the updates that are made to the template._meta
: This is an optional JSON object that contains metadata for the index.composed_of
: This is an optional list of index components that are merged to build the final index mapping.Note
This functionality is available from Elasticsearch version 7.8 and above.
The Adding metadata to a mapping recipe in this chapter about using the _meta
field.
Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.
If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.
Please Note: Packt eBooks are non-returnable and non-refundable.
Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:
If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:
Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.
You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.
Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.
When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.
For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.