Indexing the Data

Exclusive offer: get 50% off this eBook here
Elasticsearch Server: Second Edition

Elasticsearch Server: Second Edition — Save 50%

A practical guide to building fast, scalable, and flexible search solutions with clear and easy-to-understand examples with this book and ebook.

$29.99    $15.00
by Marek Rogoziński Rafał Kuć | April 2014 | Open Source

In this article by Rafał Kuć and Marek Rogoziński, authors of Elasticsearch Server Second Edition, we will learn about Elasticsearch indexing, how to configure our index structure mappings, and also see what field types we are allowed to use.

(For more resources related to this topic, see here.)

Elasticsearch indexing

We have our Elasticsearch cluster up and running, and we also know how to use the Elasticsearch REST API to index our data, delete it, and retrieve it. We also know how to use search to get our documents. If you are used to SQL databases, you might know that before you can start putting the data there, you need to create a structure, which will describe what your data looks like. Although Elasticsearch is a schema-less search engine and can figure out the data structure on the fly, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you'll see how to create new indices (and how to delete them). Before we look closer at the available API methods, let's see what the indexing process looks like.

Shards and replicas

The Elasticsearch index is built of one or more shards and each of them contains part of your document set. Each of these shards can also have replicas, which are exact copies of the shard. During index creation, we can specify how many shards and replicas should be created. We can also omit this information and use the default values either defined in the global configuration file (elasticsearch.yml) or implemented in Elasticsearch internals. If we rely on Elasticsearch defaults, our index will end up with five shards and one replica. What does that mean? To put it simply, we will end up with having 10 Lucene indices distributed among the cluster.

Are you wondering how we did the calculation and got 10 Lucene indices from five shards and one replica? The term "replica" is somewhat misleading. It means that every shard has its copy, so it means there are five shards and five copies.

Having a shard and its replica, in general, means that when we index a document, we will modify them both. That's because to have an exact copy of a shard, Elasticsearch needs to inform all the replicas about the change in shard contents. In the case of fetching a document, we can use either the shard or its copy. In a system with many physical nodes, we will be able to place the shards and their copies on different nodes and thus use more processing power (such as disk I/O or CPU). To sum up, the conclusions are as follows:

  • More shards allow us to spread indices to more servers, which means we can handle more documents without losing performance.
  • More shards means that fewer resources are required to fetch a particular document because fewer documents are stored in a single shard compared to the documents stored in a deployment with fewer shards.
  • More shards means more problems when searching across the index because we have to merge results from more shards and thus the aggregation phase of the query can be more resource intensive.
  • Having more replicas results in a fault tolerance cluster, because when the original shard is not available, its copy will take the role of the original shard. Having a single replica, the cluster may lose the shard without data loss. When we have two replicas, we can lose the primary shard and its single replica and still everything will work well.
  • The more the replicas, the higher the query throughput will be. That's because the query can use either a shard or any of its copies to execute the query.

Of course, these are not the only relationships between the number of shards and replicas in Elasticsearch.

So, how many shards and replicas should we have for our indices? That depends. We believe that the defaults are quite good but nothing can replace a good test. Note that the number of replicas is less important because you can adjust it on a live cluster after index creation. You can remove and add them if you want and have the resources to run them. Unfortunately, this is not true when it comes to the number of shards. Once you have your index created, the only way to change the number of shards is to create another index and reindex your data.

Creating indices

When we created our first document in Elasticsearch, we didn't care about index creation at all. We just used the following command:

curl -XPUT http://localhost:9200/blog/article/1 -d '{"title": "New  
version of Elasticsearch released!", "content": "...", "tags":
["announce", "elasticsearch", "release"] }'

This is fine. If such an index does not exist, Elasticsearch automatically creates the index for us. We can also create the index ourselves by running the following command:

curl -XPUT http://localhost:9200/blog/

We just told Elasticsearch that we want to create the index with the blog name. If everything goes right, you will see the following response from Elasticsearch:

{"acknowledged":true}

When is manual index creation necessary? There are many situations. One of them can be the inclusion of additional settings such as the index structure or the number of shards.

Altering automatic index creation

Sometimes, you can come to the  conclusion that automatic index creation is a bad thing. When you have a big system with many processes sending data into Elasticsearch, a simple typo in the index name can destroy hours of script work. You can turn off automatic index creation by adding the following line in the elasticsearch.yml configuration file:

action.auto_create_index: false

Note that action.auto_create_index is more complex than it looks. The value can be set to not only false or true. We can also use index name patterns to specify whether an index with a given name can be created automatically if it doesn't exist. For example, the following definition allows automatic creation of indices with the names beginning with a, but disallows the creation of indices starting with an. The other indices aren't allowed and must be created manually (because of -*).

action.auto_create_index: -an*,+a*,-*

Note that the order of pattern definitions matters. Elasticsearch checks the patterns up to the first pattern that matches, so if you move -an* to the end, it won't be used because of +a* , which will be checked first.

Settings for a newly created index

The manual creation of an index is also necessary when you want to set some configuration options, such as the number of shards and replicas. Let's look at the following example:

curl -XPUT http://localhost:9200/blog/ -d '{     "settings" : {         "number_of_shards" : 1,         "number_of_replicas" : 2     } }'

The preceding command will result in the creation of the blog index with one shard and two replicas, so it makes a total of three physical Lucene indices. Also, there are other values that can be set in this way.

So, we already have our new, shiny index. But there is a problem; we forgot to provide the mappings, which are responsible for describing the index structure. What can we do? Since we have no data at all, we'll go for the simplest approach – we will just delete the index. To do that, we will run a command similar to the preceding one, but instead of using the PUT HTTP method, we use DELETE. So the actual command is as follows:

curl –XDELETE http://localhost:9200/posts

And the response will be the same as the one we saw earlier, as follows:

{"acknowledged":true}

Now that we know what an index is, how to create it, and how to delete it, we are ready to create indices with the mappings we have defined. It is a very important part because data indexation will affect the search process and the way in which documents are matched.

Mappings configuration

If you are used to SQL databases, you may know that before you can start inserting the data in the database, you need to create a schema, which will describe what your data looks like. Although Elasticsearch is a schema-less search engine and can figure out the data structure on the fl y, we think that controlling the structure and thus defining it ourselves is a better way. In the following few pages, you'll see how to create new indices (and how to delete them) and how to create mappings that suit your needs and match your data structure.

Type determining mechanism

Before we  start describing how to create mappings  manually, we wanted to write about one thing. Elasticsearch can guess the document structure by looking at JSON, which defines the document. In JSON, strings are surrounded by quotation marks, Booleans are defined using specific words, and numbers are just a few digits. This is a simple trick, but it usually works. For example, let's look at the following document:

{   "field1": 10, "field2": "10" }

The preceding document has two fields. The field1 field will be determined as a number (to be precise, as long type), but field2 will be determined as a string, because it is surrounded by quotation marks. Of course, this can be the desired behavior, but sometimes the data source may omit the information about the data type and everything may be present as strings. The solution to this is to enable more aggressive text checking in the mapping definition by setting the numeric_detection property to true. For example, we can execute the following command during the creation of the index:

curl -XPUT http://localhost:9200/blog/?pretty -d '{   "mappings" : {     "article": {       "numeric_detection" : true     }   } }'

Unfortunately, the problem still exists if we want the Boolean type to be guessed. There is no option to force the guessing of Boolean types from the text. In such cases, when a change of source format is impossible, we can only define the field directly in the mappings definition.

Another type that causes trouble is a date-based one. Elasticsearch tries to guess dates given as timestamps or strings that match the date format. We can define the list of recognized date formats using the dynamic_date_formats property, which allows us to specify the formats array. Let's look at the following command for creating the index and type:

curl -XPUT 'http://localhost:9200/blog/' -d '{   "mappings" : {     "article" : {       "dynamic_date_formats" : ["yyyy-MM-dd hh:mm"]     }   } }'

The preceding command will result in the creation of an index called blog with the single type called article. We've also used the dynamic_date_formats property with a single date format that will result in Elasticsearch using the date core type for fields matching the defined format. Elasticsearch uses the joda-time library to define date formats, so please visit http://joda-time.sourceforge.net/api-release/org/joda/time/format/DateTimeFormat.html if you are interested in finding out more about them.

Remember that the dynamic_date_format property accepts an array of values. That means that we can handle several date formats simultaneously.

Elasticsearch Server: Second Edition A practical guide to building fast, scalable, and flexible search solutions with clear and easy-to-understand examples with this book and ebook.
Published: April 2014
eBook Price: $29.99
Book Price: $49.99
See more
Select your format and quantity:

Disabling field type guessing

Let's think about the following case. First we index a number, an integer. Elasticsearch will guess its type and will set the type to integer or long. What will happen if we index a document with a floating point number into the same field? Elasticsearch will just remove the decimal part of the number and store the rest. Another reason for turning it off is when we don't want to add new fields to an existing index—the fields that were not known during application development.

To turn off automatic field adding, we can set the dynamic property to false. We can add the dynamic property as the type property. For example, if we would like to turn off automatic field type guessing for the article type in the blog index, our command will look as follows:

curl -XPUT 'http://localhost:9200/blog/' -d '{   "mappings" : {     "article" : {       "dynamic" : "false",       "properties" : {         "id" : { "type" : "string" },         "content" : { "type" : "string" },         "author" : { "type" : "string" }      }     }   } }'

After creating the blog index using the preceding command, any field that is not mentioned in the properties section will be ignored by Elasticsearch. So any field apart from id, content, and author will just be ignored. Of course, this is only true for the article type in the blog index.

Index structure mapping

The schema mapping (or in short, mappings) is used to define the index structure. As you may recall, each index can have multiple types, but we will concentrate on a single type for now—just for simplicity. Let's assume that we want to create an index called posts that will hold data for blog posts. It could have the following structure:

  • Unique identifier
  • Name
  • Publication date
  • Contents

In Elasticsearch, mappings are sent as JSON objects in a file. So, let's create a mapping file that will match the aforementioned needs—we will call it posts.json. Its content is as follows:

{   "mappings": {     "post": {       "properties": {                        "id": {"type":"long", "store":"yes",         "precision_step":"0" },         "name": {"type":"string", "store":"yes",         "index":"analyzed" },         "published": {"type":"date", "store":"yes",         "precision_step":"0" },         "contents": {"type":"string", "store":"no",         "index":"analyzed" }                   }     }   } }

To create our posts index with the preceding file, run the following command (assuming that we stored the mappings in the posts.json file):

curl -XPOST 'http://localhost:9200/posts' -d @posts.json

Note that you can store your mappings and set a file named anyway you want.

And again, if everything goes well, we see the following response:

{"acknowledged":true}

Now we have our index structure and we can index our data. Let's take a break to discuss the contents of the posts.json file.

Type definition

As you can see, the contents of the posts.json file are JSON objects and therefore it starts and ends with curly brackets (if you want to learn more about JSON, please visit http://www.json.org/). All the type definitions inside the mentioned file are nested in the mappings object. You can define multiple types inside the mappings JSON object. In our example, we have a single post type. But, for example, if we would also like to include the user type, the file will look as follows:

{   "mappings": {     "post": {       "properties": {                        "id": { "type":"long", "store":"yes",         "precision_step":"0" },         "name": { "type":"string", "store":"yes",         "index":"analyzed" },         "published": { "type":"date", "store":"yes",         "precision_step":"0" },         "contents": { "type":"string", "store":"no",         "index":"analyzed" }                   }     },     "user": {       "properties": {                        "id": { "type":"long", "store":"yes",         "precision_step":"0" },         "name": { "type":"string", "store":"yes",   "index":"analyzed" }                   }     }   } }

Fields

Each type is defined by a set of properties—fields that are nested inside the properties object. So let's concentrate on a single field now; for example, the contents field, whose definition is as follows:

"contents": { "type":"string", "store":"yes", "index":"analyzed" }

It starts with the name of the field, which is contents in the preceding case. After the name of the field, we have an object defining the behavior of the field. The attributes are specific to the types of fields we are using. Of course, if you have multiple fields for a single type (which is what we usually have), remember to separate them with a comma.

Core types

Each field type can be specified to a specific core type provided by Elasticsearch. The core types in Elasticsearch are as follows:

  • String
  • Number
  • Date
  • Boolean
  • Binary

So, now let's discuss each of the core types available in Elasticsearch and the attributes it provides to define their behavior.

Common attributes

Before continuing with all the core type descriptions, we would like to discuss some common attributes that you can use to describe all the types (except for the binary one).

  • index_name: This defines the name of the field that will be stored in the index. If this is not defined, the name will be set to the name of the object that the field is defined with.
  • index: This can take the values analyzed and no. Also, for string-based fields, it can also be set to not_analyzed. If set to analyzed, the field will be indexed and thus searchable. If set to no, you won't be able to search on such a field. The default value is analyzed. In the case of string-based fields, there is an additional option, not_analyzed. This, when set, will mean that the field will be indexed but not analyzed. So, the field is written in the index as it was sent to Elasticsearch and only a perfect match will be counted during a search. Setting the index property to no will result in the disabling of the include_in_all property of such a field.
  • store: This can take the values yes and no and specifies if the original value of the field should be written into the index. The default value is no, which means that you can't return that field in the results (although, if you use the _source field, you can return the value even if it is not stored), but if you have it indexed, you can still search the data on the basis of it.
  • boost: The default value of this attribute is 1. Basically, it defines how important the field is inside the document; the higher the boost, the more important the values in the field.
  • null_value: This attribute specifies a value that should be written into the index in case that field is not a part of an indexed document. The default behavior will just omit that field.
  • copy_to: This attribute specifies a field to which all field values will be copied.
  • include_in_all: This attribute specifies if the field should be included in the _all field. By default, if the _all field is used, all the fields will be included in it.

String

String is the most basic text type, which allows us to store one or more characters inside it. A sample definition of such a field can be as follows:

"contents" : {"type" :"string", "store" :"no", "index" :"analyzed"}

In addition to the common attributes, the following attributes can also be set for string-based fields:

  • term_vector: This attribute can take the values no (the default one), yes, with_offsets, with_positions, and with_positions_offsets. It defines whether or not to calculate the Lucene term vectors for that field. If you are using highlighting, you will need to calculate the term vector.
  • omit_norms: This attribute can take the value true or false. The default value is false for string fields that are analyzed and true for string fields that are indexed but not analyzed. When this attribute is set to true, it disables the Lucene norms calculation for that field (and thus you can't use index-time boosting), which can save memory for fields used only in filters (and thus not being taken into consideration when calculating the score of the document).
  • analyzer: This attribute defines the name of the analyzer used for indexing and searching. It defaults to the globally-defined analyzer name.
  • index_analyzer: This attribute defines the name of the analyzer used for indexing.
  • search_analyzer: This attribute defines the name of the analyzer used for processing the part of the query string that is sent to a particular field.
  • norms.enabled: This attribute specifies whether the norms should be loaded for a field. By default, it is set to true for analyzed fields (which means that the norms will be loaded for such fields) and to false for non-analyzed fields.
  • norms.loading: This attribute takes the values eager and lazy. The first value means that the norms for such fields are always loaded. The second value means that the norms will be loaded only when needed.
  • position_offset_gap: This attribute defaults to 0 and specifies the gap in the index between instances of the given field with the same name. Setting this to a higher value may be useful if you want position-based queries (like phrase queries) to match only inside a single instance of the field.
  • index_options: This attribute defines the indexing options for the postings list—the structure holding the terms. The possible values are docs (only document numbers are indexed), freqs (document numbers and term frequencies are indexed), positions (document numbers, term frequencies, and their positions are indexed), and offsets (document numbers, term frequencies, their positions, and offsets are indexed). The default value for this property is positions for analyzed fields and docs for fields that are indexed but not analyzed.
  • ignore_above: This attribute defines the maximum size of the field in characters. Fields whose size is above the specified value will be ignored by the analyzer.

Number

This is the core type that gathers all numeric field types that are available to be used. The following types are available in Elasticsearch (we specify them by using the type property):

  • byte: This type defines a byte value; for example, 1
  • short: This type defines a short value; for example, 12
  • integer: This type defines a integer value; for example, 134
  • long: This type defines a long value; for example, 123456789
  • float: This type defines a float value; for example, 12.23
  • double: This type defines a double value; for example, 123.45

You can learn more about the mentioned Java types at http://docs.oracle.com/javase/tutorial/java/nutsandbolts/datatypes.html.

A sample definition of a field based on one of the numeric types is as follows:

"price" : {"type" :"float", "store" :"yes", "precision_step" :"4"}

In addition to the common attributes, the following ones can also be set for the numeric fields:

  • precision_step: This  attribute specifies the number of terms generated for each value in a field. The lower the value, the higher the number of terms generated. For fields with a higher number of terms per value, range queries will be faster at the cost of a slightly larger index. The default value is 4.
  • ignore_malformed: This attribute can take the value  true or  false. The default value is false. It should be set to true in order to omit badly formatted values.

Boolean

The boolean core type is designed for indexing Boolean values (true or false). A sample definition of a field based on the boolean type can be as follows:

"allowed" : { "type" : "boolean", "store": "yes" }

Binary

The binary field is a Base64 representation of the binary data stored in the index. You can use it to store data that is normally written in binary form, such as images. Fields based on this type are by default stored and not indexed, so you can only retrieve them and cannot perform search operations on them. The binary type only supports the index_name property. The sample field definition based on the binary field may look like the following:

"image" : { "type" : "binary" }

Date

The  date core  type is designed to be used for date indexing. It follows a  specific format that can be changed and is stored in UTC by default.

The default date format understood by Elasticsearch is quite universal and allows the specifying of the date and optionally the time, for example, 2012-12-24T12:10:22.

A sample definition of a field based on the  date type is as follows:

"published" : { "type" : "date", "store" : "yes", "format" :   "YYYY-mm-dd" }

A sample document that uses the preceding field is as follows:

{   "name" : "Sample document",   "published" : "2012-12-22" }

In addition to the common attributes, the following ones can also be set for the fields based on the date type:

  •  format: This attribute specifies the format of the date. The default value is dateOptionalTime. For a full list of formats, please visit http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/mapping-date-format.html.
  •  precision_step: This attribute specifies the number of terms generated for each value in that field. The lower the value, the higher the number of terms generated, and thus the faster the range queries (but with a higher index size). The default value is 4.
  • ignore_malformed: This attribute can take the value true or false. The default value is false. It should be set to true in order to omit badly formatted values.

Summary

In this article, we learned how Elasticsearch indexing works. We learned to create our own mappings that define index structure and create indices using them. We also learned how to prepare an index structure and what data types we are allowed to use.

Resources for Article:


Further resources on this subject:


Elasticsearch Server: Second Edition A practical guide to building fast, scalable, and flexible search solutions with clear and easy-to-understand examples with this book and ebook.
Published: April 2014
eBook Price: $29.99
Book Price: $49.99
See more
Select your format and quantity:

About the Author :


Marek Rogoziński

Marek Rogoziński is a software architect and consultant with more than 10 years of experience. He has specialized in solutions based on open source search engines such as Solr and Elasticsearch, and also the software stack for Big Data analytics including Hadoop, HBase, and Twitter Storm.

He is also the cofounder of the solr.pl site, which publishes information and tutorials about Solr and the Lucene library. He is also the co-author of some books published by Packt Publishing.

Currently, he holds the position of the Chief Technology Officer in a new company, designing architecture for a set of products that collect, process, and analyze large streams of input data.

Rafał Kuć

Rafał Kuć is a born team leader and software developer. He currently works as a consultant and a software engineer at Sematext Group, Inc., where he concentrates on open source technologies such as Apache Lucene and Solr, Elasticsearch, and Hadoop stack. He has more than 12 years of experience in various branches of software, from banking software to e-commerce products. He focuses mainly on Java but is open to every tool and programming language that will make the achievement of his goal easier and faster. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people with the problems they face with Solr and Lucene. Also, he has been a speaker at various conferences around the world, such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, and Lucene Revolution.

Rafał began his journey with Lucene in 2002, and it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then, Solr came along and this was it. He started working with Elasticsearch in the middle of 2010. Currently, Lucene, Solr, Elasticsearch, and information retrieval are his main points of interest.

Rafał is also the author of Apache Solr 3.1 Cookbook, and the update to it, Apache Solr 4 Cookbook. Also, he is the author of the previous edition of this book and Mastering ElasticSearch. All these books have been published by Packt Publishing.

Books From Packt


 ElasticSearch Server
ElasticSearch Server

ElasticSearch Cookbook
ElasticSearch Cookbook

Mastering ElasticSearch
Mastering ElasticSearch

Instant Lucene.NET [Instant]
Instant Lucene.NET [Instant]

Apache Solr 4 Cookbook
Apache Solr 4 Cookbook

 Apache Tomcat 7 Essentials
Apache Tomcat 7 Essentials

Apache Solr Beginner’s Guide
Apache Solr Beginner’s Guide

Quickstart Apache Axis2
Quickstart Apache Axis2


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software