Low-Level Index Control

Exclusive offer: get 50% off this eBook here
Mastering ElasticSearch

Mastering ElasticSearch — Save 50%

Extend your knowledge on ElasticSearch, and querying and data handling, along with its internal workings with this book and ebook

£20.99    £10.50
by Marek Rogoziński Rafał Kuć | October 2013 | Open Source

This article, by Rafał Kuć and Marek Rogoziński the authors of book Mastering ElasticSearch we will have covered the following topics:

  • How to use different scoring formulae and what they can bring
  • How to use different posting formats and what they can bring
  • How to handle Near Real Time searching, real-time GET, and what searcher reopening means
  • Looking deeper into multilingual data handling
  • Configuring transaction log to our needs and see how it affects our deployments
  • Segments merging, different merge policies, and merge scheduling

(For more resources related to this topic, see here.)

Altering Apache Lucene scoring

With the release of Apache Lucene 4.0 in 2012, all the users of this great, full text search library, were given the opportunity to alter the default TF/IDF based algorithm. Lucene API was changed to allow easier modification and extension of the scoring formula. However, that was not the only change that was made to Lucene when it comes to documents score calculation. Lucene 4.0 was shipped with additional similarity models, which basically allows us to use different scoring formula for our documents. In this section we will take a deeper look at what Lucene 4.0 brings and how those features were incorporated into ElasticSearch.

Setting per-field similarity

Since ElasticSearch 0.90, we are allowed to set a different similarity for each of the fields we have in our mappings. For example, let's assume that we have the following simple mapping that we use, in order to index blog posts (stored in the posts_no_similarity.json file):

{ "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" } } } } }

What we would like to do is, use the BM25 similarity model for the name field and the contents field. In order to do that, we need to extend our field definitions and add the similarity property with the value of the chosen similarity name. Our changed mappings (stored in the posts_similarity.json file) would appear as shown in the following code:

{ "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "similarity" : "BM25" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed", "similarity" : "BM25" } } } } }

And that's all, nothing more is needed. After the preceding change, Apache Lucene will use the BM25 similarity to calculate the score factor for the name and contents fields.

In case of the Divergence from randomness and Information based similarity model, we need to configure some additional properties to specify the behavior of those similarities. How to do that is covered in the next part of the current section.

Default codec properties

When using the default codec we are allowed to configure the following properties:

  • min_block_size: It specifies the minimum block size Lucene term dictionary uses to encode blocks. It defaults to 25.
  • max_block_size: It specifies the maximum block size Lucene term dictionary uses to encode blocks. It defaults to 48.

Direct codec properties

The direct codec allows us to configure the following properties:

  • min_skip_count: It specifies the minimum number of terms with a shared prefix to allow writing of a skip pointer. It defaults to 8.
  • low_freq_cutoff: The codec will use a single array object to hold postings and positions that have document frequency lower than this value. It defaults to 32.

Memory codec properties

By using the memory codec we are allowed to alter the following properties:

  • pack_fst: It is a Boolean option that defaults to false and specifies if the memory structure that holds the postings should be packed into the FST. Packing into FST will reduce the memory needed to hold the data.
  • acceptable_overhead_ratio: It is a compression ratio of the internal structure specified as a float value which defaults to 0.2. When using the 0 value, there will be no additional memory overhead but the returned implementation may be slow. When using the 0.5 value, there can be a 50 percent memory overhead, but the implementation will be fast. Values higher than 1 are also possible, but may result in high memory overhead.

Pulsing codec properties

When using the pulsing codec we are allowed to use the same properties as with the default codec and in addition to them one more property, which is described as follows:

  • freq_cut_off: It defaults to 1. The document frequency at which the postings list will be written into the term dictionary. The documents with the frequency equal to or less than the value of freq_cut_off will be processed.

Bloom filter-based codec properties

If we want to configure a bloom filter based codec, we can use the bloom_filter type and set the following properties:

  • delegate: It specifies the name of the codec we want to wrap, with the bloom filter.
  • ffp: It is a value between 0 and 1.0 which specifies the desired false positive probability. We are allowed to set multiple probabilities depending on the amount of documents per Lucene segment. For example, the default value of 10k=0.01, 1m=0.03 specifies that the fpp value of 0.01 will be used when the number of documents per segment is larger than 10.000 and the value of 0.03 will be used when the number of documents per segment is larger than one million.

For example, we could configure our custom bloom filter based codec to wrap a direct posting format as shown in the following code (stored in posts_bloom_custom.json file):

{ "settings" : { "index" : { "codec" : { "postings_format" : { "custom_bloom" : { "type" : "bloom_filter", "delegate" : "direct", "ffp" : "10k=0.03, 1m=0.05" } } } } }, "mappings" : { "post" : { "properties" : { "id" : { "type" : "long", "store" : "yes", "precision_step" : "0" }, "name" : { "type" : "string", "store" : "yes", "index" : "analyzed", "postings_format" : "custom_bloom" }, "contents" : { "type" : "string", "store" : "no", "index" : "analyzed" } } } } }

NRT, flush, refresh, and transaction log

In an ideal search solution, when new data is indexed it is instantly available for searching. At the first glance it is exactly how ElasticSearch works even in multiserver environments. But this is not the truth (or at least not all the truth) and we will show you why it is like this. Let's index an example document to the newly created index by using the following command:

curl -XPOST localhost:9200/test/test/1 -d '{ "title": "test" }'

Now, we will replace this document and immediately we will try to find it. In order to do this, we'll use the following command chain:

curl –XPOST localhost:9200/test/test/1 -d '{ "title": "test2" }' ; curl localhost:9200/test/test/_search?pretty

The preceding command will probably result in the response, which is very similar to the following response:

{"ok":true,"_index":"test","_type":"test","_id":"1","_version":2}{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 5, "successful" : 5, "failed" : 0 }, "hits" : { "total" : 1, "max_score" : 1.0, "hits" : [ { "_index" : "test", "_type" : "test", "_id" : "1", "_score" : 1.0, "_source" : { "title": "test" } } ] } }

The first line starts with a response to the indexing command—the first command. As you can see everything is correct, so the second, search query should return the document with the title field test2, however, as you can see it returned the first document. What happened?

But before we give you the answer to the previous question, we should take a step backward and discuss about how underlying Apache Lucene library makes the newly indexed documents available for searching.

Updating index and committing changes

The segments are independent indices, which means that queries that are run in parallel to indexing, from time to time should add newly created segments to the set of those segments that are used for searching. Apache Lucene does that by creating subsequent (because of write-once nature of the index) segments_N files, which list segments in the index. This process is called committing. Lucene can do this in a secure way—we are sure that all changes or none of them hits the index. If a failure happens, we can be sure that the index will be in consistent state.

Let's return to our example. The first operation adds the document to the index, but doesn't run the commit command to Lucene. This is exactly how it works. However, a commit is not enough for the data to be available for searching. Lucene library use an abstraction class called Searcher to access index. After a commit operation, the Searcher object should be reopened in order to be able to see the newly created segments. This whole process is called refresh. For performance reasons ElasticSearch tries to postpone costly refreshes and by default refresh is not performed after indexing a single document (or a batch of them), but the Searcher is refreshed every second. This happens quite often, but sometimes applications require the refresh operation to be performed more often than once every second. When this happens you can consider using another technology or requirements should be verified. If required, there is possibility to force refresh by using ElasticSearch API. For example, in our example we can add the following command:

curl –XGET localhost:9200/test/_refresh

If we add the preceding command before the search, ElasticSearch would respond as we had expected.

Changing the default refresh time

The time between automatic Searcher refresh can be changed by using the index.refresh_interval parameter either in the ElasticSearch configuration file or by using the update settings API. For example:

curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "refresh_interval" : "5m" } }'

The preceding command will change the automatic refresh to be done every 5 minutes. Please remember that the data that are indexed between refreshes won't be visible by queries.

As we said, the refresh operation is costly when it comes to resources. The longer the period of refresh is, the faster your indexing will be. If you are planning for very high indexing procedure when you don't need your data to be visible until the indexing ends, you can consider disabling the refresh operation by setting the index.refresh_interval parameter to -1 and setting it back to its original value after the indexing is done.

The transaction log

Apache Lucene can guarantee index consistency and all or nothing indexing, which is great. But this fact cannot ensure us that there will be no data loss when failure happens while writing data to the index (for example, when there isn't enough space on the device, the device is faulty or there aren't enough file handlers available to create new index files). Another problem is that frequent commit is costly in terms of performance (as you recall, a single commit will trigger a new segment creation and this can trigger the segments to merge). ElasticSearch solves those issues by implementing transaction log. Transaction log holds all uncommitted transactions and from time to time, ElasticSearch creates a new log for subsequent changes. When something goes wrong, transaction log can be replayed to make sure that none of the changes were lost. All of these tasks are happening automatically, so, the user may not be aware of the fact that commit was triggered at a particular moment. In ElasticSearch, the moment when the information from transaction log is synchronized with the storage (which is Apache Lucene index) and transaction log is cleared is called flushing.

Please note the difference between flush and refresh operations. In most of the cases refresh is exactly what you want. It is all about making new data available for searching. From the opposite side, the flush operation is used to make sure that all the data is correctly stored in the index and transaction log can be cleared.

In addition to automatic flushing, it can be forced manually using the flush API. For example, we can run a command to flush all the data stored in the transaction log for all indices, by running the following command:

curl –XGET localhost:9200/_flush

Or we can run the flush command for the particular index, which in our case is the one called library:

curl –XGET localhost:9200/library/_flush curl –XGET localhost:9200/library/_refresh

In the second example we used it together with the refresh, which after flushing the data opens a new searcher.

The transaction log configuration

If the default behavior of the transaction log is not enough ElasticSearch allows us to configure its behavior when it comes to the transaction log handling. The following parameters can be set in the elasticsearch.yml file as well as using index settings update API to control transaction log behavior:

  • index.translog.flush_threshold_period: It defaults to 30 minutes (30m). It controls the time, after which flush will be forced automatically even if no new data was being written to it. In some cases this can cause a lot of I/O operation, so sometimes it's better to do flush more often with less data being stored in it.
  • index.translog.flush_threshold_ops: It specifies the maximum number of operations after which the flush operation will be performed. It defaults to 5000.
  • index.translog.flush_threshold_size: It specifies the maximum size of the transaction log. If the size of the transaction log is equal to or greater than the parameter, the flush operation will be performed. It defaults to 200 MB.
  • index.translog.disable_flush: This option disables automatic flush. By default flushing is enabled, but sometimes it is handy to disable it temporarily, for example, during import of large amount of documents.

All of the mentioned parameters are specified for an index of our choice, but they are defining the behavior of the transaction log for each of the index shards.

Of course, in addition to setting the preceding parameters in the elasticsearch.yml file, they can also be set by using Settings Update API. For example:

curl -XPUT localhost:9200/test/_settings -d '{ "index" : { "translog.disable_flush" : true } }'

The preceding command was run before the import of a large amount of data, which gave us a performance boost for indexing. However, one should remember to turn on flushing when the import is done.

Near Real Time GET

Transaction log gives us one more feature for free that is, real-time GET operation, which provides the possibility of returning the previous version of the document including non-committed versions. The real-time GET operation fetches data from the index, but first it checks if a newer version of that document is available in the transaction log. If there is no flushed document, data from the index is ignored and a newer version of the document is returned—the one from the transaction log. In order to see how it works, you can replace the search operation in our example with the following command:

curl -XGET localhost:9200/test/test/1?pretty

ElasticSearch should return the result similar to the following:

{ "_index" : "test", "_type" : "test", "_id" : "1", "_version" : 2, "exists" : true, "_source" : { "title": "test2" } }

If you look at the result, you would see that again, the result was just as we expected and no trick with refresh was required to obtain the newest version of the document.

Mastering ElasticSearch Extend your knowledge on ElasticSearch, and querying and data handling, along with its internal workings with this book and ebook
Published: October 2013
eBook Price: £20.99
Book Price: £33.99
See more
Select your format and quantity:

Looking deeper into data handling

When starting to work with ElasticSearch, you can be overwhelmed with the different ways of searching and the different query types it provides. Each of these query types behaves differently and we do not say only about obvious differences, for example, like the one you would see when comparing range search and prefix search. It is crucial to know about these differences to understand how the queries work, especially when doing a little more than just using the default ElasticSearch instance, for example, for handling multilingual information.

Input is not always analyzed

Before we start discussing queries analysis, let's create the index by using the following command:

curl -XPUT localhost:9200/test -d '{ "mappings" : { "test" : { "properties" : { "title" : { "type" : "string", "analyzer" : "snowball" } } } } }'

As you can see, the index is pretty simple. The document contains only one field, processed by snowball analyzer. Now, let's index a simple document. We do it by running the following command:

curl -XPUT localhost:9200/test/test/1 -d '{ "title" : "the quick brown fox jumps over the lazy dog" }'

We have our big index, so we may bomb it with queries. Look closely at the following two commands:

curl localhost:9200/test/_search?pretty -d '{ "query" : { "term" : { "title" : "jumps" } } }' curl localhost:9200/test/_search?pretty -d '{ "query" : { "match" : { "title" : "jumps" } } }'

The first query will not return our document, but the second query will, surprise! You probably already know (or suspect) what the reason for such behavior is and that it is connected to analyzing. Let's compare what we, in fact, have in the index and what we are searching for. To do that, we will use the Analyze API by running the following command:

curl 'localhost:9200/test/_analyze?text=the+quick+brown+fox+jumps+over +the+lazy+dog&pretty&analyzer=snowball'

The _analyze endpoint allows us to see what ElasticSearch does with the input that is given in the text parameter. It also gives us the possibility to define which analyzer should be used (the analyzer parameter).

Other features of the analyze API are available at http://www.elasticsearch.org/guide/reference/api/admin-indices-analyze/.

The response returned by ElasticSearch for the preceding request will look similar to the following:

{ "tokens" : [ { "token" : "quick", "start_offset" : 4, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "brown", "start_offset" : 10, "end_offset" : 15, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "fox", "start_offset" : 16, "end_offset" : 19, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "jump", "start_offset" : 20, "end_offset" : 25, "type" : "<ALPHANUM>", "position" : 5 }, { "token" : "over", "start_offset" : 26, "end_offset" : 30, "type" : "<ALPHANUM>", "position" : 6 }, { "token" : "lazi", "start_offset" : 35, "end_offset" : 39, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "dog", "start_offset" : 40, "end_offset" : 43, "type" : "<ALPHANUM>", "position" : 9 } ] }

The third thing is not as bad as it looks, as long as all forms of the same word are converted into the same form. If such a thing happens, the goal of stemming will be achieved—ElasticSearch will match words from the query with the words stored in the index, independently of its form. But now let's return to our queries. The term query just searches for a given term (jumps in our case) but there is no such term in the index (there is jump). In the case of the match query the given text is first passed on to the analyzer, which converts jumps into jump, and after that the converted form is being used in the query.

Now let's look at the second example:

curl localhost:9200/test/_search?pretty -d '{ "query" : { "prefix" : { "title" : "lazy" } } }' curl localhost:9200/test/_search?pretty -d '{ "query" : { "match_phrase_prefix" : { "title" : "lazy" } } }'

In the preceding case both queries are similar, but again, the first query returns nothing (because lazy is not equal to lazi in the index) and the second query, which is analyzed, will return our document.

Example usage

All of this is interesting and you should remember the fact that some of the queries are being analyzed and some are not. However, the most interesting part is, how we can do all of this consciously to improve search-based applications.

Let's imagine searching the content of the books. It is possible that sometimes our users search by the name of the character, place name, probably by the quote fragment. We don't have any natural language analysis functionality in our application so we don't know the meaning of the phrase entered by the user. However, with some degree of probability we can assume that the most interesting result will be the one that exactly matches the phrase entered by the user. It is also very probable that the second scale of importance, will be the documents that have exactly the same words in the same form as the user input, and those documents—the ones with words with the same meaning or with a different language form.

In order to use another example let's use a command, which creates a simple index with only a single field defined:

curl -XPUT localhost:9200/test -d '{ "mappings" : { "test" : { "properties" : { "lang" : { "type" : "string" }, "title" : { "type" : "multi_field", "fields" : { "i18n" : { "type" : "string", "index" : "analyzed", analyzer : "english" }, "org" : { "type" : "string", "index" : "analyzed", "analyzer" : "standard"} } } } } } }'

We have the single field, but it is analyzed in two different ways because of the multi_field: with the standard analyzer (field title.org), and with the english analyzer (field title.i18n) which will try to change the input to its base form. If we index an example document with the following command:

curl -XPUT localhost:9200/test/test/1 -d '{ "title" : "The quick brown fox jumps over the lazy dog." }'

We will have the jumps term indexed in the title.org field and the jump term indexed in the title.i18n field. Now let's run the following query:

curl localhost:9200/test/_search?pretty -d '{ "query" : { "multi_match" : { "query" : "jumps", "fields" : ["title.org^1000", "title.i18n"] } } }'

Our document will be given a higher score for perfect match, thanks to boosting and matching the jumps term in the field.org field. The score is also given for hit in field.i18n, but the impact of this field on the overall score is much smaller, because we didn't specify the boost and thus the default value of 1 is used.

Changing the analyzer during indexing

The next thing worth mentioning when it comes to handling multilingual data is the possibility of dynamically changing the analyzer during indexing. Let's modify the previous mapping by adding the _analyzer part to it:

curl -XPUT localhost:9200/test -d '{ "mappings" : { "test" : { "_analyzer" : { "path" : "lang" }, "properties" : { "lang" : { "type" : "string" }, "title" : { "type" : "multi_field", "fields" : { "i18n" : { "type" : "string", "index" : "analyzed"}, "org" : { "type" : "string", "index" : "analyzed", "analyzer" : "standard"} } } } } } }'

The change we just did, allows ElasticSearch to determine the analyzer basing on the contents of the document that is being processed. The path parameter is the name of the document field, which contains the name of the analyzer. The second change is removal of the analyzer definition from the field.i18n field definition. Now our indexing command will look like this:

curl -XPUT localhost:9200/test/test/1 -d '{ "title" : "The quick brown fox jumps over the lazy dog.", "lang" : "english" }'

In the preceding example, ElasticSearch will take the value of the lang field and will use it as the analyzer for that document. It can be useful when you want to have different analysis for different documents (for example, some documents should have stop words removed and some shouldn't).

Changing the analyzer during searching

Changing the analyzer is also possible in the query time, by specifying the analyzer property. For example, let's look at the following query:

curl localhost:9200/test/_search?pretty -d '{ "query" : { "multi_match" : { "query" : "jumps", "fields" : ["title.org^1000", "title.i18n"], "analyzer": "english" } } }'

Thanks to the highlighted code fragment. ElasticSearch will choose the analyzer that we've explicitly mentioned.

The pitfall and default analysis

Combining the mechanism of replacing analyzer per document on index-time and query-time level is a very powerful feature, but it can also introduce hard-to-spot errors. One of them can be a situation where the analyzer is not defined. In such cases, ElasticSearch will choose the so-called default analyzer, but sometimes this is not what you can expect, because default analyzer, for example, can be redefined by plugins. In such cases, it is worth defining what the default ElasticSearch analysis should look like. To do this, we just define analyzer as usual, but instead of giving it a custom name we use the default name.

As an alternative you can define the default_index analyzer and the default_search analyzer, which will be used as a default analyzer respectively on index-time and search-time analysis.

Segment merging under control

You also know that each of the shards and replicas are actual Apache Lucene indices, which are built of multiple segments (at least one segment). If you recall, the segments are written once and read many times, apart from the information about the deleted documents which are held in one of the files and can be changed. After some time, when certain conditions are met, the contents of some segments can be copied to a bigger segment and the original segments are discarded and thus deleted from the disk. Such an operation is called segment merging.

You may ask yourself, why bother about segment merging? There are a few reasons. First of all, the more segments the index is built of, the slower the search will be and the more memory Lucene needs to use. In addition to this, the segments are immutable, so the information is not deleted from it. If you happen to delete many documents from your index, until the merge happens, those documents are only marked as deleted, not deleted physically. So, when segment merging happens the documents, which are marked as deleted, are not written into the new segment and in this way, they are removed, which decreases the final segment size.

Many small changes can result in a large number of small segments, which can lead to problems with large number of opened files. We should always be prepared to handle such situations, for example, by having the appropriate opened files limit set.

So, just to quickly summarize, segments merging takes place and from the user's point of view will result in two effects:

  • It will reduce the number of segments to allow faster searching when a few segments are merged into a single one
  • It will reduce the size of the index because of removing the deleted documents when the merge is finalized

However, you have to remember that segment merging comes with a price; the price of I/O (input/output) operations, which on slower systems can affect performance. Because of this, ElasticSearch allows us to choose the merge policy and the store level throttling.

Choosing the right merge policy

Although segments merging is Apache Lucene's duty, ElasticSearch allows us to configure which merge policy we would like to use. There are three policies that we are currently allowed to use:

  • tiered (the default one)
  • log_byte_size
  • log_doc

Each of the preceding mentioned policies have their own parameters, which define their behavior and which default values we can override (please look at the section dedicated to the policy of your choice to see what are those parameters).

In order to tell ElasticSearch, which merge policy we want to use, we should set index.merge.policy.type to the desired type, shown as follows:

index.merge.policy.type: tiered

Once the index is created with the specified merge policy type it can't be changed. However, all the properties defining merge policy behavior can be changed using the index update API.

Let's now look at the different merge policies and what functionality they provide. After this, we will discuss all the configuration options provided by the policies.

The tiered merge policy

This is the default merge policy that ElasticSearch uses. It merges the segments of approximately similar size, taking into account the maximum number of segments allowed per tier. It is also possible to differentiate the number of segments that are allowed to be merged at once from how many segments are allowed to be present per tier. During indexing, this merge policy will compute how many segments are allowed to be present in the index, which is called budget. If the number of segments the index is built of is higher than the computed budget, the tiered policy will first sort the segments by decreasing order of their size (taking into account the deleted documents). After that it will find the merge that has the lowest cost. The merge cost is calculated in a way that merges reclaiming more deletes and having a smaller size is favored.

If the merge produces a segment that is larger than the value specified by the index.merge.policy.max_merged_segment property, the policy will merge fewer segments to keep the segment size under the budget. This means, that for indices that have large shards, the default value of the index.merge.policy.max_merged_segment property may be too low and will result in the creation of many segments, slowing down your queries. Depending on the volume of your data you should monitor your segments and adjust the merge policy setting to match your needs.

The log byte size merge policy

This is a merge policy, which over time will produce an index that will be built of a logarithmic size of indices. There will be a few large segments, then there will be a few merge factor smaller segments and so on. You can imagine that there will be a few segments of the same level of size, when the number of segments will be lower than the merge factor. When an extra segment is encountered and all the segments within that level are merged. The number of segments an index will contain is proportional to the logarithm of the next size in bytes. This merge policy is generally able to keep the low number of segments in your index while minimizing the cost of segments merging.

The log doc merge policy

It is similar to the log_byte_size merge policy, but instead of operating on the actual segment size in bytes, it operates on the number of documents in the index. This merge policy will perform well when the documents are similar in terms of size or if you want segments of similar size in terms of the number of documents.

Merge policies configuration

We now know how merge policies work, but we lack the knowledge about the configuration options. So now, let's discuss each of the merge policies and see what options are exposed to us. Please remember that the default values will usually be OK for most of the deployments and they should be changed only when needed.

The tiered merge policy

When using the tiered merge policy the following options can be altered:

  • index.merge.policy.expunge_deletes_allowed: It defaults to 10 and it specifies the percentage of deleted documents in a segment in order for it to be considered to be merged when running expungeDeletes.
  • index.merge.policy.floor_segment: It is a property that enables us to prevent frequent flushing of very small segments. Segments smaller than the size defined by this property are treated by the merge mechanism, as they would have the size equal to the value of this property. It defaults to 2 MB.
  • index.merge.policy.max_merge_at_once: It specifies the maximum number of segments that will be merged at the same time during indexing. By default it is set to 10. Setting the value of this property to higher values can result in multiple segments being merged at once, which will need more I/O resources.
  • index.merge.policy.max_merge_at_once_explicit: It specifies the maximum number of segments that will be merged at the same time during optimize operation or expungeDeletes. By default it is set to 30. This setting will not affect the maximum number of segments that will be merged during indexing.
  • index.merge.policy.max_merged_segment: It defaults to 5 GB and it specifies the maximum size of a single segment that will be produced during segment merging when indexing. This setting is an approximate value, because the merged segment size is calculated by summing the size of segments that are going to be merged minus the size of the deleted documents in those segments.
  • index.merge.policy.segments_per_tier: It specifies the allowed number of segments per tier. Smaller values of this property result in less segments, which means, more merging and lower indexing performance. It defaults to 10 and should be set to a value higher than or equal to the index.merge.policy.max_merge_at_once or you'll be facing too many merges and performance issues.
  • index.reclaim_deletes_weight: It defaults to 2.0 and specifies how many merges that reclaim deletes are favored. When setting this value to 0.0 the deletes reclaim will not affect merge selection. The higher the value, the more favored will be the merge that will reclaim deletes.
  • index.compund_format: It is a Boolean value that specifies whether the index should be stored in a compound format or not. It defaults to false. If set to true, Lucene will store all the files that build the index in a single file. This is sometimes useful for systems running constantly out of file handlers, but will decrease the searching and indexing performance.
  • index.merge.async: It is a Boolean value specifying if the merge should be done asynchronously. It defaults to true.
  • index.merge.async_interval: When the index.merge.async value is set to true (so the merging is done asynchronously), this property specifies the interval between merges. The default value of this property is 1s. Please note that the value of this property needs to be kept low, for merging to actually happen and the index segments reduction will take place.

The log byte size merge policy

When using the log_byte_size merge policy the following options can be configured:

  • merge_factor: It specifies how often segments are merged during indexing. With a smaller merge_factor value, the searches are faster, less memory is used, but that comes with the cost of slower indexing. With larger merge_factor values, it is the opposite—the indexing is faster (because of less merging being done), but the searches are slower and more memory is used. By default, the merge_factor is given the value of 10. It is advised to use larger values of merge_factor for batch indexing and lower values of this parameter for normal index maintenance.
  • min_merge_size: It defines the size (total size of the segment files in bytes) of the smallest segment possible. If a segment is lower in size than the number specified by this property, it will be merged if the merge_factor property allows us to do that. This property defaults to 1.6 MB and is very useful to avoid having many very small segments. However, one should remember that setting this property to a large value will increase the merging cost.
  • max_merge_size: It defines the maximum size (total size of the segment files in bytes) of the segment that can be merged with other segments. By default it is not set, so there is no limit on the maximum size a segment can be in order to be merged.
  • maxMergeDocs: It defines the maximum number of documents a segment can have in order to be merged with other segments. By default it is not set, so there is no limit on the maximum number of documents a segment can have.
  • calibrate_size_by_deletes: It is a Boolean value, which is set to true and specifies if the size of deleted documents should be taken into consideration when calculating segment size.
  • index.compund_format: It is a Boolean value that specifies if the index should be stored in a compound format. It defaults to false. Please refer to tiered merge policy for the explanation of what this parameter does.

The mentioned properties we just saw, should be prefixed with the index.merge.policy prefix. So if we would like to set the min_merge_docs property, we should use the index.merge.policy.min_merge_docs property.

In addition to this, the log_byte_size merge policy accepts the index.merge.async property and the index.merge.async_interval property just like tiered merge policy does.

The log doc merge policy

When using the log_doc merge policy the following options can be configured:

  • merge_factor: It is same as the property that is present in the log_byte_size merge policy, so please refer to that policy for explanation.
  • min_merge_docs: It defines the minimum number of documents for the smallest segment. If a segment contains a lower document count than the number specified by this property it will be merged if the merge_factor property allows this. This property defaults to 1000 and is very useful to avoid having many very small segments. However, one should remember that setting this property to a large value will increase the merging cost.
  • max_merge_docs: It defines the maximum number of documents a segment can have in order to be merged with other segments. By default it is not set, so there is no limit on the maximum number of documents a segment can have.
  • calibrate_size_by_deletes: It is a Boolean value which defaults to true and specifies if the size of deleted documents should be taken into consideration when calculating the segment size.
  • index.compund_format: It is a Boolean value that specifies if the index should be stored in a compound format. It defaults to false. Please refer to tiered merge policy for the explanation of what this parameter does.

Similar to the previous merge policy, the previously mentioned properties should be prefixed with the index.merge.policy prefix. So if we would like to set the min_merge_docs property, we should use the index.merge.policy.min_merge_docs property.

In addition to this, the log_doc merge policy accepts the index.merge.async property and the index.merge.async_interval property, just like tiered merge policy does.

Scheduling

In addition to having a control over how merge policy is behaving, ElasticSearch allows us to define the execution of merge policy once a merge is needed. There are two merge schedulers available with the default being the ConcurrentMergeScheduler.

The concurrent merge scheduler

This is a merge scheduler that will use multiple threads in order to perform segments merging. This scheduler will create a new thread until the maximum number of threads is reached. If the maximum number of threads is reached and a new thread is needed (because segments merge needs to be performed), all the indexing will be paused until at least one merge is completed.

In order to control the maximum threads allowed, we can alter the index.merge.scheduler.max_thread_count property. By default, it is set to the value calculated by the following equation:

maximum_value(1, minimum_value(3, available_processors / 2)

So, if our system has eight processors available, the maximum number of threads that concurrent merge scheduler is allowed to use will be equal to 4.

The serial merge scheduler

A simple merge scheduler that uses the same thread for merging. It results in a merge that stops all the other document processing that was happening in the same thread, which in this case means stopping of indexing.

Setting the desired merge scheduler

In order to set the desired merge scheduler, one should set the index.merge.scheduler.type property to the value of concurrent or serial. For example, in order to use the concurrent merge scheduler, one should set the following property:

index.merge.scheduler.type: concurrent

In order to use the serial merge scheduler, one should set the following property:

index.merge.scheduler.type: serial

When talking about merge policy and merge schedulers it would be nice to visualize it. If one needs to see how the merges are done in the underlying Apache Lucene library, we suggest visiting Mike McCandless' blog post at http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html.

In addition to this, there is a plugin allowing us to see what is happening to the segments called SegmentSpy. Please refer to the following URL for more information:

https://github.com/polyfractal/elasticsearch-segmentspy

Summary

In this article, we've learned how to use different scoring formulae and what they bring to the table. We've also seen how to use different posting formats and how we benefit from using them. In addition to this, we now know how to handle Near Real Time searching and real-time GET requests and what searcher reopening means for ElasticSearch. We've discussed multilingual data handling and we've configured transaction log to our needs. Finally, we've learned about segments merging, merge policies, and scheduling.

Resources for Article:


Further resources on this subject:


Mastering ElasticSearch Extend your knowledge on ElasticSearch, and querying and data handling, along with its internal workings with this book and ebook
Published: October 2013
eBook Price: £20.99
Book Price: £33.99
See more
Select your format and quantity:

About the Author :


Marek Rogoziński

Marek Rogoziński is a software architect and consultant with more than 10 years of experience. He has specialized in solutions based on open source search engines such as Solr and Elasticsearch, and also the software stack for Big Data analytics including Hadoop, HBase, and Twitter Storm.

He is also the cofounder of the solr.pl site, which publishes information and tutorials about Solr and the Lucene library. He is also the co-author of some books published by Packt Publishing.

Currently, he holds the position of the Chief Technology Officer in a new company, designing architecture for a set of products that collect, process, and analyze large streams of input data.

Rafał Kuć

Rafał Kuć is a born team leader and software developer. He currently works as a consultant and a software engineer at Sematext Group, Inc., where he concentrates on open source technologies such as Apache Lucene and Solr, Elasticsearch, and Hadoop stack. He has more than 12 years of experience in various branches of software, from banking software to e-commerce products. He focuses mainly on Java but is open to every tool and programming language that will make the achievement of his goal easier and faster. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people with the problems they face with Solr and Lucene. Also, he has been a speaker at various conferences around the world, such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, and Lucene Revolution.

Rafał began his journey with Lucene in 2002, and it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then, Solr came along and this was it. He started working with Elasticsearch in the middle of 2010. Currently, Lucene, Solr, Elasticsearch, and information retrieval are his main points of interest.

Rafał is also the author of Apache Solr 3.1 Cookbook, and the update to it, Apache Solr 4 Cookbook. Also, he is the author of the previous edition of this book and Mastering ElasticSearch. All these books have been published by Packt Publishing.

Books From Packt


 ElasticSearch Server
ElasticSearch Server

Apache Solr 3 Enterprise Search Server
Apache Solr 3 Enterprise Search Server

Apache Solr 3.1 Cookbook
Apache Solr 3.1 Cookbook

Apache Solr 4 Cookbook
Apache Solr 4 Cookbook

Instant Lucene.NET [Instant]
Instant Lucene.NET [Instant]

 Apache Tomcat 7 Essentials
Apache Tomcat 7 Essentials

Apache Axis2 Web Services, 2nd Edition
Apache Axis2 Web Services, 2nd Edition

Quickstart Apache Axis2
Quickstart Apache Axis2


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software