Organizing Your Data and Bulk Data Ingestion

In this chapter, youâ€™ll learn how to manage indices in Elasticsearch. Until this chapter, you learned about operating on a single document. In this chapter, youâ€™ll learn about the various APIs Elasticsearch has to offer to support bulk operations. They can be very effective when its comes to rebuilding the entire index or batching requests together in a single call. Due to how the data in stored internally in Elasticsearch, the number of shards or the mapping of the fields cannot be changed after the index creation. You'll learn about Reindex API, which can rebuild the index with the correct settings. Using Elasticsearch for time-based data is a very common usage pattern. We will discuss different ways to manage time-based indices. By the end of this chapter, we will have covered the following...

Bulk operations

In this section, we will discuss various bulk operations Elasticsearch supports. Batching multiple requests together saves network round trips, and the requests in the batch can be executed in parallel. Elasticsearch has a dedicated thread pool for bulk operations, and the number of requests it can process in parallel depends on the number of the CPU processors in the node.

The following are the different bulk operations supported:

Bulk API: This can be used to batch multiple index and delete operations and execute them using a single API call.
Multi Get API: This can be used to retrieve documents using document IDs.
Update by query: This can be used to update a set of documents that match a query.
Delete by query: This can be used to delete the documents that match a query.

...

Reindex API

Before Elasticsearch 5.0, to change the index settings, or change the mapping of an index, you have a create a new index and reindex the data. Reindexing a large index is usually lot of work, which involves reading the data from the source like a SQL database, transforming the data into Elasticsearch documents and loading the data into Elasticsearch. For large applications, batch processing engines such as Hadoop are used to reindex the data. Depending on how big the index is or how complicated the ETL (Extract, Transform, Load) process is, reindex can be very expensive. To solve this, Reindex API was introduced. The original JSON document used for indexing is stored in the _source field which can be used by the Reindex API to reindex the documents. The Reindex API can be used for the following:

To change the mapping/settings of an existing index
To combine documents...

Ingest Node

Traditionally, Logstash is used to preprocess your data before indexing into Elasticsearch. Using Logstash, you can define pipelines to extract, transform, and index your data into Elasticsearch.

In Elasticsearch 5.0, the ingest node has been introduced. Using the ingest node, pipelines to modify the documents before indexing can be defined. A pipeline is a series of processors, each processor working on one or more fields in the document. The most commonly used Logstash filters are available as processors. For example, using a grok filter to extract data from an Apache log file into a document, extracting fields from JSON, changing the date format, calculating geo-distance from a location, and so on. The possibilities are endless. Elasticsearch supports many processors out of the box. You can also develop your own processors using any JVM-supported languages.

By default...

Organizing your data

In this section, we will discuss how to divide your data into multiple indices. Elasticsearch provides index aliases, which make it very easy to query multiple indices at once. It also supports index templates to configure automatic index creation. We will also discuss how to deal with time-based data, such as logs, which is a common Elasticsearch use case.

Index alias

An index alias is a pointer to one or more indexes. A search operation executed against an alias is executed across all the indexes the alias points to. The coordinating node executes the request on all indices, collects the results, and sends them back to the client. The index operation, on the other hand, cannot be executed on an alias...

Shrink API

Shrink API is used to shrink an existing index into a new index with a fewer number of shards. If the data in the index is no longer changing, the index can be optimized for search and aggregation by reducing the number of shards. The number of shards in the destination index must be a factor of the original index. For example, an index with 6 primary shards can be shrunk into 3, 2, or 1 shards. When working with time-sensitive data, such as logs, data is only indexed into the current indexes and older indexes are mostly read only. Shrink API doesn't re-index the document; it simply relinks the index segments to the new index.

To shrink an index, the index should be marked as read-only, and either a primary or a replica of all the shards of the index should be moved to one node. We can force the allocation of the shards to one node and mark it as read only as shown...

Summary

In this chapter, we discussed the various bulk operations Elasticsearch supports. You also learned about Reindex and Shrink APIs, which can be used to the change the index configuration, such as the number of shards, mapping of an existing index and so on without re-indexing the data.

We covered how to organize your data in Elasticsearch using aliases and index templates. We discussed how to use ingest node to pre-process your data before indexing into Elasticsearch. You learned how to use ingest node to transform unstructured log data into JSON documents and automatically index them into a month-based index.

In the next chapter, we will discuss different ways of querying Elasticsearch.