Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learning Elasticsearch

You're reading from  Learning Elasticsearch

Product type Book
Published in Jun 2017
Publisher Packt
ISBN-13 9781787128453
Pages 404 pages
Edition 1st Edition
Languages
Author (1):
Abhishek Andhavarapu Abhishek Andhavarapu
Profile icon Abhishek Andhavarapu

Table of Contents (11) Chapters

Preface Introduction to Elasticsearch Setting Up Elasticsearch and Kibana Modeling Your Data and Document Relations Indexing and Updating Your Data Organizing Your Data and Bulk Data Ingestion All About Search More Than a Search Engine (Geofilters, Autocomplete, and More) How to Slice and Dice Your Data Using Aggregations Production and Beyond Exploring Elastic Stack (Elastic Cloud, Security, Graph, and Alerting)

Organizing Your Data and Bulk Data Ingestion

In this chapter, you’ll learn how to manage indices in Elasticsearch. Until this chapter, you learned about operating on a single document. In this chapter, you’ll learn about the various APIs Elasticsearch has to offer to support bulk operations. They can be very effective when its comes to rebuilding the entire index or batching requests together in a single call. Due to how the data in stored internally in Elasticsearch, the number of shards or the mapping of the fields cannot be changed after the index creation. You'll learn about Reindex API, which can rebuild the index with the correct settings. Using Elasticsearch for time-based data is a very common usage pattern. We will discuss different ways to manage time-based indices. By the end of this chapter, we will have covered the following...

Bulk operations

In this section, we will discuss various bulk operations Elasticsearch supports. Batching multiple requests together saves network round trips, and the requests in the batch can be executed in parallel. Elasticsearch has a dedicated thread pool for bulk operations, and the number of requests it can process in parallel depends on the number of the CPU processors in the node.

The following are the different bulk operations supported:

  • Bulk API: This can be used to batch multiple index and delete operations and execute them using a single API call.
  • Multi Get API: This can be used to retrieve documents using document IDs.
  • Update by query: This can be used to update a set of documents that match a query.
  • Delete by query: This can be used to delete the documents that match a query.
...

Reindex API

Before Elasticsearch 5.0, to change the index settings, or change the mapping of an index, you have a create a new index and reindex the data. Reindexing a large index is usually lot of work, which involves reading the data from the source like a SQL database, transforming the data into Elasticsearch documents and loading the data into Elasticsearch. For large applications, batch processing engines such as Hadoop are used to reindex the data. Depending on how big the index is or how complicated the ETL (Extract, Transform, Load) process is, reindex can be very expensive. To solve this, Reindex API was introduced. The original JSON document used for indexing is stored in the _source field which can be used by the Reindex API to reindex the documents. The Reindex API can be used for the following:

  • To change the mapping/settings of an existing index
  • To combine documents...

Ingest Node

Traditionally, Logstash is used to preprocess your data before indexing into Elasticsearch. Using Logstash, you can define pipelines to extract, transform, and index your data into Elasticsearch.

In Elasticsearch 5.0, the ingest node has been introduced. Using the ingest node, pipelines to modify the documents before indexing can be defined. A pipeline is a series of processors, each processor working on one or more fields in the document. The most commonly used Logstash filters are available as processors. For example, using a grok filter to extract data from an Apache log file into a document, extracting fields from JSON, changing the date format, calculating geo-distance from a location, and so on. The possibilities are endless. Elasticsearch supports many processors out of the box. You can also develop your own processors using any JVM-supported languages.

By default...

Organizing your data

In this section, we will discuss how to divide your data into multiple indices. Elasticsearch provides index aliases, which make it very easy to query multiple indices at once. It also supports index templates to configure automatic index creation. We will also discuss how to deal with time-based data, such as logs, which is a common Elasticsearch use case.

Index alias

An index alias is a pointer to one or more indexes. A search operation executed against an alias is executed across all the indexes the alias points to. The coordinating node executes the request on all indices, collects the results, and sends them back to the client. The index operation, on the other hand, cannot be executed on an alias...

Shrink API

Shrink API is used to shrink an existing index into a new index with a fewer number of shards. If the data in the index is no longer changing, the index can be optimized for search and aggregation by reducing the number of shards. The number of shards in the destination index must be a factor of the original index. For example, an index with 6 primary shards can be shrunk into 3, 2, or 1 shards. When working with time-sensitive data, such as logs, data is only indexed into the current indexes and older indexes are mostly read only. Shrink API doesn't re-index the document; it simply relinks the index segments to the new index.

To shrink an index, the index should be marked as read-only, and either a primary or a replica of all the shards of the index should be moved to one node. We can force the allocation of the shards to one node and mark it as read only as shown...

Summary

In this chapter, we discussed the various bulk operations Elasticsearch supports. You also learned about Reindex and Shrink APIs, which can be used to the change the index configuration, such as the number of shards, mapping of an existing index and so on without re-indexing the data.

We covered how to organize your data in Elasticsearch using aliases and index templates. We discussed how to use ingest node to pre-process your data before indexing into Elasticsearch. You learned how to use ingest node to transform unstructured log data into JSON documents and automatically index them into a month-based index.

In the next chapter, we will discuss different ways of querying Elasticsearch.

lock icon The rest of the chapter is locked
You have been reading a chapter from
Learning Elasticsearch
Published in: Jun 2017 Publisher: Packt ISBN-13: 9781787128453
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}