Reader small image

You're reading from  Elasticsearch 7 Quick Start Guide

Product typeBook
Published inOct 2019
PublisherPackt
ISBN-139781789803327
Edition1st Edition
Right arrow
Authors (2):
Anurag Srivastava
Anurag Srivastava
author image
Anurag Srivastava

Anurag Srivastava is a senior technical lead in a multinational software company. He has more than 12 years' experience in web-based application development. He is proficient in designing architecture for scalable and highly available applications. He has handled development teams and multiple clients from all over the globe over the past 10 years of his professional career. He has significant experience with the Elastic Stack (Elasticsearch, Logstash, and Kibana) for creating dashboards using system metrics data, log data, application data, and relational databases. He has authored three other booksMastering Kibana 6.x, and Kibana 7 Quick Start Guide, and Learning Kibana 7 - Second Edition, all published by Packt.
Read more about Anurag Srivastava

Douglas Miller
Douglas Miller
author image
Douglas Miller

Douglas Miller is an expert in helping fast-growing companies to improve performance and stability, and in building search platforms using Elasticsearch. Clients (including Walgreens, Nike, Boeing, and Dish Networks) have seen sales increase, fast performance times, and lower overall costs in terms of the total costs of ownership for their Elasticsearch clusters.
Read more about Douglas Miller

View More author details
Right arrow

Prepping Your Data – Text Analysis and Mapping

In the last chapter, we explained the distributed model of Elasticsearch and covered the different APIs that are supported in Elasticsearch. Here, we will discuss Elasticsearch analyzers and mapping, which is a very important aspect of data preparation as we need to tweak our data to get the relevant results through data search. Elasticsearch is very flexible for data analysis as it provides many built-in analyzers that we can pick, and we can even create our own analyzer.

Mapping can be dynamic or explicit; in dynamic mapping, Elasticsearch identifies the datatype for each field, which can be incorrect sometimes, while in explicit mapping, we create the mapping before indexing the actual data.

In this chapter, we are going to cover the following:

  • What is an analyzer?
  • Anatomy of an analyzer
  • How to use an analyzer
  • Normalizers...

What is an analyzer?

Elasticsearch's analysis tool describes analysis as the process of converting text into tokens, which are then added to the inverted index and used for searching. Every analysis is performed by an analyzer. For example, an index time analysis built in English will convert a sentence into distinct words; these distinct words are the tokens. Take the example of the following sentence:

Hello World! This is my first program and these are my first lines.

This will be added to the inverted index as the following:

[hello, world, first, program, these, line]

The analyzer removes the most frequent words and reduces words to the word stem—so lines becomes line. An analyzer can be specified in the mapping in the text field, as shown in the following query:

PUT my_index 
{
"mappings": {
"_doc": {
"properties"...

Anatomy of an analyzer

An analyzer is a package that contains three building blocks: character filters, tokenizers, and token filters. A user can create a custom analyzer by using these or other building blocks to create the functionality needed. Allow me to elaborate more on what these building blocks are:

  • Character filters convert text into a stream of characters. They can transform the stream by adding, removing, or changing the format of the characters. For example, a character filer can change the & character to the word and. An analyzer may have no character filters, or many, but they are always applied in order.
  • Tokenizers receive the stream of characters and break it down into tokens. The output will then be a stream of tokens. For example, a whitespace tokenizer breaks the text using whitespaces: Hello World! into [hello, world]. It also records the order of the...

Mapping

Elasticsearch mapping is used to define the document structure with available fields and their datatypes. In Elasticsearch mapping, we have two types of fields, user-defined fields or properties and meta fields. User-defined fields can be any field that we use to provide for mapping or indexing while meta fields are those fields that help us to identify associated document metadata, for example, _index, _id, and _source.

Datatypes

Each field in an Elasticsearch document can have a datatype, and the datatype can be categorized into three categories, which are as follows.

The simple datatype

...

Summary

In this chapter, we covered text analysis and mapping, where we have gone through the anatomy of an analyzer in Elasticsearch. We introduced character filters, tokenizers, and token filters. We also covered how to use an analyzer and we have gone through different types of analyzers. After analyzers, we covered normalizers and tokenizers. Finally, we covered token filters and character filters.

In the next chapter, we will explore data searches in depth. We will cover URI search and body search.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Elasticsearch 7 Quick Start Guide
Published in: Oct 2019Publisher: PacktISBN-13: 9781789803327
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Anurag Srivastava

Anurag Srivastava is a senior technical lead in a multinational software company. He has more than 12 years' experience in web-based application development. He is proficient in designing architecture for scalable and highly available applications. He has handled development teams and multiple clients from all over the globe over the past 10 years of his professional career. He has significant experience with the Elastic Stack (Elasticsearch, Logstash, and Kibana) for creating dashboards using system metrics data, log data, application data, and relational databases. He has authored three other booksMastering Kibana 6.x, and Kibana 7 Quick Start Guide, and Learning Kibana 7 - Second Edition, all published by Packt.
Read more about Anurag Srivastava

author image
Douglas Miller

Douglas Miller is an expert in helping fast-growing companies to improve performance and stability, and in building search platforms using Elasticsearch. Clients (including Walgreens, Nike, Boeing, and Dish Networks) have seen sales increase, fast performance times, and lower overall costs in terms of the total costs of ownership for their Elasticsearch clusters.
Read more about Douglas Miller