Reader small image

You're reading from  Elasticsearch Indexing

Product typeBook
Published inDec 2015
Publisher
ISBN-139781783987023
Edition1st Edition
Right arrow
Author (1)
Huseyin Akdogan
Huseyin Akdogan
author image
Huseyin Akdogan

Hüseyin Akdoğan began his software adventure with the GwBasic programming language. He started learning the Visual Basic language after QuickBasic and developed many applications until 2000, after which he stepped into the world of Web with PHP. After this, he came across Java! In addition to counseling and training activities since 2005, he developed enterprise applications with JavaEE technologies. His areas of expertise are JavaServer Faces, Spring Frameworks, and big data technologies such as NoSQL and Elasticsearch. Along with these, he is also trying to specialize in other big data technologies. Hüseyin also writes articles on Java and big data technologies and works as a technical reviewer of big data books. He was a reviewer of one of the bestselling books, Mastering Elasticsearch – Second Edition.
Read more about Huseyin Akdogan

Right arrow

Chapter 4. Analysis and Analyzers

In the previous chapter, we looked at the basic concepts and definitions of mapping. We talked about fields of metadata and data types. Then, we discussed the relationship between mapping and relevant search results. Finally, we tried to have a good grasp of understanding what the schema-less is in Elasticsearch.

In this chapter, we will review the process of analysis and analyzers. We will examine the tokenizers and we will look closely at the character and token filters. In addition, we will review how to add analyzers to an Elasticsearch configuration. By the end of this chapter, we would have covered the following topics:

  • What is analysis process?

  • What is built-in analyzers?

  • What are doing tokenizers, character, and token filters?

  • What is text normalization?

  • How to create custom analyzers?

Introducing analysis


As mentioned in Chapter 1, Introduction to Efficient Indexing, a huge scale of data is produced at any moment in today's world of information technologies on various platforms, such as social media and medium and large-sized companies, which provide services in communication, health, security, and any other areas. Moreover, initially, such data is in an unstructured form.

We can see that this point of view on the big data takes into account three basic needs/concerns/forms:

  • Recording of data by high performance

  • Accessing of data by high performance

  • Analyzing of data

Big data solutions are mostly related to the aforementioned three basic needs.

Data should be recorded with high performance in order that data can be accessed with fully high performance benefits; however, it is not enough alone. To get the real meaning of data, data must be analyzed.

Thanks to data analysis, the well-established search engines like Google and many social media platforms like Facebook/Twitter are...

Process of analysis


We mentioned in Chapter 1, Introduction to Efficient Indexing and Chapter 2, What is an Elasticsearch Index that all Apache Lucene's data is stored in the inverted index. This means that the data is being transformed. The process of transforming data is called analysis. The analysis process relies on two basic pillars: tokenizing and normalizing.

The first step of the analysis process is to break the text into tokens using tokenizer after processing by the character filters for the inverted index. Then, it normalizes these tokens (that is, terms) to make them easily searchable.

Inverted index processes are performed by analyzers. Generally, an analyzer is composed of a tokenizer and one or more token filters. During the indexing time, when Elasticsearch processes a field that must be indexed, it checks whether an analyzer is defined at several levels or not because an analyzer can be specified at several levels.

The check order is as follows:

  1. At field level

  2. At type level

  3. At...

Built-in analyzers


Elasticsearch comes with several analyzers in its standard installation. In the following table, some analyzers are described:

What's text normalization?


Text normalization is the process of transforming text into a common form. That is necessary in order to remove insignificant differences among identical words.

Let's look at déjà-vu word to handle.

The word deja-vu is not equal to déjà-vu for string comparison. Even Déjà-vu is not equal to déjà-vu. Similarly, Michè'le is not equal to Michèle. All these words (that is, tokens) are not equal because the comparison is made at the byte-level by Elasticsearch. This means, for two tokens to be considered the same, they need to consist of exactly the same bytes when these tokens are compared.

However, these words have similar meanings. In other words, the same thing is being sought when a user is searching for the word déjà-vu and another user, deja-vu or deja vu. It should also be noted that the Unicode standard allows you to create equivalent text in multiple ways.

For example, take letters é (Latin Capital letter e with grave) and é (Latin Capital letter e with acute...

ICU analysis plugin


Elasticsearch has an ICU analysis plugin. You can use this plugin to use mentioned forms in the previous section, and so ensuring that all of your tokens are in the same form. Note that the plugin must be compatible with the version of Elasticsearch in your machine:

bin/plugin install elasticsearch/elasticsearch-analysis-icu/2.7.0

After installing, the plugin registers itself by default under icu_normalizer or icuNormalizer. You can see an example of the usage as follows:

curl -XPUT /my_index -d '{
  "settings": {
    "analysis": {
      "filter": {
        "nfkc_normalizer": {
          "type": "icu_normalizer",
          "name": "nfkc"
        }
      },
      "analyzer": {
        "my_normalizer": {
          "tokenizer": "icu_tokenizer",
          "filter":  [ "nfkc_normalizer" ]
        }
      }
    }
  }
}'

The preceding configuration let's normalize all tokens into the NFKC normalization form.

Note

If you want more information about the ICU, refer to http://site.icu...

An Analyzer Pipeline


If we have a good grasp of the analysis process described so far, a pipeline of an analyzer should be as shown in the following picture:

Text to be analyzed is primarily processed by the character filters. Then, a filter divides the text by tokenizers and tokens are obtained. In the final step, the token filters modify tokens.

Specifying the analyzer for a field in the mapping


You can define an analyzer both in the index_analyzer and the search_analyzer member over a field in the mapping process. Also, Elasticsearch allows you to use different analyzers in separate fields.

The following command shows us the mapping for the fields that an analyzer defined:

curl -XPUT localhost:9200/blog -d '{
  "mappings": {
    "article": {
      "properties": {
        "title": {
          "type": "string", "index_analyzer": "simple"
        },
        "content": {
          "type": "string", "index_analyzer": "whitespace", "search_analyzer": "standard"
        }
      }
    }
  }
}'
{"acknowledged":true}

We defined a simple analyzer to the title field, and whitespace analyzer to the content field by the preceding configuration. Also, the search analyzer refers to the standard analyzer in the content field.

Now, we will add a document to the blog index as follows:

curl -XPOST localhost:9200/blog/article -d '{
  "title": "My boss's...

Summary


In this chapter, we looked at the analysis process and we reviewed the building blocks of analyzer. After this, we comprehended what the character filters, tokenizers, and token filters are, and how to specify different analyzers in separate fields. Finally, we saw how to create a custom analyzer. In the next chapter, you'll discover the anatomy of an Elasticsearch cluster, what a shard is, what a replica shard is, what a function replica shard performs, and so on. In addition, we will examine the questions, how do we configure my cluster correctly? and how do we determine the correct number of shard and replicas? We will also look at some relevant cases related to this topic.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Elasticsearch Indexing
Published in: Dec 2015Publisher: ISBN-13: 9781783987023
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Huseyin Akdogan

Hüseyin Akdoğan began his software adventure with the GwBasic programming language. He started learning the Visual Basic language after QuickBasic and developed many applications until 2000, after which he stepped into the world of Web with PHP. After this, he came across Java! In addition to counseling and training activities since 2005, he developed enterprise applications with JavaEE technologies. His areas of expertise are JavaServer Faces, Spring Frameworks, and big data technologies such as NoSQL and Elasticsearch. Along with these, he is also trying to specialize in other big data technologies. Hüseyin also writes articles on Java and big data technologies and works as a technical reviewer of big data books. He was a reviewer of one of the bestselling books, Mastering Elasticsearch – Second Edition.
Read more about Huseyin Akdogan

Analyzer

Description

Standard Analyzer

This uses Standard Tokenizer to divide text. Other components are Standard Token Filter, Lower Case Token Filter, and Stop Token Filter. It normalizes tokens, lowercases tokens, and also removes unwanted tokens. By default, Elasticsearch applies the standard analyzer.

Simple Analyzer

This uses Letter Tokenizer to divide text. Another component is Lower Case Tokenizer. It lowercases tokens.

Whitespace Analyzer

This uses Whitespace Tokenizer to divide text at spaces.

Stop Analyzer

This uses Letter Tokenizer to divide text. Other components are Lower Case Tokenizer and Stop Token Filter. It removes stop words from token streams.

Pattern Analyzer

This uses a regular expression to divide text. It accepts lowercase and stop words setting.

Language Analyzer

A set of analyzers analyze the text for a specific...