Anatomy of an Analyzer

In Chapter 4, Mapping APIs, we learned about the mapping API; we also mentioned that the analyzer is one of the mapping parameters. In Chapter 1, Overview of Elasticsearch 7, we introduced analyzers and gave an example of a standard analyzer. The building blocks of an analyzer are character filters, tokenizers, and token filters. They efficiently and accurately search for targets and relevant scores, and you must understand the true meaning of the data and how a well-suited analyzer must be used. In this chapter, we will drill down to the anatomy of the analyzer and demonstrate the use of different analyzers in depth. During an index operation, the contents of a document are processed by an analyzer and the generated tokens are used to build the inverted index. During a search operation, the query content is processed by a search analyzer to generate tokens...

An analyzer's components

The purpose of an analyzer is to generate terms from a document and to create inverted indexes (such as lists of unique words and the document IDs they appear in, or a list of word frequencies). An analyzer must have only one tokenizer and, optionally, as many character filters and token filters as the user wants. Whether it is a built-in analyzer or a custom analyzer, analyzers are just an aggregation of the processes of these three building blocks, as illustrated in the following diagram:

Recall from Chapter 1, Overview of Elasticsearch 7, (you can refer to the Analyzer section) that a standard analyzer is composed of a standard tokenizer and a lowercase token filter. A standard tokenizer provides grammar-based tokenization, while a lowercase token filter normalizes tokens to lowercase. Let's suppose that the input string is an HTML text string...

Character filters

The main function of a character filter is to convert the original input text into a stream of characters and then preprocess it before passing it as an input to the tokenizer. Three built-in character filters are supported: html_strip, mapping, and pattern_replace. We'll practice each one using the same input text string as in the previous section.

The html_strip filter

This character filter removes the HTML tags (for more information about HTML tags and entities, you can refer to https://www.w3schools.com/html/default.asp). The HTML entities are replaced by the corresponding decoded UTF-8 characters. The contents stay the same by default, but the whole HTML comment will be removed. Let's suppose...

Tokenizers

The tokenizer in the analyzer receives the output character stream from the character filters and splits this into a token stream, which is the input to the token filter. Three types of tokenizer are supported in Elasticsearch, and they are described as follows:

Word-oriented tokenizer: This splits the character stream into individual tokens.
Partial word tokenizer: This splits the character stream into a sequence of characters within a given length.
Structured text tokenizer: This splits the character stream into known structured tokens such as keywords, email addresses, and zip codes.

We'll give an example for each built-in tokenizer and compile the results into the following tables. Let's first take a look at the Word-oriented tokenizer:

Word-oriented tokenizer
Tokenizer
`standard`	Input text	`"POST https://api.iextrading.com/1.0/stock/acwf...`

Token filters

The main function of a token filter is to add, modify, or delete the characters of the output tokens from the tokenizer. There are approximately 50 built-in token filters. We'll cover some popular token filters in the following table. You can find out more and learn about the remaining token filters at https://www.elastic.co/guide/en/elasticsearch/reference/7.x/analysis-tokenfilters.html. Each example token filter in the following table uses a standard tokenizer and a specified token filter. Note that no character filter is applied:

Token filter
`asciifolding`	Input text	`"Ÿőű'ľľ ľőνė Ȅľȁśťĩćŝėȁŕćĥ 7.0"`
	Description	This transforms the terms when letters, numbers, and unicode symbols are not in the first 127...

Built-in analyzers

In this section, we are going to introduce built-in analyzers. Each built-in analyzer contains a tokenizer and zero or more token filters. The corresponding parameter of the token filter that is used can be applied to the analyzer just like in the previous section. No more character filters or token filters are added in the testing. We'll cover all the supported analyzers and compile the testing results in the following table. The input text for all testing will be In Elasticsearch 7.0:

Analyzer	Composed of		Output tokens
Analyzer	Tokenizer	Token filter	Output tokens
`standard`	`standard`	`lowercase + stop (disable in default)`	`["in", "elasticsearch", "7.0"]`
`simple`	`lowercase`		`["in", "elasticsearch"]`
`whitespace`	`whitespace`		`["In", "Elasticsearch", "7.0"]`
`stop`	`lowercase`	`stop`	`["elasticsearch...`

Custom analyzers

Elasticsearch gives you a way to customize your analyzer. The first step is to define the analyzer and then use it in the mappings. You must define the analyzer in the index settings. You can then define your analyzer either in an index or in an index template for multiple indices that match the index pattern. Recall that an analyzer must only have one tokenizer and, optionally, many character filters and token filters. Let's create a custom analyzer to extract the tokens that we will use in the next chapter, which contain the following components:

tokenizer: Use the char_group tokenizer to have separators such as whitespace, digit, punctuation except for hyphens, end-of-line, symbols, and more.
token filter: Use the pattern_replace, lowercase, stemmer, stop, length, and unique filters.

Since the description text will be indexed differently, we need to...

Normalizers

The normalizer behaves like an analyzer except that it guarantees to generate a single token. There is no built-in normalizer. To customize the normalizer, you only allow character-based character filters and token filters. The way to define a normalizer is similar defining to an analyzer, except that it uses the normalizer keyword instead of analyzer. Let's delete the cf_etf_toy index and recreate it with lowercase_normalizer, which contains a lowercase token filter:

Then, we apply lowercase_normalizer to a sample text, as shown in the following screenshot:

In the response body, you can see that only one token is generated.

Summary

Terrific! We have understood the anatomy of the analyzer and have completed the analysis process. We have practiced different character filters, tokenizers, and token filters. We learned how to create a custom analyzer and use it in the _analyze API. Normalizers were also briefly introduced and practiced.

In the next chapter, we will focus on the search API. The basic functionality of the search API allows you to perform a search query and get back a search hit that matches the query. Elasticsearch supports the suggest API to help you improve the user experience. The explain APIs is also included; it computes the score explanation for a query and a specific document. This can provide useful feedback for scoring when you are looking for relevant issues. We will also discuss query Domain-Specific language (DSL) and high lighting feature in depth.

...