Packt+ | Advance your knowledge in tech

You're reading from Elasticsearch Indexing

Product typeBook

Published inDec 2015

Publisher

ISBN-139781783987023

Edition1st Edition

Tools

Elasticsearch

Concepts

Enterprise Search

Author (1)

Huseyin Akdogan

Chapter 4. Analysis and Analyzers

In the previous chapter, we looked at the basic concepts and definitions of mapping. We talked about fields of metadata and data types. Then, we discussed the relationship between mapping and relevant search results. Finally, we tried to have a good grasp of understanding what the schema-less is in Elasticsearch.

In this chapter, we will review the process of analysis and analyzers. We will examine the tokenizers and we will look closely at the character and token filters. In addition, we will review how to add analyzers to an Elasticsearch configuration. By the end of this chapter, we would have covered the following topics:

What is analysis process?
What is built-in analyzers?
What are doing tokenizers, character, and token filters?
What is text normalization?
How to create custom analyzers?

Introducing analysis

As mentioned in Chapter 1, Introduction to Efficient Indexing, a huge scale of data is produced at any moment in today's world of information technologies on various platforms, such as social media and medium and large-sized companies, which provide services in communication, health, security, and any other areas. Moreover, initially, such data is in an unstructured form.

We can see that this point of view on the big data takes into account three basic needs/concerns/forms:

Recording of data by high performance
Accessing of data by high performance
Analyzing of data

Big data solutions are mostly related to the aforementioned three basic needs.

Data should be recorded with high performance in order that data can be accessed with fully high performance benefits; however, it is not enough alone. To get the real meaning of data, data must be analyzed.

Thanks to data analysis, the well-established search engines like Google and many social media platforms like Facebook/Twitter are...

Process of analysis

We mentioned in Chapter 1, Introduction to Efficient Indexing and Chapter 2, What is an Elasticsearch Index that all Apache Lucene's data is stored in the inverted index. This means that the data is being transformed. The process of transforming data is called analysis. The analysis process relies on two basic pillars: tokenizing and normalizing.

The first step of the analysis process is to break the text into tokens using tokenizer after processing by the character filters for the inverted index. Then, it normalizes these tokens (that is, terms) to make them easily searchable.

Inverted index processes are performed by analyzers. Generally, an analyzer is composed of a tokenizer and one or more token filters. During the indexing time, when Elasticsearch processes a field that must be indexed, it checks whether an analyzer is defined at several levels or not because an analyzer can be specified at several levels.

The check order is as follows:

At field level
At type level
At...

Built-in analyzers

Elasticsearch comes with several analyzers in its standard installation. In the following table, some analyzers are described:

What's text normalization?

Text normalization is the process of transforming text into a common form. That is necessary in order to remove insignificant differences among identical words.

Let's look at déjà-vu word to handle.

The word deja-vu is not equal to déjà-vu for string comparison. Even Déjà-vu is not equal to déjà-vu. Similarly, Michè'le is not equal to Michèle. All these words (that is, tokens) are not equal because the comparison is made at the byte-level by Elasticsearch. This means, for two tokens to be considered the same, they need to consist of exactly the same bytes when these tokens are compared.

However, these words have similar meanings. In other words, the same thing is being sought when a user is searching for the word déjà-vu and another user, deja-vu or deja vu. It should also be noted that the Unicode standard allows you to create equivalent text in multiple ways.

For example, take letters é (Latin Capital letter e with grave) and é (Latin Capital letter e with acute...

ICU analysis plugin

Elasticsearch has an ICU analysis plugin. You can use this plugin to use mentioned forms in the previous section, and so ensuring that all of your tokens are in the same form. Note that the plugin must be compatible with the version of Elasticsearch in your machine:

bin/plugin install elasticsearch/elasticsearch-analysis-icu/2.7.0

After installing, the plugin registers itself by default under icu_normalizer or icuNormalizer. You can see an example of the usage as follows:

curl -XPUT /my_index -d '{
  "settings": {
    "analysis": {
      "filter": {
        "nfkc_normalizer": {
          "type": "icu_normalizer",
          "name": "nfkc"
        }
      },
      "analyzer": {
        "my_normalizer": {
          "tokenizer": "icu_tokenizer",
          "filter":  [ "nfkc_normalizer" ]
        }
      }
    }
  }
}'

The preceding configuration let's normalize all tokens into the NFKC normalization form.

Note

If you want more information about the ICU, refer to http://site.icu...

An Analyzer Pipeline

If we have a good grasp of the analysis process described so far, a pipeline of an analyzer should be as shown in the following picture:

Text to be analyzed is primarily processed by the character filters. Then, a filter divides the text by tokenizers and tokens are obtained. In the final step, the token filters modify tokens.

Specifying the analyzer for a field in the mapping

You can define an analyzer both in the index_analyzer and the search_analyzer member over a field in the mapping process. Also, Elasticsearch allows you to use different analyzers in separate fields.

The following command shows us the mapping for the fields that an analyzer defined:

curl -XPUT localhost:9200/blog -d '{
  "mappings": {
    "article": {
      "properties": {
        "title": {
          "type": "string", "index_analyzer": "simple"
        },
        "content": {
          "type": "string", "index_analyzer": "whitespace", "search_analyzer": "standard"
        }
      }
    }
  }
}'
{"acknowledged":true}

We defined a simple analyzer to the title field, and whitespace analyzer to the content field by the preceding configuration. Also, the search analyzer refers to the standard analyzer in the content field.

Now, we will add a document to the blog index as follows:

curl -XPOST localhost:9200/blog/article -d '{
  "title": "My boss's...

Summary

In this chapter, we looked at the analysis process and we reviewed the building blocks of analyzer. After this, we comprehended what the character filters, tokenizers, and token filters are, and how to specify different analyzers in separate fields. Finally, we saw how to create a custom analyzer. In the next chapter, you'll discover the anatomy of an Elasticsearch cluster, what a shard is, what a replica shard is, what a function replica shard performs, and so on. In addition, we will examine the questions, how do we configure my cluster correctly? and how do we determine the correct number of shard and replicas? We will also look at some relevant cases related to this topic.

The rest of the chapter is locked

You have been reading a chapter from

Elasticsearch Indexing

Published in: Dec 2015Publisher: ISBN-13: 9781783987023

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Huseyin Akdogan

Hüseyin Akdoğan began his software adventure with the GwBasic programming language. He started learning the Visual Basic language after QuickBasic and developed many applications until 2000, after which he stepped into the world of Web with PHP. After this, he came across Java! In addition to counseling and training activities since 2005, he developed enterprise applications with JavaEE technologies. His areas of expertise are JavaServer Faces, Spring Frameworks, and big data technologies such as NoSQL and Elasticsearch. Along with these, he is also trying to specialize in other big data technologies. Hüseyin also writes articles on Java and big data technologies and works as a technical reviewer of big data books. He was a reviewer of one of the bestselling books, Mastering Elasticsearch – Second Edition.
Read more about Huseyin Akdogan

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages

Analyzer	Description
Standard Analyzer	This uses Standard Tokenizer to divide text. Other components are Standard Token Filter, Lower Case Token Filter, and Stop Token Filter. It normalizes tokens, lowercases tokens, and also removes unwanted tokens. By default, Elasticsearch applies the standard analyzer.
Simple Analyzer	This uses Letter Tokenizer to divide text. Another component is Lower Case Tokenizer. It lowercases tokens.
Whitespace Analyzer	This uses Whitespace Tokenizer to divide text at spaces.
Stop Analyzer	This uses Letter Tokenizer to divide text. Other components are Lower Case Tokenizer and Stop Token Filter. It removes stop words from token streams.
Pattern Analyzer	This uses a regular expression to divide text. It accepts lowercase and stop words setting.
Language Analyzer	A set of analyzers analyze the text for a specific...