You're reading from Learning Elasticsearch

Product typeBook

Published inJun 2017

PublisherPackt

ISBN-139781787128453

Edition1st Edition

Tools

Elasticsearch Kibana

Concepts

Enterprise Search

Author (1)

Abhishek Andhavarapu

How does search work?

In the previous section, we discussed how to create, update, and delete documents. In this section, we will briefly discuss how search works internally and explain the basic query APIs. Mostly, I want to talk about the inverted index and Apache Lucene. All the data in Elasticsearch is internally stored in Apache Lucene as an inverted index. Although data is stored in Apache Lucene, Elasticsearch is what makes it distributed and provides the easy-to-use APIs. We will discuss Search API in detail in Chapter 6, All About Search.

Importance of information retrieval

As the computation power is increasing and cost of storage is decreasing, the amount of day-to-day data we deal with is growing exponentially. But without a way to retrieve the information and to be able to query it, the information we collect doesn't help.

Information retrieval systems are very important to make sense of the data. Imagine how hard it would be to find some information on the Internet without Google or other search engines out there. Information is not knowledge without information retrieval systems.

Simple search query

Let's say we have a User table as shown here:

Id	Name	Age	Gender	Email
1	Luke	100	M	luke@gmail.com
2	Leia	100	F	leia@gmail.com

Now, we want to query for all the users with the name Luke. A SQL query to achieve this would be something like this:

select * from user where name like ‘%luke%’

To do a similar task in Elasticsearch, you can use the search API and execute the following command:

GET http://127.0.0.1:9200/chapter1/user/_search?q=name:luke

Let's inspect the request:

INDEX	chapter1
TYPE	user
FIELD	name

Just like you would get all the rows in the User table as a result of the SQL query, the response to the Elasticsearch query would be JSON documents:

{
   "id": 1,
   "name": "Luke",
   "age": 100,
   "gender": "M",
   "email": "luke@gmail.com"
 }

Querying using the URL parameters can be used for simple queries as shown above. For more practical queries, you should pass the query represented as JSON in the request body. The same query passed in the request body is shown here:

POST http://127.0.0.1:9200/chapter1/user/_search 
{
   "query": {
     "term": {
       "name": "luke"
     }
   }
 }

The Search API is very flexible and supports different kinds of filters, sort, pagination, and aggregations.

Inverted index

Before we talk more about search, I want to talk about the inverted index. Knowing about inverted index will help you understand the limitations and strengths of Elasticsearch compared with the traditional database systems out there. Inverted index at the core is how Elasticsearch is different from other NoSQL stores, such as MongoDB, Cassandra, and so on.

We can compare an inverted index to an old library catalog card system. When you need some information/book in a library, you will use the card catalog, usually at the entrance of the library, to find the book. An inverted index is similar to the card catalog. Imagine that you were to build a system like Google to search for the web pages mentioning your search keywords. We have three web pages with Yoda quotes from Star Wars, and you are searching for all the documents with the word fear.

Document1: Fear leads to anger

Document2: Anger leads to hate

Document3: Hate leads to suffering

In a library, without a card catalog to find the book you need, you would have to go to every shelf row by row, look at each book title, and see whether it's the book you need. Computer-based information retrieval systems do the same.

Without the inverted index, the application has to go through each web page and check whether the word exists in the web page. An inverted index is similar to the following table. It is like a map with the term as a key and list of the documents the term appears in as value.

Term	Document
Fear	1
Anger	1,2
Hate	2,3
Suffering	3
Leads	1,2,3

Once we construct an index, as shown in this table, to find all the documents with the term fear is now just a lookup. Just like when a library gets a new book, the book is added to the card catalog, we keep building an inverted index as we encounter a new web page. The preceding inverted index takes care of simple use cases, such as searching for the single term. But in reality, we query for much more complicated things, and we don’t use the exact words. Now let’s say we encountered a document containing the following:

Yosemite national park may be closed for the weekend due to forecast of substantial rainfall

We want to visit Yosemite National Park, and we are looking for the weather forecast in the park. But when we query for it in the human language, we might query something like weather in yosemite or rain in yosemite. With the current approach, we will not be able to answer this query as there are no common terms between the query and the document, as shown:

Document	Query
rainfall	rain

To be able to answer queries like this and to improve the search quality, we employ various techniques such as stemming, synonyms discussed in the following sections.

Stemming

Stemming is the process of reducing a derived word into its root word. For example, rain, raining, rained, rainfall has the common root word "rain". When a document is indexed, the root word is stored in the index instead of the actual word. Without stemming, we end up storing rain, raining, rained in the index, and search relevance would be very low. The query terms also go through the stemming process, and the root words are looked up in the index. Stemming increases the likelihood of the user finding what he is looking for. When we query for rain in yosemite, even though the document originally had rainfall, the inverted index will contain term rain.

We can configure stemming in Elasticsearch using Analyzers. We will discuss how to set up and configure analyzers in Chapter 3, Modeling Your Data and Document Relations.

Synonyms

Similar to rain and raining, weekend and sunday mean the same thing. The document might not contain Sunday, but if the information retrieval system can also search for synonyms, it will significantly improve the search quality. Human language deals with a lot of things, such as tense, gender, numbers. Stemming and synonyms will not only improve the search quality but also reduce the index size by removing the differences between similar words.

More examples:

Pen, Pen[s] -> Pen

Eat, Eating -> Eat

Phrase search

As a user, we almost always search for phrases rather than single words. The inverted index in the previous section would work great for individual terms but not for phrases. Continuing the previous example, if we want to query all the documents with a phrase anger leads to in the inverted index, the previous index would not be sufficient. The inverted index for terms anger and leads is shown below:

Term	Document
Anger	1,2
Leads	1,2,3

From the preceding table, the words anger and leads exist both in document1 and document2. To support phrase search along with the document, we also need to record the position of the word in the document. The inverted index with word position is shown here:

Term	Document
Fear	1:1
Anger	1:3, 2:1
Hate	2:3, 3:1
Suffering	3:3
Leads	1:2, 2:2, 3:2

Now, since we have the information regarding the position of the word, we can search if a document has the terms in the same order as the query.

Term	Document
anger	1:3, 2:1
leads	1:2, 2:2

Since document2 has anger as the first word and leads as the second word, the same order as the query, document2 would be a better match than document1. With the inverted index, any query on the documents is just a simple lookup. This is just an introduction to inverted index; in real life, it's much more complicated, but the fundamentals remain the same. When the documents are indexed into Elasticsearch, documents are processed into the inverted index.

Apache Lucene

Apache Lucene is one of the most matured implementations of the inverted index. Lucene is an open source full-text search library. It's very high performing, entirely written in Java. Any application that requires text search can use Lucene. It allows adding full-text search capabilities to any application. Elasticsearch uses Apache Lucene to manage and create its inverted index. To learn more about Apache Lucene, please visit http://lucene.apache.org/core/.

We will talk about how distributed search works in Elasticsearch in the next section.

The term index is used both by Apache Lucene (inverted index) and Elasticsearch index. For the remainder of the book, unless specified the term index refers to an Elasticsearch index.

You have been reading a chapter from

Learning Elasticsearch

Published in: Jun 2017Publisher: PacktISBN-13: 9781787128453

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Abhishek Andhavarapu

Abhishek Andhavarapu is a software engineer at eBay who enjoys working on highly scalable distributed systems. He has a master's degree in Distributed Computing and has worked on multiple enterprise Elasticsearch applications, which are currently serving hundreds of millions of requests per day. He began his journey with Elasticsearch in 2012 to build an analytics engine to power dashboards and quickly realized that Elasticsearch is like nothing out there for search and analytics. He has been a strong advocate since then and wrote this book to share the practical knowledge he gained along the way.
Read more about Abhishek Andhavarapu

Other recommended products

Related to this chapter

Elasticsearch 7 Quick Start Guide

Elasticsearch is one of the most popular tools for distributed search. This book will help you in understanding all about the new features of Elasticsearch 7, and how to use them efficiently for searching, aggregating and indexing data with speed and accuracy.

BookOct 2019186 pages

Mastering Elasticsearch 5.x

This book will help you leverage Elasticsearch, guiding you through everything from writing and creating customized plugins to extend Elasticsearch to tackling challenges while handling relational data in Elasticsearch. You’ll learn with the help of practical examples in a step-by-step way.

BookFeb 2017428 pages

Advanced Elasticsearch 7.0

Advanced Elasticsearch 7.0, will help the readers to leverage new features and Core APIs of Elasticsearch to perform advanced search operations. This book covers data modeling, aggregations, pipeline processing, and data Analytics using Elasticsearch

BookAug 2019560 pages

Learning Elastic Stack 6.0

This book will give you a fundamental understanding of what the stack is all about, and how to use it efficiently to build powerful real-time data processing applications. It provide in-depth coverage of the different components of the Elastic Stack, and how to use them all together.

BookDec 2017434 pages

Learning Elastic Stack 7.0

This book teaches you about every component of the Elastic Stack - including Elasticsearch, Kibana, Logstash, and X-pack - with new and the updated features that are released with the 7.0 version. With the help of this book, you will be able to develop enterprise-grade distributed search and analytics applications for your data without any hassle.

BookMay 2019474 pages

Kibana 7 Quick Start Guide

Kibana is the visualization tool of the Elastic Stack, used for visualizing the results of the queries as well the dashboards generated out of the Elasticsearch and Logstash components. This book contains core concepts of Kibana with a straightforward form of chapters so that reader can move forward in a step by step manner.

BookJan 2019172 pages

Elasticsearch 7.0 Cookbook

This book is your one-stop guide to master Elasticsearch. It provides numerous problem-solution based recipes through which you can implement Elasticsearch in your enterprise applications in a very simple, hassle-free way.

BookApr 2019724 pages

Elasticsearch 5.x Cookbook

BookFeb 2017696 pages

Mastering Kibana 6.x

Mastering Kibana 6.x provides a rundown explanation required for data visualization and analysis such as X-Pack features, Beats, and machine learning. You will be expert in creating analytics-driven visualizations from a web application. You will be a maestro in creating custom monitoring dashboard using Beats with various examples

BookJul 2018376 pages

Learning Kibana 7

This book will introduce you to Kibana 7, and will show you how it fits into the Elastic stack. You will build a pure metric analytics architecture and visualize it using Timelion. You will also learn how to build relationships between documents using Graph visualization. You will also learn to build powerful Elastic dashboards using Kibana.

BookJul 2019280 pages

Mastering Elastic Stack

BookFeb 2017526 pages

Learning Kibana 5.0

BookFeb 2017284 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages