Reader small image

You're reading from  Lucene 4 Cookbook

Product typeBook
Published inJun 2015
Reading LevelExpert
Publisher
ISBN-139781782162285
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Edwood Ng
Edwood Ng
author image
Edwood Ng

Edwood Ng is a technologist with over a decade of experience in building scalable solutions from proprietary implementations to client-facing web-based applications. Currently, he's the director of DevOps at Wellframe, leading infrastructure and DevOps operations. His background in search engine began at Endeca Technologies in 2004, where he was a technical consultant helping numerous clients to architect and implement faceted search solutions. After Endeca, he drew on his knowledge and began designing and building Lucene-based solutions. His first Lucene implementation that went to production was the search engine behind http://UpDown.com. From there on, he continued to create search applications using Lucene extensively to deliver robust and scalable systems for his clients. Edwood is a supporter of an open source software. He has also contributed to the plugin sfI18NGettextPluralPlugin to the Symphony project.
Read more about Edwood Ng

Vineeth Mohan
Vineeth Mohan
author image
Vineeth Mohan

Vineeth Mohan is an architect and developer. He currently works as the CTO at Factweavers Technologies and is also an Elasticsearch-certified trainer. He loves to spend time studying emerging technologies and applications related to data analytics, data visualizations, machine learning, natural language processing, and developments in search analytics. He began coding during his high school days, which later ignited his interest in computer science, and he pursued engineering at Model Engineering College, Cochin. He was recruited by the search giant Yahoo! during his college days. After 2 years of work at Yahoo! on various big data projects, he joined a start-up that dealt with search and analytics. Finally, he started his own big data consulting company, Factweavers. Under his leadership and technical expertise, Factweavers is one of the early adopters of Elasticsearch and has been engaged with projects related to end-to-end big data solutions and analytics for the last few years. There, he got the opportunity to learn various big-data-based technologies, such as Hadoop, and high-performance data ingress systems and storage. Later, he moved to a start-up in his hometown, where he chose Elasticsearch as the primary search and analytic engine for the project assigned to him. Later in 2014, he founded Factweavers Technologies along with Jalaluddeen; it is consultancy that aims at providing Elasticsearch-based solutions. He is also an Elasticsearch-certified corporate trainer who conducts trainings in India. Till date, he has worked on numerous projects that are mostly based on Elasticsearch and has trained numerous multinationals on Elasticsearch.
Read more about Vineeth Mohan

View More author details
Right arrow

Chapter 6. Querying and Filtering Data

In this chapter, we will cover the following recipes:

  • Performing advanced filtering

  • Creating a custom filter

  • Searching with QueryParser

  • TermQuery and TermRangeQuery

  • BooleanQuery

  • PrefixQuery and WildcardQuery

  • PhraseQuery and MultiPhraseQuery

  • FuzzyQuery

  • NumericRangeQuery

  • DisjunctionMaxQuery

  • RegexpQuery

  • SpanQuery

  • CustomScoreQuery

Introduction


When it comes to search application, usability is always a key element that either makes or breaks user impression. Lucene does an excellent job of giving you the essential tools to build and search an index. In this chapter, we will look into some more advanced techniques to query and filter data. We will arm you with more knowledge to put into your toolbox so that you can leverage your Lucene knowledge to build a user-friendly search application.

Performing advanced filtering


Before we start, let us try to revisit these questions: what is a filter and what is it for? In simple terms, a filter is used to narrow the search space or, in another words, search within a search. Filter and Query may seem to provide the same functionality, but there is a significant difference between the two. Scores are calculated in querying to rank results, based on their relevancy to the search terms, while a filter has no effect on scores. It's not uncommon that users may prefer to navigate through a hierarchy of filters in order to land on the relevant results. You may often find yourselves in a situation where it is necessary to refine a result set so that users can continue to search or navigate within a subset. With the ability to apply filters, we can easily provide such search refinements. Another situation is data security where some parts of the data in the index are protected. You may need to include an additional filter behind the scene that...

Creating a custom filter


Now that we've seen numerous examples on Lucene's built-in Filters, we are ready for a more advanced topic, custom filters. There are a few important components we need to go over before we start: FieldCache, SortedDocValues, and DocIdSet. We will be using these items in our example to help you gain practical knowledge on the subject.

In the FieldCache, as you already learned, is a cache that stores field values in memory in an array structure. It's a very simple data structure as the slots in the array basically correspond to DocIds. This is also the reason why FieldCache only works for a single-valued field. A slot in an array can only hold a single value. Since this is just an array, the lookup time is constant and very fast.

The SortedDocValues has two internal data mappings for values' lookup: a dictionary mapping an ordinal value to a field value and a DocId to an ordinal value (for the field value) mapping. In the dictionary data structure, the values are deduplicated...

Searching with QueryParser


QueryParser is an interpreter tool that transforms a search string into a series of Query clauses. It's not absolutely necessary to use QueryParser to perform a search, but it's a great feature that empowers users by allowing the use of search modifiers. A user can specify a phrase match by putting quotes (") around a phrase. A user can also control whether a certain term or phrase is required by putting a plus ("+") sign in front of the term or phrase, or use a minus ("-") sign to indicate that the term or phrase must not exist in results. For Boolean searches, the user can use AND and OR to control whether all terms or phrases are required.

To do a field-specific search, you can use a colon (":") to specify a field for a search (for example, content:humpty would search for the term "humpty" in the field "content"). For wildcard searches, you can use the standard wildcard character asterisk ("*") to match 0 or more characters, or a question mark ("?") for matching...

TermQuery and TermRangeQuery


A TermQuery is a very simple query that matches documents containing a specific term. The TermRangeQuery is, as its name implies, a term range with a lower and upper boundary for matching.

How to do it..

Here are a couple of examples on TermQuery and TermRangeQuery:

query = new TermQuery(new Term("content", "humpty"));
query = new TermRangeQuery("content", new BytesRef("a"), new BytesRef("c"), true, true);

The first line is a simple query that matches the term humpty in the content field. The second line is a range query matching documents with the content that's sorted within a and c.

BooleanQuery


A BooleanQuery is a combination of other queries in which you can specify whether each subquery must, must not, or should match. These options provide the foundation to build up to logical operators of AND, OR, and NOT, which you can use in QueryParser. Here is a quick review on QueryParser syntax for BooleanQuery:

  • "+" means required; for example, a search string +humpty dumpty equates to must match humpty and should match "dumpty"

  • "-" means must not match; for example, a search string -humpty dumpty equates to must not match humpty and should match dumpty

  • AND, OR, and NOT are pseudo Boolean operators. Under the hood, Lucene uses BooleanClause.Occur to model these operators. The options for occur are MUST, MUST_NOT, and SHOULD. In an AND query, both terms must match. In an OR query, both terms should match. Lastly, in a NOT query, the term MUST_NOT exists. For example, humpty AND dumpty means must match both humpty and dumpty, humpty OR dumpty means should match either or both...

PrefixQuery and WildcardQuery


PrefixQuery, as the name implies, matches documents with terms starting with a specified prefix. WildcardQuery allows you to use wildcard characters for wildcard matching.

A PrefixQuery is somewhat similar to a WildcardQuery in which there is only one wildcard character at the end of a search string. When doing a wildcard search in QueryParser, it would return either a PrefixQuery or WildcardQuery, depending on the wildcard character's location. PrefixQuery is simpler and more efficient than WildcardQuery, so it's preferable to use PrefixQuery whenever possible. That's exactly what QueryParser does.

How to do it...

Here is a code snippet to demonstrate both Query types:

PrefixQuery query = new PrefixQuery(new Term("content", "hum"));
WildcardQuery query2 = new WildcardQuery(new Term("content", "*um*"));

How it works…

Both queries would return the same results from our setup. The PrefixQuery will match anything that starts with hum and the WildcardQuery would match...

PhraseQuery and MultiPhraseQuery


A PhraseQuery matches a particular sequence of terms, while a MultiPhraseQuery gives you an option to match multiple terms in the same position. For example, MultiPhrasQuery supports a phrase such as humpty (dumpty OR together) in which it matches humpty in position 0 and dumpty or together in position 1.

How to do it...

Here is a code snippet to demonstrate both Query types:

PhraseQuery query = new PhraseQuery();
query.add(new Term("content", "humpty"));
query.add(new Term("content", "together"));
MultiPhraseQuery query2 = new MultiPhraseQuery();
Term[] terms1 = new Term[1];terms1[0] = new Term("content", "humpty");
Term[] terms2 = new Term[2];terms2[0] = new Term("content", "dumpty");
terms2[1] = new Term("content", "together");
query2.add(terms1);
query2.add(terms2);

How it works…

The first Query, PhraseQuery, searches for the phrase humpty together. The second Query, MultiPhraseQuery, searches for the phrase humpty (dumpty OR together). The first Query would...

FuzzyQuery


A FuzzyQuery matches terms based on similarity, using the Damerau-Levenshtein algorithm. We are not going into the details of the algorithm as it is outside of our topic. What we need to know is a fuzzy match is measured in the number of edits between terms. FuzzyQuery allows a maximum of 2 edits. For example, between "humptX" and humpty is first edit and between humpXX and humpty are two edits. There is also a requirement that the number of edits must be less than the minimum term length (of either the input term or candidate term). As another example, ab and abcd would not match because the number of edits between the two terms is 2 and it's not greater than the length of ab, which is 2.

How to do it...

Here is a code snippet to demonstrate FuzzyQuery:

FuzzyQuery query = new FuzzyQuery(new Term("content", "humpXX"));

How it works…

This Query will return sentences one, two, and four from our setup, as humpXX matches humpty within the two edits. In QueryParser, FuzzyQuery can be triggered...

NumericRangeQuery


NumericRangeQuery is similar to NumericRangeFilter, in which you can specify lower bound and upper bound values for matching. To ensure search quality, make sure the numeric type (IntField, FloatField, LongField, and DoubleField) matches between the search and indexed field.

How to do it…

Here is a code snippet to demonstrate NumericRangeQuery:

Query query = NumericRangeQuery.newIntRange("num", 0, 200, true, true);

How it works…

This example will return sentence one and two from our setup. Note that we need to specify a numeric type (newIntRange) when creating a query. It accepts five parameters: name of the field, lower bound, upper bound, inclusive of a lower value, and inclusive of an upper value.

DisjunctionMaxQuery


This query type sounds a little funny and can be confusing at first. It's like a BooleanQuery that contains a number of subqueries. Instead of combining scores from each sub-Query, it returns the maximum score from one of the subqueries. The reason for this is that matching may match on multiple fields, and a match on more important fields (should have a higher score) can sometimes have an equivalent score with matching on less important fields due to the number of matching terms. For example, let's say we have a book store application, in which we have a book title and body in our index, and let's say we are searching for The Old Man and the Sea. A match on the title would return a very high score. However, it's possible that there may be another book with the similar title, but with more matching in the body—for example, Young Man and the Sea—that would return a higher combined (between the title and body) score.

In this case, a perfectly matched title may not be returned...

RegexpQuery


Lucene also offers regular expression support in Query. Lucene's favor of RegExp is fast based on benchmark testing. However, it can be slow if the expression begins with ".*". For more information about Lucene's RegExp syntax, refer to http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/RegexpQuery.html.

How to do it…

Here is a code snippet:

RegexpQuery query = new RegexpQuery(new Term("content", ".um.*"));

RegexpQuery accepts term as an argument where term would contain Regexp. In this test case, we try to match anything that contains the letter "um" with one leading character. The expression will return sentence one, two, and four from our setup.

SpanQuery


SpanQuery offers the ability to restrict matching by Term positions. It shines when you want to match multiple terms that are close to each other, but not exactly matched as a phrase. It's similar to PhraseQuery with a slop set greater than zero, but it gives you more options to control how matching is done. For instance, say we want to search for terms "humpty" and "wall" from our test setup so that we can match on the sentence "Humpty Dumpty sat on a wall". We can perform this search using either PhraseQuery or SpanQuery with a slop set to 4; both Queries would match. So let's say we switch the two Terms around. Now, we have Terms in this order, "wall" and "humpty". PhraseQuery will fail to find a match because the Terms are out of order. SpanQuery has an option to match, Terms out of order so that we can still find a match even when the Terms' order is reversed.

SpanQuery has other useful functionalities as well. We will explore each SpanQuery type in detail:

  • SpanTermQuery: This...

CustomScoreQuery


When consideration of all the built-in features is exhausted and you still need more flexibility, it may be time to explore how to build a custom scoring mechanism to customize the search results ranking. Lucene has a CustomScoreQuery class that allows you to do just that. We can provide our own score provider by extending from this class, along with CustomScoreProvider. By extending CustomScoreProvider, we can override score calculation with our own implementation.

How to do it…

Let's take a look at how it's done. We will build a CustomScoreQuery that favors documents with terms that are anagrams of the querying Terms. For each anagram found, we increase the score by 1. The search results ranking will be augmented so that the documents with anagrammed Terms are ranked higher.

Here is the AnagramQuery class:

public class AnagramQuery extends CustomScoreQuery {
  private final String field;
  private final Set<String> terms = new HashSet<String>();
  public AnagramQuery...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Lucene 4 Cookbook
Published in: Jun 2015Publisher: ISBN-13: 9781782162285
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Edwood Ng

Edwood Ng is a technologist with over a decade of experience in building scalable solutions from proprietary implementations to client-facing web-based applications. Currently, he's the director of DevOps at Wellframe, leading infrastructure and DevOps operations. His background in search engine began at Endeca Technologies in 2004, where he was a technical consultant helping numerous clients to architect and implement faceted search solutions. After Endeca, he drew on his knowledge and began designing and building Lucene-based solutions. His first Lucene implementation that went to production was the search engine behind http://UpDown.com. From there on, he continued to create search applications using Lucene extensively to deliver robust and scalable systems for his clients. Edwood is a supporter of an open source software. He has also contributed to the plugin sfI18NGettextPluralPlugin to the Symphony project.
Read more about Edwood Ng

author image
Vineeth Mohan

Vineeth Mohan is an architect and developer. He currently works as the CTO at Factweavers Technologies and is also an Elasticsearch-certified trainer. He loves to spend time studying emerging technologies and applications related to data analytics, data visualizations, machine learning, natural language processing, and developments in search analytics. He began coding during his high school days, which later ignited his interest in computer science, and he pursued engineering at Model Engineering College, Cochin. He was recruited by the search giant Yahoo! during his college days. After 2 years of work at Yahoo! on various big data projects, he joined a start-up that dealt with search and analytics. Finally, he started his own big data consulting company, Factweavers. Under his leadership and technical expertise, Factweavers is one of the early adopters of Elasticsearch and has been engaged with projects related to end-to-end big data solutions and analytics for the last few years. There, he got the opportunity to learn various big-data-based technologies, such as Hadoop, and high-performance data ingress systems and storage. Later, he moved to a start-up in his hometown, where he chose Elasticsearch as the primary search and analytic engine for the project assigned to him. Later in 2014, he founded Factweavers Technologies along with Jalaluddeen; it is consultancy that aims at providing Elasticsearch-based solutions. He is also an Elasticsearch-certified corporate trainer who conducts trainings in India. Till date, he has worked on numerous projects that are mostly based on Elasticsearch and has trained numerous multinationals on Elasticsearch.
Read more about Vineeth Mohan