Packt+ | Advance your knowledge in tech

You're reading from Lucene 4 Cookbook

Product typeBook

Published inJun 2015

Reading LevelExpert

Publisher

ISBN-139781782162285

Edition1st Edition

Languages

Java

Tools

Lucene

Concepts

Enterprise Search

Authors (2):

Edwood Ng

Vineeth Mohan

View More author details

Chapter 6. Querying and Filtering Data

In this chapter, we will cover the following recipes:

Performing advanced filtering
Creating a custom filter
Searching with QueryParser
TermQuery and TermRangeQuery
BooleanQuery
PrefixQuery and WildcardQuery
PhraseQuery and MultiPhraseQuery
FuzzyQuery
NumericRangeQuery
DisjunctionMaxQuery
RegexpQuery
SpanQuery
CustomScoreQuery

Introduction

When it comes to search application, usability is always a key element that either makes or breaks user impression. Lucene does an excellent job of giving you the essential tools to build and search an index. In this chapter, we will look into some more advanced techniques to query and filter data. We will arm you with more knowledge to put into your toolbox so that you can leverage your Lucene knowledge to build a user-friendly search application.

Performing advanced filtering

Before we start, let us try to revisit these questions: what is a filter and what is it for? In simple terms, a filter is used to narrow the search space or, in another words, search within a search. Filter and Query may seem to provide the same functionality, but there is a significant difference between the two. Scores are calculated in querying to rank results, based on their relevancy to the search terms, while a filter has no effect on scores. It's not uncommon that users may prefer to navigate through a hierarchy of filters in order to land on the relevant results. You may often find yourselves in a situation where it is necessary to refine a result set so that users can continue to search or navigate within a subset. With the ability to apply filters, we can easily provide such search refinements. Another situation is data security where some parts of the data in the index are protected. You may need to include an additional filter behind the scene that...

Creating a custom filter

Now that we've seen numerous examples on Lucene's built-in Filters, we are ready for a more advanced topic, custom filters. There are a few important components we need to go over before we start: FieldCache, SortedDocValues, and DocIdSet. We will be using these items in our example to help you gain practical knowledge on the subject.

In the FieldCache, as you already learned, is a cache that stores field values in memory in an array structure. It's a very simple data structure as the slots in the array basically correspond to DocIds. This is also the reason why FieldCache only works for a single-valued field. A slot in an array can only hold a single value. Since this is just an array, the lookup time is constant and very fast.

The SortedDocValues has two internal data mappings for values' lookup: a dictionary mapping an ordinal value to a field value and a DocId to an ordinal value (for the field value) mapping. In the dictionary data structure, the values are deduplicated...

Searching with QueryParser

QueryParser is an interpreter tool that transforms a search string into a series of Query clauses. It's not absolutely necessary to use QueryParser to perform a search, but it's a great feature that empowers users by allowing the use of search modifiers. A user can specify a phrase match by putting quotes (") around a phrase. A user can also control whether a certain term or phrase is required by putting a plus ("+") sign in front of the term or phrase, or use a minus ("-") sign to indicate that the term or phrase must not exist in results. For Boolean searches, the user can use AND and OR to control whether all terms or phrases are required.

To do a field-specific search, you can use a colon (":") to specify a field for a search (for example, content:humpty would search for the term "humpty" in the field "content"). For wildcard searches, you can use the standard wildcard character asterisk ("*") to match 0 or more characters, or a question mark ("?") for matching...

TermQuery and TermRangeQuery

A TermQuery is a very simple query that matches documents containing a specific term. The TermRangeQuery is, as its name implies, a term range with a lower and upper boundary for matching.

How to do it..

Here are a couple of examples on TermQuery and TermRangeQuery:

query = new TermQuery(new Term("content", "humpty"));
query = new TermRangeQuery("content", new BytesRef("a"), new BytesRef("c"), true, true);

The first line is a simple query that matches the term humpty in the content field. The second line is a range query matching documents with the content that's sorted within a and c.

BooleanQuery

A BooleanQuery is a combination of other queries in which you can specify whether each subquery must, must not, or should match. These options provide the foundation to build up to logical operators of AND, OR, and NOT, which you can use in QueryParser. Here is a quick review on QueryParser syntax for BooleanQuery:

"+" means required; for example, a search string +humpty dumpty equates to must match humpty and should match "dumpty"
"-" means must not match; for example, a search string -humpty dumpty equates to must not match humpty and should match dumpty
AND, OR, and NOT are pseudo Boolean operators. Under the hood, Lucene uses BooleanClause.Occur to model these operators. The options for occur are MUST, MUST_NOT, and SHOULD. In an AND query, both terms must match. In an OR query, both terms should match. Lastly, in a NOT query, the term MUST_NOT exists. For example, humpty AND dumpty means must match both humpty and dumpty, humpty OR dumpty means should match either or both...

PrefixQuery and WildcardQuery

PrefixQuery, as the name implies, matches documents with terms starting with a specified prefix. WildcardQuery allows you to use wildcard characters for wildcard matching.

A PrefixQuery is somewhat similar to a WildcardQuery in which there is only one wildcard character at the end of a search string. When doing a wildcard search in QueryParser, it would return either a PrefixQuery or WildcardQuery, depending on the wildcard character's location. PrefixQuery is simpler and more efficient than WildcardQuery, so it's preferable to use PrefixQuery whenever possible. That's exactly what QueryParser does.

How to do it...

Here is a code snippet to demonstrate both Query types:

PrefixQuery query = new PrefixQuery(new Term("content", "hum"));
WildcardQuery query2 = new WildcardQuery(new Term("content", "*um*"));

How it works…

Both queries would return the same results from our setup. The PrefixQuery will match anything that starts with hum and the WildcardQuery would match...

PhraseQuery and MultiPhraseQuery

A PhraseQuery matches a particular sequence of terms, while a MultiPhraseQuery gives you an option to match multiple terms in the same position. For example, MultiPhrasQuery supports a phrase such as humpty (dumpty OR together) in which it matches humpty in position 0 and dumpty or together in position 1.

How to do it...

Here is a code snippet to demonstrate both Query types:

PhraseQuery query = new PhraseQuery();
query.add(new Term("content", "humpty"));
query.add(new Term("content", "together"));
MultiPhraseQuery query2 = new MultiPhraseQuery();
Term[] terms1 = new Term[1];terms1[0] = new Term("content", "humpty");
Term[] terms2 = new Term[2];terms2[0] = new Term("content", "dumpty");
terms2[1] = new Term("content", "together");
query2.add(terms1);
query2.add(terms2);

How it works…

The first Query, PhraseQuery, searches for the phrase humpty together. The second Query, MultiPhraseQuery, searches for the phrase humpty (dumpty OR together). The first Query would...

FuzzyQuery

A FuzzyQuery matches terms based on similarity, using the Damerau-Levenshtein algorithm. We are not going into the details of the algorithm as it is outside of our topic. What we need to know is a fuzzy match is measured in the number of edits between terms. FuzzyQuery allows a maximum of 2 edits. For example, between "humptX" and humpty is first edit and between humpXX and humpty are two edits. There is also a requirement that the number of edits must be less than the minimum term length (of either the input term or candidate term). As another example, ab and abcd would not match because the number of edits between the two terms is 2 and it's not greater than the length of ab, which is 2.

How to do it...

Here is a code snippet to demonstrate FuzzyQuery:

FuzzyQuery query = new FuzzyQuery(new Term("content", "humpXX"));

How it works…

This Query will return sentences one, two, and four from our setup, as humpXX matches humpty within the two edits. In QueryParser, FuzzyQuery can be triggered...

NumericRangeQuery

NumericRangeQuery is similar to NumericRangeFilter, in which you can specify lower bound and upper bound values for matching. To ensure search quality, make sure the numeric type (IntField, FloatField, LongField, and DoubleField) matches between the search and indexed field.

How to do it…

Here is a code snippet to demonstrate NumericRangeQuery:

Query query = NumericRangeQuery.newIntRange("num", 0, 200, true, true);

How it works…

This example will return sentence one and two from our setup. Note that we need to specify a numeric type (newIntRange) when creating a query. It accepts five parameters: name of the field, lower bound, upper bound, inclusive of a lower value, and inclusive of an upper value.

DisjunctionMaxQuery

This query type sounds a little funny and can be confusing at first. It's like a BooleanQuery that contains a number of subqueries. Instead of combining scores from each sub-Query, it returns the maximum score from one of the subqueries. The reason for this is that matching may match on multiple fields, and a match on more important fields (should have a higher score) can sometimes have an equivalent score with matching on less important fields due to the number of matching terms. For example, let's say we have a book store application, in which we have a book title and body in our index, and let's say we are searching for The Old Man and the Sea. A match on the title would return a very high score. However, it's possible that there may be another book with the similar title, but with more matching in the body—for example, Young Man and the Sea—that would return a higher combined (between the title and body) score.

In this case, a perfectly matched title may not be returned...

RegexpQuery

Lucene also offers regular expression support in Query. Lucene's favor of RegExp is fast based on benchmark testing. However, it can be slow if the expression begins with ".*". For more information about Lucene's RegExp syntax, refer to http://lucene.apache.org/core/4_10_2/core/org/apache/lucene/search/RegexpQuery.html.

How to do it…

Here is a code snippet:

RegexpQuery query = new RegexpQuery(new Term("content", ".um.*"));

RegexpQuery accepts term as an argument where term would contain Regexp. In this test case, we try to match anything that contains the letter "um" with one leading character. The expression will return sentence one, two, and four from our setup.

SpanQuery

SpanQuery offers the ability to restrict matching by Term positions. It shines when you want to match multiple terms that are close to each other, but not exactly matched as a phrase. It's similar to PhraseQuery with a slop set greater than zero, but it gives you more options to control how matching is done. For instance, say we want to search for terms "humpty" and "wall" from our test setup so that we can match on the sentence "Humpty Dumpty sat on a wall". We can perform this search using either PhraseQuery or SpanQuery with a slop set to 4; both Queries would match. So let's say we switch the two Terms around. Now, we have Terms in this order, "wall" and "humpty". PhraseQuery will fail to find a match because the Terms are out of order. SpanQuery has an option to match, Terms out of order so that we can still find a match even when the Terms' order is reversed.

SpanQuery has other useful functionalities as well. We will explore each SpanQuery type in detail:

SpanTermQuery: This...

CustomScoreQuery

When consideration of all the built-in features is exhausted and you still need more flexibility, it may be time to explore how to build a custom scoring mechanism to customize the search results ranking. Lucene has a CustomScoreQuery class that allows you to do just that. We can provide our own score provider by extending from this class, along with CustomScoreProvider. By extending CustomScoreProvider, we can override score calculation with our own implementation.

How to do it…

Let's take a look at how it's done. We will build a CustomScoreQuery that favors documents with terms that are anagrams of the querying Terms. For each anagram found, we increase the score by 1. The search results ranking will be augmented so that the documents with anagrammed Terms are ranked higher.

Here is the AnagramQuery class:

public class AnagramQuery extends CustomScoreQuery {
  private final String field;
  private final Set<String> terms = new HashSet<String>();
  public AnagramQuery...

The rest of the chapter is locked

You have been reading a chapter from

Lucene 4 Cookbook

Published in: Jun 2015Publisher: ISBN-13: 9781782162285

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Edwood Ng

Edwood Ng is a technologist with over a decade of experience in building scalable solutions from proprietary implementations to client-facing web-based applications. Currently, he's the director of DevOps at Wellframe, leading infrastructure and DevOps operations. His background in search engine began at Endeca Technologies in 2004, where he was a technical consultant helping numerous clients to architect and implement faceted search solutions. After Endeca, he drew on his knowledge and began designing and building Lucene-based solutions. His first Lucene implementation that went to production was the search engine behind http://UpDown.com. From there on, he continued to create search applications using Lucene extensively to deliver robust and scalable systems for his clients. Edwood is a supporter of an open source software. He has also contributed to the plugin sfI18NGettextPluralPlugin to the Symphony project.
Read more about Edwood Ng

Vineeth Mohan

Vineeth Mohan is an architect and developer. He currently works as the CTO at Factweavers Technologies and is also an Elasticsearch-certified trainer. He loves to spend time studying emerging technologies and applications related to data analytics, data visualizations, machine learning, natural language processing, and developments in search analytics. He began coding during his high school days, which later ignited his interest in computer science, and he pursued engineering at Model Engineering College, Cochin. He was recruited by the search giant Yahoo! during his college days. After 2 years of work at Yahoo! on various big data projects, he joined a start-up that dealt with search and analytics. Finally, he started his own big data consulting company, Factweavers. Under his leadership and technical expertise, Factweavers is one of the early adopters of Elasticsearch and has been engaged with projects related to end-to-end big data solutions and analytics for the last few years. There, he got the opportunity to learn various big-data-based technologies, such as Hadoop, and high-performance data ingress systems and storage. Later, he moved to a start-up in his hometown, where he chose Elasticsearch as the primary search and analytic engine for the project assigned to him. Later in 2014, he founded Factweavers Technologies along with Jalaluddeen; it is consultancy that aims at providing Elasticsearch-based solutions. He is also an Elasticsearch-certified corporate trainer who conducts trainings in India. Till date, he has worked on numerous projects that are mostly based on Elasticsearch and has trained numerous multinationals on Elasticsearch.
Read more about Vineeth Mohan

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages