Packt+ | Advance your knowledge in tech

You're reading from Apache Solr Search Patterns

Product typeBook

Published inApr 2015

Reading LevelIntermediate

Publisher

ISBN-139781783981847

Edition1st Edition

Languages

Java

Tools

Solr

Concepts

Enterprise Search

Author (1)

Jayant Kumar

Chapter 3. Solr Internals and Custom Queries

In this chapter, we will see how the relevance scorer works on the inverted index. We will understand how AND and OR clauses work in a query and look at how query filters and the minimum match parameter work internally. We will understand how the eDisMax query parser works. We will implement our own query language as a Solr plugin using which we will perform a proximity search. This chapter will give us an insight into the customization of the query logic and creation of custom query parsers as plugins in Solr. This chapter will cover the following topics:

How a scorer works on an inverted index
How OR and AND clauses work
How the eDisMax query parser works
The minimum should match parameter
How filters work
Using Bibliographic Retrieval Services (BRS) queries instead of DisMax
Proximity search using SWAN (Same, With, Adj, Near) queries
Creating a parboiled parser
Building a Solr plugin for SWAN queries
Integrating the SWAN plugin in Solr

Working of a scorer on an inverted index

We have, so far, understood what an inverted index is and how relevance calculation works. Let us now understand how a scorer works on an inverted index. Suppose we have an index with the following three documents:

3 Documents

To index the document, we have applied WhitespaceTokenizer along with the EnglishMinimalStemFilterFactory class. This breaks the sentence into tokens by splitting whitespace, and EnglishMinimalStemFilterFactory converts plural English words to their singular forms. The index thus created would be similar to that shown as follows:

An inverted index

A search for the term orange will give documents 2 and 3 in its result. On running a debug on the query, we can see that the scores for both the documents are different and document 2 is ranked higher than document 3. The term frequency of orange in document 2 is higher than that in document 3.

However, this does not affect the score much as the number of terms in the document is small...

Working of OR and AND clauses

Let us see how the collector and scorer work together to calculate the results for both OR and AND clauses. Let us first focus on the OR clause. Considering the earlier index, suppose we perform the following search:

orange OR strawberry OR not

A search for orange OR strawberry OR not

On the basis of the terms in the query, Doc Id 1 was rejected during the Boolean filtering logic. We will need to introduce the concept of accumulator here. The purpose of the accumulator is to loop through each term in the query and pass the documents that contain the term to the collector.

In the present case, when the accumulator looks for documents containing orange, it gets documents 2 and 3. The output of the accumulator in this case is 2x1, 3x1, where 2 and 3 are the document IDs and 1 is the number of times the term orange occurs in both the documents.Next, it will process the term strawberry where it will get document ID 3. Now, the accumulator outputs 3x1 that adds to our...

The eDisMax query parser

Let us understand the working of the eDisMax query parser. We will also look at the minimum should match parameter and filters in this section.

Working of the eDisMax query parser

Let us first refresh our memory about the different query modes in Solr. Query modes are ways to call different query parsers to process a query. Solr has different query modes that are specified by the defType parameter in the Solr URL. The defType parameter can also be changed by specifying a different defType parameter for the requestHandler property in the solrconfig.xml file. The popularly used query modes available in Solr are DisMax (disjunction Max) and eDisMax (extended disjunction max) in addition to the default (no defType) query mode. There are many other queryModes available, such as lucene, maxscore, and surround, but these are less used.

The query parser used by DisMax can process simple phrases entered by the user and search for individual terms across different fields using...

Using BRS queries instead of DisMax

Now that we know the internals of how DisMax queries work and how scoring happens in Solr, let's look at creating our own query syntax and parser for customizing our search. The question here is what is missing in eDisMax. Note that eDisMax provides a simple search syntax where we do not have to worry about the fields and the results are sorted by relevance. However, suppose the requirement is exactly opposite. The end user is an advanced user who knows the fields and what he or she is searching for. One such example is a search involving patents. The syntax for such a search is specified by BRS. In addition to Fielded and Boolean search, BRS also provides a proximity search with clauses such as SAME (in the same paragraph), WITH (in the same sentence), ADJ (adjacent with order), and NEAR (adjacent without order), along with parenthetical grouping. An example of a BRS query is as follows:

((galaxy ADJ samsung) SAME note) AND (mobile OR tablet)

BRS provides...

Building a custom query parser

Let us look at how we can build our own query parser. We will build a proximity query parser known as SWAN query where SWAN stands for Same, With, Adjacent, and Near. This query parser would use the SWAN relationships between terms to process the query and fetch the results.

Proximity search using SWAN queries

Solr provides position-aware queries via phrase slop queries. An example of a phrase slop is "samsung galaxy"~4 that suggests samsung and galaxy must occur within 4 word positions of each other. However, this does not take care of the SWAN queries that we are looking for. Lucene has support for providing position-aware queries using SpanQueries. The classes that implement span queries in Lucene are:

SpanTermQuery(Term term): Represents the building blocks of SpanQuery. The SpanTermQuery class is used for creating SpanOrQuery, SpanNotQuery, or SpanNearQuery.
SpanOrQuery(SpanQuery clauses): Can contain multiple SpanQuery clauses. We can use the addClause...

Summary

We went through the internals of Solr and Lucene. We saw how a scorer works on an inverted index and how the DisMax parser works in finding and ranking a result set. We understood the algorithms that work within Lucene when we use the OR or AND clause in Solr. We saw how filters work during the scoring of documents within a result set and the importance and usage of the minimum match parameter.

In the second half of the chapter, we built our own query parser. We went through the concepts of the parboiled parser API and built a SWAN query parser using the API. We also understood what is required during indexing and search to integrate and use the SWAN query parser as a plugin in Solr. This also cleared our concepts of building a custom query parser plugin for Solr.

In the next chapter, we will be looking at the use of Solr in processing and handling big data problems.

The rest of the chapter is locked

You have been reading a chapter from

Apache Solr Search Patterns

Published in: Apr 2015Publisher: ISBN-13: 9781783981847

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Jayant Kumar

Jayant Kumar is an experienced software professional with a bachelor of engineering degree in computer science and more than 14 years of experience in architecting and developing large-scale web applications. Jayant is an expert on search technologies and PHP and has been working with Lucene and Solr for more than 11 years now. He is the key person responsible for introducing Lucene as a search engine on www.naukri.com, the most successful job portal in India. Jayant is also the author of the book Apache Solr PHP Integration, Packt Publishing, which has been very successful. Jayant has played many different important roles throughout his career, including software developer, team leader, project manager, and architect, but his primary focus has been on building scalable solutions on the Web. Currently, he is associated with the digital division of HT Media as the chief architect responsible for the job site www.shine.com. Jayant is an avid blogger and his blog can be visited at http://jayant7k.blogspot.in. His LinkedIn profile is available at http://www.linkedin.com/in/jayantkumar.
Read more about Jayant Kumar

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages