Reader small image

You're reading from  Apache Solr Search Patterns

Product typeBook
Published inApr 2015
Reading LevelIntermediate
Publisher
ISBN-139781783981847
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Jayant Kumar
Jayant Kumar
author image
Jayant Kumar

Jayant Kumar is an experienced software professional with a bachelor of engineering degree in computer science and more than 14 years of experience in architecting and developing large-scale web applications. Jayant is an expert on search technologies and PHP and has been working with Lucene and Solr for more than 11 years now. He is the key person responsible for introducing Lucene as a search engine on www.naukri.com, the most successful job portal in India. Jayant is also the author of the book Apache Solr PHP Integration, Packt Publishing, which has been very successful. Jayant has played many different important roles throughout his career, including software developer, team leader, project manager, and architect, but his primary focus has been on building scalable solutions on the Web. Currently, he is associated with the digital division of HT Media as the chief architect responsible for the job site www.shine.com. Jayant is an avid blogger and his blog can be visited at http://jayant7k.blogspot.in. His LinkedIn profile is available at http://www.linkedin.com/in/jayantkumar.
Read more about Jayant Kumar

Right arrow

Chapter 3. Solr Internals and Custom Queries

In this chapter, we will see how the relevance scorer works on the inverted index. We will understand how AND and OR clauses work in a query and look at how query filters and the minimum match parameter work internally. We will understand how the eDisMax query parser works. We will implement our own query language as a Solr plugin using which we will perform a proximity search. This chapter will give us an insight into the customization of the query logic and creation of custom query parsers as plugins in Solr. This chapter will cover the following topics:

  • How a scorer works on an inverted index

  • How OR and AND clauses work

  • How the eDisMax query parser works

  • The minimum should match parameter

  • How filters work

  • Using Bibliographic Retrieval Services (BRS) queries instead of DisMax

  • Proximity search using SWAN (Same, With, Adj, Near) queries

  • Creating a parboiled parser

  • Building a Solr plugin for SWAN queries

  • Integrating the SWAN plugin in Solr

Working of a scorer on an inverted index


We have, so far, understood what an inverted index is and how relevance calculation works. Let us now understand how a scorer works on an inverted index. Suppose we have an index with the following three documents:

3 Documents

To index the document, we have applied WhitespaceTokenizer along with the EnglishMinimalStemFilterFactory class. This breaks the sentence into tokens by splitting whitespace, and EnglishMinimalStemFilterFactory converts plural English words to their singular forms. The index thus created would be similar to that shown as follows:

An inverted index

A search for the term orange will give documents 2 and 3 in its result. On running a debug on the query, we can see that the scores for both the documents are different and document 2 is ranked higher than document 3. The term frequency of orange in document 2 is higher than that in document 3.

However, this does not affect the score much as the number of terms in the document is small...

Working of OR and AND clauses


Let us see how the collector and scorer work together to calculate the results for both OR and AND clauses. Let us first focus on the OR clause. Considering the earlier index, suppose we perform the following search:

orange OR strawberry OR not

A search for orange OR strawberry OR not

On the basis of the terms in the query, Doc Id 1 was rejected during the Boolean filtering logic. We will need to introduce the concept of accumulator here. The purpose of the accumulator is to loop through each term in the query and pass the documents that contain the term to the collector.

In the present case, when the accumulator looks for documents containing orange, it gets documents 2 and 3. The output of the accumulator in this case is 2x1, 3x1, where 2 and 3 are the document IDs and 1 is the number of times the term orange occurs in both the documents.Next, it will process the term strawberry where it will get document ID 3. Now, the accumulator outputs 3x1 that adds to our...

The eDisMax query parser


Let us understand the working of the eDisMax query parser. We will also look at the minimum should match parameter and filters in this section.

Working of the eDisMax query parser

Let us first refresh our memory about the different query modes in Solr. Query modes are ways to call different query parsers to process a query. Solr has different query modes that are specified by the defType parameter in the Solr URL. The defType parameter can also be changed by specifying a different defType parameter for the requestHandler property in the solrconfig.xml file. The popularly used query modes available in Solr are DisMax (disjunction Max) and eDisMax (extended disjunction max) in addition to the default (no defType) query mode. There are many other queryModes available, such as lucene, maxscore, and surround, but these are less used.

The query parser used by DisMax can process simple phrases entered by the user and search for individual terms across different fields using...

Using BRS queries instead of DisMax


Now that we know the internals of how DisMax queries work and how scoring happens in Solr, let's look at creating our own query syntax and parser for customizing our search. The question here is what is missing in eDisMax. Note that eDisMax provides a simple search syntax where we do not have to worry about the fields and the results are sorted by relevance. However, suppose the requirement is exactly opposite. The end user is an advanced user who knows the fields and what he or she is searching for. One such example is a search involving patents. The syntax for such a search is specified by BRS. In addition to Fielded and Boolean search, BRS also provides a proximity search with clauses such as SAME (in the same paragraph), WITH (in the same sentence), ADJ (adjacent with order), and NEAR (adjacent without order), along with parenthetical grouping. An example of a BRS query is as follows:

((galaxy ADJ samsung) SAME note) AND (mobile OR tablet)

BRS provides...

Building a custom query parser


Let us look at how we can build our own query parser. We will build a proximity query parser known as SWAN query where SWAN stands for Same, With, Adjacent, and Near. This query parser would use the SWAN relationships between terms to process the query and fetch the results.

Proximity search using SWAN queries

Solr provides position-aware queries via phrase slop queries. An example of a phrase slop is "samsung galaxy"~4 that suggests samsung and galaxy must occur within 4 word positions of each other. However, this does not take care of the SWAN queries that we are looking for. Lucene has support for providing position-aware queries using SpanQueries. The classes that implement span queries in Lucene are:

  • SpanTermQuery(Term term): Represents the building blocks of SpanQuery. The SpanTermQuery class is used for creating SpanOrQuery, SpanNotQuery, or SpanNearQuery.

  • SpanOrQuery(SpanQuery clauses): Can contain multiple SpanQuery clauses. We can use the addClause...

Summary


We went through the internals of Solr and Lucene. We saw how a scorer works on an inverted index and how the DisMax parser works in finding and ranking a result set. We understood the algorithms that work within Lucene when we use the OR or AND clause in Solr. We saw how filters work during the scoring of documents within a result set and the importance and usage of the minimum match parameter.

In the second half of the chapter, we built our own query parser. We went through the concepts of the parboiled parser API and built a SWAN query parser using the API. We also understood what is required during indexing and search to integrate and use the SWAN query parser as a plugin in Solr. This also cleared our concepts of building a custom query parser plugin for Solr.

In the next chapter, we will be looking at the use of Solr in processing and handling big data problems.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Solr Search Patterns
Published in: Apr 2015Publisher: ISBN-13: 9781783981847
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jayant Kumar

Jayant Kumar is an experienced software professional with a bachelor of engineering degree in computer science and more than 14 years of experience in architecting and developing large-scale web applications. Jayant is an expert on search technologies and PHP and has been working with Lucene and Solr for more than 11 years now. He is the key person responsible for introducing Lucene as a search engine on www.naukri.com, the most successful job portal in India. Jayant is also the author of the book Apache Solr PHP Integration, Packt Publishing, which has been very successful. Jayant has played many different important roles throughout his career, including software developer, team leader, project manager, and architect, but his primary focus has been on building scalable solutions on the Web. Currently, he is associated with the digital division of HT Media as the chief architect responsible for the job site www.shine.com. Jayant is an avid blogger and his blog can be visited at http://jayant7k.blogspot.in. His LinkedIn profile is available at http://www.linkedin.com/in/jayantkumar.
Read more about Jayant Kumar