Reader small image

You're reading from  Solr Cookbook - Third Edition

Product typeBook
Published inJan 2015
Reading LevelIntermediate
Publisher
ISBN-139781783553150
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Rafal Kuc
Rafal Kuc
author image
Rafal Kuc

Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as a consultant and software engineer at Sematext Group Inc. where he concentrates on open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more than 14 years of experience in various software domains—from banking software to e–commerce products. He is mainly focused on Java; however, he is open to every tool and programming language that might help him to achieve his goals easily and quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people solve their Solr and Lucene problems. He is also a speaker at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days. Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. He started working with Elasticsearch in the middle of 2010. At present, Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest. Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.
Read more about Rafal Kuc

Right arrow

Chapter 4. Querying Solr

In this chapter, we will cover the following topics:

  • Understanding and using the Lucene query language

  • Using position-aware queries

  • Using boosting with autocomplete

  • Phrase queries with shingles

  • Handling user queries without errors

  • Handling hierarchies with nested documents

  • Sorting data on the basis of a function value

  • Controlling the number of terms needed to match

  • Affecting document score using function queries

  • Using simple nested queries

  • Using the Solr document's query join functionality

  • Handling typos with n-grams

  • Rescoring query results

Introduction


Creating a simple query is not a hard task, but creating a complex one, with faceting, local params, parameter dereferencing, and phrase queries can be a challenging task. Other than this, you must remember to write your query while keeping the performance factors in mind. This is why something that is simple at first sight can turn into something more challenging, such as writing a good, complex query. This chapter will try to guide you through some of the tasks you might encounter during your everyday work with Solr.

Understanding and using the Lucene query language


As you know, Solr is built using the Apache Lucene library. Due to this, some of the query parsers available in Solr allow us to fully leverage the query language of Lucene, giving us great flexibility to understand how our queries work and with what documents they match. In this recipe, we will discuss an example usage of the Lucene query language by looking at a book search site that gives its users the possibility of defining complex Boolean queries that contain phrases.

How to do it...

Let's perform the following steps to achieve this:

  1. The first step is to prepare our index to handle data. To do this, we add the following entries to the schema.xml file:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text_general" indexed="true" stored="true" />
    <field name="description" type="text_general" indexed="true" stored="true" />
    <field name="published" type="int" indexed...

Using position aware queries


Most of the queries exposed by Lucene and Solr are not position-aware, which means that the query doesn't care about the place in the document where the word comes from. Of course, we have phrase queries that we can use for phrase searching, and even introduce the phrase slop, but this is not always enough. Sometimes, we might want to search for words with their positions in the searched documents. Let's assume that we allow our users to search for book titles and descriptions and specify how these words should be positioned related to each other. Solr provides us with such functionalities, and this recipe will show you how to use them.

How to do it...

Let's start with a simple index structure. For the purpose of this recipe, we will use the following fields:

  1. Add the following sections to the schema.xml file:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text_general" indexed="true" stored="true...

Using boosting with autocomplete


Autocomplete is very good when it comes to our user search experience. It is especially useful for showing users' data that we want to promote or the data that is of value to the users. In general, in e-commerce, the deployment of the autocomplete functionality means more profit. However, there are situations where we want to promote certain products or documents, for example, the currently top-selling books or financial reports, which are the most important ones. This recipe will show you how to boost certain documents when using the n-gram-based autocomplete functionality.

How to do it...

Let's perform the following steps to boost certain documents using the n-gram-based autocomplete function:

  1. We start with creating the index structure for our use case; we just put the following section to the schema.xml file:

    <field name="id" type="string" indexed="true" stored="true" required="true"/>
    <field name="title" type="text_general" indexed="true" stored...

Phrase queries with shingles


Imagine that you have an application that searches within millions of documents that are generated by a law company. One of the requirements is to search boost the documents that have either a search phrase or part of the phrase in their title. So, is it possible to achieve it using Solr? Yes, and this recipe will show you how to do this.

How to do it...

Let's follow these steps to achieve this:

  1. Let's start with our index structure; we configure it by adding the following section to the schema.xml file:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text_general" indexed="true" stored="true" />
  2. The second step is to create example data that looks like this:

    <doc>
      <field name="id">1</field>
      <field name="title">Financial report 2014</field>
     </doc>
     <doc>
      <field name="id">2</field>
      <field name="title">Financial marketing report 2014...

Handling user queries without errors


When building an application that uses Solr, we usually pass the query that the user sent to Solr. Sometimes, we even allow users to send complex queries that contain Lucene special characters. Due to this, there are situations where the user provides malformed queries, and thus, Solr throws an exception when running such queries. We can alter this behavior by using a new query parser called Simple. This recipe will show you how to do this.

Getting ready

Before continuing to read this recipe, I suggest reading the Understanding and using the Lucene query language recipe from this chapter.

How to do it...

Let's look into how to handle user queries without errors using the following steps:

  1. We start by creating a simple index structure that will allow us to easily illustrate the example. To do this, we place the following section in the schema.xml file:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title"...

Handling hierarchies with nested documents


In the real world, data is not flat, it contains many hierarchies that we need to handle. Sometimes it is not possible to flatten the data, but still we want to avoid cross and false matches. For example, let's assume that we have articles and comments to these articles, for example, news sites or blogs. Imagine that we want to search for articles and comments at the same time. To do this, we will use the Solr nested documents; this recipe will show you how to do this.

How to do it...

To handle hierarchies with nested documents, follow these steps:

  1. We start by defining the index structure. To do this, we add the following fields to our schema.xml file:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text_general" indexed="true" stored="true"/>
    <field name="content" type="text_general" indexed="true" stored="true"/>
    <field name="author" type="text_general" indexed="true" stored...

Sorting data on the basis of a function value


Suppose we have a search application that stores information about companies. Every company is described by a name and two floating point numbers that represent the geographical location of the company. One day your boss comes to your room and says that he wants the search results to be sorted by distance from the user's location. What's more, he wants us to force our search engine to return the distance from a user location to each of the returned companies. This recipe will show you how to achieve this requirement.

How to do it...

Let's perform the following steps to sort data on the basis of a function value:

  1. For this recipe, we will begin with the following index structure (add the following entries to your schema.xmlfile):

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="name" type="text_general" indexed="true" stored="true"/>
    <field name="location" type="location" indexed="true" stored...

Controlling the number of terms needed to match


Imagine a situation where you have an e-commerce bookstore and you want to make a search algorithm that tries to bring the best search results to your customers. However, you notice that many of your customers tend to make queries with too many words, which results in an empty result list. So, you decide to make a query that will require a maximum of two of the words, which the user entered, to be matched. This recipe will show you how to do it.

Getting ready

Before we continue, it is crucial to mention that the following method can only be used with the dismax or edismax query parser. For the list of available query parsers, refer to http://wiki.apache.org/solr/QueryParser.

How to do it...

Follow these steps to control the number of terms needed to match:

  1. Let's begin with creating our index structure. For our simple use case, we will only have documents with the identifier (the id field) and title (the title field). We define the index structure...

Affecting document score using function queries


There are many situations where you would like to have an influence on how the score of the documents is calculated. For example, you would like to boost the documents on the basis of the purchases of it. As in, as an e-commerce bookstore, you would like to be showed relevant results, but you would also like to influence them by adding yet another factor to their score. Is this possible? Yes, and this recipe will show you how to do it.

How to do it...

Let's see how the document score is affected using function queries and the following steps:

  1. Let's start by defining the index structure by adding the following section to the schema.xml file:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text_general" indexed="true" stored="true" />
    <field name="sold" type="int" indexed="true" stored="true" />
  2. The second step will be the example data, which looks like this:

    <add>
     <...

Using simple nested queries


Imagine a situation where you need a query nested inside another query. For example, you want to run a query using the standard request handler, but you need to embed a query that is parsed by the dismax query parser inside it. For example, we will like to find all the books having a certain phrase in their title, and boost the ones that have a part of the phrase present. This recipe will show you how to do this.

How to do it...

Let's start with a simple index that has the following structure:

  1. You need to put the following section to the schema.xml file:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text_general" indexed="true" stored="true" />
  2. The next step is data indexing. Our example data looks as follows:

    <add>
     <doc>
      <field name="id">1</field>
      <field name="title">Revised solrcookbook</field>
     </doc>
     <doc>
      <field name="id">2</field...

Using the Solr document query join functionality


When using Solr, you will probably be used to having a flat structure of documents without any relationships. However, there are situations where decomposing relationships is a cost we can't bear. Due to this, Solr 4.0 comes with a join functionality that lets us use some basic relationships. For example, imagine that our index consists of books and workbooks, and we want to use this relationship. This recipe will show you how to do this.

How to do it...

Let's perform the following steps:

  1. First of all, let's assume that we have the following index structure (just place the following entries in your schema.xml file):

    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="name" type="text_general" indexed="true" stored="true" multiValued="false"/>
    <field name="type" type="string" indexed="true" stored="true"/>
    <field name="book" type="string" indexed="true" stored="true...

Handling typos with n-grams


Sometimes, there are situations where you would like to have some kind of functionality that allows you to give your user the search results even though he made a typo, perhaps even more than one typo. In Solr, there are multiple ways to do this—use the Spellchecker component and try to correct the user's mistake, use fuzzy queries, or use the n-gram approach. This recipe will concentrate on the third approach and show you how to use n-grams to handle user typos.

How to do it...

For this recipe, let's assume that our index is built of four fields: identifier, name, description, and description_ngram, which will be processed with the n-gram filter.

  1. So, let's start with the definition of our index structure that can look like this (we will place the following entries in the schema.xml file):

    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="name" type="text_general" indexed="true" stored="true"/...

Rescoring query results


Imagine a situation in which your score calculation is affected by numerous function queries, which makes the score calculation very CPU-intensive. This is not a problem for small result sets, but it is for larger ones. Starting from Solr 4.9, this great search engine gives us the possibility of rerank results. This means that Solr will get some results from our initial query and will apply another query only on those results. The query that is applied modifies the score of the documents. This recipe will show you how this can be done.

How to do it...

Let's say that we have a use case where we want to show the latest books added to our index and boost them on the basis of some additional query. To do this, we will need to take the following steps:

  1. Let's start with a simple index structure. Our index will be built of three fields that look as follows (please put the following entries to the schema.xml file):

    <field name="id" type="string" indexed="true" stored="true...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Solr Cookbook - Third Edition
Published in: Jan 2015Publisher: ISBN-13: 9781783553150
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rafal Kuc

Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as a consultant and software engineer at Sematext Group Inc. where he concentrates on open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more than 14 years of experience in various software domains—from banking software to e–commerce products. He is mainly focused on Java; however, he is open to every tool and programming language that might help him to achieve his goals easily and quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people solve their Solr and Lucene problems. He is also a speaker at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days. Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. He started working with Elasticsearch in the middle of 2010. At present, Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest. Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.
Read more about Rafal Kuc