Apache Solr: Spellchecker, Statistics, and Grouping Mechanism

Exclusive offer: get 50% off this eBook here
Apache Solr 3.1 Cookbook

Apache Solr 3.1 Cookbook — Save 50%

Over 100 recipes to discover new ways to work with Apache’s Enterprise Search Server

$26.99    $13.50
by Rafał Kuć | July 2011 | Open Source

There are many features of Solr that we don't use every day. In the previous article by Rafal Kuc, author of Apache Solr 3.1 Cookbook, we took a look at some Solr functionalities such as highlighting, sorting results, ignoring words, and so on. In this article, we will take a look at the spellchecker, statistics, or grouping mechanism which may not be in everyday use, but they can come in handy in many situations. The author will try to show you how to overcome some typical problems that can be fixed by using these Solr functionalities.

Specifically, we will cover:

  • Computing statistics for the search results
  • Checking user's spelling mistakes
  • Using "group by" like functionalities in Solr

 

Apache Solr 3.1 Cookbook

Apache Solr 3.1 Cookbook

Over 100 recipes to discover new ways to work with Apache’s Enterprise Search Server

        Read more about this book      

(For more resources on this subject, see here.)

Computing statistics for the search results

Imagine a situation where you want to compute some basic statistics about the documents in the results list. For example, you have an e-commerce shop where you want to show the minimum and the maximum price of the documents that were found for a given query. Of course, you could fetch all the documents and count it by yourself, but imagine if Solr can do it for you. Yes it can and this recipe will show you how to use that functionality.

How to do it...

Let's start with the index structure (just add this to the fields section of your schema.xml file):

<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="name" type="text" indexed="true" stored="true" />
<field name="price" type="float" indexed="true" stored="true" />

The example data file looks like this:

<add>
<doc>
<field name="id">1</field>
<field name="name">Book 1</field>
<field name="price">39.99</field>
</doc>
<doc>
<field name="id">2</field>
<field name="name">Book 2</field>
<field name="price">30.11</field>
</doc>
<doc>
<field name="id">3</field>
<field name="name">Book 3</field>
<field name="price">27.77</field>
</doc>
</add>

Let's assume that we want our statistics to be computed for the price field. To do that, we send the following query to Solr:

http://localhost:8983/solr/select?q=name:book&stats=true&stats.
field=price

The response Solr returned should be like this:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">0</int>
<lst name="params">
<str name="q">name:book</str>
<str name="stats">true</str>
<str name="stats.field">price</str>
</lst>
</lst>
<result name="response" numFound="3" start="0">
<doc>
<str name="id">1</str>
<str name="name">Book 1</str>
<float name="price">39.99</float>
</doc>
<doc>
<str name="id">2</str>
<str name="name">Book 2</str>
<float name="price">30.11</float>
</doc>
<doc>
<str name="id">3</str>
<str name="name">Book 3</str>
<float name="price">27.77</float>
</doc>
</result>
<lst name="stats">
<lst name="stats_fields">
<lst name="price">
<double name="min">27.77</double>
<double name="max">39.99</double>
<double name="sum">97.86999999999999</double>
<long name="count">3</long>
<long name="missing">0</long>
<double name="sumOfSquares">3276.9851000000003</double>
<double name="mean">32.62333333333333</double>
<double name="stddev">6.486118510583508</double>
</lst>
</lst>
</lst>
</response>

As you can see, in addition to the standard results list, there was an additional section available. Now let's see how it works.

How it works...

The index structure is pretty straightforward. It contains three fields—one for holding the unique identifier (the id field), one for holding the name (the name field), and one for holding the price (the price field).

The file that contains the example data is simple too, so I'll skip discussing it.

The query is interesting. In addition to the q parameter, we have two new parameters. The first one, stats=true, tells Solr that we want to use the StatsComponent, the component which will calculate the statistics for us. The second parameter, stats.field=price, tells the StatsComponent which field to use for the calculation. In our case, we told Solr to use the price field.

Now let's look at the result returned by Solr. As you can see, the StatsComponent added an additional section to the results. This section contains the statistics generated for the field we told Solr we want statistics for. The following statistics are available:

  • min: The minimum value that was found in the field for the documents that matched the query
  • max: The maximum value that was found in the field for the documents that matched the query
  • sum: Sum of all values in the field for the documents that matched the query
  • count: How many non-null values were found in the field for the documents that matched the query
  • missing: How many documents that matched the query didn't have any value in the specified field
  • sumOfSquares: Sum of all values squared in the field for the documents that matched the query
  • mean: The average for the values in the field for the documents that matched the query
  • stddev: The standard deviation for the values in the field for the documents that matched the query

You should also remember that you can specify multiple stats.field parameters to calculate statistics for different fields in a single query.

Please be careful when using this component on the multivalued fields as it can be a performance bottleneck.

Apache Solr 3.1 Cookbook Over 100 recipes to discover new ways to work with Apache’s Enterprise Search Server
Published: July 2011
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:
        Read more about this book      

(For more resources on this subject, see here.)

Checking user's spelling mistakes

Most modern search sites have some kind of mechanism to correct user's spelling mistakes. Some of those sites have a sophisticated mechanism, while others just have a basic one. However, that doesn't matter. If all the search engines have it, then there is a big probability that your client or boss will want one too. Is there a way to integrate such functionality into Solr? Yes there is and this recipe will show you how to do it.

How to do it...

Let's start with the index structure (just add this to the fields section of your schema.xml file):

<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="name" type="text" indexed="true" stored="true" />

The example data file looks like this:

<add>
<doc>
<field name="id">1</field>
<field name="name">Solr cookbook</field>
</doc>
<doc>
<field name="id">2</field>
<field name="name">Mechanics cookbook</field>
</doc>
<doc>
<field name="id">3</field>
<field name="name">Other book</field>
</doc>
</add>

Our spell-check mechanism will work on the basis of the name field. Now, let's add the appropriate search component to the solrconfig.xml file.

<searchComponent name="spellcheck" class="solr.Spell
CheckComponent">
<lst name="spellchecker">
<str name="name">default</str>
<str name="field">name</str>
<str name="spellcheckIndexDir">./spellchecker</str>
<str name="buildOnCommit">true</str>
</lst>
</searchComponent>

In addition, we would like to have it integrated into our search handler, so we make the default search handler definition like this (add this to your solrconfig.xml file):

<requestHandler name="standard" class="solr.SearchHandler"
default="true">
<arr name="last-components">
<str>spellcheck</str>
</arr>
</requestHandler>

Now let's check how it works. To do that, we will send a query that contains a spelling mistake—we will send the words othar boak instead of other book. The query doing that should look like this:

http://localhost:8983/solr/spell?q=name:(other boak)&spellcheck=
true&spellcheck.collate=true

The Solr response for that query looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">4</int>
</lst>
<result name="response" numFound="0" start="0"/>
<lst name="spellcheck">
<lst name="suggestions">
<lst name="othar">
<int name="numFound">1</int>
<int name="startOffset">6</int>
<int name="endOffset">11</int>
<arr name="suggestion">
<str>other</str>
</arr>
</lst>
<lst name="boak">
<int name="numFound">1</int>
<int name="startOffset">12</int>
<int name="endOffset">16</int>
<arr name="suggestion">
<str>book</str>
</arr>
</lst>
<str name="collation">name:(other book)</str>
</lst>
</lst>
</response>

As you can see for the preceding response, Solr corrected the spelling mistake we made. Now let's see how it works.

How it works...

The index structure is pretty straightforward. It contains two fields: one for holding the unique identifier (the id field) and the other for holding the name (the name field). The file that contains the example data is simple too, so I'll skip discussing it.

The spellchecker component configuration is something we should look at a bit closer. It starts with the name of the component (the name attribute which, in our case, is spellcheck). Then we have the class attribute which specifies the class that implements the component.

Under the <lst name="spellchecker"> XML tag, we have the actual component configuration. The name of the spellchecker (<str name="name">) is an option which is not mandatory when we use a single spellchecker component. We used the default name. The field parameter (<str name="field">) specifies the field on the basis of which we will get the mistakes corrected. The <str name="spellcheckIndexDir"> tag specifies the directory (relative to the directory where your index directory is stored) in which the spellchecker component index will be held. In our case, the spellchecker component index will be named spellchecker and will be written in the same directory as the actual Solr index. The last parameter (<str name="buildOnCommit">) tells Solr to build the spellchecker index every time the commit operation is performed. Remember that it is crucial to build the index of the spellchecker with every commit, because the spellchecker is using its own index to generate the corrections.

The request handler we defined will be used by Solr as the default one (attribute default="true"). As you can see, we told Solr that we want to use the spellchecker component by adding a single string with the name of the component in the last-components array tag.

Now let's look at the query. We send the boak and othar words in the query parameter (q). We also told Solr to activate the spellchecker component by adding the spellcheck=true parameter to the query. We also told Solr that we want a new query to be constructed for us by adding the spellcheck.collate=true parameter. And that's actually all when it comes to the query.

Finally, we come to the results returned by Solr. As you can see, there were no documents found for the word boak and the word othar, and that was what we were actually expecting. However, as you can see, there is a spellchecker component section added to the results list (<lst name="spellcheck"> tag). For each word there is a suggestion returned by Solr (the tag <lst name="boak"> is the suggestion for the boak word). As you can see, the spellchecker component informed us about the number of suggestions found (the <int name="numFound">), about the start and end offset of the suggestion (<int name="startOffset"> and <int name="endOffset">) and about the actual suggestions (the <arr name="suggestion"> array). The only suggestion that Solr returned was the word book (<str>book</str> under the suggestion array). The same goes for the second word.

There is an additional section in the spellchecker component results generated by the spellcheck.collate=true parameter, <str name="collation">name:(other book)</str>. It tells us what query Solr suggested to us. We can either show the query to the user or send it automatically to Solr and show our user the corrected results list—this one is up to you.

Using "group by" like functionalities in Solr

Imagine a situation where you have a number of companies in your index. Each company is described by its unique identifier, name, and main office identifier. The problem is that you would like to show only one company with the given main office identifier. In other words, you would like to group by that data. Is this possible in Solr? Yes and this recipe will show you how to do it.

Getting ready

There is one thing that you need to know. The described functionality is not available in Solr 3.1 and lower. To use this functionality, you need to get Solr from the trunk of the Lucene/Solr SVN repository.

How to do it...

Let's start with the index structure (just add this to your schema.xml file to the fields section):

<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="name" type="text" indexed="true" stored="true" />
<field name="mainOfficeId" type="int" indexed="true" stored=
"true" />

The example data file looks like this:

<add>
<doc>
<field name="id">1</field>
<field name="name">Company 1</field>
<field name="mainOfficeId">1</field>
</doc>
<doc>
<field name="id">2</field>
<field name="name">Company 2</field>
<field name="mainOfficeId">2</field>
</doc>
<doc>
<field name="id">3</field>
<field name="name">Company 3</field>
<field name="mainOfficeId">1</field>
</doc>
</add>

Let's assume that our hypothetical user sends a query for the word company. In the search results, we want to show only one document with the same mainOfficeId field value. To do that, we send the following query to Solr:

http://localhost:8983/solr/select?q=name:company&group=true&group.
field=mainOfficeId

The response that was returned from Solr was as follows:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">1</int>
<lst name="params">
<str name="group.field">mainOfficeId</str>
<str name="group">true</str>
<str name="q">name:company</str>
</lst>
</lst>
<lst name="grouped">
<lst name="mainOfficeId">
<int name="matches">3</int>
<arr name="groups">
<lst>
<int name="groupValue">1</int>
<result name="doclist" numFound="2" start="0">
<doc>
<str name="id">1</str>
<int name="mainOfficeId">1</int>
<str name="name">Company 1</str>
</doc>
</result>
</lst>
<lst>
<int name="groupValue">2</int>
<result name="doclist" numFound="1" start="0">
<doc>
<str name="id">2</str>
<int name="mainOfficeId">2</int>
<str name="name">Company 2</str>
</doc>
</result>
</lst>
</arr>
</lst>
</lst>
</response>

As you can see, the results list is a little bit different from the one that we are used to. Let's see how it works.

How it works...

The index structure is pretty straightforward. It contains three fields: one for holding the unique identifier (the id field), one for holding the name (the name field), and one for holding the identifier of the main office (the mainOfficeId field). The file that contains the example data is simple too, so I'll skip discussing it.

Now let's look at the query. We send the company word in the query parameter (q). In addition to that, we have two new additional parameters. The group=true parameter tells Solr that we want to use the grouping mechanism. In addition to that we need to tell Solr what field should be used for grouping—to do that we use the group.field parameter, which in our case is set to the mainOfficeId field.

So let's have a look at how Solr behaves with the given example query. Take a look at the results list. Instead of the standard search results, we got everything grouped under the <lst name="grouped"> XML tag. For every field (or query) passed to the grouping component (in our case by the group.field parameter) Solr creates an additional section— in our case, the <lst name="mainOfficeId"> XML tag. Next thing we see is the <int name="matches"> tag, which tells us how many documents were found for the given query. And finally, we have the grouped results under the <arr name="groups"> XML tag.

For every unique value of the mainOfficeId field we have a group returned by Solr. The <int name="groupValue"> tells us about the value for which the group is constructed. In our example, we have two groups, one for the 1 value and the other for the 2 value. Documents that are in a group are described by the <result name="doclist"> XML tag. The numFound attribute of that tag tells how many documents are in the given group and the start attribute tells us from which index the documents are shown. Then for every document in the group, a <doc> XML tag is created. The <doc> tag contains the information about the fields just like the usual Solr result list.

One thing you should know when using the grouping functionality. Right now the functionality does not support the distributed search and grouping is not supported for multivalued fields.

There's more...

Fetching more than one document in a group

There are situations where you would like to get more than one default document for every group. To do that you should add the group.limit parameter with the desired value. For example, if you would like to get a maximum of four documents shown in every group, the following query should be sent to Solr:

http://localhost:8983/solr/select?q=name:company&group=true&group.
field=mainOfficeId&group.limit=4.

Summary

In this article we took a look at some additional Solr functionalities such as spellchecker, statistics, and grouping mechanism.


Further resources on this subject:


Apache Solr 3.1 Cookbook Over 100 recipes to discover new ways to work with Apache’s Enterprise Search Server
Published: July 2011
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:

About the Author :


Rafał Kuć

Rafał Kuć is a born team leader and a Software Developer. Working as a Consultant and a Software Engineer at Sematext Group, Inc., he concentrates on open source technologies such as Apache Lucene, Solr, ElasticSearch, and Hadoop stack. He has more than 11 years of experience in various software branches—from banking software to e-commerce products. He is mainly focused on Java, but open to every tool and programming language that will make the achievement of his goal easier and faster. He is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people to resolve their problems with Solr and Lucene. He is also a speaker for various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, and Lucene Revolution.

Rafał began his journey with Lucene in 2002 and it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and this was it. He started working with ElasticSearch in the middle of 2010. Currently, Lucene, Solr, ElasticSearch, and information retrieval are his main points of interest.

Rafał is also an author of Solr 3.1 Cookbook, the update to it—Solr 4.0 Cookbook, and is a co-author of ElasticSearch Server all published by Packt Publishing.

The book you are holding in your hands was something that I wanted to write after finishing the ElasticSearch Server book and I got the opportunity. I wanted not to jump from topic to topic, but concentrate on a few of them and write about what I know and share the knowledge. Again, just like the ElasticSearch Server book, I couldn't include all topics I wanted, and some small details that are more or less important, depending on the use case, had to be left aside. Nevertheless, I hope that by reading this book you'll be able to easily get into all the details about ElasticSearch and underlying Apache Lucene, and I also hope that it will let you get the desired knowledge easier and faster.

Books From Packt


Solr 1.4 Enterprise Search Server
Solr 1.4 Enterprise Search Server

Apache Solr 3 Enterprise Search Server: RAW
Apache Solr 3 Enterprise Search Server: RAW

Python 2.6 Text Processing: Beginners Guide
Python 2.6 Text Processing: Beginners Guide

Pentaho Data Integration 4 Cookbook
Pentaho Data Integration 4 Cookbook

iReport 3.7
iReport 3.7

MySQL Admin Cookbook
MySQL Admin Cookbook

NHibernate 3 Beginner's Guide
NHibernate 3 Beginner's Guide

CMS Made Simple Development Cookbook
CMS Made Simple Development Cookbook


No votes yet

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
t
3
E
m
s
H
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software