Reader small image

You're reading from  Solr Cookbook - Third Edition

Product typeBook
Published inJan 2015
Reading LevelIntermediate
Publisher
ISBN-139781783553150
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Rafal Kuc
Rafal Kuc
author image
Rafal Kuc

Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as a consultant and software engineer at Sematext Group Inc. where he concentrates on open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more than 14 years of experience in various software domains—from banking software to e–commerce products. He is mainly focused on Java; however, he is open to every tool and programming language that might help him to achieve his goals easily and quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people solve their Solr and Lucene problems. He is also a speaker at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days. Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. He started working with Elasticsearch in the middle of 2010. At present, Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest. Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.
Read more about Rafal Kuc

Right arrow

Changing similarity


Most times, the default way to calculate the score of your documents is what you need. However, sometimes you need more from Solr than just the standard behavior. For example, you might want shorter documents to be more valuable compared to longer ones. Let's assume that you want to change the default behavior and use different score calculation algorithms for the description field of your index. This recipe will show you how to leverage this functionality.

Getting ready

Before choosing one of the score calculation algorithms available in Solr, it's good to read a bit about them. The detailed description of all the algorithms is beyond the scope of this recipe and the book (although a simple description is mentioned later in the recipe), but I suggest visiting the Solr wiki page (or Javadocs) and reading basic information about the available implementations.

How to do it...

For the purpose of this recipe, let's assume we have the following index structure (just add the following entries to your schema.xml file):

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general_dfr" indexed="true" stored="true" />

The string and text_general types are available in the default schema.xml file provided with the example Solr distribution. However, we want DFRSimilarity to be used to calculate the score for the description field. In order to do this, we introduce a new type, which is defined as follows (just add the following entries to your schema.xml file):

<fieldType name="text_general_dfr" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
  <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
 <analyzer type="query">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
  <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
  <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
 <similarity class="solr.DFRSimilarityFactory">
  <str name="basicModel">P</str>
  <str name="afterEffect">L</str>
  <str name="normalization">H2</str>
  <float name="c">7</float>
 </similarity>
</fieldType>

Also, to use the per-field similarity, we have to add the following entry to your schema.xml file:

<similarity class="solr.SchemaSimilarityFactory"/>

That's all. Now, let's have a look and see how this works.

How it works...

The index structure previously presented is pretty simple as there are only three fields. The one thing we are interested in is that the description field uses our own custom field type called text_generanl_dfr.

The thing we are most interested in is the new field type definition called text_general_dfr. As you can see, apart from the index and query analyzer, there is an additional section called similarity. It is responsible for specifying which similarity implementation to use to calculate the score for a given field. You are probably used to defining field types, filters, and other things in Solr, so you probably know that the class attribute is responsible for specifying the class that implements the desired similarity implementation, in our case, solr.DFRSimilarityFactory. Also, if there is a need, you can specify additional parameters that configure the behavior of your chosen similarity class. In the previous example, we specified the four additional parameters of basicModel, afterEffect, normalization, and c, all of which define the DFRSimilarity behavior.

The solr.SchemaSimilarityFactory class is required to specify the similarity for each field.

Although the recipe is not about all the similarities available, I wanted to list the available ones. Note that each similarity might require and use different configuration parameters (all of them are described in the provided Javadocs). The list of currently available similarity factories are:

There's more...

In addition to per-field similarity definition, you can also configure the global similarity.

Changing the global similarity

Apart from specifying the similarity class on a per-field basis, you can choose fields other than the default one in a global way. For example, if you want to use BM25Similarity as the default field, you should add the following entry to your schema.xml file:

<similarity class="solr.BM25SimilarityFactory"/>

As with the per-field similarity, you need to provide the name of the factory class that is responsible for creating the appropriate similarity class.

Previous PageNext Chapter
You have been reading a chapter from
Solr Cookbook - Third Edition
Published in: Jan 2015Publisher: ISBN-13: 9781783553150
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rafal Kuc

Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as a consultant and software engineer at Sematext Group Inc. where he concentrates on open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more than 14 years of experience in various software domains—from banking software to e–commerce products. He is mainly focused on Java; however, he is open to every tool and programming language that might help him to achieve his goals easily and quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people solve their Solr and Lucene problems. He is also a speaker at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days. Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. He started working with Elasticsearch in the middle of 2010. At present, Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest. Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.
Read more about Rafal Kuc