Home Big-data-and-business-intelligence Apache Solr 4 Cookbook

Apache Solr 4 Cookbook

By Rafał Kuć
books-svg-icon Book
Subscription
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
Subscription
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Apache Solr Configuration
About this book

Apache Solr is a blazing fast, scalable, open source Enterprise search server built upon Apache Lucene. Solr is wildly popular because it supports complex search criteria, faceting, result highlighting, query-completion, query spell-checking, and relevancy tuning, amongst other numerous features.

"Apache Solr 4 Cookbook" will show you how to get the most out of your search engine. Full of practical recipes and examples, this book will show you how to set up Apache Solr, tune and benchmark performance as well as index and analyze your data to provide better, more precise, and useful search data.

"Apache Solr 4 Cookbook" will make your search better, more accurate and faster with practical recipes on essential topics such as SolrCloud, querying data, search faceting, text and data analysis, and cache configuration.

With numerous practical chapters centered on important Solr techniques and methods, Apache Solr 4 Cookbook is an essential resource for developers who wish to take their knowledge and skills further. Thoroughly updated and improved, this Cookbook also covers the changes in Apache Solr 4 including the awesome capabilities of SolrCloud.

Publication date:
January 2013
Publisher
Packt
Pages
328
ISBN
9781782161325

 

Chapter 1. Apache Solr Configuration

In this chapter we will cover:

  • Running Solr on Jetty

  • Running Solr on Apache Tomcat

  • Installing a standalone ZooKeeper

  • Clustering your data

  • Choosing the right directory implementation

  • Configuring spellchecker to not use its own index

  • Solr cache configuration

  • How to fetch and index web pages

  • How to set up the extracting request handler

  • Changing the default similarity implementation

 

Introduction


Setting up an example Solr instance is not a hard task, at least when setting up the simplest configuration. The simplest way is to run the example provided with the Solr distribution, that shows how to use the embedded Jetty servlet container.

If you don't have any experience with Apache Solr, please refer to the Apache Solr tutorial which can be found at: http://lucene.apache.org/solr/tutorial.html before reading this book.

Tip

During the writing of this chapter, I used Solr version 4.0 and Jetty version 8.1.5, and those versions are covered in the tips of the following chapter. If another version of Solr is mandatory for a feature to run, then it will be mentioned.

We have a simple configuration, simple index structure described by the schema.xml file, and we can run indexing.

In this chapter you'll see how to configure and use the more advanced Solr modules; you'll see how to run Solr in different containers and how to prepare your configuration to different requirements. You will also learn how to set up a new SolrCloud cluster and migrate your current configuration to the one supporting all the features of SolrCloud. Finally, you will learn how to configure Solr cache to meet your needs and how to pre-sort your Solr indexes to be able to use early query termination techniques efficiently.

 

Running Solr on Jetty


The simplest way to run Apache Solr on a Jetty servlet container is to run the provided example configuration based on embedded Jetty. But it's not the case here. In this recipe, I would like to show you how to configure and run Solr on a standalone Jetty container.

Getting ready

First of all you need to download the Jetty servlet container for your platform. You can get your download package from an automatic installer (such as, apt-get), or you can download it yourself from http://jetty.codehaus.org/jetty/.

How to do it...

The first thing is to install the Jetty servlet container, which is beyond the scope of this book, so we will assume that you have Jetty installed in the /usr/share/jetty directory or you copied the Jetty files to that directory.

Let's start by copying the solr.war file to the webapps directory of the Jetty installation (so the whole path would be /usr/share/jetty/webapps). In addition to that we need to create a temporary directory in Jetty installation, so let's create the temp directory in the Jetty installation directory.

Next we need to copy and adjust the solr.xml file from the context directory of the Solr example distribution to the context directory of the Jetty installation. The final file contents should look like the following code:

<?xml version="1.0"?>
<!DOCTYPE Configure PUBLIC "-//Jetty//Configure//EN" "http://www.eclipse.org/jetty/configure.dtd">
<Configure class="org.eclipse.jetty.webapp.WebAppContext">
  <Set name="contextPath">/solr</Set>
  <Set name="war"><SystemProperty name="jetty.home"/>/webapps/solr.war</Set>
  <Set name="defaultsDescriptor"><SystemProperty name="jetty.home"/>/etc/webdefault.xml</Set>
  <Set name="tempDirectory"><Property name="jetty.home" default="."/>/temp</Set>
</Configure>

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

Now we need to copy the jetty.xml, webdefault.xml, and logging.properties files from the etc directory of the Solr distribution to the configuration directory of Jetty, so in our case to the /usr/share/jetty/etc directory.

The next step is to copy the Solr configuration files to the appropriate directory. I'm talking about files such as schema.xml, solrconfig.xml, solr.xml, and so on. Those files should be in the directory specified by the solr.solr.home system variable (in my case this was the /usr/share/solr directory). Please remember to preserve the directory structure you'll see in the example deployment, so for example, the /usr/share/solr directory should contain the solr.xml (and in addition zoo.cfg in case you want to use SolrCloud) file with the contents like so:

<?xml version="1.0" encoding="UTF-8" ?>
<solr persistent="true">
  <cores adminPath="/admin/cores" defaultCoreName="collection1">
    <core name="collection1" instanceDir="collection1" />
  </cores>
</solr> 

All the other configuration files should go to the /usr/share/solr/collection1/conf directory (place the schema.xml and solrconfig.xml files there along with any additional configuration files your deployment needs). Your cores may have other names than the default collection1, so please be aware of that.

The last thing about the configuration is to update the /etc/default/jetty file and add –Dsolr.solr.home=/usr/share/solr to the JAVA_OPTIONS variable of that file. The whole line with that variable could look like the following:

JAVA_OPTIONS="-Xmx256m -Djava.awt.headless=true -Dsolr.solr.home=/usr/share/solr/" 

If you didn't install Jetty with apt-get or a similar software, you may not have the /etc/default/jetty file. In that case, add the –Dsolr.solr.home=/usr/share/solr parameter to the Jetty startup.

We can now run Jetty to see if everything is ok. To start Jetty, that was installed, for example, using the apt-get command, use the following command:

/etc/init.d/jetty start

You can also run Jetty with a java command. Run the following command in the Jetty installation directory:

java –Dsolr.solr.home=/usr/share/solr –jar start.jar

If there were no exceptions during the startup, we have a running Jetty with Solr deployed and configured. To check if Solr is running, try going to the following address with your web browser: http://localhost:8983/solr/.

You should see the Solr front page with cores, or a single core, mentioned. Congratulations! You just successfully installed, configured, and ran the Jetty servlet container with Solr deployed.

How it works...

For the purpose of this recipe, I assumed that we needed a single core installation with only schema.xml and solrconfig.xml configuration files. Multicore installation is very similar – it differs only in terms of the Solr configuration files.

The first thing we did was copy the solr.war file and create the temp directory. The WAR file is the actual Solr web application. The temp directory will be used by Jetty to unpack the WAR file.

The solr.xml file we placed in the context directory enables Jetty to define the context for the Solr web application. As you can see in its contents, we set the context to be /solr, so our Solr application will be available under http://localhost:8983/solr/. We also specified where Jetty should look for the WAR file (the war property), where the web application descriptor file (the defaultsDescriptor property) is, and finally where the temporary directory will be located (the tempDirectory property).

The next step is to provide configuration files for the Solr web application. Those files should be in the directory specified by the system solr.solr.home variable. I decided to use the /usr/share/solr directory to ensure that I'll be able to update Jetty without the need of overriding or deleting the Solr configuration files. When copying the Solr configuration files, you should remember to include all the files and the exact directory structure that Solr needs. So in the directory specified by the solr.solr.home variable, the solr.xml file should be available – the one that describes the cores of your system.

The solr.xml file is pretty simple – there should be the root element called solr. Inside it there should be a cores tag (with the adminPath variable set to the address where Solr's cores administration API is available and the defaultCoreName attribute that says which is the default core). The cores tag is a parent for cores definition – each core should have its own cores tag with name attribute specifying the core name and the instanceDir attribute specifying the directory where the core specific files will be available (such as the conf directory).

If you installed Jetty with the apt-get command or similar, you will need to update the /etc/default/jetty file to include the solr.solr.home variable for Solr to be able to see its configuration directory.

After all those steps we are ready to launch Jetty. If you installed Jetty with apt-get or a similar software, you can run Jetty with the first command shown in the example. Otherwise you can run Jetty with a java command from the Jetty installation directory.

After running the example query in your web browser you should see the Solr front page as a single core. Congratulations! You just successfully configured and ran the Jetty servlet container with Solr deployed.

There's more...

There are a few tasks you can do to counter some problems when running Solr within the Jetty servlet container. Here are the most common ones that I encountered during my work.

I want Jetty to run on a different port

Sometimes it's necessary to run Jetty on a different port other than the default one. We have two ways to achieve that:

  • Adding an additional startup parameter, jetty.port. The startup command would look like the following command:

    java –Djetty.port=9999 –jar start.jar
    
  • Changing the jetty.xml file – to do that you need to change the following line:

    <Set name="port"><SystemProperty name="jetty.port" default="8983"/></Set>

    To:

    <Set name="port"><SystemProperty name="jetty.port" default="9999"/></Set>

Buffer size is too small

Buffer overflow is a common problem when our queries are getting too long and too complex, – for example, when we use many logical operators or long phrases. When the standard head buffer is not enough you can resize it to meet your needs. To do that, you add the following line to the Jetty connector in thejetty.xml file. Of course the value shown in the example can be changed to the one that you need:

<Set name="headerBufferSize">32768</Set>

After adding the value, the connector definition should look more or less like the following snippet:

<Call name="addConnector">
<Arg>
<New class="org.mortbay.jetty.bio.SocketConnector">
<Set name="port"><SystemProperty name="jetty.port" default="8080"/></Set>
<Set name="maxIdleTime">50000</Set>
<Set name="lowResourceMaxIdleTime">1500</Set>
<Set name="headerBufferSize">32768</Set>
</New>
</Arg>
</Call>
 

Running Solr on Apache Tomcat


Sometimes you need to choose a servlet container other than Jetty. Maybe because your client has other applications running on another servlet container, maybe because you just don't like Jetty. Whatever your requirements are that put Jetty out of the scope of your interest, the first thing that comes to mind is a popular and powerful servlet container – Apache Tomcat. This recipe will give you an idea of how to properly set up and run Solr in the Apache Tomcat environment.

Getting ready

First of all we need an Apache Tomcat servlet container. It can be found at the Apache Tomcat website – http://tomcat.apache.org. I concentrated on the Tomcat Version 7.x because at the time of writing of this book it was mature and stable. The version that I used during the writing of this recipe was Apache Tomcat 7.0.29, which was the newest one at the time.

How to do it...

To run Solr on Apache Tomcat we need to follow these simple steps:

  1. Firstly, you need to install Apache Tomcat. The Tomcat installation is beyond the scope of this book so we will assume that you have already installed this servlet container in the directory specified by the $TOMCAT_HOME system variable.

  2. The second step is preparing the Apache Tomcat configuration files. To do that we need to add the following inscription to the connector definition in the server.xml configuration file:

    URIEncoding="UTF-8"

    The portion of the modified server.xml file should look like the following code snippet:

    <Connector port="8080" protocol="HTTP/1.1"
                   connectionTimeout="20000"
                   redirectPort="8443"
                   URIEncoding="UTF-8" />
  3. The third step is to create a proper context file. To do that, create a solr.xml file in the $TOMCAT_HOME/conf/Catalina/localhost directory. The contents of the file should look like the following code:

    <Context path="/solr" docBase="/usr/share/tomcat/webapps/solr.war" debug="0" crossContext="true">
       <Environment name="solr/home" type="java.lang.String" value="/usr/share/solr/" override="true"/>
    </Context>
  4. The next thing is the Solr deployment. To do that we need the apache-solr-4.0.0.war file that contains the necessary files and libraries to run Solr that is to be copied to the Tomcat webapps directory and renamed solr.war.

  5. The one last thing we need to do is add the Solr configuration files. The files that you need to copy are files such as schema.xml, solrconfig.xml, and so on. Those files should be placed in the directory specified by the solr/home variable (in our case /usr/share/solr/). Please don't forget that you need to ensure the proper directory structure. If you are not familiar with the Solr directory structure please take a look at the example deployment that is provided with the standard Solr package.

  6. Please remember to preserve the directory structure you'll see in the example deployment, so for example, the /usr/share/solr directory should contain the solr.xml (and in addition zoo.cfg in case you want to use SolrCloud) file with the contents like so:

    <?xml version="1.0" encoding="UTF-8" ?>
    <solr persistent="true">
      <cores adminPath="/admin/cores" defaultCoreName="collection1">
        <core name="collection1" instanceDir="collection1" />
      </cores>
    </solr> 
  7. All the other configuration files should go to the /usr/share/solr/collection1/conf directory (place the schema.xml and solrconfig.xml files there along with any additional configuration files your deployment needs). Your cores may have other names than the default collection1, so please be aware of that.

  8. Now we can start the servlet container, by running the following command:

    bin/catalina.sh start
    
  9. In the log file you should see a message like this:

    Info: Server startup in 3097 ms
    
  10. To ensure that Solr is running properly, you can run a browser and point it to an address where Solr should be visible, like the following:

    http://localhost:8080/solr/

If you see the page with links to administration pages of each of the cores defined, that means that your Solr is up and running.

How it works...

Let's start from the second step as the installation part is beyond the scope of this book. As you probably know, Solr uses UTF-8 file encoding. That means that we need to ensure that Apache Tomcat will be informed that all requests and responses made should use that encoding. To do that, we modified the server.xml file in the way shown in the example.

The Catalina context file (called solr.xml in our example) says that our Solr application will be available under the /solr context (the path attribute). We also specified the WAR file location (the docBase attribute). We also said that we are not using debug (the debug attribute), and we allowed Solr to access other context manipulation methods. The last thing is to specify the directory where Solr should look for the configuration files. We do that by adding the solr/home environment variable with the value attribute set to the path to the directory where we have put the configuration files.

The solr.xml file is pretty simple – there should be the root element called solr. Inside it there should be the cores tag (with the adminPath variable set to the address where the Solr cores administration API is available and the defaultCoreName attribute describing which is the default core). The cores tag is a parent for cores definition – each core should have its own core tag with a name attribute specifying the core name and the instanceDir attribute specifying the directory where the core-specific files will be available (such as the conf directory).

The shell command that is shown starts Apache Tomcat. There are some other options of the catalina.sh (or catalina.bat) script; the descriptions of these options are as follows:

  • stop: This stops Apache Tomcat

  • restart: This restarts Apache Tomcat

  • debug: This start Apache Tomcat in debug mode

  • run: This runs Apache Tomcat in the current window, so you can see the output on the console from which you run Tomcat.

After running the example address in the web browser, you should see a Solr front page with a core (or cores if you have a multicore deployment). Congratulations! You just successfully configured and ran the Apache Tomcat servlet container with Solr deployed.

There's more...

There are some other tasks that are common problems when running Solr on Apache Tomcat.

Changing the port on which we see Solr running on Tomcat

Sometimes it is necessary to run Apache Tomcat on a different port other than 8080, which is the default one. To do that, you need to modify the port variable of the connector definition in the server.xml file located in the $TOMCAT_HOME/conf directory. If you would like your Tomcat to run on port 9999, this definition should look like the following code snippet:

<Connector port="9999" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443"
               URIEncoding="UTF-8" />

While the original definition looks like the following snippet:

<Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443"
               URIEncoding="UTF-8" />
 

Installing a standalone ZooKeeper


You may know that in order to run SolrCloud—the distributed Solr installation—you need to have Apache ZooKeeper installed. Zookeeper is a centralized service for maintaining configurations, naming, and provisioning service synchronization. SolrCloud uses ZooKeeper to synchronize configuration and cluster states (such as elected shard leaders), and that's why it is crucial to have a highly available and fault tolerant ZooKeeper installation. If you have a single ZooKeeper instance and it fails then your SolrCloud cluster will crash too. So, this recipe will show you how to install ZooKeeper so that it's not a single point of failure in your cluster configuration.

Getting ready

The installation instruction in this recipe contains information about installing ZooKeeper Version 3.4.3, but it should be useable for any minor release changes of Apache ZooKeeper. To download ZooKeeper please go to http://zookeeper.apache.org/releases.html. This recipe will show you how to install ZooKeeper in a Linux-based environment. You also need Java installed.

How to do it...

Let's assume that we decided to install ZooKeeper in the /usr/share/zookeeper directory of our server and we want to have three servers (with IP addresses 192.168.1.1, 192.168.1.2, and 192.168.1.3) hosting the distributed ZooKeeper installation.

  1. After downloading the ZooKeeper installation, we create the necessary directory:

    sudo mkdir /usr/share/zookeeper 
    
  2. Then we unpack the downloaded archive to the newly created directory. We do that on three servers.

  3. Next we need to change our ZooKeeper configuration file and specify the servers that will form the ZooKeeper quorum, so we edit the /usr/share/zookeeper/conf/zoo.cfg file and we add the following entries:

    clientPort=2181
    dataDir=/usr/share/zookeeper/data
    tickTime=2000
    initLimit=10
    syncLimit=5
    server.1=192.168.1.1:2888:3888
    server.2=192.168.1.2:2888:3888
    server.3=192.168.1.3:2888:3888
  4. And now, we can start the ZooKeeper servers with the following command:

    /usr/share/zookeeper/bin/zkServer.sh start
    
  5. If everything went well you should see something like the following:

    JMX enabled by default
    Using config: /usr/share/zookeeper/bin/../conf/zoo.cfg
    Starting zookeeper ... STARTED
    

And that's all. Of course you can also add the ZooKeeper service to start automatically during your operating system startup, but that's beyond the scope of the recipe and the book itself.

How it works...

Let's skip the first part, because creating the directory and unpacking the ZooKeeper server there is quite simple. What I would like to concentrate on are the configuration values of the ZooKeeper server. The clientPort property specifies the port on which our SolrCloud servers should connect to ZooKeeper. The dataDir property specifies the directory where ZooKeeper will hold its data. So far, so good right ? So now, the more advanced properties; the tickTime property specified in milliseconds is the basic time unit for ZooKeeper. The initLimit property specifies how many ticks the initial synchronization phase can take. Finally, the syncLimit property specifies how many ticks can pass between sending the request and receiving an acknowledgement.

There are also three additional properties present, server.1, server.2, and server.3. These three properties define the addresses of the ZooKeeper instances that will form the quorum. However, there are three values separated by a colon character. The first part is the IP address of the ZooKeeper server, and the second and third parts are the ports used by ZooKeeper instances to communicate with each other.

 

Clustering your data


After the release of Apache Solr 4.0, many users will want to leverage SolrCloud distributed indexing and querying capabilities. It's not hard to upgrade your current cluster to SolrCloud, but there are some things you need to take care of. With the help of the following recipe you will be able to easily upgrade your cluster.

Getting ready

Before continuing further it is advised to read the Installing a standalone ZooKeeper recipe in this chapter. It shows how to set up a ZooKeeper cluster in order to be ready for production use.

How to do it...

In order to use your old index structure with SolrCloud, you will need to add the following field to your fields definition (add the following fragment to the schema.xml file, to its fields section):

<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/> 

Now let's switch to the solrconfig.xml file – starting with the replication handlers. First, you need to ensure that you have the replication handler set up. Remember that you shouldn't add master or slave specific configurations to it. So the replication handlers' configuration should look like the following code:

<requestHandler name="/replication" class="solr.ReplicationHandler" /> 

In addition to that, you will need to have the administration panel handlers present, so the following configuration entry should be present in your solrconfig.xml file:

<requestHandler name="/admin/" class="solr.admin.AdminHandlers" />

The last request handler that should be present is the real-time get handler, which should be defined as follows (the following should also be added to the solrconfig.xml file):

<requestHandler name="/get" class="solr.RealTimeGetHandler">
  <lst name="defaults">
    <str name="omitHeader">true</str>
  </lst>
</requestHandler>

The next thing SolrCloud needs in order to properly operate is the transaction log configuration. The following fragment should be added to the solrconfig.xml file:

<updateLog>
  <str name="dir">${solr.data.dir:}</str>
</updateLog>

The last thing is the solr.xml file. It should be pointing to the default cores administration address – the cores tag should have the adminPath property set to the /admin/cores value. The example solr.xml file could look like the following code:

<solr persistent="true">
  <cores adminPath="/admin/cores" defaultCoreName="collection1" host="localhost" hostPort="8983" zkClientTimeout="15000">
    <core name="collection1" instanceDir="collection1" />
  </cores>
</solr>

And that's all, your Solr instances configuration files are now ready to be used with SolrCloud.

How it works...

So now let's see why all those changes are needed in order to use our old configuration files with SolrCloud.

The _version_ field is used by Solr to enable documents versioning and optimistic locking, which ensures that you won't have the newest version of your document overwritten by mistake. Because of that, SolrCloud requires the _version_ field to be present in your index structure. Adding that field is simple – you just need to place another field definition that is stored and indexed, and based on the long type. That's all.

As for the replication handler, you should remember not to add slave or master specific configuration, only the simple request handler definition, as shown in the previous example. The same applies to the administration panel handlers: they need to be available under the default URL address.

The real-time get handler is responsible for getting the updated documents right away, even if no commit or the softCommit command is executed. This handler allows Solr (and also you) to retrieve the latest version of the document without the need for re-opening the searcher, and thus even if the document is not yet visible during usual search operations. The configuration is very similar to the usual request handler configuration – you need to add a new handler with the name property set to /get and the class property set to solr.RealTimeGetHandler. In addition to that, we want the handler to be omitting response headers (the omitHeader property set to true).

One of the last things that is needed by SolrCloud is the transaction log, which enables real-time get operations to be functional. The transaction log keeps track of all the uncommitted changes and enables a real-time get handler to retrieve those. In order to turn on transaction log usage, one should add the updateLog tag to the solrconfig.xml file and specify the directory where the transaction log directory should be created (by adding the dir property as shown in the example). In the configuration previously shown, we tell Solr that we want to use the Solr data directory as the place to store the transaction log directory.

Finally, Solr needs you to keep the default address for the core administrative interface, so you should remember to have the adminPath property set to the value shown in the example (in the solr.xml file). This is needed in order for Solr to be able to manipulate cores.

 

Choosing the right directory implementation


One of the most crucial properties of Apache Lucene, and thus Solr, is the Lucene directory implementation. The directory interface provides an abstraction layer for Lucene on all the I/O operations. Although choosing the right directory implementation seems simple, it can affect the performance of your Solr setup in a drastic way. This recipe will show you how to choose the right directory implementation.

How to do it...

In order to use the desired directory, all you need to do is choose the right directory factory implementation and inform Solr about it. Let's assume that you would like to use NRTCachingDirectory as your directory implementation. In order to do that, you need to place (or replace if it is already present) the following fragment in your solrconfig.xml file:

<directoryFactory name="DirectoryFactory" class="solr.NRTCachingDirectoryFactory" />

And that's all. The setup is quite simple, but what directory factories are available to use? When this book was written, the following directory factories were available:

  • solr.StandardDirectoryFactory

  • solr.SimpleFSDirectoryFactory

  • solr.NIOFSDirectoryFactory

  • solr.MMapDirectoryFactory

  • solr.NRTCachingDirectoryFactory

  • solr.RAMDirectoryFactory

So now let's see what each of those factories provide.

How it works...

Before we get into the details of each of the presented directory factories, I would like to comment on the directory factory configuration parameter. All you need to remember is that the name attribute of the directoryFactory tag should be set to DirectoryFactory and the class attribute should be set to the directory factory implementation of your choice.

If you want Solr to make the decision for you, you should use solr.StandardDirectoryFactory. This is a filesystem-based directory factory that tries to choose the best implementation based on your current operating system and Java virtual machine used. If you are implementing a small application, which won't use many threads, you can use solr.SimpleFSDirectoryFactory which stores the index file on your local filesystem, but it doesn't scale well with a high number of threads. solr.NIOFSDirectoryFactory scales well with many threads, but it doesn't work well on Microsoft Windows platforms (it's much slower), because of the JVM bug, so you should remember that.

solr.MMapDirectoryFactory was the default directory factory for Solr for the 64-bit Linux systems from Solr 3.1 till 4.0. This directory implementation uses virtual memory and a kernel feature called mmap to access index files stored on disk. This allows Lucene (and thus Solr) to directly access the I/O cache. This is desirable and you should stick to that directory if near real-time searching is not needed.

If you need near real-time indexing and searching, you should use solr.NRTCachingDirectoryFactory. It is designed to store some parts of the index in memory (small chunks) and thus speed up some near real-time operations greatly.

The last directory factory, solr.RAMDirectoryFactory, is the only one that is not persistent. The whole index is stored in the RAM memory and thus you'll lose your index after restart or server crash. Also you should remember that replication won't work when using solr.RAMDirectoryFactory. One would ask, why should I use that factory? Imagine a volatile index for an autocomplete functionality or for unit tests of your queries' relevancy. Just anything you can think of, when you don't need to have persistent and replicated data. However, please remember that this directory is not designed to hold large amounts of data.

 

Configuring spellchecker to not use its own index


If you are used to the way spellchecker worked in the previous Solr versions, you may remember that it required its own index to give you spelling corrections. That approach had some disadvantages, such as the need for rebuilding the index, and replication between master and slave servers. With the Solr Version 4.0, a new spellchecker implementation was introduced – solr.DirectSolrSpellchecker. It allowed you to use your main index to provide spelling suggestions and didn't need to be rebuilt after every commit. So now, let's see how to use that new spellchecker implementation in Solr.

How to do it...

First of all, let's assume we have a field in the index called title, in which we hold titles of our documents. What's more, we don't want the spellchecker to have its own index and we would like to use that title field to provide spelling suggestions. In addition to that, we would like to decide when we want a spelling suggestion. In order to do that, we need to do two things:

  1. First, we need to edit our solrconfig.xml file and add the spellchecking component, whose definition may look like the following code:

    <searchComponent name="spellcheck" class="solr.SpellCheckComponent">
      <str name="queryAnalyzerFieldType">title</str>
      <lst name="spellchecker">
        <str name="name">direct</str>
        <str name="field">title</str>
        <str name="classname">solr.DirectSolrSpellChecker</str>
        <str name="distanceMeasure">internal</str>
        <float name="accuracy">0.8</float>
        <int name="maxEdits">1</int>
        <int name="minPrefix">1</int>
        <int name="maxInspections">5</int>
        <int name="minQueryLength">3</int>
        <float name="maxQueryFrequency">0.01</float>
      </lst>
    </searchComponent>
  2. Now we need to add a proper request handler configuration that will use the previously mentioned search component. To do that, we need to add the following section to the solrconfig.xml file:

    <requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
      <lst name="defaults">
        <str name="df">title</str>
        <str name="spellcheck.dictionary">direct</str>
        <str name="spellcheck">on</str>
        <str name="spellcheck.extendedResults">true</str>       
        <str name="spellcheck.count">5</str>     
        <str name="spellcheck.collate">true</str>
        <str name="spellcheck.collateExtendedResults">true</str>      
      </lst>
      <arr name="last-components">
        <str>spellcheck</str>
      </arr>
    </requestHandler>
  3. And that's all. In order to get spelling suggestions, we need to run the following query:

    /spell?q=disa
  4. In response we will get something like the following code:

    <?xml version="1.0" encoding="UTF-8"?>
    <response>
    <lst name="responseHeader">
      <int name="status">0</int>
      <int name="QTime">5</int>
    </lst>
    <result name="response" numFound="0" start="0">
    </result>
    <lst name="spellcheck">
      <lst name="suggestions">
        <lst name="disa">
          <int name="numFound">1</int>
          <int name="startOffset">0</int>
          <int name="endOffset">4</int>
          <int name="origFreq">0</int>
          <arr name="suggestion">
            <lst>
              <str name="word">data</str>
              <int name="freq">1</int>
            </lst>
          </arr>
        </lst>
        <bool name="correctlySpelled">false</bool>
        <lst name="collation">
          <str name="collationQuery">data</str>
          <int name="hits">1</int>
          <lst name="misspellingsAndCorrections">
            <str name="disa">data</str>
          </lst>
        </lst>
      </lst>
    </lst>
    </response>

If you check your data folder you will see that there is not a single directory responsible for holding the spellchecker index. So, now let's see how that works.

How it works...

Now let's get into some specifics about how the previous configuration works, starting from the search component configuration. The queryAnalyzerFieldType property tells Solr which field configuration should be used to analyze the query passed to the spellchecker. The name property sets the name of the spellchecker which will be used in the handler configuration later. The field property specifies which field should be used as the source for the data used to build spelling suggestions. As you probably figured out, the classname property specifies the implementation class, which in our case is solr.DirectSolrSpellChecker, enabling us to omit having a separate spellchecker index. The next parameters visible in the configuration specify how the Solr spellchecker should behave and that is beyond the scope of this recipe (however, if you would like to read more about them, please go to the following URL address: http://wiki.apache.org/solr/SpellCheckComponent).

The last thing is the request handler configuration. Let's concentrate on all the properties that start with the spellcheck prefix. First we have spellcheck.dictionary, which in our case specifies the name of the spellchecking component we want to use (please note that the value of the property matches the value of the name property in the search component configuration). We tell Solr that we want the spellchecking results to be present (the spellcheck property with the value set to on), and we also tell Solr that we want to see the extended results format (spellcheck.extendedResults set to true). In addition to the mentioned configuration properties, we also said that we want to have a maximum of five suggestions (the spellcheck.count property), and we want to see the collation and its extended results (spellcheck.collate and spellcheck.collateExtendedResults both set to true).

There's more...

Let's see one more thing – the ability to have more than one spellchecker defined in a request handler.

More than one spellchecker

If you would like to have more than one spellchecker handling your spelling suggestions you can configure your handler to use multiple search components. For example, if you would like to use search components (spellchecking ones) named word and better (you have to have them configured), you could add multiple spellcheck.dictionary parameters to your request handler. This is how your request handler configuration would look:

<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy">
  <lst name="defaults">
    <str name="df">title</str>
    <str name="spellcheck.dictionary">direct</str>
    <str name="spellcheck.dictionary">word</str>
    <str name="spellcheck.dictionary">better</str>
    <str name="spellcheck">on</str>
    <str name="spellcheck.extendedResults">true</str>       
    <str name="spellcheck.count">5</str>     
    <str name="spellcheck.collate">true</str>
    <str name="spellcheck.collateExtendedResults">true</str>      
  </lst>
  <arr name="last-components">
    <str>spellcheck</str>
  </arr>
</requestHandler>
 

Solr cache configuration


As you may already know, caches play a major role in a Solr deployment. And I'm not talking about some exterior cache – I'm talking about the three Solr caches:

  • Filter cache: This is used for storing filter (query parameter fq) results and mainly enum type facets

  • Document cache: This is used for storing Lucene documents which hold stored fields

  • Query result cache: This is used for storing results of queries

There is a fourth cache – Lucene's internal cache – which is a field cache, but you can't control its behavior. It is managed by Lucene and created when it is first used by the Searcher object.

With the help of these caches we can tune the behavior of the Solr searcher instance. In this task we will focus on how to configure your Solr caches to suit most needs. There is one thing to remember – Solr cache sizes should be tuned to the number of documents in the index, the queries, and the number of results you usually get from Solr.

Getting ready

Before you start tuning Solr caches you should get some information about your Solr instance. That information is as follows:

  • Number of documents in your index

  • Number of queries per second made to that index

  • Number of unique filter (the fq parameter) values in your queries

  • Maximum number of documents returned in a single query

  • Number of different queries and different sorts

All these numbers can be derived from Solr logs.

How to do it...

For the purpose of this task I assumed the following numbers:

  • Number of documents in the index: 1.000.000

  • Number of queries per second: 100

  • Number of unique filters: 200

  • Maximum number of documents returned in a single query: 100

  • Number of different queries and different sorts: 500

Let's open the solrconfig.xml file and tune our caches. All the changes should be made in the query section of the file (the section between <query> and </query> XML tags).

  1. First goes the filter cache:

    <filterCache
       class="solr.FastLRUCache"
       size="200"
       initialSize="200"
       autowarmCount="100"/>
  2. Second goes the query result cache:

    <queryResultCache
       class="solr.FastLRUCache"
       size="500"
       initialSize="500"
    autowarmCount="250"/>
  3. Third we have the document cache:

    <documentCache
       class="solr.FastLRUCache"
       size="11000"
       initialSize="11000" />

    Of course the above configuration is based on the example values.

  4. Further let's set our result window to match our needs – we sometimes need to get 20–30 more results than we need during query execution. So we change the appropriate value in the solrconfig.xml file to something like this:

    <queryResultWindowSize>200</queryResultWindowSize>

And that's all!

How it works...

Let's start with a little bit of explanation. First of all we use the solr.FastLRUCache implementation instead of solr.LRUCache. So the called FastLRUCache tends to be faster when Solr puts less into caches and gets more. This is the opposite to LRUCache which tends to be more efficient when there are more puts than gets operations. That's why we use it.

This colud be the first time you see cache configuration, so I'll explain what cache configuration parameters mean:

  • class: You probably figured that out by now. Yes, this is the class implementing the cache.

  • size: This is the maximum size that the cache can have.

  • initialSize: This is the initial size that the cache will have.

  • autowarmCount: This is the number of cache entries that will be copied to the new instance of the same cache when Solr invalidates the Searcher object – for example, during a commit operation.

As you can see, I tend to use the same number of entries for size and initialSize, and half of those values for autowarmCount. The size and initialSize properties can be set to the same size in order to avoid the underlying Java object resizing, which consumes additional processing time.

There is one thing you should be aware of. Some of the Solr caches (documentCache actually) operate on internal identifiers called docid. Those caches cannot be automatically warmed. That's because docid is changing after every commit operation and thus copying docid is useless.

Please keep in mind that the settings for the size of the caches is usually good for the moment you set them. But during the life cycle of your application your data may change, your queries may change, and your user's behavior may, and probably will change. That's why you should keep track of the cache usage with the use of Solr administration pages, JMX, or a specialized software such as Scalable Performance Monitoring from Sematext (see more at http://sematext.com/spm/index.html), and see how the utilization of each of the caches changes in time and makes proper changes to the configuration.

There's more...

There are a few additional things that you should know when configuring your caches.

Using a filter cache with faceting

If you use the term enumeration faceting method (parameter facet.method=enum) Solr will use the filter cache to check each term. Remember that if you use this method, your filter cache size should have at least the size of the number of unique facet values in all your faceted fields. This is crucial and you may experience performance loss if this cache is not configured the right way.

When we have no cache hits

When your Solr instance has a low cache hit ratio you should consider not using caches at all (to see the hit ratio you can use the administration pages of Solr). Cache insertion is not free – it costs CPU time and resources. So if you see that you have a very low cache hit ratio, you should consider turning your caches off – it may speed up your Solr instance. Before you turn off the caches please ensure that you have the right cache setup – a small hit ratio can be a result of bad cache configuration.

When we have more "puts" than "gets"

When your Solr instance uses put operations more than get operations you should consider using the solr.LRUCache implementation. It's confirmed that this implementation behaves better when there are more insertions into the cache than lookups.

Filter cache

This cache is responsible for holding information about the filters and the documents that match the filter. Actually this cache holds an unordered set of document IDs that match the filter. If you don't use the faceting mechanism with a filter cache, you should at least set its size to the number of unique filters that are present in your queries. This way it will be possible for Solr to store all the unique filters with their matching document IDs and this will speed up the queries that use filters.

Query result cache

The query result cache holds the ordered set of internal IDs of documents that match the given query and the sort specified. That's why if you use caches you should add as many filters as you can and keep your query (the q parameter) as clean as possible. For example, pass only the search box content of your search application to the query parameter. If the same query will be run more than once and the cache has enough capacity to hold the entry, it will be used to give the IDs of the documents that match the query, thus a no Lucene (Solr uses Lucene to index and query data that is indexed) query will be made saving the precious I/O operation for the queries that are not in the cache – this will boost up your Solr instance performance.

The maximum size of this cache that I tend to set is the number of unique queries and their sorts that are handled by my Solr in the time between the Searcher object's invalidation. This tends to be enough in most cases.

Document cache

The document cache holds the Lucene documents that were fetched from the index. Basically, this cache holds the stored fields of all the documents that are gathered from the Solr index. The size of this cache should always be greater than the number of concurrent queries multiplied by the maximum results you get from Solr. This cache can't be automatically warmed – that is because every commit is changing the internal IDs of the documents. Remember that the cache can be memory consuming in case you have many stored fields, so there will be times when you just have to live with evictions.

Query result window

The last thing is the query result window. This parameter tells Solr how many documents to fetch from the index in a single Lucene query. This is a kind of super set of documents fetched. In our example, we tell Solr that we want the maximum of one hundred documents as a result of a single query. Our query result window tells Solr to always gather two hundred documents. Then when we need some more documents that follow the first hundred they will be fetched from the cache, and therefore we will be saving our resources. The size of the query result window is mostly dependent on the application and how it is using Solr. If you tend to do a lot of paging, you should consider using a higher query result window value.

Tip

You should remember that the size of caches shown in this task is not final, and you should adapt them to your application needs. The values and the method of their calculation should only be taken as a starting point to further observation and optimization of the process. Also, please remember to monitor your Solr instance memory usage as using caches will affect the memory that is used by the JVM.

See also

There is another way to warm your caches if you know the most common queries that are sent to your Solr instance – auto-warming queries. Please refer to the Improving Solr performance right after a startup or commit operation recipe in Chapter 6, Improving Solr Performance. For information on how to cache whole pages of results please refer to the Caching whole result pages recipe in Chapter 6, Improving Solr Performance.

 

How to fetch and index web pages


There are many ways to index web pages. We could download them, parse them, and index them with the use of Lucene and Solr. The indexing part is not a problem, at least in most cases. But there is another problem – how to fetch them? We could possibly create our own software to do that, but that takes time and resources. That's why this recipe will cover how to fetch and index web pages using Apache Nutch.

Getting ready

For the purpose of this task we will be using Version 1.5.1 of Apache Nutch. To download the binary package of Apache Nutch, please go to the download section of http://nutch.apache.org.

How to do it...

Let's assume that the website we want to fetch and index is http://lucene.apache.org.

  1. First of all we need to install Apache Nutch. To do that we just need to extract the downloaded archive to the directory of our choice; for example, I installed it in the directory /usr/share/nutch. Of course this is a single server installation and it doesn't include the Hadoop filesystem, but for the purpose of the recipe it will be enough. This directory will be referred to as $NUTCH_HOME.

  2. Then we'll open the file $NUTCH_HOME/conf/nutch-default.xml and set the value http.agent.name to the desired name of your crawler (we've taken SolrCookbookCrawler as a name). It should look like the following code:

    <property>
    <name>http.agent.name</name>
    <value>SolrCookbookCrawler</value>
    <description>HTTP 'User-Agent' request header.</description>
    </property>
  3. Now let's create empty directories called crawl and urls in the $NUTCH_HOME directory. After that we need to create the seed.txt file inside the created urls directory with the following contents:

    http://lucene.apache.org
  4. Now we need to edit the $NUTCH_HOME/conf/crawl-urlfilter.txt file. Replace the +.at the bottom of the file with +^http://([a-z0-9]*\.)*lucene.apache.org/. So the appropriate entry should look like the following code:

    +^http://([a-z0-9]*\.)*lucene.apache.org/

    One last thing before fetching the data is Solr configuration.

  5. We start with copying the index structure definition file (called schema-solr4.xml) from the $NUTCH_HOME/conf/ directory to your Solr installation configuration directory (which in my case was /usr/share/solr/collection1/conf/). We also rename the copied file to schema.xml.

We also create an empty stopwords_en.txt file or we use the one provided with Solr if you want stop words removal.

Now we need to make two corrections to the schema.xml file we've copied:

  • The first one is the correction of the version attribute in the schema tag. We need to change its value from 1.5.1 to 1.5, so the final schema tag would look like this:

    <schema name="nutch" version="1.5.1">
  • Then we change the boost field type (in the same schema.xml file) from string to float, so the boost field definition would look like this:

    <field name="boost" type="float" stored="true" indexed="false"/>

Now we can start crawling and indexing by running the following command from the $NUTCH_HOME directory:

bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 50

Depending on your Internet connection and your machine configuration you should finally see a message similar to the following one:

crawl finished: crawl-20120830171434

This means that the crawl is completed and the data was indexed to Solr.

How it works...

After installing Nutch and Solr, the first thing we did was set our crawler name. Nutch does not allow empty names so we must choose one. The file nutch-default.xml defines more properties than the mentioned ones, but at this time we only need to know about that one.

In the next step, we created two directories; one (crawl) which will hold the crawl data and the second one (urls) to store the addresses we want to crawl. The contents of the seed.txt file we created contains addresses we want to crawl, one address per line.

The crawl-urlfilter.txt file contains information about the filters that will be used to check the URLs that Nutch will crawl. In the example, we told Nutch to accept every URL that begins with http://lucene.apache.org.

The schema.xml file we copied from the Nutch configuration directory is prepared to be used when Solr is used for indexing. But the one for Solr 4.0 is a bit buggy, at least in Nutch 1.5.1 distribution, and that's why we needed to make the changes previously mentioned.

We finally came to the point where we ran the Nutch command. We specified that we wanted to store the crawled data in the crawl directory (first parameter), and the addresses to crawl data from are in the urls directory (second parameter). The –solr switch lets you specify the address of the Solr server that will be responsible for the indexing crawled data and is mandatory if you want to get the data indexed with Solr. We decided to index the data to Solr installed at the same server. The –depth parameter specifies how deep to go after the links defined. In our example, we defined that we want a maximum of three links from the main page. The –topN parameter specifies how many documents will be retrieved from each level, which we defined as 50.

There's more...

There is one more thing worth knowing when you start a journey in the land of Apache Nutch.

Multiple thread crawling

The crawl command of the Nutch command-line utility has another option – it can be configured to run crawling with multiple threads. To achieve that you add the following parameter:

-threads N

So if you would like to crawl with 20 threads you should run the crawl command like sot:

bin/nutch crawl crawl/nutch/site -dir crawl -depth 3 -topN 50 –threads 20

See also

If you seek more information about Apache Nutch please refer to the http://nutch.apache.org and go to the Wiki section.

 

How to set up the extracting request handler


Sometimes indexing prepared text files (such as XML, CSV, JSON, and so on) is not enough. There are numerous situations where you need to extract data from binary files. For example, one of my clients wanted to index PDF files – actually their contents. To do that, we either need to parse the data in some external application or set up Solr to use Apache Tika. This task will guide you through the process of setting up Apache Tika with Solr.

How to do it...

In order to set up the extracting request handler, we need to follow these simple steps:

  1. First let's edit our Solr instance solrconfig.xml and add the following configuration:

    <requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler" >
     <lst name="defaults">
      <str name="fmap.content">text</str>
      <str name="lowernames">true</str>
      <str name="uprefix">attr_</str>
      <str name="captureAttr">true</str>
     </lst>
    </requestHandler>
  2. Next create the extract folder anywhere on your system (I created that folder in the directory where Solr is installed), and place the apache-solr-cell-4.0.0.jar from the dist directory (you can find it in the Solr distribution archive). After that you have to copy all the libraries from the contrib/extraction/lib/ directory to the extract directory you created before.

  3. In addition to that, we need the following entries added to the solrconfig.xml file:

    <lib dir="../../extract" regex=".*\.jar" />

And that's actually all that you need to do in terms of configuration.

To simplify the example, I decided to choose the following index structure (place it in the fields section in your schema.xml file):

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> 
<field name="text" type="text_general" indexed="true" stored="true"/>
<dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>

To test the indexing process, I've created a PDF file book.pdf using PDFCreator which contained the following text only: This is a Solr cookbook. To index that file, I've used the following command:

curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true" -F "myfile=@book.pdf"

You should see the following response:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">578</int>
</lst>
</response>

How it works...

Binary file parsing is implemented using the Apache Tika framework. Tika is a toolkit for detecting and extracting metadata and structured text from various types of documents, not only binary files but also HTML and XML files. To add a handler that uses Apache Tika, we need to add a handler based on the solr.extraction.ExtractingRequestHandler class to our solrconfig.xml file as shown in the example.

In addition to the handler definition, we need to specify where Solr should look for the additional libraries we placed in the extract directory that we created. The dir attribute of the lib tag should be pointing to the path of the created directory. The regex attribute is the regular expression telling Solr which files to load.

Let's now discuss the default configuration parameters. The fmap.content parameter tells Solr what field content of the parsed document should be extracted. In our case, the parsed content will go to the field named text. The next parameter lowernames is set to true; this tells Solr to lower all names that come from Tika and have them lowercased. The next parameter, uprefix, is very important. It tells Solr how to handle fields that are not defined in the schema.xml file. The name of the field returned from Tika will be added to the value of the parameter and sent to Solr. For example, if Tika returned a field named creator, and we don't have such a field in our index, then Solr would try to index it under a field named attr­_creator which is a dynamic field. The last parameter tells Solr to index Tika XHTML elements into separate fields named after those elements.

Next we have a command that sends a PDF file to Solr. We are sending a file to the /update/extract handler with two parameters. First we define a unique identifier. It's useful to be able to do that during document sending because most of the binary document won't have an identifier in its contents. To pass the identifier we use the literal.id parameter. The second parameter we send to Solr is the information to perform the commit right after document processing.

See also

To see how to index binary files please refer to the Indexing PDF files and Extracting metadata from binary files recipes in Chapter 2, Indexing Your Data.

 

Changing the default similarity implementation


Most of the time, the default way of calculating the score of your documents is what you need. But sometimes you need more from Solr; that's just the standard behavior. Let's assume that you would like to change the default behavior and use a different score calculation algorithm for the description field of your index. The current version of Solr allows you to do that and this recipe will show you how to leverage this functionality.

Getting ready

Before choosing one of the score calculation algorithms available in Solr, it's good to read a bit about them. The description of all the algorithms is beyond the scope of the recipe and the book, but I would suggest going to the Solr Wiki pages (or look at Javadocs) and read the basic information about available implementations.

How to do it...

For the purpose of the recipe let's assume we have the following index structure (just add the following entries to your schema.xml file to the fields section):

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general_dfr" indexed="true" stored="true" />

The string and text_general types are available in the default schema.xml file provided with the example Solr distribution. But we want DFRSimilarity to be used to calculate the score for the description field. In order to do that, we introduce a new type, which is defined as follows (just add the following entries to your schema.xml file to the types section):

<fieldType name="text_general_dfr" class="solr.TextField" positionIncrementGap="100">
 <analyzer type="index">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
  <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
 <analyzer type="query">
  <tokenizer class="solr.StandardTokenizerFactory"/>
  <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
  <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
  <filter class="solr.LowerCaseFilterFactory"/>
 </analyzer>
 <similarity class="solr.DFRSimilarityFactory">
  <str name="basicModel">P</str>
  <str name="afterEffect">L</str>
  <str name="normalization">H2</str>
  <float name="c">7</float>
 </similarity>
</fieldType>

Also, to use per-field similarity we have to add the following entry to your schema.xml file:

<similarity class="solr.SchemaSimilarityFactory"/>

And that's all. Now let's have a look and see how that works.

How it works...

The index structure presented in this recipe is pretty simple as there are only three fields. The one thing we are interested in is that the description field uses our own custom field type called text_general_dfr.

The thing we are mostly interested in is the new field type definition called text_general_dfr. As you can see, apart from the index and query analyzer there is an additional section – similarity. It is responsible for specifying which similarity implementation to use to calculate the score for a given field. You are probably used to defining field types, filters, and other things in Solr, so you probably know that the class attribute is responsible for specifying the class implementing the desired similarity implementation which in our case is solr.DFRSimilarityFactory. Also, if there is a need, you can specify additional parameters that configure the behavior of your chosen similarity class. In the previous example, we've specified four additional parameters: basicModel, afterEffect, normalization, and c, which all define the DFRSimilarity behavior.

solr.SchemaSimilarityFactory is required to be able to specify the similarity for each field.

There's more...

In addition to per-field similarity definition, you can also configure the global similarity:

Changing the global similarity

Apart from specifying the similarity class on a per-field basis, you can choose any other similarity than the default one in a global way. For example, if you would like to use BM25Similarity as the default one, you should add the following entry to your schema.xml file:

<similarity class="solr.BM25SimilarityFactory"/>

As well as with the per-field similarity, you need to provide the name of the factory class that is responsible for creating the appropriate similarity class.

About the Author
  • Rafał Kuć

    Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as a consultant and software engineer at Sematext Group Inc. where he concentrates on open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more than 14 years of experience in various software domains—from banking software to e–commerce products. He is mainly focused on Java; however, he is open to every tool and programming language that might help him to achieve his goals easily and quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people solve their Solr and Lucene problems. He is also a speaker at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days.

    Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. He started working with Elasticsearch in the middle of 2010. At present, Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest.

    Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.

    Browse publications by this author
Latest Reviews (1 reviews total)
Ebenfalls erstklassige Darstellung des Frameworks.
Apache Solr 4 Cookbook
Unlock this book and the full library FREE for 7 days
Start now