In this chapter we will cover:
Running Solr on Jetty
Running Solr on Apache Tomcat
Installing a standalone ZooKeeper
Clustering your data
Choosing the right directory implementation
Configuring spellchecker to not use its own index
Solr cache configuration
How to fetch and index web pages
How to set up the extracting request handler
Changing the default similarity implementation
Setting up an example Solr instance is not a hard task, at least when setting up the simplest configuration. The simplest way is to run the example provided with the Solr distribution, that shows how to use the embedded Jetty servlet container.
If you don't have any experience with Apache Solr, please refer to the Apache Solr tutorial which can be found at: http://lucene.apache.org/solr/tutorial.html before reading this book.
Tip
During the writing of this chapter, I used Solr version 4.0 and Jetty version 8.1.5, and those versions are covered in the tips of the following chapter. If another version of Solr is mandatory for a feature to run, then it will be mentioned.
We have a simple configuration, simple index structure described by the schema.xml
file, and we can run indexing.
In this chapter you'll see how to configure and use the more advanced Solr modules; you'll see how to run Solr in different containers and how to prepare your configuration to different requirements. You will also learn how to set up a new SolrCloud cluster and migrate your current configuration to the one supporting all the features of SolrCloud. Finally, you will learn how to configure Solr cache to meet your needs and how to pre-sort your Solr indexes to be able to use early query termination techniques efficiently.
The simplest way to run Apache Solr on a Jetty servlet container is to run the provided example configuration based on embedded Jetty. But it's not the case here. In this recipe, I would like to show you how to configure and run Solr on a standalone Jetty container.
First of all you need to download the Jetty servlet container for your platform. You can get your download package from an automatic installer (such as, apt-get
), or you can download it yourself from http://jetty.codehaus.org/jetty/.
The first thing is to install the Jetty servlet container, which is beyond the scope of this book, so we will assume that you have Jetty installed in the /usr/share/jetty
directory or you copied the Jetty files to that directory.
Let's start by copying the solr.war
file to the webapps
directory of the Jetty installation (so the whole path would be /usr/share/jetty/webapps
). In addition to that we need to create a temporary directory in Jetty installation, so let's create the temp
directory in the Jetty installation directory.
Next we need to copy and adjust the solr.xml
file from the context
directory of the Solr example distribution to the context
directory of the Jetty installation. The final file contents should look like the following code:
<?xml version="1.0"?> <!DOCTYPE Configure PUBLIC "-//Jetty//Configure//EN" "http://www.eclipse.org/jetty/configure.dtd"> <Configure class="org.eclipse.jetty.webapp.WebAppContext"> <Set name="contextPath">/solr</Set> <Set name="war"><SystemProperty name="jetty.home"/>/webapps/solr.war</Set> <Set name="defaultsDescriptor"><SystemProperty name="jetty.home"/>/etc/webdefault.xml</Set> <Set name="tempDirectory"><Property name="jetty.home" default="."/>/temp</Set> </Configure>
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
Now we need to copy the jetty.xml
, webdefault.xml
, and logging.properties
files from the etc
directory of the Solr distribution to the configuration directory of Jetty, so in our case to the /usr/share/jetty/etc
directory.
The next step is to copy the Solr configuration files to the appropriate directory. I'm talking about files such as schema.xml
, solrconfig.xml
, solr.xml
, and so on. Those files should be in the directory specified by the solr.solr.home
system variable (in my case this was the /usr/share/solr
directory). Please remember to preserve the directory structure you'll see in the example deployment, so for example, the /usr/share/solr
directory should contain the solr.xml
(and in addition zoo.cfg
in case you want to use SolrCloud) file with the contents like so:
<?xml version="1.0" encoding="UTF-8" ?> <solr persistent="true"> <cores adminPath="/admin/cores" defaultCoreName="collection1"> <core name="collection1" instanceDir="collection1" /> </cores> </solr>
All the other configuration files should go to the /usr/share/solr/collection1/conf
directory (place the schema.xml
and solrconfig.xml
files there along with any additional configuration files your deployment needs). Your cores may have other names than the default collection1
, so please be aware of that.
The last thing about the configuration is to update the /etc/default/jetty
file and add –Dsolr.solr.home=/usr/share/solr
to the JAVA_OPTIONS
variable of that file. The whole line with that variable could look like the following:
JAVA_OPTIONS="-Xmx256m -Djava.awt.headless=true -Dsolr.solr.home=/usr/share/solr/"
If you didn't install Jetty with apt-get
or a similar software, you may not have the /etc/default/jetty
file. In that case, add the –Dsolr.solr.home=/usr/share/solr
parameter to the Jetty startup.
We can now run Jetty to see if everything is ok. To start Jetty, that was installed, for example, using the apt-get
command, use the following command:
/etc/init.d/jetty start
You can also run Jetty with a java
command. Run the following command in the Jetty installation directory:
java –Dsolr.solr.home=/usr/share/solr –jar start.jar
If there were no exceptions during the startup, we have a running Jetty with Solr deployed and configured. To check if Solr is running, try going to the following address with your web browser: http://localhost:8983/solr/
.
You should see the Solr front page with cores, or a single core, mentioned. Congratulations! You just successfully installed, configured, and ran the Jetty servlet container with Solr deployed.
For the purpose of this recipe, I assumed that we needed a single core installation with only schema.xml
and solrconfig.xml
configuration files. Multicore installation is very similar – it differs only in terms of the Solr configuration files.
The first thing we did was copy the solr.war
file and create the temp
directory. The WAR file is the actual Solr web application. The temp
directory will be used by Jetty to unpack the WAR file.
The solr.xml
file we placed in the context
directory enables Jetty to define the context for the Solr web application. As you can see in its contents, we set the context to be /solr
, so our Solr application will be available under http://localhost:8983/solr/
. We also specified where Jetty should look for the WAR file (the war
property), where the web application descriptor file (the defaultsDescriptor
property) is, and finally where the temporary directory will be located (the tempDirectory
property).
The next step is to provide configuration files for the Solr web application. Those files should be in the directory specified by the system solr.solr.home
variable. I decided to use the /usr/share/solr
directory to ensure that I'll be able to update Jetty without the need of overriding or deleting the Solr configuration files. When copying the Solr configuration files, you should remember to include all the files and the exact directory structure that Solr needs. So in the directory specified by the solr.solr.home
variable, the solr.xml
file should be available – the one that describes the cores of your system.
The solr.xml
file is pretty simple – there should be the root element called solr
. Inside it there should be a cores
tag (with the adminPath
variable set to the address where Solr's cores administration API is available and the defaultCoreName
attribute that says which is the default core). The cores
tag is a parent for cores definition – each core should have its own cores
tag with name
attribute specifying the core name and the instanceDir
attribute specifying the directory where the core specific files will be available (such as the conf
directory).
If you installed Jetty with the apt-get
command or similar, you will need to update the /etc/default/jetty
file to include the solr.solr.home
variable for Solr to be able to see its configuration directory.
After all those steps we are ready to launch Jetty. If you installed Jetty with apt-get
or a similar software, you can run Jetty with the first command shown in the example. Otherwise you can run Jetty with a java
command from the Jetty installation directory.
After running the example query in your web browser you should see the Solr front page as a single core. Congratulations! You just successfully configured and ran the Jetty servlet container with Solr deployed.
There are a few tasks you can do to counter some problems when running Solr within the Jetty servlet container. Here are the most common ones that I encountered during my work.
Sometimes it's necessary to run Jetty on a different port other than the default one. We have two ways to achieve that:
Adding an additional startup parameter,
jetty.port
. The startup command would look like the following command:java –Djetty.port=9999 –jar start.jar
Changing the
jetty.xml
file – to do that you need to change the following line:<Set name="port"><SystemProperty name="jetty.port" default="8983"/></Set>
To:
<Set name="port"><SystemProperty name="jetty.port" default="9999"/></Set>
Buffer overflow is a common problem when our queries are getting too long and too complex, – for example, when we use many logical operators or long phrases. When the standard head buffer is not enough you can resize it to meet your needs. To do that, you add the following line to the Jetty connector in thejetty.xml
file. Of course the value shown in the example can be changed to the one that you need:
<Set name="headerBufferSize">32768</Set>
After adding the value, the connector definition should look more or less like the following snippet:
<Call name="addConnector">
<Arg>
<New class="org.mortbay.jetty.bio.SocketConnector">
<Set name="port"><SystemProperty name="jetty.port" default="8080"/></Set>
<Set name="maxIdleTime">50000</Set>
<Set name="lowResourceMaxIdleTime">1500</Set>
<Set name="headerBufferSize">32768</Set>
</New>
</Arg>
</Call>
Sometimes you need to choose a servlet container other than Jetty. Maybe because your client has other applications running on another servlet container, maybe because you just don't like Jetty. Whatever your requirements are that put Jetty out of the scope of your interest, the first thing that comes to mind is a popular and powerful servlet container – Apache Tomcat. This recipe will give you an idea of how to properly set up and run Solr in the Apache Tomcat environment.
First of all we need an Apache Tomcat servlet container. It can be found at the Apache Tomcat website – http://tomcat.apache.org. I concentrated on the Tomcat Version 7.x because at the time of writing of this book it was mature and stable. The version that I used during the writing of this recipe was Apache Tomcat 7.0.29, which was the newest one at the time.
To run Solr on Apache Tomcat we need to follow these simple steps:
Firstly, you need to install Apache Tomcat. The Tomcat installation is beyond the scope of this book so we will assume that you have already installed this servlet container in the directory specified by the
$TOMCAT_HOME
system variable.The second step is preparing the Apache Tomcat configuration files. To do that we need to add the following inscription to the connector definition in the
server.xml
configuration file:URIEncoding="UTF-8"
The portion of the modified
server.xml
file should look like the following code snippet:<Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" URIEncoding="UTF-8" />
The third step is to create a proper context file. To do that, create a
solr.xml
file in the$TOMCAT_HOME/conf/Catalina/localhost
directory. The contents of the file should look like the following code:<Context path="/solr" docBase="/usr/share/tomcat/webapps/solr.war" debug="0" crossContext="true"> <Environment name="solr/home" type="java.lang.String" value="/usr/share/solr/" override="true"/> </Context>
The next thing is the Solr deployment. To do that we need the
apache-solr-4.0.0.war
file that contains the necessary files and libraries to run Solr that is to be copied to the Tomcatwebapps
directory and renamedsolr.war
.The one last thing we need to do is add the Solr configuration files. The files that you need to copy are files such as
schema.xml
,solrconfig.xml
, and so on. Those files should be placed in the directory specified by thesolr/home
variable (in our case/usr/share/solr/
). Please don't forget that you need to ensure the proper directory structure. If you are not familiar with the Solr directory structure please take a look at the example deployment that is provided with the standard Solr package.Please remember to preserve the directory structure you'll see in the example deployment, so for example, the
/usr/share/solr
directory should contain thesolr.xml
(and in additionzoo.cfg
in case you want to use SolrCloud) file with the contents like so:<?xml version="1.0" encoding="UTF-8" ?> <solr persistent="true"> <cores adminPath="/admin/cores" defaultCoreName="collection1"> <core name="collection1" instanceDir="collection1" /> </cores> </solr>
All the other configuration files should go to the
/usr/share/solr/collection1/conf
directory (place theschema.xml
andsolrconfig.xml
files there along with any additional configuration files your deployment needs). Your cores may have other names than the defaultcollection1
, so please be aware of that.Now we can start the servlet container, by running the following command:
bin/catalina.sh start
In the log file you should see a message like this:
Info: Server startup in 3097 ms
To ensure that Solr is running properly, you can run a browser and point it to an address where Solr should be visible, like the following:
http://localhost:8080/solr/
If you see the page with links to administration pages of each of the cores defined, that means that your Solr is up and running.
Let's start from the second step as the installation part is beyond the scope of this book. As you probably know, Solr uses UTF-8 file encoding. That means that we need to ensure that Apache Tomcat will be informed that all requests and responses made should use that encoding. To do that, we modified the server.xml
file in the way shown in the example.
The Catalina context file (called solr.xml
in our example) says that our Solr application will be available under the /solr
context (the path
attribute). We also specified the WAR file location (the docBase
attribute). We also said that we are not using debug (the debug
attribute), and we allowed Solr to access other context manipulation methods. The last thing is to specify the directory where Solr should look for the configuration files. We do that by adding the solr/home
environment variable with the value
attribute set to the path to the directory where we have put the configuration files.
The solr.xml
file is pretty simple – there should be the root element called solr
. Inside it there should be the cores
tag (with the adminPath
variable set to the address where the Solr cores administration API is available and the defaultCoreName
attribute describing which is the default core). The cores
tag is a parent for cores definition – each core should have its own core
tag with a name
attribute specifying the core name and the instanceDir
attribute specifying the directory where the core-specific files will be available (such as the conf
directory).
The shell command that is shown starts Apache Tomcat. There are some other options of the catalina.sh
(or catalina.bat
) script; the descriptions of these options are as follows:
stop
: This stops Apache Tomcatrestart
: This restarts Apache Tomcatdebug
: This start Apache Tomcat in debug moderun
: This runs Apache Tomcat in the current window, so you can see the output on the console from which you run Tomcat.
After running the example address in the web browser, you should see a Solr front page with a core (or cores if you have a multicore deployment). Congratulations! You just successfully configured and ran the Apache Tomcat servlet container with Solr deployed.
There are some other tasks that are common problems when running Solr on Apache Tomcat.
Sometimes it is necessary to run Apache Tomcat on a different port other than 8080, which is the default one. To do that, you need to modify the port
variable of the connector definition in the server.xml
file located in the $TOMCAT_HOME/conf
directory. If you would like your Tomcat to run on port 9999, this definition should look like the following code snippet:
<Connector port="9999" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443"
URIEncoding="UTF-8" />
While the original definition looks like the following snippet:
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443"
URIEncoding="UTF-8" />
You may know that in order to run SolrCloud—the distributed Solr installation—you need to have Apache ZooKeeper installed. Zookeeper is a centralized service for maintaining configurations, naming, and provisioning service synchronization. SolrCloud uses ZooKeeper to synchronize configuration and cluster states (such as elected shard leaders), and that's why it is crucial to have a highly available and fault tolerant ZooKeeper installation. If you have a single ZooKeeper instance and it fails then your SolrCloud cluster will crash too. So, this recipe will show you how to install ZooKeeper so that it's not a single point of failure in your cluster configuration.
The installation instruction in this recipe contains information about installing ZooKeeper Version 3.4.3, but it should be useable for any minor release changes of Apache ZooKeeper. To download ZooKeeper please go to http://zookeeper.apache.org/releases.html. This recipe will show you how to install ZooKeeper in a Linux-based environment. You also need Java installed.
Let's assume that we decided to install ZooKeeper in the /usr/share/zookeeper
directory of our server and we want to have three servers (with IP addresses 192.168.1.1
, 192.168.1.2
, and 192.168.1.3
) hosting the distributed ZooKeeper installation.
After downloading the ZooKeeper installation, we create the necessary directory:
sudo mkdir /usr/share/zookeeper
Then we unpack the downloaded archive to the newly created directory. We do that on three servers.
Next we need to change our ZooKeeper configuration file and specify the servers that will form the ZooKeeper quorum, so we edit the
/usr/share/zookeeper/conf/zoo.cfg
file and we add the following entries:clientPort=2181 dataDir=/usr/share/zookeeper/data tickTime=2000 initLimit=10 syncLimit=5 server.1=192.168.1.1:2888:3888 server.2=192.168.1.2:2888:3888 server.3=192.168.1.3:2888:3888
And now, we can start the ZooKeeper servers with the following command:
/usr/share/zookeeper/bin/zkServer.sh start
If everything went well you should see something like the following:
JMX enabled by default Using config: /usr/share/zookeeper/bin/../conf/zoo.cfg Starting zookeeper ... STARTED
And that's all. Of course you can also add the ZooKeeper service to start automatically during your operating system startup, but that's beyond the scope of the recipe and the book itself.
Let's skip the first part, because creating the directory and unpacking the ZooKeeper server there is quite simple. What I would like to concentrate on are the configuration values of the ZooKeeper server. The clientPort
property specifies the port on which our SolrCloud servers should connect to ZooKeeper. The dataDir
property specifies the directory where ZooKeeper will hold its data. So far, so good right ? So now, the more advanced properties; the tickTime
property specified in milliseconds is the basic time unit for ZooKeeper. The initLimit
property specifies how many ticks the initial synchronization phase can take. Finally, the syncLimit
property specifies how many ticks can pass between sending the request and receiving an acknowledgement.
There are also three additional properties present, server.1
, server.2
, and server.3
. These three properties define the addresses of the ZooKeeper instances that will form the quorum. However, there are three values separated by a colon character. The first part is the IP address of the ZooKeeper server, and the second and third parts are the ports used by ZooKeeper instances to communicate with each other.
After the release of Apache Solr 4.0, many users will want to leverage SolrCloud distributed indexing and querying capabilities. It's not hard to upgrade your current cluster to SolrCloud, but there are some things you need to take care of. With the help of the following recipe you will be able to easily upgrade your cluster.
Before continuing further it is advised to read the Installing a standalone ZooKeeper recipe in this chapter. It shows how to set up a ZooKeeper cluster in order to be ready for production use.
In order to use your old index structure with SolrCloud, you will need to add the following field to your fields definition (add the following fragment to the schema.xml
file, to its fields
section):
<field name="_version_" type="long" indexed="true" stored="true" multiValued="false"/>
Now let's switch to the solrconfig.xml
file – starting with the replication handlers. First, you need to ensure that you have the replication handler set up. Remember that you shouldn't add master or slave specific configurations to it. So the replication handlers' configuration should look like the following code:
<requestHandler name="/replication" class="solr.ReplicationHandler" />
In addition to that, you will need to have the administration panel handlers present, so the following configuration entry should be present in your solrconfig.xml
file:
<requestHandler name="/admin/" class="solr.admin.AdminHandlers" />
The last request handler that should be present is the real-time get
handler, which should be defined as follows (the following should also be added to the solrconfig.xml
file):
<requestHandler name="/get" class="solr.RealTimeGetHandler"> <lst name="defaults"> <str name="omitHeader">true</str> </lst> </requestHandler>
The next thing SolrCloud needs in order to properly operate is the transaction log configuration. The following fragment should be added to the solrconfig.xml
file:
<updateLog> <str name="dir">${solr.data.dir:}</str> </updateLog>
The last thing is the solr.xml
file. It should be pointing to the default cores administration address – the cores
tag should have the adminPath
property set to the /admin/cores
value. The example solr.xml
file could look like the following code:
<solr persistent="true"> <cores adminPath="/admin/cores" defaultCoreName="collection1" host="localhost" hostPort="8983" zkClientTimeout="15000"> <core name="collection1" instanceDir="collection1" /> </cores> </solr>
And that's all, your Solr instances configuration files are now ready to be used with SolrCloud.
So now let's see why all those changes are needed in order to use our old configuration files with SolrCloud.
The _version_
field is used by Solr to enable documents versioning and optimistic locking, which ensures that you won't have the newest version of your document overwritten by mistake. Because of that, SolrCloud requires the _version_
field to be present in your index structure. Adding that field is simple – you just need to place another field definition that is stored and indexed, and based on the long
type. That's all.
As for the replication handler, you should remember not to add slave or master specific configuration, only the simple request handler definition, as shown in the previous example. The same applies to the administration panel handlers: they need to be available under the default URL address.
The real-time get
handler is responsible for getting the updated documents right away, even if no commit or the softCommit
command is executed. This handler allows Solr (and also you) to retrieve the latest version of the document without the need for re-opening the searcher, and thus even if the document is not yet visible during usual search operations. The configuration is very similar to the usual request handler configuration – you need to add a new handler with the name
property set to /get
and the class
property set to solr.RealTimeGetHandler
. In addition to that, we want the handler to be omitting response headers (the omitHeader
property set to true
).
One of the last things that is needed by SolrCloud is the transaction log, which enables real-time get
operations to be functional. The transaction log keeps track of all the uncommitted changes and enables a real-time get
handler to retrieve those. In order to turn on transaction log usage, one should add the updateLog
tag to the solrconfig.xml
file and specify the directory where the transaction log directory should be created (by adding the dir
property as shown in the example). In the configuration previously shown, we tell Solr that we want to use the Solr data directory as the place to store the transaction log directory.
Finally, Solr needs you to keep the default address for the core administrative interface, so you should remember to have the adminPath
property set to the value shown in the example (in the solr.xml
file). This is needed in order for Solr to be able to manipulate cores.
One of the most crucial properties of Apache Lucene, and thus Solr, is the Lucene directory implementation. The directory interface provides an abstraction layer for Lucene on all the I/O operations. Although choosing the right directory implementation seems simple, it can affect the performance of your Solr setup in a drastic way. This recipe will show you how to choose the right directory implementation.
In order to use the desired directory, all you need to do is choose the right directory factory implementation and inform Solr about it. Let's assume that you would like to use NRTCachingDirectory
as your directory implementation. In order to do that, you need to place (or replace if it is already present) the following fragment in your solrconfig.xml
file:
<directoryFactory name="DirectoryFactory" class="solr.NRTCachingDirectoryFactory" />
And that's all. The setup is quite simple, but what directory factories are available to use? When this book was written, the following directory factories were available:
solr.StandardDirectoryFactory
solr.SimpleFSDirectoryFactory
solr.NIOFSDirectoryFactory
solr.MMapDirectoryFactory
solr.NRTCachingDirectoryFactory
solr.RAMDirectoryFactory
So now let's see what each of those factories provide.
Before we get into the details of each of the presented directory factories, I would like to comment on the directory factory configuration parameter. All you need to remember is that the name
attribute of the directoryFactory
tag should be set to DirectoryFactory
and the class
attribute should be set to the directory factory implementation of your choice.
If you want Solr to make the decision for you, you should use solr.StandardDirectoryFactory
. This is a filesystem-based directory factory that tries to choose the best implementation based on your current operating system and Java virtual machine used. If you are implementing a small application, which won't use many threads, you can use solr.SimpleFSDirectoryFactory
which stores the index file on your local filesystem, but it doesn't scale well with a high number of threads. solr.NIOFSDirectoryFactory
scales well with many threads, but it doesn't work well on Microsoft Windows platforms (it's much slower), because of the JVM bug, so you should remember that.
solr.MMapDirectoryFactory
was the default directory factory for Solr for the 64-bit Linux systems from Solr 3.1 till 4.0. This directory implementation uses virtual memory and a kernel feature called mmap
to access index files stored on disk. This allows Lucene (and thus Solr) to directly access the I/O cache. This is desirable and you should stick to that directory if near real-time searching is not needed.
If you need near real-time indexing and searching, you should use solr.NRTCachingDirectoryFactory
. It is designed to store some parts of the index in memory (small chunks) and thus speed up some near real-time operations greatly.
The last directory factory, solr.RAMDirectoryFactory
, is the only one that is not persistent. The whole index is stored in the RAM memory and thus you'll lose your index after restart or server crash. Also you should remember that replication won't work when using solr.RAMDirectoryFactory
. One would ask, why should I use that factory? Imagine a volatile index for an autocomplete functionality or for unit tests of your queries' relevancy. Just anything you can think of, when you don't need to have persistent and replicated data. However, please remember that this directory is not designed to hold large amounts of data.
If you are used to the way spellchecker worked in the previous Solr versions, you may remember that it required its own index to give you spelling corrections. That approach had some disadvantages, such as the need for rebuilding the index, and replication between master and slave servers. With the Solr Version 4.0, a new spellchecker implementation was introduced – solr.DirectSolrSpellchecker
. It allowed you to use your main index to provide spelling suggestions and didn't need to be rebuilt after every commit. So now, let's see how to use that new spellchecker implementation in Solr.
First of all, let's assume we have a field in the index called title
, in which we hold titles of our documents. What's more, we don't want the spellchecker to have its own index and we would like to use that title
field to provide spelling suggestions. In addition to that, we would like to decide when we want a spelling suggestion. In order to do that, we need to do two things:
First, we need to edit our
solrconfig.xml
file and add the spellchecking component, whose definition may look like the following code:<searchComponent name="spellcheck" class="solr.SpellCheckComponent"> <str name="queryAnalyzerFieldType">title</str> <lst name="spellchecker"> <str name="name">direct</str> <str name="field">title</str> <str name="classname">solr.DirectSolrSpellChecker</str> <str name="distanceMeasure">internal</str> <float name="accuracy">0.8</float> <int name="maxEdits">1</int> <int name="minPrefix">1</int> <int name="maxInspections">5</int> <int name="minQueryLength">3</int> <float name="maxQueryFrequency">0.01</float> </lst> </searchComponent>
Now we need to add a proper request handler configuration that will use the previously mentioned search component. To do that, we need to add the following section to the
solrconfig.xml
file:<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy"> <lst name="defaults"> <str name="df">title</str> <str name="spellcheck.dictionary">direct</str> <str name="spellcheck">on</str> <str name="spellcheck.extendedResults">true</str> <str name="spellcheck.count">5</str> <str name="spellcheck.collate">true</str> <str name="spellcheck.collateExtendedResults">true</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>
And that's all. In order to get spelling suggestions, we need to run the following query:
/spell?q=disa
In response we will get something like the following code:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">5</int> </lst> <result name="response" numFound="0" start="0"> </result> <lst name="spellcheck"> <lst name="suggestions"> <lst name="disa"> <int name="numFound">1</int> <int name="startOffset">0</int> <int name="endOffset">4</int> <int name="origFreq">0</int> <arr name="suggestion"> <lst> <str name="word">data</str> <int name="freq">1</int> </lst> </arr> </lst> <bool name="correctlySpelled">false</bool> <lst name="collation"> <str name="collationQuery">data</str> <int name="hits">1</int> <lst name="misspellingsAndCorrections"> <str name="disa">data</str> </lst> </lst> </lst> </lst> </response>
If you check your data folder you will see that there is not a single directory responsible for holding the spellchecker index. So, now let's see how that works.
Now let's get into some specifics about how the previous configuration works, starting from the search component configuration. The queryAnalyzerFieldType
property tells Solr which field configuration should be used to analyze the query passed to the spellchecker. The name
property sets the name of the spellchecker which will be used in the handler configuration later. The field
property specifies which field should be used as the source for the data used to build spelling suggestions. As you probably figured out, the classname
property specifies the implementation class, which in our case is solr.DirectSolrSpellChecker
, enabling us to omit having a separate spellchecker index. The next parameters visible in the configuration specify how the Solr spellchecker should behave and that is beyond the scope of this recipe (however, if you would like to read more about them, please go to the following URL address: http://wiki.apache.org/solr/SpellCheckComponent).
The last thing is the request handler configuration. Let's concentrate on all the properties that start with the spellcheck
prefix. First we have spellcheck.dictionary
, which in our case specifies the name of the spellchecking component we want to use (please note that the value of the property matches the value of the name
property in the search component configuration). We tell Solr that we want the spellchecking results to be present (the spellcheck
property with the value set to on
), and we also tell Solr that we want to see the extended results format (spellcheck.extendedResults
set to true
). In addition to the mentioned configuration properties, we also said that we want to have a maximum of five suggestions (the spellcheck.count
property), and we want to see the collation and its extended results (spellcheck.collate
and spellcheck.collateExtendedResults
both set to true
).
Let's see one more thing – the ability to have more than one spellchecker defined in a request handler.
If you would like to have more than one spellchecker handling your spelling suggestions you can configure your handler to use multiple search components. For example, if you would like to use search components (spellchecking ones) named word
and better
(you have to have them configured), you could add multiple spellcheck.dictionary
parameters to your request handler. This is how your request handler configuration would look:
<requestHandler name="/spell" class="solr.SearchHandler" startup="lazy"> <lst name="defaults"> <str name="df">title</str> <str name="spellcheck.dictionary">direct</str> <str name="spellcheck.dictionary">word</str> <str name="spellcheck.dictionary">better</str> <str name="spellcheck">on</str> <str name="spellcheck.extendedResults">true</str> <str name="spellcheck.count">5</str> <str name="spellcheck.collate">true</str> <str name="spellcheck.collateExtendedResults">true</str> </lst> <arr name="last-components"> <str>spellcheck</str> </arr> </requestHandler>
As you may already know, caches play a major role in a Solr deployment. And I'm not talking about some exterior cache – I'm talking about the three Solr caches:
There is a fourth cache – Lucene's internal cache
– which is a field cache, but you can't control its behavior. It is managed by Lucene and created when it is first used by the Searcher
object.
With the help of these caches we can tune the behavior of the Solr searcher instance. In this task we will focus on how to configure your Solr caches to suit most needs. There is one thing to remember – Solr cache sizes should be tuned to the number of documents in the index, the queries, and the number of results you usually get from Solr.
Before you start tuning Solr caches you should get some information about your Solr instance. That information is as follows:
Number of documents in your index
Number of queries per second made to that index
Number of unique filter (the
fq
parameter) values in your queriesMaximum number of documents returned in a single query
Number of different queries and different sorts
For the purpose of this task I assumed the following numbers:
Number of documents in the index:
1.000.000
Number of queries per second:
100
Number of unique filters:
200
Maximum number of documents returned in a single query:
100
Number of different queries and different sorts:
500
Let's open the solrconfig.xml
file and tune our caches. All the changes should be made in the query section of the file (the section between <query>
and </query>
XML tags).
First goes the filter cache:
<filterCache class="solr.FastLRUCache" size="200" initialSize="200" autowarmCount="100"/>
Second goes the query result cache:
<queryResultCache class="solr.FastLRUCache" size="500" initialSize="500" autowarmCount="250"/>
Third we have the document cache:
<documentCache class="solr.FastLRUCache" size="11000" initialSize="11000" />
Of course the above configuration is based on the example values.
Further let's set our result window to match our needs – we sometimes need to get 20–30 more results than we need during query execution. So we change the appropriate value in the
solrconfig.xml
file to something like this:<queryResultWindowSize>200</queryResultWindowSize>
And that's all!
Let's start with a little bit of explanation. First of all we use the solr.FastLRUCache
implementation instead of solr.LRUCache
. So the called FastLRUCache
tends to be faster when Solr puts less into caches and gets more. This is the opposite to LRUCache
which tends to be more efficient when there are more puts
than gets
operations. That's why we use it.
This colud be the first time you see cache configuration, so I'll explain what cache configuration parameters mean:
class
: You probably figured that out by now. Yes, this is the class implementing the cache.size
: This is the maximum size that the cache can have.initialSize
: This is the initial size that the cache will have.autowarmCount
: This is the number of cache entries that will be copied to the new instance of the same cache when Solr invalidates theSearcher
object – for example, during a commit operation.
As you can see, I tend to use the same number of entries for size
and initialSize
, and half of those values for autowarmCount
. The size
and initialSize
properties can be set to the same size in order to avoid the underlying Java object resizing, which consumes additional processing time.
There is one thing you should be aware of. Some of the Solr caches (documentCache
actually) operate on internal identifiers called docid
. Those caches cannot be automatically warmed. That's because docid
is changing after every commit operation and thus copying docid
is useless.
Please keep in mind that the settings for the size of the caches is usually good for the moment you set them. But during the life cycle of your application your data may change, your queries may change, and your user's behavior may, and probably will change. That's why you should keep track of the cache usage with the use of Solr administration pages, JMX, or a specialized software such as Scalable Performance Monitoring from Sematext (see more at http://sematext.com/spm/index.html), and see how the utilization of each of the caches changes in time and makes proper changes to the configuration.
There are a few additional things that you should know when configuring your caches.
If you use the term enumeration faceting method (parameter facet.method=enum
) Solr will use the filter cache to check each term. Remember that if you use this method, your filter cache size should have at least the size of the number of unique facet values in all your faceted fields. This is crucial and you may experience performance loss if this cache is not configured the right way.
When your Solr instance has a low cache hit ratio you should consider not using caches at all (to see the hit ratio you can use the administration pages of Solr). Cache insertion is not free – it costs CPU time and resources. So if you see that you have a very low cache hit ratio, you should consider turning your caches off – it may speed up your Solr instance. Before you turn off the caches please ensure that you have the right cache setup – a small hit ratio can be a result of bad cache configuration.
When your Solr instance uses put operations more than get operations you should consider using the solr.LRUCache
implementation. It's confirmed that this implementation behaves better when there are more insertions into the cache than lookups.
This cache is responsible for holding information about the filters and the documents that match the filter. Actually this cache holds an unordered set of document IDs that match the filter. If you don't use the faceting mechanism with a filter cache, you should at least set its size to the number of unique filters that are present in your queries. This way it will be possible for Solr to store all the unique filters with their matching document IDs and this will speed up the queries that use filters.
The query result cache holds the ordered set of internal IDs of documents that match the given query and the sort specified. That's why if you use caches you should add as many filters as you can and keep your query (the q
parameter) as clean as possible. For example, pass only the search box content of your search application to the query parameter. If the same query will be run more than once and the cache has enough capacity to hold the entry, it will be used to give the IDs of the documents that match the query, thus a no Lucene (Solr uses Lucene to index and query data that is indexed) query will be made saving the precious I/O operation for the queries that are not in the cache – this will boost up your Solr instance performance.
The maximum size of this cache that I tend to set is the number of unique queries and their sorts that are handled by my Solr in the time between the Searcher
object's invalidation. This tends to be enough in most cases.
The document cache holds the Lucene documents that were fetched from the index. Basically, this cache holds the stored fields of all the documents that are gathered from the Solr index. The size of this cache should always be greater than the number of concurrent queries multiplied by the maximum results you get from Solr. This cache can't be automatically warmed – that is because every commit is changing the internal IDs of the documents. Remember that the cache can be memory consuming in case you have many stored fields, so there will be times when you just have to live with evictions.
The last thing is the query result window. This parameter tells Solr how many documents to fetch from the index in a single Lucene query. This is a kind of super set of documents fetched. In our example, we tell Solr that we want the maximum of one hundred documents as a result of a single query. Our query result window tells Solr to always gather two hundred documents. Then when we need some more documents that follow the first hundred they will be fetched from the cache, and therefore we will be saving our resources. The size of the query result window is mostly dependent on the application and how it is using Solr. If you tend to do a lot of paging, you should consider using a higher query result window value.
Tip
You should remember that the size of caches shown in this task is not final, and you should adapt them to your application needs. The values and the method of their calculation should only be taken as a starting point to further observation and optimization of the process. Also, please remember to monitor your Solr instance memory usage as using caches will affect the memory that is used by the JVM.
There is another way to warm your caches if you know the most common queries that are sent to your Solr instance – auto-warming queries. Please refer to the Improving Solr performance right after a startup or commit operation recipe in Chapter 6, Improving Solr Performance. For information on how to cache whole pages of results please refer to the Caching whole result pages recipe in Chapter 6, Improving Solr Performance.
There are many ways to index web pages. We could download them, parse them, and index them with the use of Lucene and Solr. The indexing part is not a problem, at least in most cases. But there is another problem – how to fetch them? We could possibly create our own software to do that, but that takes time and resources. That's why this recipe will cover how to fetch and index web pages using Apache Nutch.
For the purpose of this task we will be using Version 1.5.1 of Apache Nutch. To download the binary package of Apache Nutch, please go to the download section of http://nutch.apache.org.
Let's assume that the website we want to fetch and index is http://lucene.apache.org.
First of all we need to install Apache Nutch. To do that we just need to extract the downloaded archive to the directory of our choice; for example, I installed it in the directory
/usr/share/nutch
. Of course this is a single server installation and it doesn't include the Hadoop filesystem, but for the purpose of the recipe it will be enough. This directory will be referred to as$NUTCH_HOME
.Then we'll open the file
$NUTCH_HOME/conf/nutch-default.xml
and set the valuehttp.agent.name
to the desired name of your crawler (we've takenSolrCookbookCrawler
as a name). It should look like the following code:<property> <name>http.agent.name</name> <value>SolrCookbookCrawler</value> <description>HTTP 'User-Agent' request header.</description> </property>
Now let's create empty directories called
crawl
andurls
in the$NUTCH_HOME
directory. After that we need to create theseed.txt
file inside the createdurls
directory with the following contents:http://lucene.apache.org
Now we need to edit the
$NUTCH_HOME/conf/crawl-urlfilter.txt
file. Replace the+.
at the bottom of the file with+^http://([a-z0-9]*\.)*lucene.apache.org/
. So the appropriate entry should look like the following code:+^http://([a-z0-9]*\.)*lucene.apache.org/
One last thing before fetching the data is Solr configuration.
We start with copying the index structure definition file (called
schema-solr4.xml
) from the$NUTCH_HOME/conf/
directory to your Solr installation configuration directory (which in my case was/usr/share/solr/collection1/conf/
). We also rename the copied file toschema.xml
.
We also create an empty stopwords_en.txt
file or we use the one provided with Solr if you want stop words removal.
Now we need to make two corrections to the schema.xml
file we've copied:
The first one is the correction of the
version
attribute in theschema
tag. We need to change its value from1.5.1
to1.5
, so the finalschema
tag would look like this:<schema name="nutch" version="1.5.1">
Then we change the
boost
field type (in the sameschema.xml
file) fromstring
tofloat
, so theboost
field definition would look like this:<field name="boost" type="float" stored="true" indexed="false"/>
Now we can start crawling and indexing by running the following command from the $NUTCH_HOME
directory:
bin/nutch crawl urls -solr http://localhost:8983/solr/ -depth 3 -topN 50
Depending on your Internet connection and your machine configuration you should finally see a message similar to the following one:
crawl finished: crawl-20120830171434
This means that the crawl is completed and the data was indexed to Solr.
After installing Nutch and Solr, the first thing we did was set our crawler name. Nutch does not allow empty names so we must choose one. The file nutch-default.xml
defines more properties than the mentioned ones, but at this time we only need to know about that one.
In the next step, we created two directories; one (crawl
) which will hold the crawl data and the second one (urls
) to store the addresses we want to crawl. The contents of the seed.txt
file we created contains addresses we want to crawl, one address per line.
The crawl-urlfilter.txt
file contains information about the filters that will be used to check the URLs that Nutch will crawl. In the example, we told Nutch to accept every URL that begins with http://lucene.apache.org
.
The schema.xml
file we copied from the Nutch configuration directory is prepared to be used when Solr is used for indexing. But the one for Solr 4.0 is a bit buggy, at least in Nutch 1.5.1 distribution, and that's why we needed to make the changes previously mentioned.
We finally came to the point where we ran the Nutch command. We specified that we wanted to store the crawled data in the crawl
directory (first parameter), and the addresses to crawl data from are in the urls
directory (second parameter). The –solr
switch lets you specify the address of the Solr server that will be responsible for the indexing crawled data and is mandatory if you want to get the data indexed with Solr. We decided to index the data to Solr installed at the same server. The –depth
parameter specifies how deep to go after the links defined. In our example, we defined that we want a maximum of three links from the main page. The –topN
parameter specifies how many documents will be retrieved from each level, which we defined as 50.
There is one more thing worth knowing when you start a journey in the land of Apache Nutch.
The crawl
command of the Nutch command-line utility has another option – it can be configured to run crawling with multiple threads. To achieve that you add the following parameter:
-threads N
So if you would like to crawl with 20 threads you should run the crawl command like sot:
bin/nutch crawl crawl/nutch/site -dir crawl -depth 3 -topN 50 –threads 20
If you seek more information about Apache Nutch please refer to the http://nutch.apache.org and go to the Wiki section.
Sometimes indexing prepared text files (such as XML, CSV, JSON, and so on) is not enough. There are numerous situations where you need to extract data from binary files. For example, one of my clients wanted to index PDF files – actually their contents. To do that, we either need to parse the data in some external application or set up Solr to use Apache Tika. This task will guide you through the process of setting up Apache Tika with Solr.
In order to set up the extracting request handler, we need to follow these simple steps:
First let's edit our Solr instance
solrconfig.xml
and add the following configuration:<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler" > <lst name="defaults"> <str name="fmap.content">text</str> <str name="lowernames">true</str> <str name="uprefix">attr_</str> <str name="captureAttr">true</str> </lst> </requestHandler>
Next create the
extract
folder anywhere on your system (I created that folder in the directory where Solr is installed), and place theapache-solr-cell-4.0.0.jar
from thedist
directory (you can find it in the Solr distribution archive). After that you have to copy all the libraries from thecontrib/extraction/lib/
directory to theextract
directory you created before.In addition to that, we need the following entries added to the
solrconfig.xml
file:<lib dir="../../extract" regex=".*\.jar" />
And that's actually all that you need to do in terms of configuration.
To simplify the example, I decided to choose the following index structure (place it in the fields
section in your schema.xml
file):
<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" /> <field name="text" type="text_general" indexed="true" stored="true"/> <dynamicField name="attr_*" type="text_general" indexed="true" stored="true" multiValued="true"/>
To test the indexing process, I've created a PDF file book.pdf
using PDFCreator which contained the following text only: This is a Solr cookbook
. To index that file, I've used the following command:
curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true" -F "myfile=@book.pdf"
You should see the following response:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">578</int> </lst> </response>
Binary file parsing is implemented using the Apache Tika framework. Tika is a toolkit for detecting and extracting metadata and structured text from various types of documents, not only binary files but also HTML and XML files. To add a handler that uses Apache Tika, we need to add a handler based on the solr.extraction.ExtractingRequestHandler
class to our solrconfig.xml
file as shown in the example.
In addition to the handler definition, we need to specify where Solr should look for the additional libraries we placed in the extract
directory that we created. The dir
attribute of the lib
tag should be pointing to the path of the created directory. The regex
attribute is the regular expression telling Solr which files to load.
Let's now discuss the default configuration parameters. The fmap.content
parameter tells Solr what field content of the parsed document should be extracted. In our case, the parsed content will go to the field named text
. The next parameter lowernames
is set to true
; this tells Solr to lower all names that come from Tika and have them lowercased. The next parameter, uprefix
, is very important. It tells Solr how to handle fields that are not defined in the schema.xml
file. The name of the field returned from Tika will be added to the value of the parameter and sent to Solr. For example, if Tika returned a field named creator
, and we don't have such a field in our index, then Solr would try to index it under a field named attr_creator
which is a dynamic field. The last parameter tells Solr to index Tika XHTML elements into separate fields named after those elements.
Next we have a command that sends a PDF file to Solr. We are sending a file to the /update/extract
handler with two parameters. First we define a unique identifier. It's useful to be able to do that during document sending because most of the binary document won't have an identifier in its contents. To pass the identifier we use the literal.id
parameter. The second parameter we send to Solr is the information to perform the commit right after document processing.
To see how to index binary files please refer to the Indexing PDF files and Extracting metadata from binary files recipes in Chapter 2, Indexing Your Data.
Most of the time, the default way of calculating the score of your documents is what you need. But sometimes you need more from Solr; that's just the standard behavior. Let's assume that you would like to change the default behavior and use a different score calculation algorithm for the description
field of your index. The current version of Solr allows you to do that and this recipe will show you how to leverage this functionality.
Before choosing one of the score calculation algorithms available in Solr, it's good to read a bit about them. The description of all the algorithms is beyond the scope of the recipe and the book, but I would suggest going to the Solr Wiki pages (or look at Javadocs) and read the basic information about available implementations.
For the purpose of the recipe let's assume we have the following index structure (just add the following entries to your schema.xml
file to the fields
section):
<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="name" type="text_general" indexed="true" stored="true"/> <field name="description" type="text_general_dfr" indexed="true" stored="true" />
The string
and text_general
types are available in the default schema.xml
file provided with the example Solr distribution. But we want DFRSimilarity
to be used to calculate the score for the description
field. In order to do that, we introduce a new type, which is defined as follows (just add the following entries to your schema.xml
file to the types
section):
<fieldType name="text_general_dfr" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <analyzer type="query"> <tokenizer class="solr.StandardTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/> <filter class="solr.LowerCaseFilterFactory"/> </analyzer> <similarity class="solr.DFRSimilarityFactory"> <str name="basicModel">P</str> <str name="afterEffect">L</str> <str name="normalization">H2</str> <float name="c">7</float> </similarity> </fieldType>
Also, to use per-field similarity we have to add the following entry to your schema.xml
file:
<similarity class="solr.SchemaSimilarityFactory"/>
And that's all. Now let's have a look and see how that works.
The index structure presented in this recipe is pretty simple as there are only three fields. The one thing we are interested in is that the description
field uses our own custom field type called text_general_dfr
.
The thing we are mostly interested in is the new field type definition called text_general_dfr
. As you can see, apart from the index and query analyzer there is an additional section – similarity
. It is responsible for specifying which similarity implementation to use to calculate the score for a given field. You are probably used to defining field types, filters, and other things in Solr, so you probably know that the class
attribute is responsible for specifying the class implementing the desired similarity implementation which in our case is solr.DFRSimilarityFactory
. Also, if there is a need, you can specify additional parameters that configure the behavior of your chosen similarity class. In the previous example, we've specified four additional parameters: basicModel
, afterEffect
, normalization
, and c
, which all define the DFRSimilarity
behavior.
solr.SchemaSimilarityFactory
is required to be able to specify the similarity for each field.
In addition to per-field similarity definition, you can also configure the global similarity:
Apart from specifying the similarity class on a per-field basis, you can choose any other similarity than the default one in a global way. For example, if you would like to use BM25Similarity
as the default one, you should add the following entry to your schema.xml
file:
<similarity class="solr.BM25SimilarityFactory"/>
As well as with the per-field similarity, you need to provide the name of the factory class that is responsible for creating the appropriate similarity class.