In this chapter, we will cover:
Running Solr on Jetty
Running Solr on Apache Tomcat
Using the Suggester component
Handling multiple languages in a single index
Indexing fields in a dynamic way
Making multilingual data searchable with multicore deployment
Solr cache configuration
How to fetch and index web pages
Getting the most relevant results with early query termination
How to set up the Extracting Request Handler
Setting up an example Solr instance is not a hard task, at least when setting up the simplest configuration. The simplest way is to run the example provided with the Solr distribution, which shows how to use the embedded Jetty servlet container.
So far so good. We have a simple configuration, simple index structure described by the schema.xml
file, and we can run indexing.
In this chapter, we will go a little further. You'll see how to configure and use the more advanced Solr modules. You'll see how to run Solr in different containers and how to prepare your configuration to meet different requirements. Finally, you will learn how to configure Solr cache to meet your needs and how to pre-sort your Solr indexes to be able to use early query termination techniques efficiently.
If you don't have any experience with Apache Solr, please refer to the Apache Solr tutorial that can be found at http://lucene.apache.org/solr/tutorial.html before reading this book.
The simplest way to run Apache Solr on a Jetty servlet container is to run the provided example configuration based on the embedded Jetty, but it's not the case here. In this recipe, I would like to show you how to configure and run Solr on a standalone Jetty container.
First of all, you need to download the Jetty servlet container for your platform. You can get your download package from an automatic installer (like apt -get
) or you can download it yourself from http://jetty.codehaus.org/jetty/. Of course, you also need solr.war
and other configuration files that come with Solr (you can get them from the example distribution that comes with Solr).
There are a few common mistakes that people do when setting up Jetty with Solr, but if you follow the following instructions, the configuration process will be simple, fast, and will work flawlessly.
The first thing is to install the Jetty servlet container. For now, let's assume that you have Jetty installed.
Now we need to copy the jetty.xml
and webdefault.xml
files from the example/etc
directory of the Solr distribution to the configuration directory of Jetty. In my Debian Linux distribution, it's /etc/jetty
. After that, we have our Jetty installation configured.
The third step is to deploy the Solr web application by simply copying the solr.war
file to the webapps
directory of Jetty.
The next step is to copy the Solr configuration files to the appropriate directory. I'm talking about files like schema.xml
, solrconfig.xml
, and so on. Those files should be in the directory specified by the jetty.home
system variable (in my case, this was the /usr/share/jetty
directory). Please remember to preserve the directory structure you see in the example.
We can now run Jetty to see if everything is ok. To start Jetty that was installed, for example, using the apt -get
command, use the following command:
/etc/init.d/jetty start
If there were no exceptions during start up, we have a running Jetty with Solr deployed and configured. To check if Solr is running, try going to the following address with your web browser: http://localhost:8983/solr/
.
You should see the Solr front page with cores, or a single core, mentioned. Congratulations, you have just successfully installed, configured, and ran the Jetty servlet container with Solr deployed.
For the purpose of this recipe, I assumed that we needed a single core installation with only schema.xml
and solrconfig.xml
configuration files. Multicore installation is very similar—it differs only in terms of the Solr configuration files.
Sometimes there is a need for some additional libraries for Solr to see. If you need those, just create a directory called lib
in the same directory that you have the conf
folder and put the additional libraries there. It is handy when you are working not only with the standard Solr package, but you want to include your own code as a standalone Java library.
The third step is to provide configuration files for the Solr web application. Those files should be in the directory specified by the system variable jetty.home
or solr.solr.home
. I decided to use the jetty.home
directory, but whenever you need to put Solr configuration files in a different directory than Jetty, just ensure that you set the solr.solr.home
property properly. When copying Solr configuration files, you should remember to include all the files and the exact directory structure that Solr needs. For the record, you need to ensure that all the configuration files are stored in the conf
directory for Solr to recognize them.
After all those steps, we are ready to launch Jetty. The example command has been run from the Jetty installation directory.
After running the example query in your web browser, you should see the Solr front page as a single core. Congratulations, you have just successfully configured and ran the Jetty servlet container with Solr deployed.
There are a few tasks you can do to counter some problems when running Solr within the Jetty servlet container. Here are the most common ones that I encountered during my work.
Sometimes it's necessary to run Jetty on a different port other than the default one. We have two ways to achieve that:
Adding an additional start up parameter,
jetty.port
. The start up command would look like this:java –Djetty.port=9999 –jar start.jar
Changing the
jetty.xml
file—to do that, you need to change the following line:<Set name="port"><SystemProperty name="jetty.port" default="8983"/></Set>
to:
<Set name="port"><SystemProperty name="jetty.port" default="9999"/></Set>
Buffer overflow is a common problem when our queries are getting too long and too complex—for example, when using many logical operators or long phrases. When the standard HEAD buffer is not enough, you can resize it to meet your needs. To do that, you add the following line to the Jetty connector in the jetty.xml
file. Of course, the value shown in the example can be changed to the one that you need:
<Set name="headerBufferSize">32768</Set>
After adding the value, the connector definition should look more or less like this:
<Call name="addConnector">
<Arg>
<New class="org.mortbay.jetty.bio.SocketConnector">
<Set name="port"><SystemProperty name="jetty.port" default="8080"/></Set>
<Set name="maxIdleTime">50000</Set>
<Set name="lowResourceMaxIdleTime">1500</Set>
<Set name="headerBufferSize">32768</Set>
</New>
</Arg>
</Call>
Tip
Downloading the example code
You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.
Sometimes you need to choose a different servlet container other than Jetty. Maybe it's because your client has other applications running on another servlet container, maybe it's because you just don't like Jetty. Whatever your requirements are that put Jetty out of the scope of your interest, the first thing that comes to mind is a popular and powerful servlet container—Apache Tomcat. This recipe will give you an idea of how to properly set up and run Solr in the Apache Tomcat environment.
First of all, we need an Apache Tomcat servlet container. It can be found at the Apache Tomcat website—http://tomcat.apache.org. I concentrated on Tomcat version 6.x because of two things—version 5 is pretty old right now, and version 7 is the opposite—it's too young in my opinion. That is why I decided to show you how to deploy Solr on Tomcat version 6.0.29, which was the newest one while writing this book.
To run Solr on Apache Tomcat, we need to perform the following six simple steps:
Firstly, you need to install Apache Tomcat. The Tomcat installation is beyond the scope of this book, so we will assume that you have already installed this servlet container in the directory specified by the
$TOMCAT_HOME
system variable.The next step is preparing the Apache Tomcat configuration files. To do that, we need to add the following inscription to the connector definition in the
server.xml
configuration file:URIEncoding="UTF-8"
The portion of the modified
server.xml
file should look like this:<Connector port="8080" protocol="HTTP/1.1" connectionTimeout="20000" redirectPort="8443" URIEncoding="UTF-8" />
Create a proper context file. To do that, create a
solr.xml
file in the$TOMCAT_HOME/conf/Catalina/localhost
directory. The contents of the file should look like this:<Context path="/solr"> <Environment name="solr/home" type="java.lang.String" value="/home/solr/configuration/" override="true"/> </Context>
The next thing is the Solr deployment. To do that, we need a
solr.war
file that contains the necessary files and libraries to run Solr to be copied to the Tomcatwebapps
directory. If you need some additional libraries for Solr to see, you should add them to the$TOMCAT_HOME/lib
directory.The last thing we need to do is add the Solr configuration files. The files that you need to copy are files like
schema.xml
,solrconfig.xml
, and so on. Those files should be placed in the directory specified by thesolr/home
variable (in our case,/home/solr/configuration/
). Please don't forget that you need to ensure the proper directory structure. If you are not familiar with the Solr directory structure, please take a look at the example deployment that is provided with standard Solr package.Now we can start the servlet container by running the following command:
bin/catalina.sh start
In the log file, you should see a message like this:
Info: Server startup in 3097 ms
To ensure that Solr is running properly, you can run a browser, and point it to an address where Solr should be visible, like the following: http://localhost:8080/solr/
.
If you see the page with links to administration pages of each of the cores defined, that means that your Solr is up and running.
Let's start from the second step, as the installation part is beyond the scope of this book. As you probably know, Solr uses UTF-8 file encoding. That means that we need to ensure that Apache Tomcat will be informed that all requests and responses made should use that encoding. To do that, we modify the server.xml
in the way shown in the example.
The Catalina context file (called solr.xml
in our example) says that our Solr application will be available under the /solr
context (the path
attribute), the war file will be placed in the /home/tomcat/webapps/
directory. solr/home
is also defined, and that is where we need to put our Solr configuration files. The shell command that is shown starts Apache Tomcat. I think that I should mention some other options of catalina.sh
(or catalina.bat
) script:
stop
—stops Apache Tomcatrestart
—restarts Apache Tomcatdebug
—start Apache Tomcat in debug mode
After running the example address in a web browser, you should see a Solr front page with a core (or cores if you have a multicore deployment). Congratulations, you have just successfully configured and ran the Apache Tomcat servlet container with Solr deployed.
Here are some other tasks that are common problems when running Solr on Apache Tomcat.
Sometimes it is necessary to run Apache Tomcat on a port other than the 8080 default one. To do that, you need to modify the port
variable of the connector definition in the server.xml
file located in the $TOMCAT_HOME/conf
directory. If you would like your Tomcat to run on port 9999, this definition should look like this:
<Connector port="9999" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443"
URIEncoding="UTF-8" />
While the original definition looks like this:
<Connector port="8080" protocol="HTTP/1.1"
connectionTimeout="20000"
redirectPort="8443"
URIEncoding="UTF-8" />
Nowadays, it's common for web pages to give a search suggestion (or autocomplete as I tend to call it), just like many "big" search engines do—just like Google, Microsoft, and others. Lately, Solr developers came up with a new component called Suggester . It provides Solr with flexible ways to add suggestions to your application using Solr. This recipe will guide you through the process of configuring and using this new component.
First we need to add the search component definition in the solrconfig.xml
file. That definition should look like this:
<searchComponent class="solr.SpellCheckComponent" name="suggester"> <lst name="spellchecker"> <str name="name">suggester</str> <str name="classname"> org.apache.solr.spelling.suggest.Suggester</str> <str name="lookupImpl"> org.apache.solr.spelling.suggest.tst.TSTLookup</str> <str name="field">name</str> <str name="threshold">2</str> </lst> </searchComponent>
Now we can define an appropriate request handler. To do that, we modify the solrconfig.xml
file and add the following lines:
<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggester"> <lst name="defaults"> <str name="spellcheck">true</str> <str name="spellcheck.dictionary">suggester</str> <str name="spellcheck.count">10</str> </lst> <arr name="components"> <str>suggester</str> </arr> </requestHandler>
Now if all went well (Solr server started without an error), we can make a Solr query like this:
http://localhost:8983/solr/suggester/?q=a
You should see the response as follows:
<?xml version="1.0" encoding="UTF-8"> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">16</int> </lst> <lst name="spellcheck"> <lst name="suggestions"> <lst name="a"> <int name="numFound">8</int> <int name="startOffset">0</int> <int name="endOffset">1</int> <arr name="suggestions"> <str>a</str> <str>asus</str> <str>ati</str> <str>ata</str> <str>adata</str> <str>all</str> <str>allinone</str> <str>apple</str> </arr> <lst> <lst> </lst> </response>
After reading the aforementioned search component configuration, you may wonder why we use solr.SpellCheckComponent
as our search component implementation class. Actually, the Suggester component relies on spellchecker and reuses most of its classes and logic. That is why Solr developers decided to reuse spellchecker code.
Anyway, back to our configuration. We have some interesting configuration options. We have the component name (name
variable), we have the logic implementation class (classname
variable), the word lookup implementation (lookupImpl
variable), the field on which suggestions will be based (field
variable), and a threshold
parameter which defines the minimum fraction of documents where a term should appear to be visible in the response.
Following the definition, there is a request handler definition present. Of course, it is not mandatory, but useful. You don't need to pass all the required parameters with your query all the time. Instead, you just write those parameters to the solrconfig.xml
file along with request handler definition. We called our example request handler /suggester
and with that name, it'll be exposed in the servlet container. We have three parameters saying that, in Solr, when using the defined request handler it should always include suggestions (suggest
parameter set to true
), it should use the dictionary named suggester
that is actually our component (spellcheck.dictionary
parameter), and that the maximum numbers of suggestions should be 10 (spellcheck.count
parameter).
As mentioned before, the Suggester component reuses most of the spellchecker logic, so most of all spellchecker parameters can be used as the configuration of the behavior of the Suggester component. Remember that, because you never know when you'll need some of the non-standard parameters.
There are a few things that are good to know about when using the Suggester component. The following are the three most common things that people tend to ask about:
There are some situations when you'll need to get suggestions not from a defined field in an index, but from a file. To do that, you need to have a text dictionary (encoded in UTF-8) that looks as follows:
Suggestion 1 Suggestion 2 Suggestion 3
Second, you need to add another parameter to the Suggest component definition named sourceLocation
that specifies the dictionary location (the file that will be used for the suggestions base):
<str name="sourceLocation">suggest_dictionary.txt</str>
So the whole Suggester component definition would look like this:
<searchComponent class="solr.SpellCheckComponent" name="suggester">
<lst name="spellchecker">
<str name="name">suggester</str>
<str name="classname"> org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl"> org.apache.solr.spelling.suggest.tst.TSTLookup</str>
<str name=" sourceLocation">suggest_dictionary.txt</str>
</lst>
</searchComponent>
If you have field-based suggestions, you'll probably want to rebuild your suggestion dictionary after every change in your index. Of course, you can do that manually by invoking the appropriate command (the spellcheckbuild
parameter with the true
value), but we don't want to do it this way—we want Solr to do it for us. To do it, just add another parameter to your Suggester component configuration. The new parameter should look like this:
<str name="buildOnCommit">true</str>
How to remove language mistakes and uncommon words from suggestions is a common concern. It's not only your data—most data has some kind of mistakes or errors—and the process of data cleaning is time consuming and difficult. Solr has an answer for you. We can add another parameter to our Suggester component configuration and we won't see any of the mistakes and uncommon words in the suggestions. Add the following parameter:
<float name="threshold">0.05</float>
This parameter takes values from 0 to 1. It tells the Suggester component the minimum fraction of documents of the total where a word should appear to be included as the suggestion.
If you don't need a sophisticated Suggester component like the one described, you should take a look at the How to implement an autosuggest feature using faceting recipe in Chapter 6, Using Faceting Mechanism.
There are many examples where multilingual applications and multilingual searches are mandatory—for example, libraries having books in multiple languages. This recipe will cover how to set up Solr, so we can make our multilingual data searchable using a single query and a single response. This task doesn't cover how to analyze your data and automatically detect the language used. That is beyond the scope of this book.
First of all, you need to identify what languages your applications will use. For example, my latest application uses two languages—English and German.
After that, we need to know which fields need to be separate. For example—a field with an ISBN number or an identifier field can be shared, because they don't need to be indexed in a language-specific manner, but titles and descriptions should be separate. Let's assume that our example documents consist of four fields:
ID
ISBN
Title
Description
We have two languages—English and German—and we want all documents to be searchable with one query within one index.
First of all, we need to define some language-specific field types. To add the field type for English, we need to add the following lines to the schema.xml
file:
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType>
Next, we need to add the field type for fields containing a German title and description:
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="German2"/> </analyzer> </fieldType>
Now we need to define the document's fields using the previously defined field types. To do that, we add the following lines to the schema.xml
file in the field section.
<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="isbn" type="string" indexed="true" stored="true" required="true" /> <field name="title_en" type="text_en" indexed="true" stored="true" /> <field name="description_en" type="text_en" indexed="true" stored="true" /> <field name="title_de" type="text_de" indexed="true" stored="true" /> <field name="description_de" type="text_de" indexed="true" stored="true" />
And that's all we need to do with configuration. After the indexation, we can start to query our Solr server. Next thing is to query Solr for the documents that are in English or German. To do that, we send the following query to Solr:
q=title_en:harry+OR+description_en:harry+OR+title_de:harry+OR+description_de:harry
First of all, we added a text_en
field type to analyze the English title and description. We tell Solr to split the data on whitespaces, split on case change, make text parts lowercased, and finally to stem text with the appropriate algorithm—in this case, it is one of the English stemming algorithms available in Solr and Lucene. Let's stop for a minute to explain what stemming is. All you need to know, for now, is that stemming is the process for reducing inflected or derived words into their stem, root, or base forms. It's good to use a stemming algorithm in full text search engines because we can show the same set of documents on words like 'cat' and 'cats'.
Of course, we don't want to use English stemming algorithms with German words. That's why we've added another field type—text_ger
. It differs from the text_en
field in one matter—it uses solr.SnowballPorterFilterFactory
with the attribute specifying German language instead of solr.PorterStemFilterFactory
.
The last thing we do is to add field definitions. We have six fields. Two of those six are id
(unique identifier) and isbn
. The next two are English title and description, and the last two are German title and description.
As you can see in the example query, we can get documents regardless of where the data is. To achieve that, I used the OR
logical operator to get the documents that the searched word, in any of the specified fields. If we have a match in English, we will get the desired documents; if we get a match in German, we will also get the desired documents.
If you don't need to get all multilingual results with a single query, you can look at the Making multilingual data searchable with multicore deployment recipe description provided in this chapter. If you want to sort the data we talked about, please refer to Chapter 10, Dealing With Problems and the recipe How to sort non-English languages properly.
Sometimes you don't know the names and number of fields you have to store in your index. You can only determine the types that you'll use and nothing else, either because it is a requirement or maybe the data is coming from different sources or the data is very complex, and no one has enough knowledge to specify how many and what fields are needed to be present in the index. There is a cure for that problem—dynamic fields. This recipe will guide you through the definition and usage of dynamic fields.
Looking from the scope of this book, we just need to do one thing—prepare the schema.xml
file, so that Solr will be able to determine where to put our data. The example schema.xml
file which comes with Solr provides definitions of some dynamic fields. Here are the example dynamic fields:
<dynamicField name="*_i" type="int" indexed="true" stored="true"/> <dynamicField name="*_s" type="string" indexed="true" stored="true"/> <dynamicField name="*_l" type="long" indexed="true" stored="true"/> <dynamicField name="*_t" type="text" indexed="true" stored="true"/> <dynamicField name="*_b" type="boolean" indexed="true" stored="true"/> <dynamicField name="*_f" type="float"indexed="true" stored="true"/> <dynamicField name="*_d" type="double" indexed="true" stored="true"/>
Having those definitions in the schema.xml
file, we can update data without the need for a static field definition. Here is an example document that can be sent to Solr:
<add> <doc> <field name="name_s">Solr Cookbook</field> <field name="name_t">Solr Cookbook</field> <field name="price_d">19.99</field> <field name="quantity_i">12</field> <field name="available_b">true</field> </doc> </add>
When defining dynamic fields, you tell Solr to observe (at indexing time) every expansion that the pattern you wrote matches. For example, the pattern *_i
will expand to field names like quantity_i
and grade_i
, but not for gradei
.
However, you must know one thing—Solr will always try to choose the field that matches the pattern and has the shortest field pattern. What's more, Solr will always choose static field over dynamic field.
Let's get back to our definitions. We have a few dynamic fields defined in the schema.xml
file. As you can see, dynamic field definition is just a little bit different from a static field definition. Definitions begin with an appropriate XML tag dynamicField
and have the same attributes as static field definition. One thing that's uncommon is the name—it's a pattern that will be used to match the field name. There is one more thing to remember—you can only use one wildcard character—*
as the start or end of a dynamic field name. Any other usage will result in a Solr start up error.
As you see in the example document, we have five fields filled. The field named name_s
will be matched to the pattern *_s
, the field named name_t
will be matched to the pattern *_t
, and the rest will behave similarly.
Of course, this is just a simple example of a document. The documents may consist of hundreds of fields both static and dynamic. Also the dynamic field definitions may be much more sophisticated than the one in the preceding example. You should just remember that using dynamic fields is perfectly eligible and you should use them whenever your deployment needs them.
If you need your data not only to be indexed dynamically, but also to be copied from one field to another, please refer to the recipe entitled Copying contents of one field to another in Chapter 3, Analyzing your Text Data.
You don't always need to handle multiple languages in a single index, either, it's because you have your application in multiple languages showing only one of them at the same time or some other requirement. Whatever your cause is, this recipe will guide you on how to handle separable data in a single instance of Solr server through the use of multicore deployment.
First of all, you need to create the solr.xml
file and place it in your $SOLR_HOME
directory. Let's assume that our application will handle two languages—English and German. The sample solr.xml
file might look like this:
<?xml version="1.0" encoding="UTF-8" ?> <solr> <cores adminPath="/admin/cores/"> <core name="en" instanceDir="cores/en"> <property name="dataDir" value="cores/en/data" /> </core> <core name="de" instanceDir="cores/de"> <property name="dataDir" value="cores/de/data" /> </core> </cores> </solr>
Let's create directories mentioned in the solr.xml
file. For the purpose of the example, I assumed that the $SOLR_HOME
points to the /usr/share/solr/directory
. We need to create the following directories:
$$SOLR_HOME/cores
$SOLR_HOME/cores/en
$SOLR_HOME/cores/de
$SOLR_HOME/cores/en/conf
$SOLR_HOME/cores/en/data
$SOLR_HOME/cores/de/conf
$SOLR_HOME/cores/de/data
We will use the sample solrconfig.xml
file provided with the example deployment of multicore Solr version 3.1. Just copy this file to the conf
directory of both cores. For the record, the file should contain:
<?xml version="1.0" encoding="UTF-8" ?> <config> <updateHandler class="solr.DirectUpdateHandler2" /> <requestDispatcher handleSelect="true" > <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" /> </requestDispatcher> <requestHandler name="standard" class="solr.StandardRequestHandler" default="true" /> <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" /> <requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" /> <admin> <defaultQuery>solr</defaultQuery> </admin> </config>
Now we should prepare a simple schema.xml
file. To make it simple, let's just add two field types to the example Solr schema.xml
.
To the schema.xml
file that will describe the index containing English documents, let's add the following field type (just add it in the types section of the schema.xml
file):
<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.PorterStemFilterFactory"/> </analyzer> </fieldType>
To the schema.xml
file, describing the index containing the document in German, let's add the following field type:
<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100"> <analyzer> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/> <filter class="solr.LowerCaseFilterFactory"/> <filter class="solr.SnowballPorterFilterFactory" language="German2"/> </analyzer> </fieldType>
The field definition for the English schema.xml
should look like this:
<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="isbn" type="string" indexed="true" stored="true" required="true" /> <field name="title" type="text_en" indexed="true" stored="true" /> <field name="description" type="text_en" indexed="true" stored="true" />
The field definition for the German schema.xml
should look like this:
<field name="id" type="string" indexed="true" stored="true" required="true" /> <field name="isbn" type="string" indexed="true" stored="true" required="true" /> <field name="title" type="text_de" indexed="true" stored="true" /> <field name="description" type="text_de" indexed="true" stored="true" />
Now you should copy the files you've just created to the appropriate directories:
German
schema.xml
file to$SOLR_HOME/cores/de/conf
English
schema.xml
file to$SOLR_HOME/cores/en/conf
That's all in terms of the configuration. You can now start your Solr instance as you always do.
Now all the index update requests should be made to the following addresses:
English documents should go to
http://localhost:8983/solr/en/update
German documents should go to
http://localhost:8983/solr/de/update
It is similar when querying Solr, for example, I've made two queries, first for the documents in English and second for the documents in German:
http://localhost:8983/solr/en/select?q=harry http://localhost:8983/solr/de/select?q=harry
First of all, we create the solr.xml
file to tell Solr that the deployment will consist of one or more cores. What is a core? Multiple cores let you have multiple separate indexes inside a single Solr server instance. Of course you can run multiple Solr servers, but every one of them would have its own process (actually a servlet container process), its own memory space assigned, and so on. The multicore deployment lets you use multiple indexes inside a single Solr instance, a single servlet container process, and with the same memory space.
Following that, we have two cores defined. Every core is defined in its own core
tag and has some attributes defining its properties. Like the core home directory (instanceDir
attribute) or where the data will be stored (dataDir
property). You can have multiple cores in one instance of Solr, almost an infinite number of cores in theory, but in practice, don't use too many.
There are some things about the solr.xml
file that need to be discussed further. First of all, the adminPath
attribute of the tag cores
—it defines where the core admin interface attribute will be available. With the value shown in the example, the core admin will be available under the following address: http://localhost:8983/solr/admin/cores
.
The field type definition for each of the cores is pretty straightforward. The file that describes the index for English documents uses an English stemmer for text data, and the file that describes the index for German documents uses a German stemmer for text data.
The only difference in field definition is the type that the description and title fields use—for the German schema.xml
they use the text_de
field type, and for the English schema.xml
they use the text_en
field type.
As for the queries, you must know one thing. When using multicore with more than one core, the address under which Solr offers its handlers is different from the one when not using cores. Solr adds a core name before the handler name. So if you have a handler named /simple
in the core named example
, it will be available under the context /solr/example/simple
, not /solr/simple
. When you know that, you'll be able to know where to point your applications that use Solr and a multicore deployment.
There is one more thing—you need to remember that every core has a separate index. That means that you can't combine results from different cores, at least not automatically. For example, you can't automatically get results with a combination of documents in English and German; you must do it by yourself or choose a different architecture for your Solr deployment.
If you need more information about cores, maybe some of the following information will be helpful.
If you seek for more information about the core admin interface commands, please refer to the Solr wiki pages found at http://wiki.apache.org/solr/CoreAdmin.
As you may already know, cache plays a major role in a Solr deployment. And I'm not talking about some exterior cache—I'm talking about the three Solr caches:
There is a fourth cache—Lucene's internal cache—the field cache, but you can't control its behavior—it is managed by Lucene and created when it is first used by the Searcher object.
With the help of these caches, we can tune the behavior of the Solr searcher instance. In this recipe, we will focus on how to configure your Solr caches to suit most needs. There is one thing to remember—Solr cache sizes should be tuned to the number of documents in the index, the queries, and the number of results you usually get from Solr.
Before you start tuning Solr caches, you should get some information about your Solr instance. That information is:
Number of documents in your index
Number of queries per second made to that index
Number of unique filter (
fq
parameter) values in your queriesMaximum number of documents returned in a single query
Number of different queries and different sorts
All those numbers can be derived from the Solr logs and by using the Solr admin interface.
For the purpose of this task, I assumed the following numbers:
Number of documents in the index: 1,000,000
Number of queries per second: 100
Number of unique filters: 200
Maximum number of documents returned in a single query: 100
Number of different queries and different sorts: 500
Let's open the solrconfig.xml
file and tune our caches. All the changes should be made in the query section of the file (the section between the <query>
and </query>
XML tags).
First goes the filter cache:
<filterCache class="solr.FastLRUCache" size="200" initialSize="200" autowarmCount="100"/>
Second goes the query result cache:
<queryResultCache class="solr.FastLRUCache" size="500" initialSize="500" autowarmCount="250"/>
Third, we have the document cache:
<documentCache class="solr.FastLRUCache" size="11000" initialSize="11000" />
Of course, the preceding configuration is based on the example values.
Furthermore, let's set our result window to match our needs—we sometimes need to get 20–30 more results than we need during query execution. So, we change the appropriate value in the solrconfig.xml
to something like this:
<queryResultWindowSize>200</queryResultWindowSize>
And that's all.
Let's start with a small explanation. First of all, we use the solr.FastLRUCache
implementation instead of the solr.LRUCache
. This is a new type of cache implementation found in the 1.4 version of Solr. So called FastLRUCache
tends to be faster when Solr puts less into caches and gets more. This is the opposite to LRUCache
that tends to be more efficient when there is more "puts" than "gets" operations. That's why we use it.
This may be the first time you have seen cache configuration, so I'll explain what cache configuration parameters mean:
class
—you probably figured that out by now. Yes, this is the class implementing the cachesize
—this is the maximum size that the cache can have.initialSize
—this is the initial size that the cache will have.autowarmCount
—this is the number of cache entries that will be copied to the new instance of the same cache when Solr invalidates the Searcher object—for example, during commit operation.
As you can see, I tend to use the same number of entries for size
and initialSize
, and half of those values for the autowarmCount
.
There is one thing you should be aware of. Some of the Solr caches (the document cache, actually) operate on internal identifiers called docid
. Those caches cannot be automatically warmed. That's because the docid
is changing after every commit operation and thus copying the docid
is useless.
Now let's take a look at the cache types and what they are used for.
So first we have the filter cache. This cache is responsible for holding information about the filters and the documents that match the filter. Actually, this cache holds an unordered set of document ids
that match the filter. If you don't use the faceting mechanism with filter cache, you should set its size to the number of unique filters that are present in your queries at least. This way, it will be possible for Solr to store all the unique filters with theirs matching documents ids
and this will speed up queries that use filters.
The next cache is the query result cache. It holds the ordered set of internal ids
of documents that match the given query and the sort specified. That's why, if you use caches, you should add as much filters as you can and keep your query (q
parameter) as clean as possible (for example, pass only the search box content of your search application to the query parameter). If the same query will be run more than once and the cache has enough size to hold the entry, information available in the cache will be used. This will allow Solr to save precious I/O operation for the queries that are not in the cache—resulting in performance boost.
Tip
The maximum size of this cache I tend to set is the number of unique queries and their sorts that Solr handles in the time between Searchers object invalidation. This tends to be enough in the most cases.
The last type of cache is the document cache. It holds the Lucene documents that were fetched from the index. Basically, this cache holds the stored fields of all the documents that are gathered from the Solr index. The size of this cache should always be greater than the number of concurrent queries multiplied by the maximum results you get from Solr. This cache can't be automatically warmed—because every commit is changing the internal IDs of the documents. Remember that the cache can be memory-consuming in case you have many stored fields.
The last is the query result window. This parameter tells Solr how many documents to fetch from the index in a single Lucene query. This is a kind of super set of documents fetched. In our example, we tell Solr that we want maximum of one hundred documents as a result of a single query. Our query result window tells Solr to always gather two hundred documents. Then when we will need some more documents that follow the first hundred they will be fetched from cache and therefore we will be saving our resources. The size of the query result window is mostly dependent on the application and how it is using Solr. If you tend to do multiple paging, you should consider using a higher query result window value.
Tip
You should remember that the size of caches shown in this task is not final and you should adapt them to your application needs. The values and the method of their calculation should be only taken as a starting point to further observation and optimizing process. Also, please remember to monitor your Solr instance memory usage as using caches will affect the memory that is used by the JVM.
There are a few things that you should know when configuring your caches.
If you use the term enumeration faceting method (parameter facet.method=enum
), Solr will use the filter cache to check each term. Remember that if you use this method, your filter cache size should have at least the size of the number of unique facet values in all your faceted fields. This is crucial and you may experience performance loss if this cache is not configured the right way.
When your Solr instance has a low cache hit ratio, you should consider not using caches at all (to see the hit ratio, you can use the administration pages of Solr). Cache insertion is not free—it costs CPU time and resources. So if you see that you have very low cache hit ratio, you should consider turning your caches off—it may speed up your Solr instance. Before you turn off the caches, please ensure that you had the right cache set up—a small hit ratio can be a result of a bad cache configuration.
There is another way to warm your caches, if you know the most common queries that are sent to your Solr instance—auto warming queries.
To see how to configure them, you should refer to Chapter 7, Improving Solr Performance and the recipe Improving Solr performance right after start up or commit operation
To see how to use the administration pages of the Solr server, you should refer to Chapter 4, Solr Administration
For information on how to cache whole pages of results, please refer to Chapter 7, the recipe Caching whole result pages.
There are many ways to index web pages. We could download them, parse them, and index with the use of Lucene and Solr. The indexing part is not a problem, at least in most cases. But there is another problem—how do you fetch them? We could possibly create our own software to do that, but that takes time and resources. That's why this recipe will cover how to fetch and index web pages using Apache Nutch.
For the purpose of this recipe we will be using version 1.2 of Apache Nutch. To download the binary package of Apache Nutch, please go to the download section of http://nutch.apache.org.
First of all, we need to install Apache Nutch. To do that, we just need to extract the downloaded archive to the directory of our choice, for example, I installed it in the directory:/nutch
. This directory will be referred to as $NUTCH_HOME
.
Open the file $NUTCH_HOME/conf/nutch-default.xml
and set the value http.agent.name
to the desired name of your crawler. It should look like this:
<property>
<name>http.agent.name</name>
<value>SolrCookbookCrawler</value>
<description>HTTP 'User-Agent' request header.</description>
</property>
Now let's create an empty directory called crawl
in the $NUTCH_HOME
directory. Then create the nutch
directory in the $NUTCH_HOME/crawl
directory.
The next step is to create a directory urls
in the $NUTCH_HOME/crawl/nutch
directory.
Now add the file named nutch
to the directory $$NUTCH_HOME/crawl/nutch/site
. For the purpose of this book, we will be crawling Solr and Lucene pages, so this file should contain the following: http://lucene.apache.org.
Now we need to edit the $NUTCH_HOME/conf/crawl-urlfilter.txt
file. Replace the MY.DOMAIN.NAME
with the http://lucene.apache.org. So the appropriate entry should look like:
+^http://lucene.apache.org/
One last thing before fetching the data is Solr configuration. The only thing that we need to do is to copy the file $NUTCH_HOME/conf/schema.xml
to the directory $$SOLR_HOME//conf
.
Now we can start fetching web pages.
Run the following command from the $NUTCH_HOME
directory:
bin/nutch crawl crawl/nutch/site -dir crawl -depth 3 -topN 50
Depending on your Internet connection and your machine configuration, you should finally see the following message:
crawl finished: crawl
This means that the crawl is completed and the data is fetched. Now we should invert the fetched data to be able to index anchor text to the indexed pages. To do that, we invoke the following command:
bin/nutch invertlinks crawl/linkdb -dir crawl/segments
Sometime later, you'll see a message that will inform you that the invert process is completed:
LinkDb: finished at 2010-10-18 21:35:44, elapsed: 00:00:15
We can now send our data to Solr (you can find the appropriate schema.xml
file in the Nutch distribution in the conf/schema.xml
directory). To do that, you should run the following command:
bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*
After a period of time, depending on the size of your crawl database, you should see a message informing you that the indexing process was finished:
SolrIndexer: finished at 2010-10-18 21:39:28, elapsed: 00:00:26
After installing Nutch and Solr, the first thing we do is set our crawler name. Nutch does not allow empty names, so we must choose one. The file nutch-default.xml
defines more properties than the mentioned one, but at this time, we only need to know about that one.
The next step is the creation of directories where the crawl database will be stored. It doesn't have to be exactly the same directory as the example crawl
. You can place it on a different partition or another hard disk drive.
The file site
we created in the directory $$NUTCH_HOME/crawl/nutch
should contain information about the sites from which we want information to be fetched. In the example, we have only one site—http://lucene.apache.org.
The crawl-urlfilter.txt
file contains information about the filters that will be used to check the URLs that Nutch will crawl. In the example, we told Nutch to accept every URL that begins with http://lucene.apache.org.
Next, we start with some "Nutch magic". First of all we run the crawling command. The crawl
command of Nutch command line utility needs some parameters, they are as follows:
File with addresses to fetch defined.
Directory where the fetch database will be stored.
How deep to go after the links are defined—in our example, we told Nutch to go for a maximum of three links from the main page.
How many documents to get from each level of the depth. In our example, we told Nutch to get a maximum of 50 documents per level of depth.
The next big thing is the link inversion process. This process is performed to generate link database so that Nutch can index the anchor with the associated pages. The invertlinks
command of Nutch command line utility was run with two parameters:
Output directory where the newly created link database should be created
Directory where the data segments were written during the crawl process
The last command that was run was the one that pushed the data into Solr. This process uses the javabin
format and uses the /update
handler, so remember to have both of these functionalities configured in your Solr instance. The solrindex
command of the Nutch command line utility was run with the following parameters:
Address of the Solr server instance
Directory containing the crawl database created by the
crawl
commandDirectory containing the link database created by the
invertlinks
commandList of segments that contain crawl data
There is one more thing worth knowing when you start a journey in the land of Apache Nutch.
The crawl
command of the Nutch command-line utility has another option—it can be configured to run crawling with multiple threads. To achieve that, you add the parameter:
-threads N
So if you would like to crawl with 10 threads, you should run the crawl command as follows:
bin/nutch crawl crawl/nutch/site -dir crawl -depth 3 -topN 50 –threads 10
If you seek more information about Apache Nutch, please refer to http://nutch.apache.org and go to the wiki section.
When we have millions of documents, large indexes, and many shards, there are situations where you don't need to show all the results on a given query. It is very probable that you only want to show your user the top N results. It's time when you can use the early termination techniques to terminate long queries after a set amount of time. But using early termination techniques is a bit tricky. There a few things that need to be addressed before you can use early termination techniques. One of those things is getting the most relevant results. The tool for sorting the Lucene index was only available in Apache Nutch, but that is history now because the Lucene version of this tool was committed to the SVN repository. This recipe will guide you through the process of index pre-sorting and explain why to use this new feature of Lucene and Solr and how to get the most relevant results with the help of this tool.
During the writing of this book, the IndexSorter
tool was only available in branch_3x
of the Lucene and Solr SVN repository. After downloading the appropriate version, compiling and installing it, we can begin using this tool.
IndexSorter
is an index post processing tool. This means that it should be used after data is indexed. Let's assume that we have our data indexed. For the purpose of showing how to use the tool, I modified my schema.xml
file to consist only of these fields (add the following to the fields section of your schema.xml
file):
<field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/> <field name="isbn" type="string" indexed="true" stored="true" multiValued="false" /> <field name="title" type="text" indexed="true" stored="true" multiValued="false" /> <field name="description" type="text" indexed="true" stored="true" multiValued="false" /> <field name="author" type="string" indexed="true" stored="true" multiValued="false" /> <field name="value" type="string" indexed="true" stored="true" multiValued="false" />
Let's assume that we have a requirement that we need to show our data sorted by the value
field (the field contains float values, the higher the value the more important the document is), but our index is so big, that a single query is taking more time than a client will be willing to wait for the results. That's why we need to pre-sort the index by the required field. To do that, we will use the tool named IndexSorter
. There is one more thing before you can run the IndexSorter
tool
So let's run the following command:
java –cp lucene-lubs/* org.apache.lucene.index.IndexSorter solr/data/index solr/data/new_index author
After some time, we should see a message like the following one:
IndexSorter: done, 9112 total milliseconds
This message means that everything went well and our index is sorted by a value
field. The sorted index is written in the solr/data/new_index
directory and the old index is not altered in any way. To use the new index, you should replace the contents of the old index directory (that is, solr/data/index
) with the contents of the solr/data/new_index
directory.
I think that the field definitions do not need to be explained. The only thing worth looking at is the value
field which is the field on which we will be sorting.
But how does this tool work? Basically, it sorts your Lucene index in a static way. What does that mean? Let's start with some explanation. When indexing documents with Solr (and Lucene of course), they are automatically given an internal identification number—a document ID. Documents with low internal ID will be chosen by Lucene first. During the indexing process, we don't have the possibility to set the internal document IDs. So what happens when we use TimeLimitingCollector
(and therefore ending a query after a set amount of time) in combination with sorting by the author
field on millions of data? We get some amount of data, but not all data, because we end a query after a set amount of time. Then Solr sorts that data and return it to the application or a user. You can imagine that because the data set is not complete, the end user can get random results. This is because Solr, and therefore Lucene, will choose the documents with low ID first.
To avoid it, and get the most relevant result, we can use the IndexSorter
tool, to change the IDs of documents we are interested in and store them with low internal IDs. And that is what the IndexSorter
tool is for—sorting our index on the basis of a defined field. Why do we only want to return the first amount of documents? When we have millions of documents, the user usually want to see the most relevant ones, not all.
One thing to remember is that the sorting is static. You cannot change it during query execution. So if you need sorting on multiple fields, you should consider multicore deployment where one core holds unsorted data, and other cores hold indexes sorted using the IndexSorter
tool. Therefore, you'll be able to use the early termination techniques and get the most relevant data sorted on the basis of different fields.
To see how to use the early termination technique with Solr, refer to Chapter 7, the recipe How to get the first top documents fast when having millions of them.
Sometimes indexing prepared text files (XML, CSV, JSON, and so on) is not enough. There are numerous situations where you need to extract data from binary files. For example, one of my clients wanted to index PDF files—actually their contents. To do that, we either need to parse the data in some external application or setup Solr to use Apache Tika. This recipe will guide you through the process of setting up Apache Tika with Solr.
First, let's edit our Solr instance solrconfig.xml
and add the following configuration:
<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler"> <lst name="defaults"> <str name="fmap.content">text</str> <str name="lowernames">true</str> <str name="uprefix">attr_</str> <str name="captureAttr">true</str> </lst> </requestHandler>
Next, create the lib
folder next to the conf
directory (the directory where you place your Solr configuration files) and place the apache-solr-cell-3.1-SNAPSHOT.jar
file from the dist
directory (looking from the official Solr distribution package) there. After that, you have to copy all the libraries from the contrib/extraction/lib/
directory to the lib
directory you created before.
And that's actually all that you need to do in terms of configuration.
To simplify the example, I decided to choose the standard schema.xml
file distributed with Solr.
To test the indexing process, I've created a PDF file book.pdf
using PDFCreator, which contained only the following text: This is a Solr cookbook
. To index that file, I've used the following command:
curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true" -F "myfile=@example.pdf"
You should see the following response:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">578</int> </lst> </response>
Binary file parsing is implemented using the Apache Tika framework. Tika is a toolkit for detecting and extracting metadata and structured text from various types of documents, not only binary files, but also HTML and XML files. To add a handler which uses Apache Tika, we need to add a handler based on the org.apache.solr.handler.extraction.ExtractingRequestHandler
class to our solrconfig.xml
file, as shown in the example.
So we added a new request handler with some default parameters. Those parameters tell Solr how to handle the data that Tika returns. The fmap.content
parameter tells Solr to what field content of the parsed document should be put. In our case, the parsed content will go to the field named text
. The next parameter lowernames
set to true
tells Solr to lower all names that comes from Tika and make them lowercased. The next parameter, uprefix
, is very important. It tells Solr how to handle fields that are not defined in the schema.xml
file. The name of the field returned from Tika will be added to the value of the parameter and send to Solr. For example, if Tika returned a field named creator
and we don't have such a field in our index, than Solr would try to index it under a field named attr_creator
, which is a dynamic field. The last parameter tells Solr to index Tika XHTML elements into separate fields named after those elements.
Next we have a command that sends a PDF file to Solr. We are sending a file to the /update/extract
handler with two parameters. First we define a unique identifier. It's useful to be able to do that during document sending because most of the binary documents won't have an identifier in its contents. To pass the identifier, we use the literal.id
parameter. The second parameter we send to Solr is information to perform commit right after document processing.
To see how to index binary files, please take a look at Chapter 2, Indexing Your Data, the recipes: Indexing PDF files, Indexing Microsoft Office files, and Extracting metadata from binary files.