Home Big-data-and-business-intelligence Apache Solr 3.1 Cookbook

Apache Solr 3.1 Cookbook

By Rafał Kuć
books-svg-icon Book
Subscription
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
Subscription
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Apache Solr Configuration
About this book

Apache Solr is a fast, scalable, modern, open source, and easy-to-use search engine. It allows you to develop a professional search engine for your ecommerce site, web application, or back office software. Setting up Solr is easy, but configuring it to get the most out of your site is the difficult bit.

The Solr 3.1 Cookbook will make your everyday work easier by using real-life examples that show you how to deal with the most common problems that can arise while using the Apache Solr search engine. Why waste your time searching the Internet for solutions when you can have all the answers in one place?

This cookbook will show you how to get the most out of your search engine. Each chapter covers a different aspect of working with Solr from analyzing your text data through querying, performance improvement, and developing your own modules. The practical recipes will help you to quickly solve common problems with data analysis, show you how to use faceting to collect data and to speed up the performance of Solr. You will learn about functionalities that most newbies are unaware of, such as sorting results by a function value, highlighting matched words, and computing statistics to make your work with Solr easy and stress free.

Publication date:
July 2011
Publisher
Packt
Pages
300
ISBN
9781849512183

 

Chapter 1. Apache Solr Configuration

In this chapter, we will cover:

  • Running Solr on Jetty

  • Running Solr on Apache Tomcat

  • Using the Suggester component

  • Handling multiple languages in a single index

  • Indexing fields in a dynamic way

  • Making multilingual data searchable with multicore deployment

  • Solr cache configuration

  • How to fetch and index web pages

  • Getting the most relevant results with early query termination

  • How to set up the Extracting Request Handler

 

Introduction


Setting up an example Solr instance is not a hard task, at least when setting up the simplest configuration. The simplest way is to run the example provided with the Solr distribution, which shows how to use the embedded Jetty servlet container.

So far so good. We have a simple configuration, simple index structure described by the schema.xml file, and we can run indexing.

In this chapter, we will go a little further. You'll see how to configure and use the more advanced Solr modules. You'll see how to run Solr in different containers and how to prepare your configuration to meet different requirements. Finally, you will learn how to configure Solr cache to meet your needs and how to pre-sort your Solr indexes to be able to use early query termination techniques efficiently.

If you don't have any experience with Apache Solr, please refer to the Apache Solr tutorial that can be found at http://lucene.apache.org/solr/tutorial.html before reading this book.

Note

During the writing of this chapter, I used Solr version 3.1 and Jetty version 6.1.26, and those versions are covered in the tips of the following chapter. If another version of Solr is mandatory for a feature to run, then it will be mentioned.

 

Running Solr on Jetty


The simplest way to run Apache Solr on a Jetty servlet container is to run the provided example configuration based on the embedded Jetty, but it's not the case here. In this recipe, I would like to show you how to configure and run Solr on a standalone Jetty container.

Getting ready

First of all, you need to download the Jetty servlet container for your platform. You can get your download package from an automatic installer (like apt -get) or you can download it yourself from http://jetty.codehaus.org/jetty/. Of course, you also need solr.war and other configuration files that come with Solr (you can get them from the example distribution that comes with Solr).

How to do it...

There are a few common mistakes that people do when setting up Jetty with Solr, but if you follow the following instructions, the configuration process will be simple, fast, and will work flawlessly.

The first thing is to install the Jetty servlet container. For now, let's assume that you have Jetty installed.

Now we need to copy the jetty.xml and webdefault.xml files from the example/etc directory of the Solr distribution to the configuration directory of Jetty. In my Debian Linux distribution, it's /etc/jetty. After that, we have our Jetty installation configured.

The third step is to deploy the Solr web application by simply copying the solr.war file to the webapps directory of Jetty.

The next step is to copy the Solr configuration files to the appropriate directory. I'm talking about files like schema.xml, solrconfig.xml, and so on. Those files should be in the directory specified by the jetty.home system variable (in my case, this was the /usr/share/jetty directory). Please remember to preserve the directory structure you see in the example.

We can now run Jetty to see if everything is ok. To start Jetty that was installed, for example, using the apt -get command, use the following command:

/etc/init.d/jetty start

If there were no exceptions during start up, we have a running Jetty with Solr deployed and configured. To check if Solr is running, try going to the following address with your web browser: http://localhost:8983/solr/.

You should see the Solr front page with cores, or a single core, mentioned. Congratulations, you have just successfully installed, configured, and ran the Jetty servlet container with Solr deployed.

How it works...

For the purpose of this recipe, I assumed that we needed a single core installation with only schema.xml and solrconfig.xml configuration files. Multicore installation is very similar—it differs only in terms of the Solr configuration files.

Sometimes there is a need for some additional libraries for Solr to see. If you need those, just create a directory called lib in the same directory that you have the conf folder and put the additional libraries there. It is handy when you are working not only with the standard Solr package, but you want to include your own code as a standalone Java library.

The third step is to provide configuration files for the Solr web application. Those files should be in the directory specified by the system variable jetty.home or solr.solr.home. I decided to use the jetty.home directory, but whenever you need to put Solr configuration files in a different directory than Jetty, just ensure that you set the solr.solr.home property properly. When copying Solr configuration files, you should remember to include all the files and the exact directory structure that Solr needs. For the record, you need to ensure that all the configuration files are stored in the conf directory for Solr to recognize them.

After all those steps, we are ready to launch Jetty. The example command has been run from the Jetty installation directory.

After running the example query in your web browser, you should see the Solr front page as a single core. Congratulations, you have just successfully configured and ran the Jetty servlet container with Solr deployed.

There's more...

There are a few tasks you can do to counter some problems when running Solr within the Jetty servlet container. Here are the most common ones that I encountered during my work.

I want Jetty to run on a different port

Sometimes it's necessary to run Jetty on a different port other than the default one. We have two ways to achieve that:

  1. Adding an additional start up parameter, jetty.port. The start up command would look like this:

    java –Djetty.port=9999 –jar start.jar
    
  2. Changing the jetty.xml file—to do that, you need to change the following line:

    <Set name="port"><SystemProperty name="jetty.port" default="8983"/></Set>

    to:

    <Set name="port"><SystemProperty name="jetty.port" default="9999"/></Set>

Buffer size is too small

Buffer overflow is a common problem when our queries are getting too long and too complex—for example, when using many logical operators or long phrases. When the standard HEAD buffer is not enough, you can resize it to meet your needs. To do that, you add the following line to the Jetty connector in the jetty.xml file. Of course, the value shown in the example can be changed to the one that you need:

<Set name="headerBufferSize">32768</Set>

After adding the value, the connector definition should look more or less like this:

<Call name="addConnector">
<Arg>
<New class="org.mortbay.jetty.bio.SocketConnector">
<Set name="port"><SystemProperty name="jetty.port" default="8080"/></Set>
<Set name="maxIdleTime">50000</Set>
<Set name="lowResourceMaxIdleTime">1500</Set>
<Set name="headerBufferSize">32768</Set>
</New>
</Arg>
</Call>

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.PacktPub.com. If you purchased this book elsewhere, you can visit http://www.PacktPub.com/support and register to have the files e-mailed directly to you.

 

Running Solr on Apache Tomcat


Sometimes you need to choose a different servlet container other than Jetty. Maybe it's because your client has other applications running on another servlet container, maybe it's because you just don't like Jetty. Whatever your requirements are that put Jetty out of the scope of your interest, the first thing that comes to mind is a popular and powerful servlet container—Apache Tomcat. This recipe will give you an idea of how to properly set up and run Solr in the Apache Tomcat environment.

Getting ready

First of all, we need an Apache Tomcat servlet container. It can be found at the Apache Tomcat website—http://tomcat.apache.org. I concentrated on Tomcat version 6.x because of two things—version 5 is pretty old right now, and version 7 is the opposite—it's too young in my opinion. That is why I decided to show you how to deploy Solr on Tomcat version 6.0.29, which was the newest one while writing this book.

How to do it...

To run Solr on Apache Tomcat, we need to perform the following six simple steps:

  1. Firstly, you need to install Apache Tomcat. The Tomcat installation is beyond the scope of this book, so we will assume that you have already installed this servlet container in the directory specified by the $TOMCAT_HOME system variable.

  2. The next step is preparing the Apache Tomcat configuration files. To do that, we need to add the following inscription to the connector definition in the server.xml configuration file:

    URIEncoding="UTF-8"

    The portion of the modified server.xml file should look like this:

    <Connector port="8080" protocol="HTTP/1.1"
                   connectionTimeout="20000"
                   redirectPort="8443"
                URIEncoding="UTF-8" />
  3. Create a proper context file. To do that, create a solr.xml file in the $TOMCAT_HOME/conf/Catalina/localhost directory. The contents of the file should look like this:

    <Context path="/solr">
    <Environment name="solr/home" type="java.lang.String" value="/home/solr/configuration/" override="true"/>
    </Context>
  4. The next thing is the Solr deployment. To do that, we need a solr.war file that contains the necessary files and libraries to run Solr to be copied to the Tomcat webapps directory. If you need some additional libraries for Solr to see, you should add them to the $TOMCAT_HOME/lib directory.

  5. The last thing we need to do is add the Solr configuration files. The files that you need to copy are files like schema.xml, solrconfig.xml, and so on. Those files should be placed in the directory specified by the solr/home variable (in our case, /home/solr/configuration/). Please don't forget that you need to ensure the proper directory structure. If you are not familiar with the Solr directory structure, please take a look at the example deployment that is provided with standard Solr package.

  6. Now we can start the servlet container by running the following command:

    bin/catalina.sh start
    

    In the log file, you should see a message like this:

    Info: Server startup in 3097 ms
    

To ensure that Solr is running properly, you can run a browser, and point it to an address where Solr should be visible, like the following: http://localhost:8080/solr/.

If you see the page with links to administration pages of each of the cores defined, that means that your Solr is up and running.

How it works...

Let's start from the second step, as the installation part is beyond the scope of this book. As you probably know, Solr uses UTF-8 file encoding. That means that we need to ensure that Apache Tomcat will be informed that all requests and responses made should use that encoding. To do that, we modify the server.xml in the way shown in the example.

The Catalina context file (called solr.xml in our example) says that our Solr application will be available under the /solr context (the path attribute), the war file will be placed in the /home/tomcat/webapps/ directory. solr/home is also defined, and that is where we need to put our Solr configuration files. The shell command that is shown starts Apache Tomcat. I think that I should mention some other options of catalina.sh (or catalina.bat) script:

  • stop—stops Apache Tomcat

  • restart—restarts Apache Tomcat

  • debug—start Apache Tomcat in debug mode

After running the example address in a web browser, you should see a Solr front page with a core (or cores if you have a multicore deployment). Congratulations, you have just successfully configured and ran the Apache Tomcat servlet container with Solr deployed.

There's more...

Here are some other tasks that are common problems when running Solr on Apache Tomcat.

Changing the port on which we see Solr running on Tomcat

Sometimes it is necessary to run Apache Tomcat on a port other than the 8080 default one. To do that, you need to modify the port variable of the connector definition in the server.xml file located in the $TOMCAT_HOME/conf directory. If you would like your Tomcat to run on port 9999, this definition should look like this:

<Connector port="9999" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443"
            URIEncoding="UTF-8" />

While the original definition looks like this:

<Connector port="8080" protocol="HTTP/1.1"
               connectionTimeout="20000"
               redirectPort="8443"
            URIEncoding="UTF-8" />
 

Using the Suggester component


Nowadays, it's common for web pages to give a search suggestion (or autocomplete as I tend to call it), just like many "big" search engines do—just like Google, Microsoft, and others. Lately, Solr developers came up with a new component called Suggester . It provides Solr with flexible ways to add suggestions to your application using Solr. This recipe will guide you through the process of configuring and using this new component.

How to do it...

First we need to add the search component definition in the solrconfig.xml file. That definition should look like this:

<searchComponent class="solr.SpellCheckComponent" name="suggester">
<lst name="spellchecker">
<str name="name">suggester</str>
<str name="classname"> org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl"> org.apache.solr.spelling.suggest.tst.TSTLookup</str>
<str name="field">name</str>
<str name="threshold">2</str>
</lst>
</searchComponent>

Now we can define an appropriate request handler. To do that, we modify the solrconfig.xml file and add the following lines:

<requestHandler class="org.apache.solr.handler.component.SearchHandler" name="/suggester">
<lst name="defaults">
<str name="spellcheck">true</str>
<str name="spellcheck.dictionary">suggester</str>
<str name="spellcheck.count">10</str>
</lst>
<arr name="components">
<str>suggester</str>
</arr>
</requestHandler>

Now if all went well (Solr server started without an error), we can make a Solr query like this:

http://localhost:8983/solr/suggester/?q=a

You should see the response as follows:

<?xml version="1.0" encoding="UTF-8">
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">16</int>
</lst>
<lst name="spellcheck">
<lst name="suggestions">
<lst name="a">
<int name="numFound">8</int>
<int name="startOffset">0</int>
<int name="endOffset">1</int>
<arr name="suggestions">
<str>a</str>
<str>asus</str>
<str>ati</str>
<str>ata</str>
<str>adata</str>
<str>all</str>
<str>allinone</str>
<str>apple</str>
</arr>
<lst>
<lst>
</lst>
</response>

How it works...

After reading the aforementioned search component configuration, you may wonder why we use solr.SpellCheckComponent as our search component implementation class. Actually, the Suggester component relies on spellchecker and reuses most of its classes and logic. That is why Solr developers decided to reuse spellchecker code.

Anyway, back to our configuration. We have some interesting configuration options. We have the component name (name variable), we have the logic implementation class (classname variable), the word lookup implementation (lookupImpl variable), the field on which suggestions will be based (field variable), and a threshold parameter which defines the minimum fraction of documents where a term should appear to be visible in the response.

Following the definition, there is a request handler definition present. Of course, it is not mandatory, but useful. You don't need to pass all the required parameters with your query all the time. Instead, you just write those parameters to the solrconfig.xml file along with request handler definition. We called our example request handler /suggester and with that name, it'll be exposed in the servlet container. We have three parameters saying that, in Solr, when using the defined request handler it should always include suggestions (suggest parameter set to true), it should use the dictionary named suggester that is actually our component (spellcheck.dictionary parameter), and that the maximum numbers of suggestions should be 10 (spellcheck.count parameter).

As mentioned before, the Suggester component reuses most of the spellchecker logic, so most of all spellchecker parameters can be used as the configuration of the behavior of the Suggester component. Remember that, because you never know when you'll need some of the non-standard parameters.

There's more...

There are a few things that are good to know about when using the Suggester component. The following are the three most common things that people tend to ask about:

Suggestions from a static dictionary

There are some situations when you'll need to get suggestions not from a defined field in an index, but from a file. To do that, you need to have a text dictionary (encoded in UTF-8) that looks as follows:

Suggestion 1
Suggestion 2
Suggestion 3

Second, you need to add another parameter to the Suggest component definition named sourceLocation that specifies the dictionary location (the file that will be used for the suggestions base):

<str name="sourceLocation">suggest_dictionary.txt</str>

So the whole Suggester component definition would look like this:

<searchComponent class="solr.SpellCheckComponent" name="suggester">
<lst name="spellchecker">
<str name="name">suggester</str>
<str name="classname"> org.apache.solr.spelling.suggest.Suggester</str>
<str name="lookupImpl"> org.apache.solr.spelling.suggest.tst.TSTLookup</str>
<str name=" sourceLocation">suggest_dictionary.txt</str>
</lst>
</searchComponent>

Rebuilding the suggestion word base after commit

If you have field-based suggestions, you'll probably want to rebuild your suggestion dictionary after every change in your index. Of course, you can do that manually by invoking the appropriate command (the spellcheckbuild parameter with the true value), but we don't want to do it this way—we want Solr to do it for us. To do it, just add another parameter to your Suggester component configuration. The new parameter should look like this:

<str name="buildOnCommit">true</str>

Removing uncommon words from suggestions

How to remove language mistakes and uncommon words from suggestions is a common concern. It's not only your data—most data has some kind of mistakes or errors—and the process of data cleaning is time consuming and difficult. Solr has an answer for you. We can add another parameter to our Suggester component configuration and we won't see any of the mistakes and uncommon words in the suggestions. Add the following parameter:

<float name="threshold">0.05</float>

This parameter takes values from 0 to 1. It tells the Suggester component the minimum fraction of documents of the total where a word should appear to be included as the suggestion.

See also

If you don't need a sophisticated Suggester component like the one described, you should take a look at the How to implement an autosuggest feature using faceting recipe in Chapter 6, Using Faceting Mechanism.

 

Handling multiple languages in a single index


There are many examples where multilingual applications and multilingual searches are mandatory—for example, libraries having books in multiple languages. This recipe will cover how to set up Solr, so we can make our multilingual data searchable using a single query and a single response. This task doesn't cover how to analyze your data and automatically detect the language used. That is beyond the scope of this book.

How to do it...

First of all, you need to identify what languages your applications will use. For example, my latest application uses two languages—English and German.

After that, we need to know which fields need to be separate. For example—a field with an ISBN number or an identifier field can be shared, because they don't need to be indexed in a language-specific manner, but titles and descriptions should be separate. Let's assume that our example documents consist of four fields:

  • ID

  • ISBN

  • Title

  • Description

We have two languages—English and German—and we want all documents to be searchable with one query within one index.

First of all, we need to define some language-specific field types. To add the field type for English, we need to add the following lines to the schema.xml file:

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" 
generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>

Next, we need to add the field type for fields containing a German title and description:

<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German2"/>
</analyzer>
</fieldType>

Now we need to define the document's fields using the previously defined field types. To do that, we add the following lines to the schema.xml file in the field section.

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="isbn" type="string" indexed="true" stored="true" required="true" />
<field name="title_en" type="text_en" indexed="true" stored="true" />
<field name="description_en" type="text_en" indexed="true" stored="true" />
<field name="title_de" type="text_de" indexed="true" stored="true" />
<field name="description_de" type="text_de" indexed="true" stored="true" />

And that's all we need to do with configuration. After the indexation, we can start to query our Solr server. Next thing is to query Solr for the documents that are in English or German. To do that, we send the following query to Solr:

q=title_en:harry+OR+description_en:harry+OR+title_de:harry+OR+description_de:harry

How it works...

First of all, we added a text_en field type to analyze the English title and description. We tell Solr to split the data on whitespaces, split on case change, make text parts lowercased, and finally to stem text with the appropriate algorithm—in this case, it is one of the English stemming algorithms available in Solr and Lucene. Let's stop for a minute to explain what stemming is. All you need to know, for now, is that stemming is the process for reducing inflected or derived words into their stem, root, or base forms. It's good to use a stemming algorithm in full text search engines because we can show the same set of documents on words like 'cat' and 'cats'.

Of course, we don't want to use English stemming algorithms with German words. That's why we've added another field type—text_ger. It differs from the text_en field in one matter—it uses solr.SnowballPorterFilterFactory with the attribute specifying German language instead of solr.PorterStemFilterFactory.

The last thing we do is to add field definitions. We have six fields. Two of those six are id (unique identifier) and isbn. The next two are English title and description, and the last two are German title and description.

As you can see in the example query, we can get documents regardless of where the data is. To achieve that, I used the OR logical operator to get the documents that the searched word, in any of the specified fields. If we have a match in English, we will get the desired documents; if we get a match in German, we will also get the desired documents.

See also

If you don't need to get all multilingual results with a single query, you can look at the Making multilingual data searchable with multicore deployment recipe description provided in this chapter. If you want to sort the data we talked about, please refer to Chapter 10, Dealing With Problems and the recipe How to sort non-English languages properly.

 

Indexing fields in a dynamic way


Sometimes you don't know the names and number of fields you have to store in your index. You can only determine the types that you'll use and nothing else, either because it is a requirement or maybe the data is coming from different sources or the data is very complex, and no one has enough knowledge to specify how many and what fields are needed to be present in the index. There is a cure for that problem—dynamic fields. This recipe will guide you through the definition and usage of dynamic fields.

How to do it...

Looking from the scope of this book, we just need to do one thing—prepare the schema.xml file, so that Solr will be able to determine where to put our data. The example schema.xml file which comes with Solr provides definitions of some dynamic fields. Here are the example dynamic fields:

<dynamicField name="*_i" type="int" indexed="true" stored="true"/>
<dynamicField name="*_s" type="string" indexed="true" stored="true"/>
<dynamicField name="*_l" type="long" indexed="true" stored="true"/>
<dynamicField name="*_t" type="text" indexed="true" stored="true"/>
<dynamicField name="*_b" type="boolean" indexed="true" stored="true"/>
<dynamicField name="*_f" type="float"indexed="true" stored="true"/>
<dynamicField name="*_d" type="double" indexed="true" stored="true"/>

Having those definitions in the schema.xml file, we can update data without the need for a static field definition. Here is an example document that can be sent to Solr:

<add>
<doc>
<field name="name_s">Solr Cookbook</field>
<field name="name_t">Solr Cookbook</field>
<field name="price_d">19.99</field>
<field name="quantity_i">12</field>
<field name="available_b">true</field>
</doc>
</add>

How it works...

When defining dynamic fields, you tell Solr to observe (at indexing time) every expansion that the pattern you wrote matches. For example, the pattern *_i will expand to field names like quantity_i and grade_i, but not for gradei.

However, you must know one thing—Solr will always try to choose the field that matches the pattern and has the shortest field pattern. What's more, Solr will always choose static field over dynamic field.

Let's get back to our definitions. We have a few dynamic fields defined in the schema.xml file. As you can see, dynamic field definition is just a little bit different from a static field definition. Definitions begin with an appropriate XML tag dynamicField and have the same attributes as static field definition. One thing that's uncommon is the name—it's a pattern that will be used to match the field name. There is one more thing to remember—you can only use one wildcard character—* as the start or end of a dynamic field name. Any other usage will result in a Solr start up error.

As you see in the example document, we have five fields filled. The field named name_s will be matched to the pattern *_s, the field named name_t will be matched to the pattern *_t, and the rest will behave similarly.

Of course, this is just a simple example of a document. The documents may consist of hundreds of fields both static and dynamic. Also the dynamic field definitions may be much more sophisticated than the one in the preceding example. You should just remember that using dynamic fields is perfectly eligible and you should use them whenever your deployment needs them.

See also

If you need your data not only to be indexed dynamically, but also to be copied from one field to another, please refer to the recipe entitled Copying contents of one field to another in Chapter 3, Analyzing your Text Data.

 

Making multilingual data searchable with multicore deployment


You don't always need to handle multiple languages in a single index, either, it's because you have your application in multiple languages showing only one of them at the same time or some other requirement. Whatever your cause is, this recipe will guide you on how to handle separable data in a single instance of Solr server through the use of multicore deployment.

How to do it...

First of all, you need to create the solr.xml file and place it in your $SOLR_HOME directory. Let's assume that our application will handle two languages—English and German. The sample solr.xml file might look like this:

<?xml version="1.0" encoding="UTF-8" ?>
<solr>
<cores adminPath="/admin/cores/">
<core name="en" instanceDir="cores/en">
<property name="dataDir" value="cores/en/data" />
</core>
<core name="de" instanceDir="cores/de">
<property name="dataDir" value="cores/de/data" />
</core>
</cores>
</solr>

Let's create directories mentioned in the solr.xml file. For the purpose of the example, I assumed that the $SOLR_HOME points to the /usr/share/solr/directory. We need to create the following directories:

  • $$SOLR_HOME/cores

  • $SOLR_HOME/cores/en

  • $SOLR_HOME/cores/de

  • $SOLR_HOME/cores/en/conf

  • $SOLR_HOME/cores/en/data

  • $SOLR_HOME/cores/de/conf

  • $SOLR_HOME/cores/de/data

We will use the sample solrconfig.xml file provided with the example deployment of multicore Solr version 3.1. Just copy this file to the conf directory of both cores. For the record, the file should contain:

<?xml version="1.0" encoding="UTF-8" ?>
<config>
<updateHandler class="solr.DirectUpdateHandler2" />
<requestDispatcher handleSelect="true" >
<requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />
</requestDispatcher>
<requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
<requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />
<admin>
<defaultQuery>solr</defaultQuery>
</admin>
</config>

Now we should prepare a simple schema.xml file. To make it simple, let's just add two field types to the example Solr schema.xml.

To the schema.xml file that will describe the index containing English documents, let's add the following field type (just add it in the types section of the schema.xml file):

<fieldType name="text_en" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.PorterStemFilterFactory"/>
</analyzer>
</fieldType>

To the schema.xml file, describing the index containing the document in German, let's add the following field type:

<fieldType name="text_de" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory"/>
<filter class="solr.SnowballPorterFilterFactory" language="German2"/>
</analyzer>
</fieldType>

The field definition for the English schema.xml should look like this:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="isbn" type="string" indexed="true" stored="true" required="true" />
<field name="title" type="text_en" indexed="true" stored="true" />
<field name="description" type="text_en" indexed="true" stored="true" />

The field definition for the German schema.xml should look like this:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="isbn" type="string" indexed="true" stored="true" required="true" />
<field name="title" type="text_de" indexed="true" stored="true" />
<field name="description" type="text_de" indexed="true" stored="true" />

Now you should copy the files you've just created to the appropriate directories:

  • German schema.xml file to $SOLR_HOME/cores/de/conf

  • English schema.xml file to $SOLR_HOME/cores/en/conf

That's all in terms of the configuration. You can now start your Solr instance as you always do.

Now all the index update requests should be made to the following addresses:

  • English documents should go to http://localhost:8983/solr/en/update

  • German documents should go to http://localhost:8983/solr/de/update

It is similar when querying Solr, for example, I've made two queries, first for the documents in English and second for the documents in German:

http://localhost:8983/solr/en/select?q=harry
http://localhost:8983/solr/de/select?q=harry

How it works...

First of all, we create the solr.xml file to tell Solr that the deployment will consist of one or more cores. What is a core? Multiple cores let you have multiple separate indexes inside a single Solr server instance. Of course you can run multiple Solr servers, but every one of them would have its own process (actually a servlet container process), its own memory space assigned, and so on. The multicore deployment lets you use multiple indexes inside a single Solr instance, a single servlet container process, and with the same memory space.

Following that, we have two cores defined. Every core is defined in its own core tag and has some attributes defining its properties. Like the core home directory (instanceDir attribute) or where the data will be stored (dataDir property). You can have multiple cores in one instance of Solr, almost an infinite number of cores in theory, but in practice, don't use too many.

There are some things about the solr.xml file that need to be discussed further. First of all, the adminPath attribute of the tag cores—it defines where the core admin interface attribute will be available. With the value shown in the example, the core admin will be available under the following address: http://localhost:8983/solr/admin/cores.

The field type definition for each of the cores is pretty straightforward. The file that describes the index for English documents uses an English stemmer for text data, and the file that describes the index for German documents uses a German stemmer for text data.

The only difference in field definition is the type that the description and title fields use—for the German schema.xml they use the text_de field type, and for the English schema.xml they use the text_en field type.

As for the queries, you must know one thing. When using multicore with more than one core, the address under which Solr offers its handlers is different from the one when not using cores. Solr adds a core name before the handler name. So if you have a handler named /simple in the core named example, it will be available under the context /solr/example/simple, not /solr/simple. When you know that, you'll be able to know where to point your applications that use Solr and a multicore deployment.

There is one more thing—you need to remember that every core has a separate index. That means that you can't combine results from different cores, at least not automatically. For example, you can't automatically get results with a combination of documents in English and German; you must do it by yourself or choose a different architecture for your Solr deployment.

There's more...

If you need more information about cores, maybe some of the following information will be helpful.

More information about core admin interface

If you seek for more information about the core admin interface commands, please refer to the Solr wiki pages found at http://wiki.apache.org/solr/CoreAdmin.

See also

If you need to handle multiple languages in a single index, please refer to the Handling multiple languages in a single index recipe in this chapter.

 

Solr cache configuration


As you may already know, cache plays a major role in a Solr deployment. And I'm not talking about some exterior cache—I'm talking about the three Solr caches:

  • Filter cache—used for storing filter (query parameter fq) results and enum type facets mainly

  • Document cache—used for storing Lucene documents that hold stored fields

  • Query result cache—used for storing results of queries

There is a fourth cache—Lucene's internal cache—the field cache, but you can't control its behavior—it is managed by Lucene and created when it is first used by the Searcher object.

With the help of these caches, we can tune the behavior of the Solr searcher instance. In this recipe, we will focus on how to configure your Solr caches to suit most needs. There is one thing to remember—Solr cache sizes should be tuned to the number of documents in the index, the queries, and the number of results you usually get from Solr.

Getting ready

Before you start tuning Solr caches, you should get some information about your Solr instance. That information is:

  • Number of documents in your index

  • Number of queries per second made to that index

  • Number of unique filter (fq parameter) values in your queries

  • Maximum number of documents returned in a single query

  • Number of different queries and different sorts

All those numbers can be derived from the Solr logs and by using the Solr admin interface.

How to do it...

For the purpose of this task, I assumed the following numbers:

  • Number of documents in the index: 1,000,000

  • Number of queries per second: 100

  • Number of unique filters: 200

  • Maximum number of documents returned in a single query: 100

  • Number of different queries and different sorts: 500

Let's open the solrconfig.xml file and tune our caches. All the changes should be made in the query section of the file (the section between the <query> and </query> XML tags).

First goes the filter cache:

<filterCache
   class="solr.FastLRUCache"
   size="200"
   initialSize="200"
   autowarmCount="100"/>

Second goes the query result cache:

<queryResultCache
   class="solr.FastLRUCache"
size="500"
   initialSize="500"
autowarmCount="250"/>

Third, we have the document cache:

<documentCache
   class="solr.FastLRUCache"
   size="11000"
   initialSize="11000" />

Of course, the preceding configuration is based on the example values.

Furthermore, let's set our result window to match our needs—we sometimes need to get 20–30 more results than we need during query execution. So, we change the appropriate value in the solrconfig.xml to something like this:

<queryResultWindowSize>200</queryResultWindowSize>

And that's all.

How it works...

Let's start with a small explanation. First of all, we use the solr.FastLRUCache implementation instead of the solr.LRUCache. This is a new type of cache implementation found in the 1.4 version of Solr. So called FastLRUCache tends to be faster when Solr puts less into caches and gets more. This is the opposite to LRUCache that tends to be more efficient when there is more "puts" than "gets" operations. That's why we use it.

This may be the first time you have seen cache configuration, so I'll explain what cache configuration parameters mean:

  • class—you probably figured that out by now. Yes, this is the class implementing the cache

  • size—this is the maximum size that the cache can have.

  • initialSize—this is the initial size that the cache will have.

  • autowarmCount—this is the number of cache entries that will be copied to the new instance of the same cache when Solr invalidates the Searcher object—for example, during commit operation.

As you can see, I tend to use the same number of entries for size and initialSize, and half of those values for the autowarmCount.

There is one thing you should be aware of. Some of the Solr caches (the document cache, actually) operate on internal identifiers called docid. Those caches cannot be automatically warmed. That's because the docid is changing after every commit operation and thus copying the docid is useless.

Now let's take a look at the cache types and what they are used for.

Filter cache

So first we have the filter cache. This cache is responsible for holding information about the filters and the documents that match the filter. Actually, this cache holds an unordered set of document ids that match the filter. If you don't use the faceting mechanism with filter cache, you should set its size to the number of unique filters that are present in your queries at least. This way, it will be possible for Solr to store all the unique filters with theirs matching documents ids and this will speed up queries that use filters.

Query result cache

The next cache is the query result cache. It holds the ordered set of internal ids of documents that match the given query and the sort specified. That's why, if you use caches, you should add as much filters as you can and keep your query (q parameter) as clean as possible (for example, pass only the search box content of your search application to the query parameter). If the same query will be run more than once and the cache has enough size to hold the entry, information available in the cache will be used. This will allow Solr to save precious I/O operation for the queries that are not in the cache—resulting in performance boost.

Tip

The maximum size of this cache I tend to set is the number of unique queries and their sorts that Solr handles in the time between Searchers object invalidation. This tends to be enough in the most cases.

Document cache

The last type of cache is the document cache. It holds the Lucene documents that were fetched from the index. Basically, this cache holds the stored fields of all the documents that are gathered from the Solr index. The size of this cache should always be greater than the number of concurrent queries multiplied by the maximum results you get from Solr. This cache can't be automatically warmed—because every commit is changing the internal IDs of the documents. Remember that the cache can be memory-consuming in case you have many stored fields.

Query result window

The last is the query result window. This parameter tells Solr how many documents to fetch from the index in a single Lucene query. This is a kind of super set of documents fetched. In our example, we tell Solr that we want maximum of one hundred documents as a result of a single query. Our query result window tells Solr to always gather two hundred documents. Then when we will need some more documents that follow the first hundred they will be fetched from cache and therefore we will be saving our resources. The size of the query result window is mostly dependent on the application and how it is using Solr. If you tend to do multiple paging, you should consider using a higher query result window value.

Tip

You should remember that the size of caches shown in this task is not final and you should adapt them to your application needs. The values and the method of their calculation should be only taken as a starting point to further observation and optimizing process. Also, please remember to monitor your Solr instance memory usage as using caches will affect the memory that is used by the JVM.

There's more...

There are a few things that you should know when configuring your caches.

Using filter cache with faceting

If you use the term enumeration faceting method (parameter facet.method=enum), Solr will use the filter cache to check each term. Remember that if you use this method, your filter cache size should have at least the size of the number of unique facet values in all your faceted fields. This is crucial and you may experience performance loss if this cache is not configured the right way.

When we have no cache hits

When your Solr instance has a low cache hit ratio, you should consider not using caches at all (to see the hit ratio, you can use the administration pages of Solr). Cache insertion is not free—it costs CPU time and resources. So if you see that you have very low cache hit ratio, you should consider turning your caches off—it may speed up your Solr instance. Before you turn off the caches, please ensure that you had the right cache set up—a small hit ratio can be a result of a bad cache configuration.

When we have more "puts" than "gets"

When your Solr instance uses put operations more than get operations, you should consider using the solr.LRUCache implementation. It's confirmed that this implementation behaves better when there are more insertions into the cache than lookups.

See also

There is another way to warm your caches, if you know the most common queries that are sent to your Solr instance—auto warming queries.

  • To see how to configure them, you should refer to Chapter 7, Improving Solr Performance and the recipe Improving Solr performance right after start up or commit operation

  • To see how to use the administration pages of the Solr server, you should refer to Chapter 4, Solr Administration

  • For information on how to cache whole pages of results, please refer to Chapter 7, the recipe Caching whole result pages.

 

How to fetch and index web pages


There are many ways to index web pages. We could download them, parse them, and index with the use of Lucene and Solr. The indexing part is not a problem, at least in most cases. But there is another problem—how do you fetch them? We could possibly create our own software to do that, but that takes time and resources. That's why this recipe will cover how to fetch and index web pages using Apache Nutch.

Getting ready

For the purpose of this recipe we will be using version 1.2 of Apache Nutch. To download the binary package of Apache Nutch, please go to the download section of http://nutch.apache.org.

How to do it...

First of all, we need to install Apache Nutch. To do that, we just need to extract the downloaded archive to the directory of our choice, for example, I installed it in the directory:/nutch. This directory will be referred to as $NUTCH_HOME.

Open the file $NUTCH_HOME/conf/nutch-default.xml and set the value http.agent.name to the desired name of your crawler. It should look like this:

<property>
<name>http.agent.name</name>
<value>SolrCookbookCrawler</value>
<description>HTTP 'User-Agent' request header.</description>
</property>

Now let's create an empty directory called crawl in the $NUTCH_HOME directory. Then create the nutch directory in the $NUTCH_HOME/crawl directory.

The next step is to create a directory urls in the $NUTCH_HOME/crawl/nutch directory.

Now add the file named nutch to the directory $$NUTCH_HOME/crawl/nutch/site. For the purpose of this book, we will be crawling Solr and Lucene pages, so this file should contain the following: http://lucene.apache.org.

Now we need to edit the $NUTCH_HOME/conf/crawl-urlfilter.txt file. Replace the MY.DOMAIN.NAME with the http://lucene.apache.org. So the appropriate entry should look like:

+^http://lucene.apache.org/

One last thing before fetching the data is Solr configuration. The only thing that we need to do is to copy the file $NUTCH_HOME/conf/schema.xml to the directory $$SOLR_HOME//conf.

Now we can start fetching web pages.

Run the following command from the $NUTCH_HOME directory:

bin/nutch crawl crawl/nutch/site -dir crawl -depth 3 -topN 50

Depending on your Internet connection and your machine configuration, you should finally see the following message:

crawl finished: crawl

This means that the crawl is completed and the data is fetched. Now we should invert the fetched data to be able to index anchor text to the indexed pages. To do that, we invoke the following command:

bin/nutch invertlinks crawl/linkdb -dir crawl/segments

Sometime later, you'll see a message that will inform you that the invert process is completed:

LinkDb: finished at 2010-10-18 21:35:44, elapsed: 00:00:15

We can now send our data to Solr (you can find the appropriate schema.xml file in the Nutch distribution in the conf/schema.xml directory). To do that, you should run the following command:

bin/nutch solrindex http://127.0.0.1:8983/solr/ crawl/crawldb crawl/linkdb crawl/segments/*

After a period of time, depending on the size of your crawl database, you should see a message informing you that the indexing process was finished:

SolrIndexer: finished at 2010-10-18 21:39:28, elapsed: 00:00:26

How it works...

After installing Nutch and Solr, the first thing we do is set our crawler name. Nutch does not allow empty names, so we must choose one. The file nutch-default.xml defines more properties than the mentioned one, but at this time, we only need to know about that one.

The next step is the creation of directories where the crawl database will be stored. It doesn't have to be exactly the same directory as the example crawl. You can place it on a different partition or another hard disk drive.

The file site we created in the directory $$NUTCH_HOME/crawl/nutch should contain information about the sites from which we want information to be fetched. In the example, we have only one site—http://lucene.apache.org.

The crawl-urlfilter.txt file contains information about the filters that will be used to check the URLs that Nutch will crawl. In the example, we told Nutch to accept every URL that begins with http://lucene.apache.org.

Next, we start with some "Nutch magic". First of all we run the crawling command. The crawl command of Nutch command line utility needs some parameters, they are as follows:

  • File with addresses to fetch defined.

  • Directory where the fetch database will be stored.

  • How deep to go after the links are defined—in our example, we told Nutch to go for a maximum of three links from the main page.

  • How many documents to get from each level of the depth. In our example, we told Nutch to get a maximum of 50 documents per level of depth.

The next big thing is the link inversion process. This process is performed to generate link database so that Nutch can index the anchor with the associated pages. The invertlinks command of Nutch command line utility was run with two parameters:

  • Output directory where the newly created link database should be created

  • Directory where the data segments were written during the crawl process

The last command that was run was the one that pushed the data into Solr. This process uses the javabin format and uses the /update handler, so remember to have both of these functionalities configured in your Solr instance. The solrindex command of the Nutch command line utility was run with the following parameters:

  • Address of the Solr server instance

  • Directory containing the crawl database created by the crawl command

  • Directory containing the link database created by the invertlinks command

  • List of segments that contain crawl data

There's more...

There is one more thing worth knowing when you start a journey in the land of Apache Nutch.

Multiple thread crawling

The crawl command of the Nutch command-line utility has another option—it can be configured to run crawling with multiple threads. To achieve that, you add the parameter:

-threads N

So if you would like to crawl with 10 threads, you should run the crawl command as follows:

bin/nutch crawl crawl/nutch/site -dir crawl -depth 3 -topN 50 –threads 10

See also

If you seek more information about Apache Nutch, please refer to http://nutch.apache.org and go to the wiki section.

 

Getting the most relevant results with early query termination


When we have millions of documents, large indexes, and many shards, there are situations where you don't need to show all the results on a given query. It is very probable that you only want to show your user the top N results. It's time when you can use the early termination techniques to terminate long queries after a set amount of time. But using early termination techniques is a bit tricky. There a few things that need to be addressed before you can use early termination techniques. One of those things is getting the most relevant results. The tool for sorting the Lucene index was only available in Apache Nutch, but that is history now because the Lucene version of this tool was committed to the SVN repository. This recipe will guide you through the process of index pre-sorting and explain why to use this new feature of Lucene and Solr and how to get the most relevant results with the help of this tool.

Getting ready

During the writing of this book, the IndexSorter tool was only available in branch_3x of the Lucene and Solr SVN repository. After downloading the appropriate version, compiling and installing it, we can begin using this tool.

How to do it...

IndexSorter is an index post processing tool. This means that it should be used after data is indexed. Let's assume that we have our data indexed. For the purpose of showing how to use the tool, I modified my schema.xml file to consist only of these fields (add the following to the fields section of your schema.xml file):

<field name="id" type="string" indexed="true" stored="true"  multiValued="false" required="true"/>
<field name="isbn" type="string" indexed="true" stored="true" multiValued="false" />
<field name="title" type="text" indexed="true" stored="true" multiValued="false" />
<field name="description" type="text" indexed="true" stored="true" multiValued="false" />
<field name="author" type="string" indexed="true" stored="true" multiValued="false" />
<field name="value" type="string" indexed="true" stored="true" multiValued="false" />

Let's assume that we have a requirement that we need to show our data sorted by the value field (the field contains float values, the higher the value the more important the document is), but our index is so big, that a single query is taking more time than a client will be willing to wait for the results. That's why we need to pre-sort the index by the required field. To do that, we will use the tool named IndexSorter. There is one more thing before you can run the IndexSorter tool

So let's run the following command:

java –cp lucene-lubs/* org.apache.lucene.index.IndexSorter solr/data/index solr/data/new_index author

After some time, we should see a message like the following one:

IndexSorter: done, 9112 total milliseconds

This message means that everything went well and our index is sorted by a value field. The sorted index is written in the solr/data/new_index directory and the old index is not altered in any way. To use the new index, you should replace the contents of the old index directory (that is, solr/data/index) with the contents of the solr/data/new_index directory.

How it works...

I think that the field definitions do not need to be explained. The only thing worth looking at is the value field which is the field on which we will be sorting.

But how does this tool work? Basically, it sorts your Lucene index in a static way. What does that mean? Let's start with some explanation. When indexing documents with Solr (and Lucene of course), they are automatically given an internal identification number—a document ID. Documents with low internal ID will be chosen by Lucene first. During the indexing process, we don't have the possibility to set the internal document IDs. So what happens when we use TimeLimitingCollector (and therefore ending a query after a set amount of time) in combination with sorting by the author field on millions of data? We get some amount of data, but not all data, because we end a query after a set amount of time. Then Solr sorts that data and return it to the application or a user. You can imagine that because the data set is not complete, the end user can get random results. This is because Solr, and therefore Lucene, will choose the documents with low ID first.

To avoid it, and get the most relevant result, we can use the IndexSorter tool, to change the IDs of documents we are interested in and store them with low internal IDs. And that is what the IndexSorter tool is for—sorting our index on the basis of a defined field. Why do we only want to return the first amount of documents? When we have millions of documents, the user usually want to see the most relevant ones, not all.

One thing to remember is that the sorting is static. You cannot change it during query execution. So if you need sorting on multiple fields, you should consider multicore deployment where one core holds unsorted data, and other cores hold indexes sorted using the IndexSorter tool. Therefore, you'll be able to use the early termination techniques and get the most relevant data sorted on the basis of different fields.

See also

To see how to use the early termination technique with Solr, refer to Chapter 7, the recipe How to get the first top documents fast when having millions of them.

 

How to set up Extracting Request Handler


Sometimes indexing prepared text files (XML, CSV, JSON, and so on) is not enough. There are numerous situations where you need to extract data from binary files. For example, one of my clients wanted to index PDF files—actually their contents. To do that, we either need to parse the data in some external application or setup Solr to use Apache Tika. This recipe will guide you through the process of setting up Apache Tika with Solr.

How to do it...

First, let's edit our Solr instance solrconfig.xml and add the following configuration:

<requestHandler name="/update/extract" class="org.apache.solr.handler.extraction.ExtractingRequestHandler">
<lst name="defaults">
<str name="fmap.content">text</str>
<str name="lowernames">true</str>
<str name="uprefix">attr_</str>
<str name="captureAttr">true</str>
</lst>
</requestHandler>

Next, create the lib folder next to the conf directory (the directory where you place your Solr configuration files) and place the apache-solr-cell-3.1-SNAPSHOT.jar file from the dist directory (looking from the official Solr distribution package) there. After that, you have to copy all the libraries from the contrib/extraction/lib/ directory to the lib directory you created before.

And that's actually all that you need to do in terms of configuration.

To simplify the example, I decided to choose the standard schema.xml file distributed with Solr.

To test the indexing process, I've created a PDF file book.pdf using PDFCreator, which contained only the following text: This is a Solr cookbook. To index that file, I've used the following command:

curl "http://localhost:8983/solr/update/extract?literal.id=1&commit=true" -F "myfile=@example.pdf"

You should see the following response:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">578</int>
</lst>
</response>

How it works...

Binary file parsing is implemented using the Apache Tika framework. Tika is a toolkit for detecting and extracting metadata and structured text from various types of documents, not only binary files, but also HTML and XML files. To add a handler which uses Apache Tika, we need to add a handler based on the org.apache.solr.handler.extraction.ExtractingRequestHandler class to our solrconfig.xml file, as shown in the example.

So we added a new request handler with some default parameters. Those parameters tell Solr how to handle the data that Tika returns. The fmap.content parameter tells Solr to what field content of the parsed document should be put. In our case, the parsed content will go to the field named text. The next parameter lowernames set to true tells Solr to lower all names that comes from Tika and make them lowercased. The next parameter, uprefix, is very important. It tells Solr how to handle fields that are not defined in the schema.xml file. The name of the field returned from Tika will be added to the value of the parameter and send to Solr. For example, if Tika returned a field named creator and we don't have such a field in our index, than Solr would try to index it under a field named attr_creator, which is a dynamic field. The last parameter tells Solr to index Tika XHTML elements into separate fields named after those elements.

Next we have a command that sends a PDF file to Solr. We are sending a file to the /update/extract handler with two parameters. First we define a unique identifier. It's useful to be able to do that during document sending because most of the binary documents won't have an identifier in its contents. To pass the identifier, we use the literal.id parameter. The second parameter we send to Solr is information to perform commit right after document processing.

See also

To see how to index binary files, please take a look at Chapter 2, Indexing Your Data, the recipes: Indexing PDF files, Indexing Microsoft Office files, and Extracting metadata from binary files.

About the Author
  • Rafał Kuć

    Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as a consultant and software engineer at Sematext Group Inc. where he concentrates on open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more than 14 years of experience in various software domains—from banking software to e–commerce products. He is mainly focused on Java; however, he is open to every tool and programming language that might help him to achieve his goals easily and quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people solve their Solr and Lucene problems. He is also a speaker at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days.

    Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. He started working with Elasticsearch in the middle of 2010. At present, Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest.

    Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.

    Browse publications by this author
Apache Solr 3.1 Cookbook
Unlock this book and the full library FREE for 7 days
Start now