Indexing Data in Solr 1.4 Enterprise Search Server: Part1

Exclusive offer: get 50% off this eBook here
Solr 1.4 Enterprise Search Server

Solr 1.4 Enterprise Search Server — Save 50%

Enhance your search with faceted navigation, result highlighting, fuzzy queries, ranked scoring, and more

$26.99    $13.50
by David Smiley | August 2009 | Open Source

In this two part article series by David Smiley,we will learn how to index data in Solr 1.4 Enterprise Search Server. We're going to review the four main mechanisms that Solr offers:

  • Solr's native XML
  • CSV (Character Separated Value)
  • Direct Database and XML Import through Solr's DataImportHandler
  • Rich documents through Solr Cell

 

(For more resources on Solr, see here.)

Lets get started.

Communicating with Solr

There are a few dimensions to the options available for communicating with Solr:

Direct HTTP or a convenient client API

Applications interact with Solr over HTTP. This can either be done directly (by hand, but by using an HTTP client of your choice), or it might be facilitated by a Solr integration API such as SolrJ or Solr Flare, which in turn use HTTP.

An exception to HTTP is offered by SolrJ, which can optionally be used in an embedded fashion with Solr (so-called Embedded Solr) to avoid network and inter process communication altogether. However, unless you are sure you really want to embed Solr within another application, this option is discouraged in favor of writing a custom Solr updating request handler.

Data streamed remotely or from Solr's Filesystem

Even though an application will be communicating with Solr over HTTP, it does not have to send Solr data over this channel. Solr supports what it calls remote streaming. Instead of giving Solr the data directly, it is given a URL that it will resolve. It might be an HTTP URL, but more likely it is a filesystem based URL, applicable when the data is already on Solr's machine. Finally, in the case of Solr's DataImportHandler, the data can be fetched from a database.

Data formats

The following are the different data formats:

  • Solr-XML: Solr has a specific XML schema it uses to specify documents and their fields. It supports instructions to delete documents and to perform optimizes and commits too.
  • Solr-binary: Analogous to Solr-XML, it is an efficient binary representation of the same structure. This is only supported by the SolrJ client API.
  • CSV: CSV is a character separated value format (often a comma).
  • Rich documents like PDF, XLS, DOC, PPT to Solr: The text data extracted from these formats is directed to a particular field in your Solr schema.
  • Finally, Solr's DIH DataImportHandler contrib add-on is a powerful capability that can communicate with both databases and XML sources (for example: web services). It supports configurable relational and schema mapping options and supports custom transformation additions if needed. The DIH uniquely supports delta updates if the source data has modification dates.

We'll use the XML, CSV, and DIH options in bringing the MusicBrainz data into Solr from its database to demonstrate Solr's capability. Most likely, an application would use just one format.

Before these approaches are described, we'll discuss curl and remote streaming, which are foundational topics.

Using curl to interact with Solr

Solr receives commands (and possibly the associated data) through HTTP POST.

Solr lets you use HTTP GET too (for example, through your web browser). However, this is an inappropriate HTTP verb if it causes something to change on the server, as happens with indexing. For more information on this concept, read about REST at:
http://en.wikipedia.org/wiki/Representational_State_Transfer
.

One way to send an HTTP POST is through the Unix command line program curl (also available on Windows through Cygwin). Even if you don't use curl, it is very important to know how we're going to use it, because the concepts will be applied no matter how you make the HTTP messages.

There are several ways to tell Solr to index data, and all of them are through HTTP POST:

  • Send the data as the entire POST payload (only applicable to Solr's XML format). curl does this with data-binary (or some similar options) and an appropriate content-type header reflecting that it's XML.
  • Send some name-value pairs akin to an HTML form submission. With curl, such pairs are proceeded by -F. If you're giving data to Solr to be indexed (as opposed to it looking for it in a database), then there are a few ways to do that:
    • Put the data into the stream.body parameter. If it's small, perhaps less than a megabyte, then this approach is fine. The limit is configured with the multipartUploadLimitInKB setting in solrconfig.xml.
    • Refer to the data through either a local file on the Solr server using the stream.file parameter or a URL that Solr will fetch it from through the stream.url parameter. These choices are a feature that Solr calls remote streaming.

Here is an example of the first choice. Let's say we have an XML file named artists.xml in the current directory. We can post it to Solr using the following command line:

curl http://localhost:8983/solr/update -H 'Content-type:text/xml; 
charset=utf-8' --data-binary @artists.xml

If it succeeds, then you'll have output that looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int><int name="QTime">128</int>
</lst>
</response>

To use the solr.body feature for the example above, you would do this:

curl http://localhost:8983/solr/update -F solr.body=@artists.xml

In both cases, the @ character instructs curl to get the data from the file instead of being @artists.xml literally. If the XML is short, then you can just as easily specify it literally on the command line:

curl http://localhost:8983/solr/update -F stream.body=' <commit />'

Notice the leading space in the value. This was intentional. In this example, curl treats @ and < to mean things we don't want. In this case, it might be more appropriate to use form-string instead of -F. However, it's more typing, and I'm feeling lazy.

Remote streaming

In the examples above, we've given Solr the data to index in the HTTP message. Alternatively, the POST request can give Solr a pointer to the data in the form of either a file path accessible to Solr or an HTTP URL to it.

The file path is accessed by the Solr server on its machine, not the client, and it must also have the necessary operating system file permissions too.

However, just as before, the originating request does not return a response until Solr has finished processing it. If you're sending a large CSV file, then it is practical to use remote streaming. Otherwise, if the file is of a decent size or is already at some known URL, then you may find remote streaming faster and/or more convenient, depending on your situation.

Here is an example of Solr accessing a local file:

curl http://localhost:8983/solr/update -F stream.file=/tmp/artists.xml

To use a URL, the parameter would change to stream.url, and we'd specify a URL. We're passing a name-value parameter (stream.file and the path), not the actual data.

Remote streaming must be enabled
In order to use remote streaming (stream.file or stream.url), you must enable it in solrconfig.xml. It is disabled by default and is configured on a line that looks like this:

<requestParsers enableRemoteStreaming="true" multipartUploadLimitInKB="2048" />
Solr 1.4 Enterprise Search Server Enhance your search with faceted navigation, result highlighting, fuzzy queries, ranked scoring, and more
Published: August 2009
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:

(For more resources on Solr, see here.)

Sending XML to Solr

Solr's native XML syntax is very simple. You can tell Solr to add documents to an index, to commit changes, to optimize the index, and to delete documents. Here is a sample XML file you can HTTP POST to Solr:

<add allowDups="false">
<doc boost="2.0">
<field name="id">5432a</field>
<field name="type" ...</field>
<field name="a_name" boost="0.5"></field>
<!-- the date/time syntax MUST look just like this (ISO-8601)-->
<field name="begin_date">2007-12-31T09:40:00Z</field>
</doc>
<doc>
<doc>
<field name="id">5432a</field>
<field name="type" ...
<field name="begin_date">2007-12-31T09:40:00Z</field>
</doc>
<!-- more here as needed -->
</add>

The allowDups defaults to false to guarantee the uniqueness of values in the field that you have designated as the unique field in the schema (assuming you have such a field). If you were to add another document that has the same value for the unique field, then this document would override the previous document, whether it is pending a commit or it's already committed. You will not get an error.

If you are sure that you will be adding a document that is not a duplicate, then you can set allowDups to true to get a performance improvement.

Boosting affects the scores of matching documents in order to affect ranking in score-sorted search results. Providing a boost value, whether at the document or field level, is optional. The default value is 1.0, which is effectively a non-boost. Technically, documents are not boosted, only fields are. The effective boost value of a field is that specified for the document multiplied by that specified for the field.

Specifying boosts here is called index-time boosting, which is rarely done as compared to the more flexible query-time boosting. Index-time boosting is less flexible because such boosting decisions must be decided at index-time and will apply to all of the queries.

Deleting documents

You can delete a document by its unique field (we delete two documents here):

<delete><id>artist:11604</id><id>artist:11603</id></delete>

Or, you can delete all of the documents that match a particular Lucene/Solr query:

<delete><query>timestamp:[* TO NOW-12HOUR]</query></delete>

The contents of the delete tag can be any number of ID and query tags if you want to batch many deletions into one message to Solr.

The query syntax is not discussed in this article, but I'll explain this somewhat complicated query anyway. Let's suppose that all of your documents had a timestamp field with a value of the time it was indexed, and you have an update strategy that bulk loads all of the data on a daily basis. If the loading process results in documents that shouldn't be in the index anymore, then we can delete them immediately after a bulk load. This query would delete all of the documents not indexed within the last 12 hours. Twelve was chosen somewhat arbitrarily, but it needs to be less than 24 (the update process interval) and greater than the longest time it might conceivably take to bulk load all the data.

Solr 1.4 Enterprise Search Server Enhance your search with faceted navigation, result highlighting, fuzzy queries, ranked scoring, and more
Published: August 2009
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:

(For more resources on Solr, see here.)

Commit, optimize, and rollback

Data sent to Solr is not immediately searchable, nor do deletions take immediate effect. Like a database, changes must be committed first. Unlike a database, there are no distinct sessions (that is transactions) between each client, and instead there is in-effect one global modification state. This means that if more than one Solr client were to submit modifications and commit them at similar times, it is possible for part of one client's set of changes to be committed before that client told Solr to commit. Usually, you will have just one process responsible for updating Solr. But if not, then keep this in mind.

To commit changes using the XML syntax, simply send this to Solr:

<commit />

Depending on Solr's auto-warming configuration and cache state prior to committing, a commit can take a non-trivial amount of time, in the order of seconds, perhaps up to a minute or longer in extreme cases. The amount of data committed has little impact on this delay. Generally, databases commit very fast.

All uncommitted changes can be withdrawn by sending Solr the rollback command:

<rollback />

Lucene's index is internally composed of one or more segments. Modifications get committed to the last segment. Lucene will on occasions either start a new segment or merge them all together into one. When Lucene has just one segment, it is in an optimized state, because each segment degrades query performance. It is recommended to explicitly optimize the index at an opportune time like after a bulk load of data and/or a daily interval in off-peak hours, if there are sporadic updates to the index. You can do this by simply sending this XML:

<optimize />

Both commit and optimize take two additional boolean options that default to true:

<commit waitFlush="true" waitSearcher="true">

If you were to set these to false, then commit and optimize return immediately, even though the operation hasn't actually finished yet. So if you wrote a script that committed with these at their false values and then executed a query against Solr, then you may find that the query will not reflect the changes (yet). By waiting for the data to flush to the disk (waitFlush) and waiting for a new searcher to be ready to respond to changes (waitSearcher), this circumstance is avoided.

A convenient alternative to send these commands through XML is to simply add commit, optimize, or rollback as boolean request parameters when communicating with Solr. You'll see an example of this with CSV in the next section. Request parameters can be put on the URL and/or as form parameters, if applicable. These three request parameters are honored by Solr whether you send Solr its native XML format, CSV, or rich documents. waitFlush and waitSearcher are not supported in this manner.

Sending CSV to Solr

If you have data in a CSV format or if it is more convenient for you to get CSV than XML, then you may prefer the CSV option to the XML format. Solr's CSV options are fairly flexible.

To get some CSV data out of a local PostgreSQL database for the MusicBrainz tracks, I ran this command:

psql -U postgres -d musicbrainz_db -c "COPY (
select 'Track:' || t.id as id, 'Track' as type, t.name as t_name,
t.length/1000 as t_duration, a.id as t_a_id, a.name as t_a_name,
albumjoin.sequence as t_num, r.id as t_r_id, r.name as t_r_name,
array_to_string(r.attributes,' ') as t_r_attributes, albummeta.tracks as t_r_tracks
from (track t inner join albumjoin on t.id = albumjoin.track
inner join album r on albumjoin.album = r.id left join albummeta on
 albumjoin.album = albummeta.id) inner join
artist a on t.artist = a.id
) to '/tmp/tracks' CSV HEADER"

And it generated output that looks like this (first three lines):

id,type,t_name,t_duration,t_a_id,t_a_name,t_num,t_r_id,t_r_name,
t_r_attributes,t_r_tracks
Track:183326,Track,In the Arms of Sleep,254,11650,The Smashing
Pumpkins,4,22471,Mellon Collie and the Infinite Sadness (disc 2:
Twilight to Starlight),0 1 100,14
Track:183328,Track,Tales of a Scorched Earth,228,11650,The Smashing
Pumpkins,6,22471,Mellon Collie and the Infinite Sadness (disc 2:
Twilight to Starlight),0 1 100,14

To get Solr to import the CSV file, I typed this at the command line:

curl http://localhost:8983/solr/update/csv -F f.t_r_attributes.split=true 
-F f.t_r_attributes.separator=' ' -F overwrite=false -F commit=true -F stream.file=/tmp/tracks

When I actually did this I had PostgreSQL on one machine and Solr on another. I used the Unix mkfifo command to create an in-memory data pipe mounted at /tmp/tracks. This way, I didn't have to actually generate a huge CSV file. I could essentially stream it directly from PostgreSQL into Solr. Details on this approach and PostgreSQL are out of the scope of this article.

Configuration options

The configuration options to Solr's CSV capability are set through HTTP posting name-value pairs in the same format that HTML forms post their data. As explained earlier, technically you could use a URL through HTTP GET with a stream.url or stream.file parameter. However, this is a bad practice. Also note that Solr's CSV capability doesn't support index-time boosting, but that is an uncommon requirement.

The following are the names of each configuration option with an explanation. For the MusicBrainz track CSV file, I was able to use the defaults with the exception of specifying how to parse the multi-valued t_r_attributes field and disabling unique key processing for performance.

  • separator: The character that separates each value on a line. Defaults to ,.
  • header: Is set to true if the first line lists the field names (the default).
  • fieldnames: If the first line doesn't have the field names, then you'll have to use this instead to indicate what they are. They are comma separated. If no name is specified for a column, then its data is skipped.
  • skip: The fields to not import in the CSV file.
  • skipLines: The number of lines to skip in the input file. Defaults to 0.
  • trim: If true, then removes leading and trailing whitespace as a final step, even if quoting is used to explicitly specify a space. Defaults to false. Solr already does an initial pass trim, but quoting may leave spaces.
  • encapsulator: This character is used to encapsulate (that is surround, quote) values in order to preserve the field separator as a field value instead of mistakenly parsing it as the next field. This character itself is escaped by doubling it. It defaults to the double quote, unless escape is specified. Example:
    11604, foo, "The ""second"" word is quoted.", bar
  • escape: If this character is found in the input text, then the next character is taken literally in place of this escape character, and it isn't otherwise treated specially by the file's syntax. Example:
    11604, foo, The second, word is followed by a comma., bar
  • keepEmpty: Specified whether blank (zero length) fields should be indexed as such or omitted. It defaults to false.
  • overwrite: It indicates whether to enforce the unique key constraint of the schema by overwritting existing documents with the same ID. It defaults to true. Disable this to increase performance, if you are sure you are passing new documents.
  • split: This is a field-specific option used to split what would normally be one value into multiple values. Another set of CSV configuration options (separator, and so on) can be specified for this field to instruct Solr on how to do that. See the previous tracks MusicBrainz example on how this is used.
  • map: This is another field-specific option used to replace input values with another. It can be used to remove values too. The value should include a colon which separates the left side which is replaced with the right side. If we were to use this feature on the tracks of the MusicBrainz data, then it could be used to map the numeric code in t_r_attributes to more meaningful values. Here's an example of such an attempt:

    -F keepEmpty=false -F f.t_r_attributes.map=0:
    -F f.t_r_attributes.map=1:Album -F f.t_r_attributes.map=2:Single

    This causes 0 to be removed, because it seems to be useless data, as nearly all tracks have it, and we map 1 to Album and 2 to Single.


Further resources on this subject:


About the Author :


David Smiley

Born to code, David Smiley is a software engineer that’s passionate about search, Lucene, spatial, and open-source. He has a great deal of expertise with Lucene and Solr, which started in 2008 at MITRE. In 2009 as the lead author, he wrote Solr 1.4 Enterprise Search Server, the first book on Solr, published by PACKT. It was updated in 2011, and again for this third edition. After the first book, he developed a one and two-day Solr training courses delivered a half dozen times within MITRE, and he delivered LucidWorks’ training once too. Most of his excitement and energy relating to Lucene is centered on Lucene’s spatial module to include Spatial4j, which he is largely responsible for. He presented his progress on this at Lucene Revolution and other conferences several times. Finally, he currently holds committer / Project Management Committee (PMC) status with the Lucene/Solr open-source project. During all this time, David has staked his career on search, working exclusively on such projects, formerly for MITRE, and now as an independent consultant for various clients. You can reach him at dsmiley@apache.org.

Books From Packt

Learning jQuery 1.3
Learning jQuery 1.3

Magento: Beginner's Guide
Magento: Beginner's Guide

Building Powerful and Robust Websites with Drupal 6
Building Powerful and Robust Websites with Drupal 6

Joomla! 1.5 Development Cookbook [RAW]
Joomla! 1.5 Development Cookbook [RAW]

Joomla! 1.5 SEO
Joomla! 1.5 SEO

Zend Framework 1.8 Web Application Development
Zend Framework 1.8 Web Application Development

Symfony 1.3 web application development
Symfony 1.3 web application development

FreePBX 2.5 Powerful Telephony Solutions
FreePBX 2.5 Powerful Telephony Solutions

Your rating: None Average: 5 (1 vote)

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
s
2
K
C
h
1
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software