Indexing Data in Solr 1.4 Enterprise Search Server: Part2

(For more resources on Solr, see here.)

Direct database and XML import

The capability for Solr to get data directly from a database or HTTP GET accessible XML is distributed with Solr as a contrib module, and it is known as the DataImportHandler (DIH in short). The complete reference documentation for this capability is here at, and it's rather thorough. In this article, we'll only walk through an example to see how it can be used with the MusicBrainz data set.

In short, the DIH offers the following capabilities:

  • Imports data from databases through JDBC (Java Database Connectivity)
  • Imports XML data from a URL (HTTP GET) or a file
  • Can combine data from different tables or sources in various ways
  • Extraction/Transformation of the data
  • Import of updated (delta) data from a database, assuming a last-updated date
  • A diagnostic/development web page
  • Extensible to support alternative data sources and transformation steps

As the MusicBrainz data is in a database, the most direct method to get data into Solr is definitely through the DIH using JDBC.

Getting started with DIH

DIH is not a direct part of Solr. Hence it might not be included in your Solr setup. It amounts to a JAR file named something like apache-solr-dataimporthandler-1.4.jar, which is probably already embedded within the solr.war file. You can use jar -tf solr.war to see. Alternatively, it may be placed in <solr-home>/lib, which is alongside the conf directory we've been working with. For database connectivity, we need to ensure that the JDBC driver is on the Java classpath. Placing it in <solr-home>/lib is a convenient way to do this.

The DIH needs to be registered with Solr in solrconfig.xml. Here is how it is done:

<requestHandler name="/dataimport"
<lst name="defaults">
<str name="config">mb-dih-artists-jdbc.xml</str>

mb-dih-artists-jdbc.xml (mb being short for MusicBrainz) is a file in <solr-home>/conf, which is used to configure DIH. It is possible to specify some configuration aspects in this request handler configuration instead of the dedicated configuration file. However, I recommend that it all be in the DIHconfig file, as in our example here.

Given below is an mb-dih-artists-jdbc.xml file with a rather long SQL query:

<dataSource name="jdbc" driver="org.postgresql.Driver"
user="musicbrainz" readOnly="true" autoCommit="false" />
<entity name="artist" dataSource="jdbc" pk="id" query="
select as id, as a_name, a.sortname as a_name_sort,
a.begindate as a_begin_date, a.enddate as a_end_date,
a.type as a_type
array(select from artistalias aa
where aa.ref = )
, '|') as a_alias
array(select from v_artist_members am
where = order by
, '|') as a_member_name
array(select from v_artist_members am
where = order by
, '|') as a_member_id,
(select re.releasedate from release re inner join
album r on re.album = where r.artist =
order by releasedate desc limit 1) as
from artist a
<field column = "id" template="Artist:${}" />
<field column = "type" template="Artist" />
<field column = "a_begin_date"
dateTimeFormat="yyyy-MM-dd" />
<field column = "a_end_date"
dateTimeFormat="yyyy-MM-dd" />
<field column = "a_alias" splitBy="|" />
<field column = "a_member_name" splitBy="|"/>
<field column = "a_member_id" splitBy="|" />

The DIH development console

Before describing the configuration details, we're going to take a look at the DIH development console. It is accessed by going to this URL (modifications may be needed for your host, port, core, and so on):

The development console looks like the following screenshot:

 Solr 1.4 Enterprise Search Server

The screen is divided into two panes: on the left is the DIH control form, which includes an editable version of the DIH configuration file and on the right is the command output as raw XML. The screen works quite simply. The form essentially results in submitting a URL to the right pane. There's no real server-side logic to this interface beyond the standard DIH command invocations being executed on the right. The last section on DIH in this article goes into more detail on submitting a command to the DIH.

DIH DataSources of type JdbcDataSource

The DIH configuration file starts with the declaration of one or more data sources using the element <dataSource/>, which refers to either a database, a file, or an HTTP URL, depending on the type attribute. It defaults to a value of JdbcDataSource. Those familiar with JDBC should find the driver and url attributes with accompanying user and password straightforward—consult the documentation for your driver/database for further information. readOnly is a boolean that will set a variety of other JDBC options appropriately when set to true. And batchSize is an alias for the JDBC fetchSize and defaults to 500. There are numerous JDBC oriented attributes that can be set as well. I would not normally recommend learning about a feature by reading source code, but this is an exception. For further information, read

Efficient JDBC configuration

Many database drivers in the default configurations (including those for PostgreSQL and MySQL) fetch all of the query results into the memory instead of on-demand or using a batch/fetch size. This may work well for typical database usage like OLTP (Online Transaction Processing systems), but is completely unworkable for ETL (Extract Transform and Load) usage such as this. Configuring the driver to stream the data requires driver-specific configuration options. You may need to consult relevant documentation for the JDBC driver. For PostgreSQL, set autoCommit to false. For MySQL, set batchSize to -1(The DIH detects the -1 and replaces it with Java's Integer.MIN_VALUE, which triggers the special behavior in MySQL's JDBC driver). For Microsoft SQL Server, set responseBuffering to adaptive. Further information about specific databases is at :

DIH documents, entities

After the declaration of <dataSource/> element(s) is the <document/> element. In turn, this element contains one or more <entity/> elements. In this sample configuration, we're only getting artists. However, if we wanted to have more than one type in the same index, then another could be added. The dataSource attribute references a correspondingly named element earlier. It is only necessary if there are multiple to choose from, but we've put it here explicitly anyway.

The main piece of an entity used with a JDBC data source is the query attribute, which is the SQL query to be evaluated. You'll notice that this query involves some sub-queries, which are made into arrays and then transformed into strings joined by spaces. The particular functions used to do these sorts of things are generally database specific. This is done to shoe-horn multi-valued data into a single row in the results. It may create a more complicated query, but it does mean that the database does all of the heavy lifting so that all of the data Solr needs for an artist is in the row. An alternative with DIH is to declare other entities within the entity. If you aren't using a database or if you wish to mix in another data source (even if it's of a different type), then you will be forced to do that. See the Solr DIH Wiki page for examples:

The DIH also supports a delta query, which is a query that selects time-stamped data with dates after the last queried date. This won't be covered here, but you can find more information at the previous URL.

DIH fields and transformers

Within the <entity/> are some <field/>elements that declare how the columns in the query map to Solr. The field element must have a column attribute that matches the corresponding named column in the SQL query. The name attribute is the Solr schema field name that the column is going into. If it is not specified (and it never is for our example), then it defaults to the column name.

Use the SQL as a keyword as we've done to use the same names as the Solr schema instead of the database schema. This reduces the number of explicit mappings needed in <field/> elements and shortens existing ones.

When a column in the result can be placed directly into Solr without further processing, there is no need to specify the field declaration, because it is implied.

An attribute of the entity declaration that we didn't mention yet is transformer. This declares a comma-separated list of transformers that manipulate the transfer of data from the JDBC resultset into a Solr field. These transformers evaluate a field, if it has an attribute it uses to do its job. More than one might operate on a given field. Therefore, the order in which the transformers are declared in matters. Here are the attributes we've used:

  • template: It is used by TemplateTransformer and declares text, which might include variable name substitutions using ${name} syntax. To access an existing field, use the entityname.columnname syntax.
  • splitBy: It is used by RegexTransformer and splits a single string value into a multi-value by looking for the specified character.
  • dateTimeFormat: It is used by DateFormatTransformer. This is a Java date/time format pattern If the type of the field in the schema is a date, then it is necessary to ensure Solr can interpret the format. Alternatively, ensure that the string matches the ISO-8601 format, which looks like this: 1976-10-23T23:59:59.000Z. As in all cases in Solr, when specifying dates you can use its so-called "DateMath" syntax such as appending /DAY to tell Solr to round the date to a day.

(For more resources on Solr, see here.)

Importing with DIH

Unlike the other importing mechanisms, the DIH returns immediately, while the import continues asynchronously. To get the current status of the DIH, go to this URL http://localhost:8983/solr/dataimport, and you'll get output like the following:

<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">15</int>
<lst name="initArgs">
<lst name="defaults">
<str name="config">mb-dih-artists-jdbc.xml</str>
<str name="status">idle</str>
<str name="importResponse"/>
<lst name="statusMessages"/>
<str name="WARNING">This response format is experimental. It is
likely to change in the future.</str>

Commands are given to DIH as request parameters just as everything else is in Solr. We could tell the DIH to do a full-import just by going to this URL:http://localhost:8983/solr/dataimport?command=full-import. On the command line we would use:

curl http://localhost:8983/solr/dataimport
-F command=full-import

It uses HTTP POST, which is better, as discussed much earlier.

Other boolean parameters named clean, commit, and optimize may accompany the parameter that defaults to true, if not present. Clean is specific to DIH, and it means that before running the import, it will remove all of the existing data.

Two other useful commands are reload-config and abort. The first will reload the DIH configuration file, which is useful for picking up small changes. The second will cancel any existing imports in progress.

While most of this article assumes that the content you want to index in Solr is in a neatly structured data format of some kind, such as in a database table, a selection of XML files, or CSV, the reality is that we store information in the much messier world of binary formats such as PDF, Microsoft Office, or even images and music files. Your author Eric Pugh, became involved with the Solr community when a client required a search system that could ingest the thousands of PDF and Microsoft Word documents that they had produced over the years. The outgrowth of that effort, Solr Cell, distributed as a contrib module to Solr, provides a very powerful framework for indexing various binary formats.

Solr Cell is technically called the Extracting Request Handler, however the name came about because:

Grant [Ingersoll] was writing the javadocs for the code and needed an entry for the <title> tag and wrote out "Solr Content Extraction Library", as the contrib directory is named "extraction". This then lead to an "acronym": Solr CEL, which then gets mashed to: Solr Cell! Hence, the project name is "Solr Cell"! It's also appropriate because a Solar Cell's job is to convert the raw energy of the Sun to electricity, and this contrib's module is responsible for converting the "raw" content of a document to something usable by Solr.

We'll look at how to leverage Solr Cell for extracting karaoke song lyrics from MIDI files. Just think, you can build a Solr powered index of all your favorite karaoke songs! The complete reference material for Solr Cell is available at

Extracting binary content

Every binary format is different, and all of them provide different types of metadata, as well as different methods of extracting content. The heavy lifting of providing a single API to an ever expanding list of binary/structured formats is left up to Apache Tika:

    Apache Tika is a toolkit for detecting and extracting metadata and structured text content from various documents using existing parser libraries.

Tika supports a wide variety of formats, from the predictable to the unexpected. Some of the key formats supported are Adobe PDF, Microsoft Office including Word, Excel, PowerPoint, and Visio, and Outlook. Other formats that are supported include extracting metadata from images such as JPG, GIF, and PNG, as well as from various audio formats such as MP3, MIDI, and Wave audio. Tika itself does not attempt to parse the individual binary formats. Instead, it delegates the parsing to various third party libraries, while providing a high level stream of SAX events as the documents are parsed.

Solr Cell is a fairly thin wrapper consisting of a SolrContentHandler that consumes the SAX events and builds the SolrInputDocument from the fields that are specified to be extracted from the binary document.

There are some not so obvious things to keep in mind when indexing binary documents:

  • You can supply any kind of supported document to Tika, and the AutoDetectParser will attempt to discover the correct mime type of the document. Alternatively, you can supply a stream.type parameter to specify which parser to use.
  • The default SolrContentHandler that is used by Solr Cell is fairly simplistic. You may find that you need to perform extra massaging of the data being indexed beyond what Solr Cell offers to reduce the junk data being indexed. Subclass createFactory() method of ExtractingRequestHandler to return your own custom SolrContentHandler.
  • Remember that as you are indexing you are potentially sending large binary files over the wire that must then be parsed in server memory, which can be very slow. If you are looking to only index metadata, then it may make sense to write your own parser using Tika directly, extract the metadata, and post that across to the server.
  • You need to supply to Solr the various dependent JAR files that Tika requires to parse the documents. Put them with the Solr Cell JAR (named something like apache-solr-cell-1.4.jar) in <solr-home>/lib.

Tika has only recently become a full fledged project and has already had a couple of releases. You can learn more about Tika from the web site at

Configuring Solr

In /examples/cores/karaoke/conf/solrconfig.xml lies the request handler for parsing binary documents:

<requestHandler name="/update/extract"
<lst name="defaults">
<str name="map.Last-Modified">last_modified</str>
<str name="uprefix">metadata_</str>

Here we can see that the Tika metadata attribute Last-Modified is being mapped to the Solr field last_modified, assuming we are provided that Tika attribute. The parameter uprefix is specifying the prefix to use when storing any Tika fields that don't have a corresponding matching Solr field.

In order to use Solr Cell, we placed the Solr Cell JAR in the ./examples/cores/karaoke/lib/ directory, because it is not included in solr.war. The JAR files placed in this lib directory are available only to the karaoke core. To share across all cores add to ./examples/cores/lib/ and by specifying it as the shared lib in./examples/cores/solr.xml:

<solr persistent="false" sharedLib="lib">

For this example, we are parsing .kar karaoke files that are recorded in the MIDI format using the standard Java package However, we have also put other JAR dependencies of Solr Cell such as pdfbox, poi, and icu4j in ./lib.

(For more resources on Solr, see here.)

Extracting karaoke lyrics

We are now ready to extract karaoke lyrics by posting MIDI files to our Solr /karaoke/update/extract handler. Some classic ABBA tunes for your enjoyment are available in the ./examples/appendix/karaoke/songs/ directory, gratefully sourced from FreeKaraoke at In order to index the song Angel Eyes from the command line using curl, the simplest command to run is:

>> curl 'http://localhost:8983/solr/karaoke/update/extract?map.
content=text' -F "file=@angeleyes.kar"

Don't forget to commit your changes:

>> curl http://localhost:8983/solr/karaoke/update/ -H "Content-Type:
text/xml" --data-binary '<commit waitFlush="false"/>'

You can also trigger a commit when indexing content by appending commit=true to the URL, however this is an inefficient approach if you are indexing many documents

We have a single map.content=text parameter that specifies the default field for content extracted from the source. In this case, angeleyes.kar should be stored in the Solr field text. Now go look for the results at http://localhost:8983/solr/karaoke/select/?q=*:*. You should see:

<result name="response" numFound="1" start="0">
<arr name="text">
Angel Eyes by Abba sequenced by Michael Boyce
tinker@worldnet.att.netSoft karaoke@KMIDI KARAOKE
FILEWords@LENGL@TAngel Eyes@TABBALast night I was taking a walk
along the river/And I saw him together with a young girl/And the
look that he gave made me shiver/'Cause he always used ...

You've now indexed information about the song and the lyrics in the text field that forms the textual content of the MIDI file. However, what about the metadata, for the MIDI file that Tika also exposes? Well, this is where dynamic fields come in very handy. Every binary format has a set of metadata that to a varying extent overlaps with other formats. Fortunately it is very easy to specify to Solr Cell how you would want to map metadata by using the uprefix property. We specify that all of the metadata_* fields should be created using dynamic fields in schema.xml:

<dynamicField name="metadata_*" type="string" indexed="true"
stored="true" multiValued="false"/>

Since handling metadata properly is something we want to standardize on, we add to the configuration element in solrconfig.xml:

<str name="map.Last-Modified">last_modified</str>
<str name="uprefix">metadata_</str>

Notice that the & values in the URL are escaped with backslashes: . Forgetting to escape special characters is a common issue when working with curl.

When you search for all documents, you should see indexed metadata for Angel Eyes, prefixed with metadata_:

<str name="metadata_Content-Type">audio/midi</str>
<str name="metadata_divisionType">PPQ</str>
<str name="metadata_patches">0</str>
<str name="metadata_stream_content_type">
<str name="metadata_stream_name">angeleyes.kar</str>
<str name="metadata_stream_size">55677</str>
<str name="metadata_stream_source_info">file</str>
<str name="metadata_tracks">16</str>

Obviously, in most use cases, every time you index the same file you don't want to get a new document. If your schema has a uniqueKey field defined such as id, then you can provide a specific ID by passing a literal value using Each time you index the file using the same ID, it will delete and insert that document. However, that implies that you have the ability to manage IDs through some third party system like a database. If you want to use the metadata, such as the stream_name provided by Tika to provide the key, then you just need to map that field using map.stream_name=id. To make the example work, update ./examples/cores/karaoke/schema.xml to specify <uniqueKey>id</uniqueKey>.

>> curl 'http://localhost:8983/solr/karaoke/update/extract?map.
content=text&map.stream_name=id' -F "file=@angeleyes.kar"

This of course assumes that you've defined <uniqueKey>id</uniqueKey> to be of type string, not a number.

Indexing richer documents

Indexing karaoke lyrics from MIDI files is also a fairly trivial example. We basically just strip out all of the contents, and store them in the Solr text field. However, indexing other types of documents, such as PDFs, can be a bit more complicated. Let's look at Take a Chance on Me, a complex PDF file that explains what a Monte Carlo simulation is, while making lots of puns about the lyrics and titles of songs from ABBA. View ./examples/appendix/karaoke/mccm.pdf, and you will see a complex PDF document with multiple fonts, background images, complex mathematical equations, Greek symbols, and charts. However, indexing that content is as simple as the prior example:

>> curl 'http://localhost:8983/solr/karaoke/update/extract?map.
content=text&map.stream_name=id&commit=true' -F "file=@mccm.pdf"

If you do a search for the document using the filename as the id via http://localhost:8983/solr/karaoke/select/?q=id:mccm.pdf, then you'll also see that the last_modified field that we mapped in solrconfig.xml is being populated. Tika provides a Last-Modified field for PDFs, but not for MIDI files:

<arr name="id">
<arr name="last_modified">
<str>Sun Mar 03 15:55:09 EST 2002</str>
<arr name="text">
Take A Chance On Me

So with these richer documents, how can we get a handle on the metadata and content that is available? Passing extractOnly=true on the URL will output what Solr Cell has extracted, including metadata fields, without actually indexing them:

<str name="mccm.pdf">&lt;?xml version="1.0" encoding="UTF-8"?&gt;
&lt;html xmlns=""&gt;
&lt;title&gt;Take A Chance On Me&lt;/title&gt;
Take A Chance On Me
Monte Carlo Condensed Matter
A very brief guide to Monte Carlo simulation.
<lst name="mccm.pdf_metadata">
<arr name="stream_source_info"><str>file</str></arr>
<arr name="subject"><str>Monte Carlo Condensed Matter</str></arr>
<arr name="Last-Modified"><str>Sun Mar 03 15:55:09 EST
<arr name="creator"><str>PostScript PDriver module 4.49</str></arr>
<arr name="title"><str>Take A Chance On Me</str></arr>
<arr name="stream_content_type"><str>application/
<arr name="created"><str>Sun Mar 03 15:53:14 EST 2002</str></arr>
<arr name="stream_size"><str>378454</str></arr>
<arr name="stream_name"><str>mccm.pdf</str></arr>

At the top in an XML node called <str name="mccm.pdf"/> is the content extracted from the PDF as an XHTML document. As it is XHTML wrapped in another separate XML document, the various <and> tags have been escaped: <div>. If you cut and paste the contents of <str/> node into a text editor and convert the &lt; to < and &gt; to >, then you can see the structure of the XHTML document that is indexed.

Below the contents of the PDF, you can also see a wide variety of PDF document-specific metadata fields, including subject, title, and creator, as well as metadata fields added by Solr Cell for all imported formats, including stream_source_info, stream_content_type, stream_size, and the already-seen stream_name.

So why would we want to see the XHTML structure of the content? The answer is in order to narrow down our results. We can use XPath queries through the ext.xpath parameter to select a subset of the data to be indexed. To make up an arbitrary example, let's say that after looking at mccm.html we know we only want the second paragraph of content to be indexed:

>> curl 'http://localhost:8983/solr/karaoke/update/extract?map.
p[1]' -F "file=@mccm.pdf"

We now have only the second paragraph, which is the summary of what the document Take a Chance on Me is about.

Binary file size

Take a Chance on Me is a 372 KB file stored at ./examples/appendix/karaoke/mccm.pdf, and it highlights one of the challenges of using Solr Cell. If you are indexing a thousand PDF documents that each average 372 KB, then you are shipping 372 megabytes over the wire, assuming the data is not already on Solr's file system. However, if you extract the contents of the PDF on the client side and only send that over the web, then what is sent to the Solr text field is just 5.1 KB. Look at ./examples/appendix/karaoke/mccm.txt to see the actual text extracted from mccm.pdf. Generously assuming that the metadata adds an extra 1 KB of information, then you have a total content sent over the wire of 6.1 megabytes ((5.1 KB + 1.0 KB) * 1000).

Solr Cell offers a quick way to start indexing that vast amount of information stored in previously inaccessible binary formats without resorting to custom code per binary format. However, depending on the files, you may be needlessly transmitting a lot of data, only to extract a small portion of text. Moreover, you may find that the logic provided by Solr Cell for parsing and selecting just the data you want may not be rich enough. For these cases you may be better off building a dedicated client-side tool that does all of the parsing and munging you require.


At this point, you should have a schema that you believe will suit your needs, and you should know how to get your data into it. From Solr's native XML to CSV to databases to rich documents, Solr offers a variety of possibilities to ingest data into the index. In the end, usually one or two mechanisms will be used. In addition, you can usually expect the need to write some code, perhaps just a simple bash or ant script to implement the automation of getting data from your source system into Solr.

Further resources on this subject:

  • Integrating Solr: Ruby on Rails Integration [article]
  • Faceting in Solr 1.4 Enterprise Search Server [article]
  • Indexing Data in Solr 1.4 Enterprise Search Server: Part1 [article]

  • You've been reading an excerpt of:

    Solr 1.4 Enterprise Search Server

    Explore Title