Solr 1.4 Enterprise Search Server

By David Smiley , Eric Pugh
  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Quick Starting Solr

About this book

If you are a developer building a high-traffic web site, you need to have a terrific search engine. Sites like Netflix.com and Zappos.com employ Solr, an open source enterprise search server, which uses and extends the Lucene search library. This is the first book in the market on Solr and it will show you how to optimize your web site for high volume web traffic with full-text search capabilities along with loads of customization options. So, let your users gain a terrific search experience.

This book is a comprehensive reference guide for every feature Solr has to offer. It serves the reader right from initiation to development to deployment. It also comes with complete running examples to demonstrate its use and show how to integrate it with other languages and frameworks.

This book first gives you a quick overview of Solr, and then gradually takes you from basic to advanced features that enhance your search. It starts off by discussing Solr and helping you understand how it fits into your architecture—where all databases and document/web crawlers fall short, and Solr shines. The main part of the book is a thorough exploration of nearly every feature that Solr offers. To keep this interesting and realistic, we use a large open source set of metadata about artists, releases, and tracks courtesy of the MusicBrainz.org project. Using this data as a testing ground for Solr, you will learn how to import this data in various ways from CSV to XML to database access. You will then learn how to search this data in a myriad of ways, including Solr's rich query syntax, "boosting" match scores based on record data and other means, about searching across multiple fields with different boosts, getting facets on the results, auto-complete user queries, spell-correcting searches, highlighting queried text in search results, and so on.

After this thorough tour, we'll demonstrate working examples of integrating a variety of technologies with Solr such as Java, JavaScript, Drupal, Ruby, XSLT, PHP, and Python.

Finally, we'll cover various deployment considerations to include indexing strategies and performance-oriented configuration that will enable you to scale Solr to meet the needs of a high-volume site.

Publication date:
August 2009
Publisher
Packt
Pages
336
ISBN
9781847195883

 

Chapter 1. Quick Starting Solr

Welcome to Solr! You've made an excellent choice in picking a technology to power your searching needs. In this chapter, we're going to cover the following topics:

  • An overview of what Solr and Lucene are all about

  • What makes Solr different from other database technologies

  • How to get Solr, what's included, and what is where

  • Running Solr and importing sample data

  • A quick tour of the interface and key configuration files

 

An introduction to Solr


Solr is an open source enterprise search server. It is a mature product powering search for public sites like CNet, Zappos, and Netflix, as well as intranet sites. It is written in Java, and that language is used to further extend/modify Solr. However, being a server that communicates using standards such as HTTP and XML, knowledge of Java is very useful but not strictly a requirement. In addition to the standard ability to return a list of search results for some query, it has numerous other features such as result highlighting, faceted navigation (for example, the ones found on most e-commerce sites), query spell correction, auto-suggest queries, and "more like this" for finding similar documents.

Lucene, the underlying engine

Before describing Solr, it is best to start with Apache Lucene, the core technology underlying it. Lucene is an open source, high-performance text search engine library. Lucene was developed and open sourced by Doug Cutting in 2000 and has evolved and matured since then with a strong online community. Being just a code library, Lucene is not a server and certainly isn't a web crawler either. This is an important fact. There aren't even any configuration files. In order to use Lucene directly, one writes code to store and query an index stored on a disk. The major features found in Lucene are as follows:

  • A text-based inverted index persistent storage for efficient retrieval of documents by indexed terms

  • A rich set of text analyzers to transform a string of text into a series of terms (words), which are the fundamental units indexed and searched

  • A query syntax with a parser and a variety of query types from a simple term lookup to exotic fuzzy matches

  • A good scoring algorithm based on sound Information Retrieval (IR) principles to produce the more likely candidates first, with flexible means to affect the scoring

  • A highlighter feature to show words found in context

  • A query spellchecker based on indexed content

Note

For even more information on the query spellchecker, check out the Lucene In Action book (LINA for short) by Erik Hatcher and Otis Gospodnetić.

Solr, the Server-ization of Lucene

With the definition of Lucene behind us, Solr can be described succinctly as the server-ization of Lucene. However, it is definitely not a thin wrapper around the Lucene libraries. Most of Solr's features are distinct from Lucene, such as faceting, but not far into the implementation. The line is often blurred as to what is Solr and what is Lucene. Without further adieu, here is the major feature-set in Solr:

  • HTTP request processing for indexing and querying documents.

  • Several caches for faster query responses.

  • A web-based administrative interface including:

    • Runtime performance statistics including cache hit/miss rates.

    • A query form to search the index.

    • A schema browser with histograms of popular terms along with some statistics.

    • Detailed breakdown of scoring mathematics and text analysis phases.

  • Configuration files for the schema and the server itself (in XML).

    • Solr adds to Lucene's text analysis library and makes it configurable through XML.

    • Introduces the notion of a field type (this is important yet surprisingly not in Lucene). Types are present for dates and special sorting concerns.

  • The disjunction-max query handler is more usable by end user queries and applications than Lucene's underlying raw queries.

  • Faceting of query results.

  • A spell check plugin used for making alternative query suggestions (that is, "did you mean ___")

  • A more like this plugin to list documents that are similar to a chosen document.

  • A distributed Solr server model with supporting scripts to support larger scale deployments.

These features will be covered in more detail in later chapters.

 

Comparison to database technology


Knowledge of relational databases (often abbreviated RDBMS or just database for short) is an increasingly common skill that developers possess. A database and a [Lucene] search index aren't dramatically different conceptually. So let's start off by assuming that you know database basics, and I'll describe how a search index is different.

Note

This comparison puts aside the possibility that your database has built-in text indexing features. The point here is only to help you understand Solr.

This biggest difference is that a Lucene index is like a single-table database without any support for relational queries (JOINs). Yes, it sounds crazy, but remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely de-normalized and contain mostly just the data needed to be searched. One redeeming aspect of the single table schema is that fields can be multi-valued.

Other notable differences are as follows:

  • Updates: Entire documents can be deleted and added again but not updated.

  • Substring Search versus Text Search: Using a database, the poor man's search would be a substring search such as SELECT * FROM mytable WHERE name LIKE '%Books%'. That would match "CookBooks" as well as "My Books". Lucene instead fundamentally searches on terms (words). Depending on analysis configuration, this can mean that various forms of the word (example: book, singular) are found too, even phonetic (sounds-like) matches are possible. Using advanced ngram analysis techniques, it can do partial words too, although this is uncommon.

  • Scored Results and Boosting: Much of the power of Lucene is in its ability to score each matched document according to how well the search matched it. For example, if multiple words are searched for and are optional (a boolean OR search), then Lucene scores documents that matched more terms higher than those that just matched one. There are a variety of other factors too, and it's possible to adjust weightings of different fields. By comparison, a database has no concept of this, a record either matched or not. Of course, Lucene can sort on field values if that is needed.

  • Slow commits: Solr is highly optimized for search speed, and that speed is largely attributable to caches. When a commit is done to finalize documents that were just added, all of the caches need to be rebuilt, which could take between seconds and a minute, depending on various factors.

 

Getting started


Solr is a Java based web application, but you don't need to be particularly familiar with Java in order to use it. With most topics, this book assumes little to no such knowledge on your part. However, if you wish to extend Solr, then you will definitely need to know Java. I also assume a basic familiarity with the command line, whether it is DOS or any Unix shell.

Before truly getting started with Solr, let's get the prerequisites out of the way. Note that if you are using Mac OS X, then you should have the needed pieces already (though you may need the developer tools add-on). If any of the -version test commands mentioned as follows fail, then you don't have it. URLs are provided for convenience, but it is up to you to install the software according to instructions provided at the relevant sites.

A Java Development Kit (JDK) v1.5 or later: You can download the JDK from http://java.sun.com/javase/. Typing java -version will tell you which version of Java you are using if any, and you should type javac -version to ensure that you have the development kit too. You only need the JRE to run Solr, but you will need the JDK to compile it from source and to extend it.

Apache Ant: Any recent version should do and is available at http://ant.apache.org/. If you never modify Solr and just stick to a recent official release, then you can skip this. Note that the software provided with this book uses Ant as well. Therefore, you'll want Ant if you wish to follow along. Typing ant -version should demonstrate that you have it installed.

Subversion or Git for source control of Solr: http://subversion.tigris.org/getting.html or http://git-scm.com/. This isn't strictly necessary, but it's recommended for working with Solr's source code. If you choose to use a command line based distribution of either, then svn -version or git --version should work. Further instructions in this book are based on the command line, because it is a universal access method.

Any Java EE servlet engine app-server: This is a Java web server. Solr includes one already, Jetty, and we'll be using this throughout the book. In a later chapter, "Solr in the real world", deploying to an alternative is discussed.

The last official release or fresh code from source control

Let's finally get started and get Solr running. The official site for Solr is at http://lucene.apache.org/solr, where you can download the latest official release. Solr 1.3 was released on September 15th, 2008. Solr 1.4 is expected around the same time a year later and thus is probably available as you read this. This book was written in-between these releases and so it contains many but not all of 1.4's features. An alternative to downloading an official release is getting the latest code from source control (that is version control). In either case, the directory structure is conveniently identical and both include the source code. For many open source projects, the choice is almost always the last official release and not the latest source.

However, Solr's committers have made unit and integration testing a priority, evident by the testing infrastructure and test code-coverage of over 70 percent (http://hudson.zones.apache.org/hudson/view/Solr/job/Solr-trunk/clover/), which is very good. Many projects have none at all. As a result, the latest source release is very stable, and it also makes changes to Solr easier, given that so many tests are in place to give confidence that Solr is working properly—so far as the tests test it, of course. And unlike a database, which is almost never modified to suit the needs of a project, Solr is modified often. Also note that there are a good many feature additions provided as source code patches within Solr's JIRA (its issue tracking system). The decision is of course up to you. If you are satisfied with the feature-set in the latest release and/or you don't think you'll be modifying Solr at all, then the latest release is fine. One way to gauge what (completed) features are not yet in the latest official release is to visit Solr's JIRA at http://issues.apache.org/jira/browse/SOLR, and then click on Roadmap. Also, the Wiki at http://wiki.apache.org/solr/ should have features that are not yet in the latest release version marked as such.

Tip

Choose to get Solr through source control even if you are going to stick with the last official release. When/if you make changes to Solr, it will then be easier to see what those differences are. Switching to a different release becomes much easier too.

We're going to get the code through a subversion and check out the trunk (a source control term for the latest code). If you are using an IDE or some GUI tool for subversion, then feel free to use that. The command line will suffice too. You should be able to successfully execute the following:

svn co http://svn.apache.org/repos/asf/lucene/solr/trunk/ solr_svn

That will result in Solr being checked out into the solr_svn directory. If you prefer one of the official releases, then use one of the following URLs, instead of the one above: http://svn.apache.org/repos/asf/lucene/solr/tags/ (put that into your web browser to see the choices). So called nightlies are also available if you don't want to use a subversion but want recent code.

Testing and building Solr

If you prefer a downloadable pre-built Solr, instead of using a subversion, then you can skip this section.

Tip

Ant basics

Apache ant is a cross-platform build scripting tool specified with XML. It is largely Java oriented. An ant script is assumed to be named build.xml in the root of a project. It contains a set of named ant targets that you can run. In order to list them while including description, type ant -p to get a nice report. In order to run a target, simply supply it to ant as the first argument such as ant compile. Targets often internally invoke other targets, and you'll see this in the output. In the end, ant should report BUILD SUCCESSFUL if successful and BUILD FAILED if not. Note that ant's use of the term 'build' is universal in ant, even if 'build' is not an apt description of what a target performed.

Testing and building Solr is easy. Before we build Solr, we're going to test it first to ensure that there are no failing tests. Simply execute the test target in Solr's installation directory like ant test. That should have executed without any errors. On my old machine, it took about ten minutes to run. If there were errors (extremely rare), then you'll have to switch to a different version or wait shortly for it to be fixed. Now to build a ready-to-install Solr, just type ant dist. This is going to fill the dist directory with some JAR files and a WAR file. If you are not familiar with Java, these files are a packaging mechanism for compiled code and related resources. These files are technically ZIP files but with a different file extension, and so you can use any ZIP file tools to view their contents. The most important one is the WAR file which we'll be using next.

Solr's installation directory structure

In this section, we'll orient you to Solr's directory structure. This is not Solr's home directory, but a different place that we'll mention after this.

  • build: Only appears after Solr is built to house compiled code before being packaged. You won't need to look in here.

  • client: Contains convenient language-specific APIs for talking to Solr as an alternative to using your own code to send XML over HTTP. As of this writing, this only contains a couple of Ruby choices. The Java client called SolrJ is actually in src/solrj. More information on using clients to communicate with Solr is in Chapter 8.

  • dist: The built Solr JAR files and WAR file are here, as well as the dependencies. This directory is created and filled when Solr is built.

  • example: This is an installation of the Jetty servlet engine (a Java web server) including some sample data and Solr configuration. The interesting child directories are:

    • example/etc: Jetty's configuration. Among other things, here you can change the web port used from the pre-supplied 8983 to 80 (HTTP default).

    • example/multicore: Houses multiple Solr home directories in a Solr multicore setup. This will be discussed in Chapter 7.

    • example/solr: A Solr home directory for the default setup that we'll be using.

    • example/webapps: Solr's WAR file is deployed here.

  • lib: All of Solr's API dependencies. The larger pieces are Lucene, some Apache commons utilities, and Stax for efficient XML processing.

  • site: This is for managing what is published on the Solr web site. You won't need to go in here.

  • src: Various source code. It's broken down into a few notable directories:

    • src/java: Solr's source code, written in Java.

    • src/scripts: Unix bash shell scripts, particularly useful in larger production deployments employing multiple Solr servers.

    • src/solrj: Solr's Java client.

    • src/test: Solr's test source code and test files.

    • src/webapp: Solr's web administration interface, including Java Servlets (source code form) and JSPs. This is mostly what constitutes the WAR file. The JSPs for the admin interface are under here in web/admin/, if you care to tweak any to your needs.

If you are a Java developer, you may have noticed that the Java source in Solr is not located in one place. It's in src/java for the majority of Solr, src/common for the parts of Solr that are common to both the server side and Solrj client side, src/test for the test code, and src/webapp/src for the servlet-specific code. I am merely pointing this out to help you find code, not to be critical. Solr's files are well organized.

Solr's home directory

A Solr home directory contains Solr's configuration and data (a Lucene Index) for a running Solr instance. Solr includes a sample, one at example/solr, which we'll be using in-place throughout most of the book. Technically, example/multicore is also a valid Solr home but for a multi-core setup, which will be discussed much later. You know you're looking at a Solr home directory when it contains either a solr.xml file (formerly multicore.xml in Solr 1.3), or if it contains both a conf and a data directory, though strictly speaking these might not be the actual requirements.

Note

data might not yet be present because you haven't started Solr yet, which will create it if it's not present and assuming it's not configured to be named differently.

Solr's home directory is laid out like this:

  • bin: Suggested directory to place Solr replication scripts, if you have a more advanced setup.

  • conf: Configuration files. The two I mention below are very important, but it will also contain some other .txt and .xml files, which are referenced by these two files for different things such as special text analysis steps.

  • conf/schema.xml: This is the schema for the index including field type definitions with associated analyzer chains.

  • conf/solrconfig.xml: This is the primary Solr configuration file.

  • conf/xslt: This directory contains various XSLT files that can be used to transform Solr's XML query responses into formats such as Atom/RSS.

  • data: Contains the actual Lucene index data. It's binary data, so you won't be doing anything with it except perhaps deleting it occasionally.

  • lib: Optional placement of extra Java JAR files that Solr will load on startup, allowing you to externalize plugins from the Solr distribution (the WAR file) for convenience. If you extend Solr without modifying Solr itself, then those modifications can be deployed in a JAR file here.

It's really important to know how Solr finds its home directory. This is covered next.

How Solr finds its home

In the next section, you'll start Solr. When Solr starts up, about the first thing it does is load its configuration from its home directory. Where that is exactly can be specified in several different ways.

Solr first checks for a Java system property named solr.solr.home. There are a few ways to set a Java system property, but a universal one, no matter which servlet engine you use, is through the command line where Java is invoked. You could explicitly set Solr's home like so when you start Jetty: java -Dsolr.solr.home=solr/ -jar start.jar, or you could use Java Naming and Directory Interface (JNDI) to bind the directory path to java:comp/env/solr/home. As with Java system properties, there are multiple ways to do this. Some are app-server dependent, but a universal one is to add the following to the WAR file's web.xml located in src/web-app/web/WEB-INF (you'll find this there already but commented out).

<env-entry>
  <env-entry-name>solr/home</env-entry-name>
  <env-entry-value>solr/</env-entry-value>
  <env-entry-type>java.lang.String</env-entry-type>
</env-entry>

As this is a change to web.xml, you'll need to re-run ant dist-war to repackage it, and only then you'll redeploy it. Doing this with Jetty supplied with Solr is insufficient because JNDI itself isn't set up. I'm not going to get into this further, because if you know what JNDI is and want to use it, then you'll surely figure out how to do it for your particular app-server.

Finally, if Solr's home isn't configured as a Java system property or through JNDI, then it defaults to solr/. In the examples above, I used that particular path too. We're going to simply stick with this path for the rest of this book, because this is a development, not production, setting.

Tip

In a production environment, you will almost certainly configure Solr's home rather than let it fall back to the default solr/. You will also probably use an absolute path instead of a relative one, which wouldn't work if you accidentally start your app-server from a different directory.

When troubleshooting setting Solr's home, be sure to look at the very first Solr log messages when Solr starts:

Aug 7, 2008 4:59:35 PM org.apache.solr.core.Config getInstanceDir

INFO: Solr home defaulted to 'null' (could not find system property or JNDI)

Aug 7, 2008 4:59:35 PM org.apache.solr.core.Config setInstanceDir

INFO: Solr home set to 'solr/'

This shows that Solr was left to default to solr/. You'll see this output when you start Solr, as described in the next section.

Deploying and running Solr

The file we're going to deploy is the file ending in .war in the dist directory (dist/apache-solr-1.4.war). The WAR file in particular is important, because this single file represents an entire Java web application. It includes Solr's JAR file, all of Solr's dependencies (which amount to other JAR files), Java Server Pages (JSPs) (which are rendered to a web browser when the WAR is deployed), and various configuration files and other web resources. It does not include Solr's home directory, however.

How one deploys a WAR file to a Java servlet engine depends on that servlet engine, but it is common for there to be a directory named something like webapps, which contains WAR files optionally in an expanded form. By expanded, I mean that the WAR file may be uncompressed and thus a directory by the same name. This can be a convenient deployed form in order to make changes in-place (such as to JSP files and static web files) without requiring rebuilding a WAR file and replacing an existing one. The disadvantage is that changes are not directly tracked by source control (example: Subversion). Another thing to note about the WAR file is that by convention, its name (without the .war extension, if present) is the path portion of the URL where the web server mounts the web application. For example, if you have an apache-solr-1.4.war file, then you would access it at http://localhost:8983/apache-solr-1.4/, assuming it's on the local machine and running at that default port.

We're going to deploy this WAR file into the Jetty servlet engine included with Solr. If you are using a pre-built downloaded Solr distribution, then Solr is already deployed into Jetty as solr.war. Solr has an ant target that does this (and some other things we don't care about) called example, so you can simply run it like ant example. This target didn't keep the original WAR filename when copying it. It abbreviated it to simply solr.war. This means that the URL path is just solr. By the way, because ant targets generally call other necessary ant targets, it was technically not necessary to run ant dist earlier in order for this step to work. This would not have run the tests, however.

Now we're going to start up Jetty and finally see Solr running (albeit without any data to query yet). First go to the example directory, and then run Jetty's start.jar file by typing the following command:

cd example
java -jar start.jar

You'll see about a page of output including references to Solr. When it is finished, you should see this output at the very end of the command prompt:

2008-08-07 14:10:50.516::INFO: Started SocketConnector @ 0.0.0.0:8983

The 0.0.0.0 means it's listening to connections from any host (not just localhost, notwithstanding potential firewalls) and 8983 is the port. If Jetty reports this, then it doesn't necessarily mean that Solr was deployed successfully. You might see an error such as a stack trace in the output, if something went wrong. Even if it did go wrong, you should be able to access the web server at this address: http://localhost:8983. It will show you a list of links to web applications which will just be Solr for this setup. Solr should have this link: http://localhost:8983/solr, and if you go there, then you should either see details about an error if Solr wasn't loaded correctly, or a simple page with a link to Solr's admin page, which should be http://localhost:8983/solr/admin/. You'll be visiting that link often.

Tip

To quit Jetty (and many other command line programs for that matter), hit Ctrl-C on the keyboard.

 

A quick tour of Solr!


Start up Jetty if it isn't already up and point your browser to the admin URL: http://localhost:8983/solr/admin/, so that we can get our bearings on this interface that is not yet familiar to you. We're not going to discuss any page in any depth at this point.

Tip

This part of Solr is somewhat rough and is subject to change more than any other part of Solr.

The top gray area in the previous screenshot is a header that is on every page. When you start dealing with multiple Solr instances (development machine versus production, multicore, Solr clusters), it is important to know where you are. The IP and port are obvious. The (example) is a reference to the name of the schema. That's just a simple label at the top of the schema file to name the schema. If you have multiple schemas for different data sets, then this is a useful differentiator. Next is the current working directory cwd, and Solr's home.

The block below this is a navigation menu to the different admin screens and configuration data. The navigation menu is explained as follows:

  • SCHEMA: This downloads the schema configuration file (XML) directly to the browser.

    Tip

    Firefox conveniently displays XML data with syntax highlighting. Safari, on the other hand, tries to render it and the result is unusable. Your mileage will vary depending on the browser you use. You can always use your browser's view source command if needed.

  • CONFIG: It is similar to the SCHEMA choice, but this is the main configuration file for Solr.

  • ANALYSIS: It is used for diagnosing potential query/indexing problems having to do with the text analysis. This is a somewhat advanced screen and will be discussed later.

  • SCHEMA BROWSER: This is a neat view of the schema reflecting various heuristics of the actual data in the index. We'll return here later.

  • STATISTICS: Here you will find stats such as timing and cache hit ratios. In Chapter 9, we will visit this screen to evaluate Solr's performance.

  • INFO: This lists static versioning information about internal components to Solr. Frankly, it's not very useful.

  • DISTRIBUTION: It contains Distributed/Replicated status information, only applicable for such configurations. More information on this is in Chapter 9.

  • PING: Ignore this, although it can be used for a health-check in distributed mode.

  • LOGGING: This allows you to adjust the logging levels for different parts of Solr at runtime. For Jetty as we're running it, this output goes to the console and nowhere else.

    Tip

    Solr uses SLF4j for its logging, which in Solr, is by default configured to use Java's built-in logging (that is JUL or JDK14 Logging). If you're more familiar with another framework like Log4J, then you can do this by simply removing the slf4j-jdk14 JAR file and adding slf4j-log4j12 (not included). If you're using Solr 1.3, then you're stuck with JUL.

  • JAVA PROPERTIES: It lists Java system properties.

  • THREAD DUMP: This displays a Java thread dump useful for experienced Java developers in diagnosing problems.

After the main menu is the Make a Query text box where you can type in a simple query. There's no data in Solr yet, so there's no point trying that right now.

  • FULL INTERFACE: As you might guess, it brings you to a form with more options, especially useful when diagnosing query problems or if you forget what the URL parameters are for some of the query options. The form is still very limited, however, and only allows a fraction of the query options that you can submit to Solr.

Finally, the bottom Assistance area contains useful information for Solr online. The last section of this chapter has more information on such resources.

Loading sample data

Solr happens to come with some sample data and a loader script, found in the example/exampledocs directory. We're going to use that, but just for the remainder of this chapter so that we can explore Solr more without getting into schema decision making and deeper data loading options. For the rest of the book, we'll base the examples on the supplemental files, which are provided online.

Firstly, ensure that Solr is running. You should assume that it is always in a running state throughout this book to follow any example. Now go into the example/exampledocs directory, and run the following:

exampledocs$ java -jar post.jar *.xml
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTing file hd.xml
SimplePostTool: POSTing file ipod_other.xml
SimplePostTool: POSTing file ipod_video.xml
SimplePostTool: POSTing file vidcard.xml
SimplePostTool: COMMITting Solr index changes..

Or if you are using a Unix-like environment, you have the option of using the post.sh shell script, which behaves similarly. What this does is it invokes the Java program embedded in post.jar with each file in the current directory ending in .xml. post.jar is a simple program that iterates over each argument given (a file reference), and HTTP posts it to Solr running on the current machine at the example server's default configuration (being http://localhost:8983/solr/update ). I recommend examining the contents of the post.sh shell script for illustrative purposes. As seen above, the command will mention the files it is sending. Finally it will send a commit command, which will cause documents that were posted prior to the last commit to be saved and visible.

Tip

The post.sh and post.jar programs could theoretically be used in a production scenario, but they are intended just for demonstration of the technology with the example data.

Let's take a look at one of these documents like monitor.xml:

<add>
  <doc>
    <field name="id">3007WFP</field>
    <field name="name">Dell Widescreen UltraSharp 3007WFP</field>
    <field name="manu">Dell, Inc.</field>
    <field name="cat">electronics</field>
    <field name="cat">monitor</field>
    <field name="features">30" TFT active matrix LCD, 2560 x 1600,.25mm dot pitch, 700:1 contrast</field>
    <field name="includes">USB cable</field>
    <field name="weight">401.6</field>
    <field name="price">2199</field>
    <field name="popularity">6</field>
    <field name="inStock">true</field>
  </doc>
</add>

The schema for the XML files that are posted to Solr are very simple. This one here doesn't demonstrate all of it, but this is most of what matters. Multiple documents (represented by the <doc> tag) can be present in series within the <add> tag, which is recommended in bulk data loading scenarios for performance. Remember that Solr gets a <commit/> tag sent to it in a separate POST. This syntax and command-set may very well be all that you use. More about these options and other data loading choices will be discussed in Chapter 3.

A simple query

On the main admin page, let's run a simple query searching for monitor.

Tip

When using Solr's search form, don't hit the return key. It would be nice if it submits the form, but it adds a carriage return to the search box instead. If you leave this carriage return there and hit Search, then you'll get an error. Perhaps this will be fixed at some point.

Before we go over the XML output, I want to point out the URL and its parameters, which you will become very familiar with: http://localhost:8983/solr/select/?q=monitor&version=2.2&start=0&rows=10&indent=on.

The form (whether the basic one or the Full Interface one) simply constructs a URL with appropriate parameters, and your browser sees the XML results. It is convenient to use the form at first, but then subsequently make direct modifications to the URL in the browser instead of returning to the form. The form only controls a basic subset of all possible parameters. The main benefit to the form is that it applies the URL escaping for special characters in the query, and for some basic options, you needn't remember what the parameter names are.

Solr's search results from its web interface are in XML. As suggested earlier, you'll probably find that using the Firefox web browser provides the best experience due to the syntax coloring. Internet Explorer displays XML content well too. If you, at some point, want Solr to return a web page to your liking or an alternative XML structure, then that will be covered later. Here is the XML response with my comments:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
  <int name="status">0</int>
  <int name="QTime">3</int>
  <lst name="params">
    <str name="indent">on</str>
    <str name="rows">10</str>
    <str name="start">0</str>
    <str name="q">monitor</str>
    <str name="version">2.2</str>
  </lst>
</lst>

The first section of the response, which precedes the <result> tag that is about to follow, indicates how long the query took (measured in milliseconds), as well as listing the parameters that define the query. Solr has some sophisticated caching, and you will find that your queries will often complete in a millisecond or less, if you've run the query before. In the params list, q is clearly your query. rows and start have to do with paging. Clearly you wouldn't want Solr to always return all of the results at once, unless you really knew what you were doing. indent indents the XML output, which is convenient for experimentation. version isn't used much, but if you start building clients that interact with Solr, then you'll want to specify the version to reduce the possibility of things breaking, if you were to upgrade Solr. These parameters in the output are convenient for experimentation but can be configured to be omitted. Next up is the most important part, the results.

<result name="response" numFound="2" start="0">

The numFound number is self explanatory. start is the index into the query results that are returned in the XML. Often, you'll want to see the score of the documents. However, the very basic query performed from the front Solr page doesn't include the score, despite the fact that it's sorted by it (Solr's default). The full interface form includes the score by default. Queries that include the score will include a maxScore attribute in the result tag. The maxScore for the query is independent of any paging, so that no matter which part of the result set you've paged into (by using the start parameter), the maxScore will be the same. The content of the result tag is a list of documents that matched the query in a score sorted order. Later, we'll do some sorting by specified fields.

<doc>
    <arr name="cat"><str>electronics</str><str>monitor</str></arr>
    <arr name="features"><str>30" TFT active matrix LCD, 2560 x 1600,.25mm dot pitch, 700:1 contrast</str></arr>
    <str name="id">3007WFP</str>
    <bool name="inStock">true</bool>
    <str name="includes">USB cable</str>
    <str name="manu">Dell, Inc.</str>
    <str name="name">Dell Widescreen UltraSharp 3007WFP</str>
    <int name="popularity">6</int>
    <float name="price">2199.0</float>
    <str name="sku">3007WFP</str>
    <arr name="spell"><str>Dell Widescreen UltraSharp 3007WFP</str></arr>
    <date name="timestamp">2008-08-09T03:56:41.487Z</date>
    <float name="weight">401.6</float>
</doc>
<doc>
...
</doc>
</result>
</response>

The document list is pretty straightforward. By default, Solr will list all of the stored fields, plus the score if you asked for it (we didn't in this case). Remember that not all of the fields are necessarily stored (that is, you can query on them but not store them for retrieval—an optimization choice). Notice that basic data types str, bool, date, int, and float are used. Also note that certain fields are multi-valued, as indicated by an arr tag.

This was a basic query. As you start adding more query options like faceting, highlighting, and so on, you will see additional XML following the result tag.

Some statistics

Let's take a look at the statistics page: http://localhost:8983/solr/admin/stats.jsp. Before we loaded data into Solr, this page reported that numDocs was 0, but now it should be 26. If you're wondering what maxDocs is and the difference, maxDocs reports a number that is in some situations higher due to documents that have been deleted but not yet committed. That can happen either due to an explicit delete posted to Solr or by adding a document that replaces another in order to enforce a unique primary key. While you're at this page, notice that the query handler named /update has some stats too:

name

/update

class

org.apache.solr.handler.XmlUpdateRequestHandler

version

$Revision: 679936 $

description

Add documents with XML

stats

handlerStart: 1218253728453

requests: 19

errors: 4

timeouts: 0

totalTime: 1392

avgTimePerRequest: 73.26316

avgRequestsPerSecond: 2.850955E-4

In my case, as seen above, there are some errors reported because I was fooling around, posting all of the files in the exampledocs directory, not just the XML ones. Another Solr handler name you'll want to examine is standard, which has been processing our queries.

Tip

These statistics are as up-to-date as Solr is running, they are not stored to disk. As such, you cannot use them for long-term statistics.

 

The schema and configuration files


Solr's configuration files are extremely well documented. We're not going to go over the details here but this should give you a sense of what is where.

The schema (defined in schema.xml) contains field type definitions (defined within the <types> tag) and lists the fields that make up your schema (within the <fields> tag), which references a type. The schema contains other information too such as the primary key (the field that uniquely identifies each document—a constraint that Solr enforces) and the default search field. The sample schema in Solr uses the field named text, confusingly, there is a field type named text too. But remember that the monitor.xml document we reviewed earlier had no field named text, right? It is common for the schema to call out for certain fields to be copied to other fields—particularly fields not in input documents. So, even though the input documents don't have a field named text, there are <copyField> tags in the schema, which call for the fields named cat, name, manu, features, and includes to be copied to text. This is a popular technique to speed up queries, so that queries can search over a small number of fields rather than a long list of them. Such fields used this way are rarely stored, as they are just needed for querying and so are indexed. There is a lot more we could talk about in the schema, but we're going to move on for now.

Solr's solrconfig.xml file contains lots of parameters that can be tweaked. At the moment, we're just going to take a peak at the request handlers that are defined with <requestHandler> tags. They make up about half of the file. In our first query, we didn't specify any request handler, so we got the default one. It's defined here:

<requestHandler name="standard" class="solr.SearchHandler" default="true">
<!-- default values for query parameters -->
  <lst name="defaults">
    <str name="echoParams">explicit</str>
    <!-- 
    <int name="rows">10</int>
    <str name="fl">*</str>
    <str name="version">2.1</str>
    -->
  </lst>
</requestHandler>

When you POST commands to Solr (such as to index a document) or query Solr (HTTP GET), it goes through a particular request handler. Handlers can be registered against certain URL paths. When we uploaded the documents earlier, it went to the handler defined like this:

<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />

The request handlers oriented to querying using the class solr.SearchHandler are much more interesting.

Note

The important thing to realize about using a request handler is that they are nearly completely configurable through URL parameters or POST'ed form parameters. They can also be specified in solrconfig.xml within either default, appends, or invariants named lst blocks, which serve to establish defaults. More on this is in Chapter 4. This arrangement allows you to set up a request handler for a particular application that will be querying Solr without forcing the application to specify all of its query options.

The standard request handler defined previously doesn't really define any defaults other than the parameters that are to be echoed in the response. Remember its presence at the top of the XML output? By changing explicit to none you can have it omitted, or use all and you'll potentially see more parameters, if other defaults happened to be configured in the request handler. This parameter can alternatively be specified in the URL through echoParams=none. Remember to separate URL parameters with ampersands.

 

Solr resources outside this book


The following are some prominent Solr resources that you should be aware of:

  • Solr's Wiki: http://wiki.apache.org/solr/ has a lot of great documentation and miscellaneous information. For a Wiki, it's fairly organized too. In particular, if you are going to use a particular app-server in production, then there is probably a Wiki page there on specific details.

  • Within the Solr installation, you will also find that there are README.txt files in many directories within Solr and that the configuration files are very well documented.

  • Solr's mailing lists contain a wealth of information. If you have a few discriminating keywords then you can find nuggets of information in there with a search engine. The mailing lists of Solr and other Lucene sub-projects are best searched at: http://www.lucidimagination.com/search/ or Nabble.com.

    Tip

    It is highly recommended to subscribe to the Solr-users mailing list. You'll learn a lot and potentially help others too.

  • Solr's issue tracker, a JIRA installation at http://issues.apache.org/jira/browse/SOLR contains information on enhancements and bugs. Some of the comments for these issues can be extensive and enlightening. JIRA also uses a Lucene-powered search.

    Note

    Notation convention: Solr's JIRA issues are referenced like this: SOLR-64. You'll see such references in this book and elsewhere. You can easily look these up at Solr's JIRA. You may also see issues for Lucene that follow the same convention, for example, LUCENE-1215.

 

Summary


This completes a quick introduction to Solr. In the ensuing chapters, you're really going to get familiar with what Solr has to offer. I recommend you proceed in order from the next chapter through Chapter 6, because these build on each other and expose nearly all of the capabilities in Solr. These chapters are also useful as a reference to Solr's features. You can of course skip over sections that are not interesting to you. Chapter 8, is one you might peruse at any time, as it may have a section particularly applicable to your Solr usage scenario.

Accompanying the book at PACKT's web site is both source code and data to be indexed by Solr. In order to try out the same examples used in the book, you will have to download it and run the provided ant task, which prepares it for you. This first chapter is the only one that is not based on that supplemental content.

About the Authors

  • David Smiley

    Born to code, David Smiley is a software engineer who's passionate about search, Lucene, spatial, and open source. He has a great deal of expertise with Lucene and Solr, which started in 2008 at MITRE. In 2009, as the lead author, along with the coauthor Eric Pugh, he wrote Solr 1.4 Enterprise Search Server, the first book on Solr, published by Packt Publishing. It was updated in 2011, Apache Solr 3 Enterprise Search Server, Packt Publishing, and again for this third edition.

    After the first book, he developed 1- and 2-day Solr training courses, delivered half a dozen times within MITRE, and he has also delivered training on LucidWorks once. Most of his excitement and energy relating to Lucene is centered on Lucene's spatial module to include Spatial4j, which he is largely responsible for. He has presented his progress on this at Lucene Revolution and other conferences several times. He currently holds the status of committer & Project Management Committee (PMC) member with the Lucene/Solr open source project. Over the years, David has staked his career on search, working exclusively on such projects, formerly for MITRE and now as an independent consultant for various clients. You can reach him at [email protected] and view his LinkedIn profile here: http://www.linkedin.com/in/davidwsmiley.

    Browse publications by this author
  • Eric Pugh

    Fascinated by the "craft" of software development, Eric Pugh has been involved in the open source world as a developer, committer, and user for the past decade. He is an emeritus member of the Apache Software Foundation.

    In biotech, financial services, and defense IT, he has helped European and American companies develop coherent strategies to embrace open source software. As a speaker, he has advocated the advantages of Agile practices in search, discovery, and analytics projects.

    Eric became involved in Solr when he submitted the patch SOLR-284 to parse rich document types, such as PDF and MS Office formats, that became the single-most popular patch, as measured by votes! The patch was subsequently cleaned up and enhanced by three other individuals, demonstrating the power of the free / open source models to build great code collaboratively. SOLR-284 was eventually refactored into Solr Cell.

    He blogs at http://www.opensourceconnections.com/blog/.

    Browse publications by this author
Book Title
Access this book, plus 7,500 other titles for FREE
Access now