Welcome to Solr! You've made an excellent choice in picking a technology to power your searching needs. In this chapter, we're going to cover the following topics:
An overview of what Solr and Lucene are all about
What makes Solr different from other database technologies
How to get Solr, what's included, and what is where
Running Solr and importing sample data
A quick tour of the interface and key configuration files
Solr is an open source enterprise search server. It is a mature product powering search for public sites like CNet, Zappos, and Netflix, as well as intranet sites. It is written in Java, and that language is used to further extend/modify Solr. However, being a server that communicates using standards such as HTTP and XML, knowledge of Java is very useful but not strictly a requirement. In addition to the standard ability to return a list of search results for some query, it has numerous other features such as result highlighting, faceted navigation (for example, the ones found on most e-commerce sites), query spell correction, auto-suggest queries, and "more like this" for finding similar documents.
Before describing Solr, it is best to start with Apache Lucene, the core technology underlying it. Lucene is an open source, high-performance text search engine library. Lucene was developed and open sourced by Doug Cutting in 2000 and has evolved and matured since then with a strong online community. Being just a code library, Lucene is not a server and certainly isn't a web crawler either. This is an important fact. There aren't even any configuration files. In order to use Lucene directly, one writes code to store and query an index stored on a disk. The major features found in Lucene are as follows:
A text-based inverted index persistent storage for efficient retrieval of documents by indexed terms
A rich set of text analyzers to transform a string of text into a series of terms (words), which are the fundamental units indexed and searched
A query syntax with a parser and a variety of query types from a simple term lookup to exotic fuzzy matches
A good scoring algorithm based on sound Information Retrieval (IR) principles to produce the more likely candidates first, with flexible means to affect the scoring
A highlighter feature to show words found in context
A query spellchecker based on indexed content
With the definition of Lucene behind us, Solr can be described succinctly as the server-ization of Lucene. However, it is definitely not a thin wrapper around the Lucene libraries. Most of Solr's features are distinct from Lucene, such as faceting, but not far into the implementation. The line is often blurred as to what is Solr and what is Lucene. Without further adieu, here is the major feature-set in Solr:
HTTP request processing for indexing and querying documents.
Several caches for faster query responses.
A web-based administrative interface including:
Runtime performance statistics including cache hit/miss rates.
A query form to search the index.
A schema browser with histograms of popular terms along with some statistics.
Detailed breakdown of scoring mathematics and text analysis phases.
Configuration files for the schema and the server itself (in XML).
Solr adds to Lucene's text analysis library and makes it configurable through XML.
Introduces the notion of a field type (this is important yet surprisingly not in Lucene). Types are present for dates and special sorting concerns.
The disjunction-max query handler is more usable by end user queries and applications than Lucene's underlying raw queries.
Faceting of query results.
A spell check plugin used for making alternative query suggestions (that is, "did you mean ___")
A more like this plugin to list documents that are similar to a chosen document.
A distributed Solr server model with supporting scripts to support larger scale deployments.
These features will be covered in more detail in later chapters.
Knowledge of relational databases (often abbreviated RDBMS or just database for short) is an increasingly common skill that developers possess. A database and a [Lucene] search index aren't dramatically different conceptually. So let's start off by assuming that you know database basics, and I'll describe how a search index is different.
Note
This comparison puts aside the possibility that your database has built-in text indexing features. The point here is only to help you understand Solr.
This biggest difference is that a Lucene index is like a single-table database without any support for relational queries (JOINs). Yes, it sounds crazy, but remember that an index is usually only there to support search and not to be the primary source of the data. So your database may be in "third normal form" but the index will be completely de-normalized and contain mostly just the data needed to be searched. One redeeming aspect of the single table schema is that fields can be multi-valued.
Other notable differences are as follows:
Updates: Entire documents can be deleted and added again but not updated.
Substring Search versus Text Search: Using a database, the poor man's search would be a substring search such as
SELECT
*
FROM
mytable
WHERE
name
LIKE
'%Books%'
. That would match "CookBooks" as well as "My Books". Lucene instead fundamentally searches on terms (words). Depending on analysis configuration, this can mean that various forms of the word (example: book, singular) are found too, even phonetic (sounds-like) matches are possible. Using advanced ngram analysis techniques, it can do partial words too, although this is uncommon.Scored Results and Boosting: Much of the power of Lucene is in its ability to score each matched document according to how well the search matched it. For example, if multiple words are searched for and are optional (a boolean
OR
search), then Lucene scores documents that matched more terms higher than those that just matched one. There are a variety of other factors too, and it's possible to adjust weightings of different fields. By comparison, a database has no concept of this, a record either matched or not. Of course, Lucene can sort on field values if that is needed.Slow commits: Solr is highly optimized for search speed, and that speed is largely attributable to caches. When a commit is done to finalize documents that were just added, all of the caches need to be rebuilt, which could take between seconds and a minute, depending on various factors.
Solr is a Java based web application, but you don't need to be particularly familiar with Java in order to use it. With most topics, this book assumes little to no such knowledge on your part. However, if you wish to extend Solr, then you will definitely need to know Java. I also assume a basic familiarity with the command line, whether it is DOS or any Unix shell.
Before truly getting started with Solr, let's get the prerequisites out of the way. Note that if you are using Mac OS X, then you should have the needed pieces already (though you may need the developer tools add-on). If any of the -version
test commands mentioned as follows fail, then you don't have it. URLs are provided for convenience, but it is up to you to install the software according to instructions provided at the relevant sites.
A Java Development Kit (JDK) v1.5 or later: You can download the JDK from http://java.sun.com/javase/. Typing java
-version
will tell you which version of Java you are using if any, and you should type javac -version
to ensure that you have the development kit too. You only need the JRE to run Solr, but you will need the JDK to compile it from source and to extend it.
Apache Ant: Any recent version should do and is available at http://ant.apache.org/. If you never modify Solr and just stick to a recent official release, then you can skip this. Note that the software provided with this book uses Ant as well. Therefore, you'll want Ant if you wish to follow along. Typing ant
-version
should demonstrate that you have it installed.
Subversion or Git for source control of Solr: http://subversion.tigris.org/getting.html or http://git-scm.com/. This isn't strictly necessary, but it's recommended for working with Solr's source code. If you choose to use a command line based distribution of either, then svn -version
or git --version
should work. Further instructions in this book are based on the command line, because it is a universal access method.
Any Java EE servlet engine app-server: This is a Java web server. Solr includes one already, Jetty, and we'll be using this throughout the book. In a later chapter, "Solr in the real world", deploying to an alternative is discussed.
Let's finally get started and get Solr running. The official site for Solr is at http://lucene.apache.org/solr, where you can download the latest official release. Solr 1.3 was released on September 15th, 2008. Solr 1.4 is expected around the same time a year later and thus is probably available as you read this. This book was written in-between these releases and so it contains many but not all of 1.4's features. An alternative to downloading an official release is getting the latest code from source control (that is version control). In either case, the directory structure is conveniently identical and both include the source code. For many open source projects, the choice is almost always the last official release and not the latest source.
However, Solr's committers have made unit and integration testing a priority, evident by the testing infrastructure and test code-coverage of over 70 percent (http://hudson.zones.apache.org/hudson/view/Solr/job/Solr-trunk/clover/), which is very good. Many projects have none at all. As a result, the latest source release is very stable, and it also makes changes to Solr easier, given that so many tests are in place to give confidence that Solr is working properly—so far as the tests test it, of course. And unlike a database, which is almost never modified to suit the needs of a project, Solr is modified often. Also note that there are a good many feature additions provided as source code patches within Solr's JIRA (its issue tracking system). The decision is of course up to you. If you are satisfied with the feature-set in the latest release and/or you don't think you'll be modifying Solr at all, then the latest release is fine. One way to gauge what (completed) features are not yet in the latest official release is to visit Solr's JIRA at http://issues.apache.org/jira/browse/SOLR, and then click on Roadmap. Also, the Wiki at http://wiki.apache.org/solr/ should have features that are not yet in the latest release version marked as such.
Tip
Choose to get Solr through source control even if you are going to stick with the last official release. When/if you make changes to Solr, it will then be easier to see what those differences are. Switching to a different release becomes much easier too.
We're going to get the code through a subversion and check out the trunk
(a source control term for the latest code). If you are using an IDE or some GUI tool for subversion, then feel free to use that. The command line will suffice too. You should be able to successfully execute the following:
svn co http://svn.apache.org/repos/asf/lucene/solr/trunk/ solr_svn
That will result in Solr being checked out into the solr_svn
directory. If you prefer one of the official releases, then use one of the following URLs, instead of the one above: http://svn.apache.org/repos/asf/lucene/solr/tags/ (put that into your web browser to see the choices). So called nightlies are also available if you don't want to use a subversion but want recent code.
If you prefer a downloadable pre-built Solr, instead of using a subversion, then you can skip this section.
Tip
Ant basics
Apache ant is a cross-platform build scripting tool specified with XML. It is largely Java oriented. An ant script is assumed to be named build.xml
in the root of a project. It contains a set of named ant targets
that you can run. In order to list them while including description, type ant -p
to get a nice report. In order to run a target, simply supply it to ant as the first argument such as ant compile
. Targets often internally invoke other targets, and you'll see this in the output. In the end, ant should report BUILD SUCCESSFUL if successful and BUILD FAILED if not. Note that ant's use of the term 'build' is universal in ant, even if 'build' is not an apt description of what a target performed.
Testing and building Solr is easy. Before we build Solr, we're going to test it first to ensure that there are no failing tests. Simply execute the test
target in Solr's installation directory like ant
test
. That should have executed without any errors. On my old machine, it took about ten minutes to run. If there were errors (extremely rare), then you'll have to switch to a different version or wait shortly for it to be fixed. Now to build a ready-to-install Solr, just type ant
dist
. This is going to fill the dist
directory with some JAR files and a WAR file. If you are not familiar with Java, these files are a packaging mechanism for compiled code and related resources. These files are technically ZIP files but with a different file extension, and so you can use any ZIP file tools to view their contents. The most important one is the WAR file which we'll be using next.
In this section, we'll orient you to Solr's directory structure. This is not Solr's home directory, but a different place that we'll mention after this.
build
: Only appears after Solr is built to house compiled code before being packaged. You won't need to look in here.client
: Contains convenient language-specific APIs for talking to Solr as an alternative to using your own code to send XML over HTTP. As of this writing, this only contains a couple of Ruby choices. The Java client called SolrJ is actually insrc
/solrj
. More information on using clients to communicate with Solr is in Chapter 8.dist
: The built Solr JAR files and WAR file are here, as well as the dependencies. This directory is created and filled when Solr is built.example
: This is an installation of the Jetty servlet engine (a Java web server) including some sample data and Solr configuration. The interesting child directories are:example
/etc
: Jetty's configuration. Among other things, here you can change the web port used from the pre-supplied 8983 to 80 (HTTP default).example
/multicore
: Houses multiple Solr home directories in a Solrmulticore
setup. This will be discussed in Chapter 7.example
/solr
: A Solrhome
directory for the default setup that we'll be using.
lib
: All of Solr's API dependencies. The larger pieces are Lucene, some Apache commons utilities, and Stax for efficient XML processing.site
: This is for managing what is published on the Solr web site. You won't need to go in here.src
: Various source code. It's broken down into a few notable directories:src
/java
: Solr's source code, written in Java.src
/scripts
: Unix bash shell scripts, particularly useful in larger production deployments employing multiple Solr servers.src
/solrj
: Solr's Java client.src
/webapp
: Solr's web administration interface, including Java Servlets (source code form) and JSPs. This is mostly what constitutes the WAR file. The JSPs for the admin interface are under here inweb/admin/
, if you care to tweak any to your needs.
If you are a Java developer, you may have noticed that the Java source in Solr is not located in one place. It's in src
/java
for the majority of Solr, src
/common
for the parts of Solr that are common to both the server side and Solrj client side, src/test
for the test code, and src
/webapp
/src
for the servlet-specific code. I am merely pointing this out to help you find code, not to be critical. Solr's files are well organized.
A Solr home directory contains Solr's configuration and data (a Lucene Index) for a running Solr instance. Solr includes a sample, one at example
/solr
, which we'll be using in-place throughout most of the book. Technically, example
/multicore
is also a valid Solr home but for a multi-core setup, which will be discussed much later. You know you're looking at a Solr home directory when it contains either a solr.xml
file (formerly multicore.xml
in Solr 1.3), or if it contains both a conf
and a data
directory, though strictly speaking these might not be the actual requirements.
Note
data
might not yet be present because you haven't started Solr yet, which will create it if it's not present and assuming it's not configured to be named differently.
Solr's home directory is laid out like this:
bin
: Suggested directory to place Solr replication scripts, if you have a more advanced setup.conf
: Configuration files. The two I mention below are very important, but it will also contain some other.txt
and.xml
files, which are referenced by these two files for different things such as special text analysis steps.conf
/schema.xml
: This is the schema for the index including field type definitions with associated analyzer chains.conf
/solrconfig.xml
: This is the primary Solr configuration file.conf
/xslt
: This directory contains various XSLT files that can be used to transform Solr's XML query responses into formats such as Atom/RSS.data
: Contains the actual Lucene index data. It's binary data, so you won't be doing anything with it except perhaps deleting it occasionally.lib
: Optional placement of extra Java JAR files that Solr will load on startup, allowing you to externalize plugins from the Solr distribution (the WAR file) for convenience. If you extend Solr without modifying Solr itself, then those modifications can be deployed in a JAR file here.
It's really important to know how Solr finds its home directory. This is covered next.
In the next section, you'll start Solr. When Solr starts up, about the first thing it does is load its configuration from its home directory. Where that is exactly can be specified in several different ways.
Solr first checks for a Java system property named solr.solr.home
. There are a few ways to set a Java system property, but a universal one, no matter which servlet engine you use, is through the command line where Java is invoked. You could explicitly set Solr's home like so when you start Jetty: java
-Dsolr.solr.home=solr/
-jar start.jar
, or you could use Java Naming and Directory Interface (JNDI) to bind the directory path to java:comp/env/solr/home
. As with Java system properties, there are multiple ways to do this. Some are app-server dependent, but a universal one is to add the following to the WAR file's web.xml
located in src
/web-app
/web
/WEB-INF
(you'll find this there already but commented out).
<env-entry> <env-entry-name>solr/home</env-entry-name> <env-entry-value>solr/</env-entry-value> <env-entry-type>java.lang.String</env-entry-type> </env-entry>
As this is a change to web.xml
, you'll need to re-run ant
dist-war
to repackage it, and only then you'll redeploy it. Doing this with Jetty supplied with Solr is insufficient because JNDI itself isn't set up. I'm not going to get into this further, because if you know what JNDI is and want to use it, then you'll surely figure out how to do it for your particular app-server.
Finally, if Solr's home isn't configured as a Java system property or through JNDI, then it defaults to solr/
. In the examples above, I used that particular path too. We're going to simply stick with this path for the rest of this book, because this is a development, not production, setting.
Tip
In a production environment, you will almost certainly configure Solr's home rather than let it fall back to the default solr/
. You will also probably use an absolute path instead of a relative one, which wouldn't work if you accidentally start your app-server from a different directory.
When troubleshooting setting Solr's home, be sure to look at the very first Solr log messages when Solr starts:
Aug 7, 2008 4:59:35 PM org.apache.solr.core.Config getInstanceDir
INFO: Solr home defaulted to 'null' (could not find system property or JNDI)
Aug 7, 2008 4:59:35 PM org.apache.solr.core.Config setInstanceDir
INFO: Solr home set to 'solr/'
This shows that Solr was left to default to solr/
. You'll see this output when you start Solr, as described in the next section.
The file we're going to deploy is the file ending in .war
in the dist
directory (dist
/apache-solr-1.4.war
). The WAR file in particular is important, because this single file represents an entire Java web application. It includes Solr's JAR file, all of Solr's dependencies (which amount to other JAR files), Java Server Pages (JSPs) (which are rendered to a web browser when the WAR is deployed), and various configuration files and other web resources. It does not include Solr's home directory, however.
How one deploys a WAR file to a Java servlet engine depends on that servlet engine, but it is common for there to be a directory named something like webapps
, which contains WAR files optionally in an expanded form. By expanded, I mean that the WAR file may be uncompressed and thus a directory by the same name. This can be a convenient deployed form in order to make changes in-place (such as to JSP files and static web files) without requiring rebuilding a WAR file and replacing an existing one. The disadvantage is that changes are not directly tracked by source control (example: Subversion). Another thing to note about the WAR file is that by convention, its name (without the .war
extension, if present) is the path portion of the URL where the web server mounts the web application. For example, if you have an apache-solr-1.4.war
file, then you would access it at http://localhost:8983/apache-solr-1.4/
, assuming it's on the local machine and running at that default port.
We're going to deploy this WAR file into the Jetty servlet engine included with Solr. If you are using a pre-built downloaded Solr distribution, then Solr is already deployed into Jetty as solr.war
. Solr has an ant target that does this (and some other things we don't care about) called example
, so you can simply run it like ant
example
. This target didn't keep the original WAR filename when copying it. It abbreviated it to simply solr.war
. This means that the URL path is just solr
. By the way, because ant targets generally call other necessary ant targets, it was technically not necessary to run ant
dist
earlier in order for this step to work. This would not have run the tests, however.
Now we're going to start up Jetty and finally see Solr running (albeit without any data to query yet). First go to the example
directory, and then run Jetty's start.jar
file by typing the following command:
cd example
java -jar start.jar
You'll see about a page of output including references to Solr. When it is finished, you should see this output at the very end of the command prompt:
2008-08-07 14:10:50.516::INFO: Started SocketConnector @ 0.0.0.0:8983
The 0.0.0.0
means it's listening to connections from any host (not just localhost, notwithstanding potential firewalls) and 8983 is the port. If Jetty reports this, then it doesn't necessarily mean that Solr was deployed successfully. You might see an error such as a stack trace in the output, if something went wrong. Even if it did go wrong, you should be able to access the web server at this address: http://localhost:8983
. It will show you a list of links to web applications which will just be Solr for this setup. Solr should have this link: http://localhost:8983/solr
, and if you go there, then you should either see details about an error if Solr wasn't loaded correctly, or a simple page with a link to Solr's admin page, which should be http://localhost:8983/solr/admin/
. You'll be visiting that link often.
Start up Jetty if it isn't already up and point your browser to the admin URL: http://localhost:8983/solr/admin/
, so that we can get our bearings on this interface that is not yet familiar to you. We're not going to discuss any page in any depth at this point.
The top gray area in the previous screenshot is a header that is on every page. When you start dealing with multiple Solr instances (development machine versus production, multicore, Solr clusters), it is important to know where you are. The IP and port are obvious. The (example) is a reference to the name of the schema. That's just a simple label at the top of the schema file to name the schema. If you have multiple schemas for different data sets, then this is a useful differentiator. Next is the current working directory cwd, and Solr's home.
The block below this is a navigation menu to the different admin screens and configuration data. The navigation menu is explained as follows:
SCHEMA: This downloads the schema configuration file (XML) directly to the browser.
CONFIG: It is similar to the SCHEMA choice, but this is the main configuration file for Solr.
ANALYSIS: It is used for diagnosing potential query/indexing problems having to do with the text analysis. This is a somewhat advanced screen and will be discussed later.
SCHEMA BROWSER: This is a neat view of the schema reflecting various heuristics of the actual data in the index. We'll return here later.
STATISTICS: Here you will find stats such as timing and cache hit ratios. In Chapter 9, we will visit this screen to evaluate Solr's performance.
INFO: This lists static versioning information about internal components to Solr. Frankly, it's not very useful.
DISTRIBUTION: It contains Distributed/Replicated status information, only applicable for such configurations. More information on this is in Chapter 9.
PING: Ignore this, although it can be used for a health-check in distributed mode.
LOGGING: This allows you to adjust the logging levels for different parts of Solr at runtime. For Jetty as we're running it, this output goes to the console and nowhere else.
Tip
Solr uses SLF4j for its logging, which in Solr, is by default configured to use Java's built-in logging (that is JUL or JDK14 Logging). If you're more familiar with another framework like Log4J, then you can do this by simply removing the slf4j-jdk14 JAR file and adding slf4j-log4j12 (not included). If you're using Solr 1.3, then you're stuck with JUL.
JAVA PROPERTIES: It lists Java system properties.
THREAD DUMP: This displays a Java thread dump useful for experienced Java developers in diagnosing problems.
After the main menu is the Make a Query text box where you can type in a simple query. There's no data in Solr yet, so there's no point trying that right now.
FULL INTERFACE: As you might guess, it brings you to a form with more options, especially useful when diagnosing query problems or if you forget what the URL parameters are for some of the query options. The form is still very limited, however, and only allows a fraction of the query options that you can submit to Solr.
Finally, the bottom Assistance area contains useful information for Solr online. The last section of this chapter has more information on such resources.
Solr happens to come with some sample data and a loader script, found in the example
/exampledocs
directory. We're going to use that, but just for the remainder of this chapter so that we can explore Solr more without getting into schema decision making and deeper data loading options. For the rest of the book, we'll base the examples on the supplemental files, which are provided online.
Firstly, ensure that Solr is running. You should assume that it is always in a running state throughout this book to follow any example. Now go into the example
/exampledocs
directory, and run the following:
exampledocs$ java -jar post.jar *.xml
SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTing file hd.xml
SimplePostTool: POSTing file ipod_other.xml
SimplePostTool: POSTing file ipod_video.xml
SimplePostTool: POSTing file vidcard.xml
SimplePostTool: COMMITting Solr index changes..
Or if you are using a Unix-like environment, you have the option of using the post.sh
shell script, which behaves similarly. What this does is it invokes the Java program embedded in post.jar
with each file in the current directory ending in .xml
. post.jar
is a simple program that iterates over each argument given (a file reference), and HTTP posts it to Solr running on the current machine at the example server's default configuration (being http://localhost:8983/solr/update
). I recommend examining the contents of the post.sh
shell script for illustrative purposes. As seen above, the command will mention the files it is sending. Finally it will send a commit
command, which will cause documents that were posted prior to the last commit to be saved and visible.
Tip
The post.sh
and post.jar
programs could theoretically be used in a production scenario, but they are intended just for demonstration of the technology with the example data.
Let's take a look at one of these documents like monitor.xml
:
<add> <doc> <field name="id">3007WFP</field> <field name="name">Dell Widescreen UltraSharp 3007WFP</field> <field name="manu">Dell, Inc.</field> <field name="cat">electronics</field> <field name="cat">monitor</field> <field name="features">30" TFT active matrix LCD, 2560 x 1600,.25mm dot pitch, 700:1 contrast</field> <field name="includes">USB cable</field> <field name="weight">401.6</field> <field name="price">2199</field> <field name="popularity">6</field> <field name="inStock">true</field> </doc> </add>
The schema for the XML files that are posted to Solr are very simple. This one here doesn't demonstrate all of it, but this is most of what matters. Multiple documents (represented by the <doc>
tag) can be present in series within the <add>
tag, which is recommended in bulk data loading scenarios for performance. Remember that Solr gets a <commit/>
tag sent to it in a separate POST. This syntax and command-set may very well be all that you use. More about these options and other data loading choices will be discussed in Chapter 3.
On the main admin page, let's run a simple query searching for monitor
.
Tip
When using Solr's search form, don't hit the return key. It would be nice if it submits the form, but it adds a carriage return to the search box instead. If you leave this carriage return there and hit Search, then you'll get an error. Perhaps this will be fixed at some point.
Before we go over the XML output, I want to point out the URL and its parameters, which you will become very familiar with: http://localhost:8983/solr/select/?q=monitor&version=2.2&start=0&rows=10&indent=on
.
The form (whether the basic one or the Full Interface one) simply constructs a URL with appropriate parameters, and your browser sees the XML results. It is convenient to use the form at first, but then subsequently make direct modifications to the URL in the browser instead of returning to the form. The form only controls a basic subset of all possible parameters. The main benefit to the form is that it applies the URL escaping for special characters in the query, and for some basic options, you needn't remember what the parameter names are.
Solr's search results from its web interface are in XML. As suggested earlier, you'll probably find that using the Firefox web browser provides the best experience due to the syntax coloring. Internet Explorer displays XML content well too. If you, at some point, want Solr to return a web page to your liking or an alternative XML structure, then that will be covered later. Here is the XML response with my comments:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">3</int> <lst name="params"> <str name="indent">on</str> <str name="rows">10</str> <str name="start">0</str> <str name="q">monitor</str> <str name="version">2.2</str> </lst> </lst>
The first section of the response, which precedes the <result>
tag that is about to follow, indicates how long the query took (measured in milliseconds), as well as listing the parameters that define the query. Solr has some sophisticated caching, and you will find that your queries will often complete in a millisecond or less, if you've run the query before. In the params
list, q
is clearly your query. rows
and start
have to do with paging. Clearly you wouldn't want Solr to always return all of the results at once, unless you really knew what you were doing. indent
indents the XML output, which is convenient for experimentation. version
isn't used much, but if you start building clients that interact with Solr, then you'll want to specify the version to reduce the possibility of things breaking, if you were to upgrade Solr. These parameters in the output are convenient for experimentation but can be configured to be omitted. Next up is the most important part, the results.
<result name="response" numFound="2" start="0">
The numFound
number is self explanatory. start
is the index into the query results that are returned in the XML. Often, you'll want to see the score of the documents. However, the very basic query performed from the front Solr page doesn't include the score, despite the fact that it's sorted by it (Solr's default). The full interface form includes the score by default. Queries that include the score will include a maxScore
attribute in the result tag. The maxScore
for the query is independent of any paging, so that no matter which part of the result set you've paged into (by using the start
parameter), the maxScore
will be the same. The content of the result tag is a list of documents that matched the query in a score sorted order. Later, we'll do some sorting by specified fields.
<doc> <arr name="cat"><str>electronics</str><str>monitor</str></arr> <arr name="features"><str>30" TFT active matrix LCD, 2560 x 1600,.25mm dot pitch, 700:1 contrast</str></arr> <str name="id">3007WFP</str> <bool name="inStock">true</bool> <str name="includes">USB cable</str> <str name="manu">Dell, Inc.</str> <str name="name">Dell Widescreen UltraSharp 3007WFP</str> <int name="popularity">6</int> <float name="price">2199.0</float> <str name="sku">3007WFP</str> <arr name="spell"><str>Dell Widescreen UltraSharp 3007WFP</str></arr> <date name="timestamp">2008-08-09T03:56:41.487Z</date> <float name="weight">401.6</float> </doc> <doc> ... </doc> </result> </response>
The document list is pretty straightforward. By default, Solr will list all of the stored fields, plus the score if you asked for it (we didn't in this case). Remember that not all of the fields are necessarily stored (that is, you can query on them but not store them for retrieval—an optimization choice). Notice that basic data types str
, bool
, date
, int
, and float
are used. Also note that certain fields are multi-valued, as indicated by an arr
tag.
This was a basic query. As you start adding more query options like faceting, highlighting, and so on, you will see additional XML following the result
tag.
Let's take a look at the statistics page: http://localhost:8983/solr/admin/stats.jsp
. Before we loaded data into Solr, this page reported that numDocs
was 0
, but now it should be 26
. If you're wondering what maxDocs
is and the difference, maxDocs
reports a number that is in some situations higher due to documents that have been deleted but not yet committed. That can happen either due to an explicit delete posted to Solr or by adding a document that replaces another in order to enforce a unique primary key. While you're at this page, notice that the query handler named /update
has some stats too:
name |
/update |
---|---|
class |
org.apache.solr.handler.XmlUpdateRequestHandler |
version |
$Revision: 679936 $ |
description |
Add documents with XML |
stats |
handlerStart: 1218253728453 requests: 19 errors: 4 timeouts: 0 totalTime: 1392 avgTimePerRequest: 73.26316 avgRequestsPerSecond: 2.850955E-4 |
In my case, as seen above, there are some errors reported because I was fooling around, posting all of the files in the exampledocs
directory, not just the XML ones. Another Solr handler name you'll want to examine is standard
, which has been processing our queries.
Solr's configuration files are extremely well documented. We're not going to go over the details here but this should give you a sense of what is where.
The schema (defined in schema.xml
) contains field type definitions (defined within the
<types>
tag) and lists the fields that make up your schema (within the <fields>
tag), which references a type. The schema contains other information too such as the primary key (the field that uniquely identifies each document—a constraint that Solr enforces) and the default search field. The sample schema in Solr uses the field named
text
, confusingly, there is a field type named text
too. But remember that the monitor.xml
document we reviewed earlier had no field named text
, right? It is common for the schema to call out for certain fields to be copied to other fields—particularly fields not in input documents. So, even though the input documents don't have a field named text
, there are
<copyField>
tags in the schema, which call for the fields named cat
, name
, manu
, features
, and includes
to be copied to text
. This is a popular technique to speed up queries, so that queries can search over a small number of fields rather than a long list of them. Such fields used this way are rarely stored
, as they are just needed for querying and so are indexed
. There is a lot more we could talk about in the schema, but we're going to move on for now.
Solr's solrconfig.xml
file contains lots of parameters that can be tweaked. At the moment, we're just going to take a peak at the request handlers that are defined with <requestHandler>
tags. They make up about half of the file. In our first query, we didn't specify any request handler, so we got the default one. It's defined here:
<requestHandler name="standard" class="solr.SearchHandler" default="true"> <!-- default values for query parameters --> <lst name="defaults"> <str name="echoParams">explicit</str> <!-- <int name="rows">10</int> <str name="fl">*</str> <str name="version">2.1</str> --> </lst> </requestHandler>
When you POST commands to Solr (such as to index a document) or query Solr (HTTP GET), it goes through a particular request handler. Handlers can be registered against certain URL paths. When we uploaded the documents earlier, it went to the handler defined like this:
<requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />
The request handlers oriented to querying using the class solr.SearchHandler
are much more interesting.
Note
The important thing to realize about using a request handler is that they are nearly completely configurable through URL parameters or POST'ed form parameters. They can also be specified in solrconfig.xml
within either default
, appends
, or invariants
named lst
blocks, which serve to establish defaults. More on this is in Chapter 4. This arrangement allows you to set up a request handler for a particular application that will be querying Solr without forcing the application to specify all of its query options.
The standard
request handler defined previously doesn't really define any defaults other than the parameters that are to be echoed in the response. Remember its presence at the top of the XML output? By changing explicit
to none
you can have it omitted, or use all
and you'll potentially see more parameters, if other defaults happened to be configured in the request handler. This parameter can alternatively be specified in the URL through echoParams=none
. Remember to separate URL parameters with ampersands.
The following are some prominent Solr resources that you should be aware of:
Solr's Wiki: http://wiki.apache.org/solr/ has a lot of great documentation and miscellaneous information. For a Wiki, it's fairly organized too. In particular, if you are going to use a particular app-server in production, then there is probably a Wiki page there on specific details.
Within the Solr installation, you will also find that there are
README.txt
files in many directories within Solr and that the configuration files are very well documented.Solr's mailing lists contain a wealth of information. If you have a few discriminating keywords then you can find nuggets of information in there with a search engine. The mailing lists of Solr and other Lucene sub-projects are best searched at: http://www.lucidimagination.com/search/ or Nabble.com.
Solr's issue tracker, a JIRA installation at http://issues.apache.org/jira/browse/SOLR contains information on enhancements and bugs. Some of the comments for these issues can be extensive and enlightening. JIRA also uses a Lucene-powered search.
This completes a quick introduction to Solr. In the ensuing chapters, you're really going to get familiar with what Solr has to offer. I recommend you proceed in order from the next chapter through Chapter 6, because these build on each other and expose nearly all of the capabilities in Solr. These chapters are also useful as a reference to Solr's features. You can of course skip over sections that are not interesting to you. Chapter 8, is one you might peruse at any time, as it may have a section particularly applicable to your Solr usage scenario.
Accompanying the book at PACKT's web site is both source code and data to be indexed by Solr. In order to try out the same examples used in the book, you will have to download it and run the provided ant task, which prepares it for you. This first chapter is the only one that is not based on that supplemental content.