Apache Solr 3 Enterprise Search Server — Save 50%
Enhance your search with faceted navigation, result highlighting, relevancy ranked sorting, and more with this book and ebook
Apache Solr, built on Apache Lucene, is a wildly popular open source enterprise search server that easily delivers powerful search and faceted navigation features that are elusive with databases. Solr supports complex search criteria, faceting, result highlighting, query-completion, query spellcheck, relevancy tuning, and more.
In this article by David Smiley and Eric Pugh, co-authors of Apache Solr 3 Enterprise Search Server, we will cover the following points:
- How to get Solr, what's included, and what is where
- Running Solr and importing sample data
(For more resources on Apache, see here.)
We're going to get started by downloading Solr, examine its directory structure, and then finally run it. This sets you up for the next section, which tours a running Solr server.
Get Solr: You can download Solr from its website: http://lucene.apache.org/ solr/. The last Solr release this article was written for is version 3.4. Solr has had several relatively minor point-releases since 3.1 and it will continue. In general I recommend using the latest release since Solr and Lucene's code are extensively tested. Lucid Imagination also provides a Solr distribution called "LucidWorks for Solr". As of this writing it is Solr 3.2 with some choice patches that came after to ensure its stability and performance. It's completely open source; previous LucidWorks releases were not as they included some extras with use limitations. LucidWorks for Solr is a good choice if maximum stability is your chief concern over newer features.
Get Java: The only prerequisite software needed to run Solr is Java 5 (a.k.a. java version 1.5) or later—ideally Java 6. Typing java –version at a command line will tell you exactly which version of Java you are using, if any.
Use latest version of Java! The initial release of Java 7 included some serious bugs that were discovered shortly before its release that affect Lucene and Solr. The release of Java 7u1 on October 19th, 2011 resolves these issues. These same bugs occurred with Java 6 under certain JVM switches, and Java 6u29 resolves them. Therefore, I advise you to use the latest Java release.
Java is available on all major platforms including Windows, Solaris, Linux, and Apple. Visit http://www.java.com to download the distribution for your platform. Java always comes with the Java Runtime Environment (JRE) and that's all Solr requires. The Java Development Kit (JDK) includes the JRE plus the Java compiler and various diagnostic utility programs. One such useful program is jconsole, and so the JDK distribution is recommended.
Solr is a Java-based web application, but you don't need to be particularly familiar with Java in order to use it.
Solr's installation directory structure
When you unzip Solr after downloading it, you should find a relatively straightforward directory structure:
- client: Convenient language-specific client APIs for talking to Solr.
Ignore the client directory Most client libraries are maintained by other organizations, except for the Java client SolrJ which lies in the dist/ directory. client/ only contains solr-ruby , which has fallen out of favor compared to rsolr —both of which are Ruby Solr clients.
- contrib: Solr contrib modules. These are extensions to Solr. The final JAR file for each of these contrib modules is actually in dist/; so the actual files here are mainly the dependent JAR files.
- analysis-extras: A few text analysis components that have large dependencies. There are some "ICU" Unicode classes for multilingual support, a Chinese stemmer, and a Polish stemmer.
- clustering: A engine for clustering search results.
- dataimporthandler: The DataImportHandler (DIH) —a very popular contrib module that imports data into Solr from a database and some other sources.
- extraction: Integration with Apache Tika– a framework for extracting text from common file formats. This module is also called SolrCell and Tika is also used by the DIH's TikaEntityProcessor.
- uima: Integration with Apache UIMA—a framework for extracting metadata out of text. There are modules that identify proper names in text and identify the language, for example. To learn more, see Solr's wiki: http://wiki.apache.org/solr/SolrUIMA.
- velocity: Simple Search UI framework based on the Velocity templating language.
- dist: Solr's WAR and contrib JAR files. The Solr WAR file is the main artifact that embodies Solr as a standalone file deployable to a Java web server. The WAR does not include any contrib JARs. You'll also find the core of Solr as a JAR file, which you might use if you are embedding Solr within an application, and Solr's test framework as a JAR file, which is to assist in testing Solr extensions. You'll also see SolrJ's dependent JAR files here.
- docs: Documentation—the HTML files and related assets for the public Solr website, to be precise. It includes a good quick tutorial, and of course Solr's API. Even if you don't plan on extending the API, some parts of it are useful as a reference to certain pluggable Solr configuration elements—see the listing for the Java package org.apache.solr.analysis in particular.
- example: A complete Solr server, serving as an example. It includes the Jetty servlet engine (a Java web server), Solr, some sample data and sample Solr configurations. The interesting child directories are:
- example/etc: Jetty's configuration. Among other things, here you can change the web port used from the pre-supplied 8983 to 80 (HTTP default).
- exampledocs: Sample documents to be indexed into the default Solr configuration, along with the post.jar program for sending the documents to Solr.
- example/solr: The default, sample Solr configuration. This should serve as a good starting point for new Solr applications. It is used in Solr's tutorial.
- example/webapps: Where Jetty expects to deploy Solr from. A copy of Solr's WAR file is here, which contains Solr's compiled code.
Solr's home directory and Solr cores
When Solr starts, the very first thing it does is determine where the Solr home directory is. There are various ways to tell Solr where it is, but by default it's the directory named simply solr relative to the current working directory where Solr is started. You will usually see a solr.xml file in the home directory, which is optional but recommended. It mainly lists Solr cores. For simpler configurations like example/solr, there is just one Solr core, which uses Solr's home directory as its core instance directory . A Solr core holds one Lucene index and the supporting Solr configuration for that index. Nearly all interactions with Solr are targeted at a specific core. If you want to index different types of data separately or shard a large index into multiple ones then Solr can host multiple Solr cores on the same Java server.
A Solr core's instance directory is laid out like this:
- conf: Configuration files. The two I mention below are very important, but it will also contain some other .txt and .xml files which are referenced by these two.
- conf/schema.xml: The schema for the index including field type definitions with associated analyzer chains.
- conf/solrconfig.xml: The primary Solr configuration file.
- conf/xslt: Various XSLT files that can be used to transform Solr's XML query responses into formats such as Atom and RSS.
- conf/velocity: HTML templates and related web assets for rapid UI prototyping using Solritas. The soon to be discussed "browse" UI is implemented with these templates.
- data: Where Lucene's index data lives. It's binary data, so you won't be doing anything with it except perhaps deleting it occasionally to start anew.
- lib: Where extra Java JAR files can be placed that Solr will load on startup. This is a good place to put contrib JAR files, and their dependencies.
Now we're going to start up Jetty and finally see Solr running albeit without any data to query yet.
We're about to run Solr directly from the unzipped installation. This is great for exploring Solr and doing local development, but it's not what you would seriously do in a production scenario. In a production scenario you would have a script or other mechanism to start and stop the servlet engine with the operating system—Solr does not include this. And to keep your system organized, you should keep the example directly as exactly what its name implies—an example. So if you want to use the provided Jetty servlet engine in production, a fine choice then copy the example directory elsewhere and name it something else.
First go to the example directory, and then run Jetty's start.jar file by typing the following command:
>>cd example >>java -jar start.jar
The > > notation is the command prompt. These commands will work across *nix and DOS shells. You'll see about a page of output, including references to Solr. When it is finished, you should see this output at the very end of the command prompt:
2008-08-07 14:10:50.516::INFO: Started SocketConnector @ 0.0.0.0:8983
The 0.0.0.0 means it's listening to connections from any host (not just localhost, notwithstanding potential firewalls) and 8983 is the port. If Jetty reports this, then it doesn't necessarily mean that Solr was deployed successfully. You might see an error such as a stack trace in the output if something went wrong. Even if it did go wrong, you should be able to access the web server: http://localhost:8983. Jetty will give you a 404 page but it will include a list of links to deployed web applications, which will just be Solr for this setup. Solr is accessible at: http://localhost:8983/solr, and if you browse to that page, then you should either see details about an error if Solr wasn't loaded correctly, or a simple page with a link to Solr's admin page, which should be http://localhost:8983/solr/admin/. You'll be visiting that link often.
To quit Jetty (and many other command line programs for that matter), press Ctrl+C on the keyboard.
|Enhance your search with faceted navigation, result highlighting, relevancy ranked sorting, and more with this book and ebook|
eBook Price: $29.99
Book Price: $49.99
(For more resources on Apache, see here.)
A quick tour of Solr
Start up Jetty if it isn't already running and point your browser to Solr's admin site at: http://localhost:8983/solr/admin/. This tour will help you get your bearings on this interface that is not yet familiar to you. We're not going to discuss it in any depth at this point.
This part of Solr will get a dramatic face-lift for Solr 4. The current interface is functional, albeit crude.
The top gray area in the preceding screenshot is a header that is on every page of the admin site. When you start dealing with multiple Solr instances—for example, development versus production, multicore, Solr clusters—it is important to know where you are. The IP and port are obvious. The (example) is a reference to the name of the schema—a simple label at the top of the schema file. If you have multiple schemas for different data sets, then this is a useful differentiator. Next is the current working directory cwd, and Solr's home. Arguably the name of the core and the location of the data directory should be on this overview page but they are not.
The block below this is a navigation menu to the different admin screens and configuration data. The navigation menu includes the following choices:
- SCHEMA: This retrieves the schema.xml configuration file directly to the browser. This is an important file which lists the fields in the index and defines their types.
Most recent browsers show the XML color-coded and with controls to collapse sections. If you don't see readable results and won't upgrade or switch your browser, you can always use your browser's View source command.
- CONFIG: This downloads the solrconfig.xml configuration file directly to the browser. This is also an important file, which serves as the main configuration file.
- ANALYSIS: This is used for diagnosing query and indexing problems related to text analysis. This is an advanced screen and will be discussed later.
- SCHEMA BROWSER: This is an analytical view of the schema reflecting various heuristics of the actual data in the index. We'll return here later.
- REPLICATION: This contains index replication status information. It is only shown when replication is enabled.
- STATISTICS: Here you will find stats such as timing and cache hit ratios.
- INFO: This lists static versioning information about internal components to Solr. Frankly, it's not very useful.
- DISTRIBUTION: This contains rsync-based index replication status information. This replication approach predates the internal Java based mechanism, and so it is somewhat deprecated.
- PING: This returns an XML formatted status document. It is designed to fail if Solr can't perform a search query you give it. If you are using a load balancer or some other infrastructure that can check if Solr is operational, configure it to request this URL.
- LOGGING: This allows you to adjust the logging levels for different parts of Solr at runtime. For Jetty as we're running it, this output goes to the console and nowhere else.
- JAVA PROPERTIES: This lists Java system properties, which are basically Java oriented global environment variables.
- THREAD DUMP: This displays a Java thread dump useful for experienced Java developers in diagnosing problems.
After the main menu is the Make a Query text box where you can type in a simple query. There's no data in Solr yet, so there's no point trying that right now.
- FULL INTERFACE: This brings you to a search form with more options. The form is still very limited, however, and only allows a fraction of the query options that you can submit to Solr.
Finally, the bottom Assistance area contains useful information for Solr online.
Loading sample data
Solr comes with some sample data and a loader script, found in the example/exampledocs directory.
We're going to invoke the post.jar Java program, officially called SimplePostTool with a list of Solr-formatted XML input files. Most JAR files aren't executable but this one is. This simple program iterates over each argument given, a file reference, and HTTP posts it to Solr running on the current machine at the example server's default configuration—http://localhost:8983/solr/update. Finally, it will send a commit command , which will cause documents that were posted prior to the last commit to be saved and visible. Obviously, Solr must be running for this to work, so ensure that it is first. Here is the command and its output:
>> cd example/exampledocs >>java -jar post.jar *.xml SimplePostTool: version 1.4 SimplePostTool: POSTing files to
http://localhost:8983/solr/updateSimplePostTool: POSTing file gb18030-example.xml SimplePostTool: POSTing file hd.xml SimplePostTool: POSTing file ipod_other.xml ... etc. SimplePostTool: COMMITting Solr index changes..
If you are using a Unix-like environment, you have an alternate option of using the post.sh shell script , which behaves similarly by using curl. I recommend examining the contents of the post.sh bash shell script for illustrative purposes, even if you are on Windows—it's very short.
The post.sh and post.jar programs could be used in a production scenario, but they are intended just for demonstration of the technology with the example data.
Let's take a look at one of these XML files we just posted to Solr, monitor.xml :
<add> <doc> <field name="id">3007WFP</field> <field name="name">Dell Widescreen UltraSharp 3007WFP</field> <field name="manu">Dell, Inc.</field> <field name="cat">electronics</field> <field name="cat">monitor</field> <field name="features">30" TFT active matrix LCD, 2560 x 1600,
.25mm dot pitch, 700:1 contrast</field> <field name="includes">USB cable</field> <field name="weight">401.6</field> <field name="price">2199</field> <field name="popularity">6</field> <field name="inStock">true</field> <!-- Buffalo store --> <field name="store">43.17614,-90.57341</field> </doc> </add>
The XML schema for XML files that can be posted to Solr is very simple. This file doesn't demonstrate all of the elements and attributes, but it shows most of what matters. Multiple documents, represented by the <doc> tag , can be present in series within the <add> tag , which is recommended in bulk data loading scenarios. The other essential tag, not seen here, is <commit/> which post.jar and post.sh send in a separate post. This syntax and command set may very well be all that you use.
A simple query
On Solr's main admin page, run a simple query that searches for the word monitor. Simply type this word in and click on the Search button. The resulting URL will be: http://localhost:8983/solr/select/?q=monitor&version=2.2&start=0&rows =10&indent=on
Both this form and the Full Interface one are standard HTML forms; they are as simple as they come. The form inputs become URL parameters to another HTTP GET request which is a Solr search returning XML. The form only controls a basic subset of all possible parameters. The main benefit to the form is that it applies the URL escaping for special characters in the query, and for some basic options, you needn't remember what the parameter names are. It is convenient to use the form as a starting point for developing a search, and then subsequently refine the URL directly in the browser instead of returning to the form.
Solr's search results are by default in XML. Most modern browsers, such as Firefox, provide a good XML view with syntax coloring and hierarchical structure collapse controls. Solr can format responses in JSON and other formats but that's a topic for another time. They have the same basic structure as the XML you're about to see, by the way.
The XML response consists of a <response/> element, which wraps the entire message. The first child element contains request header metadata. Here is the beginning of the response:
<?xml version="1.0" encoding="UTF-8"?> <response> <lst name="responseHeader"> <int name="status">0</int> <int name="QTime">3</int> <lst name="params"> <str name="indent">on</str> <str name="rows">10</str> <str name="start">0</str> <str name="q">monitor</str> <str name="version">2.2</str> </lst> </lst> ...
Here we see:
- status: Is always zero unless there was a serious problem.
- QTime: Is the duration of time in milliseconds that Solr took to process the search. It does not include streaming back the response. Due to multiple layers of caching, you will find that your searches will often complete in a millisecond or less if you've run the query before.
- params: Lists the request parameters. By default, it only lists parameters explicitly in the URL; there are usually more specified in a <requestHandler/> in solrconfig.xml.
Next up is the most important part, the results.
<result name="response" numFound="2" start="0">
The numFound number is the number of documents matching the query in the entire index. start is the index offset into those matching documents that are returned in the response below. Notice there is a search parameter by the same name as seen in the response header. There is also a parameter rows which specifies how many matching documents to return—just 10 in this example
Often, you'll want to see the score of each matching document, which is a number assigned to it based on how relevant the document is to the search query. This search response doesn't refer to scores because it needs to be explicitly requested in the fl parameter—a comma separated field list. The full interface form includes the score by default. A search that requests the score will have a maxScore attribute in the <result/> element , which is the maximum score of all documents that matched the search. It's independent of the sort order or result paging parameters.
The content of the result tag is a list of documents that matched the query. The default sort is by descending score. Later, we'll do some sorting by specified fields.
<str>30" TFT active matrix LCD, 2560 x 1600,
.25mm dot pitch, 700:1 contrast</str>
<str name="includes">USB cable</str>
<str name="manu">Dell, Inc.</str>
<str name="name">Dell Widescreen UltraSharp 3007WFP</str>
The document list is pretty straightforward. By default, Solr will list all of the stored fields. Not all of the fields are necessarily stored—that is, you can query on them but not retrieve their value—an optimization choice. Notice that it uses the basic data types str, bool, date, int, and float. Also note that certain fields are multi-valued, as indicated by an arr tag.
This was a basic keyword search. As you start adding more options like faceting and highlighting, you will see additional XML following the result element.
Let's take a look at the statistics admin page: http://localhost:8983/solr/admin/stats.jsp. Before we loaded data into Solr, this page reported that numDocs was 0, but now it should be 17. If you are using a version of Solr other than 3.4, then this number may be different. The astute reader might observe we posted fewer XML files to Solr. The discrepancy is due to some XML files containing multiple Solr documents. maxDocs reports a number that is in some situations higher due to documents that have been deleted but not yet committed. That can happen either due to an explicit delete posted to Solr or by adding a document that replaces another in order to enforce a unique primary key. While you're at this page, notice that the request handler named /update has some stats too:
$Revision: 1165749 $
Add documents with XML
Another request handler you'll want to examine is named search, which has been processing our search queries.
These statistics are calculated only for the current running Solr, they are not stored to disk. As such, you cannot use them for long-term statistics.
The sample browse interface
The final destination of our quick Solr tour is to visit the so-called browse interface—available at http://localhost:8983/solr/browse.It's for demonstrating various Solr features:
- Standard keyword search. You can experiment with Solr's syntax.
- Query debugging: You can toggle display of the parsed query and document score explain information.
- Query-suggest: Start typing a word like "enco" and suddenly "encoded" will be suggested to you.
- Highlighting: The highlighting is in italics, which might not be obvious.
- More-Like-This: provides related products.
- Faceting: Field value facets, query facets, numeric range facets, and date range facets.
- Clustering: You must first start Solr as the on-screen instructions describe.
- Query boosting: By price.
- Query spellcheck: Using it requires building the spellcheck index and enabling spellcheck with a parameter.
- Geospatial search: You can filter by distance. Click on the spatial link at the top-left to enable this.
This is also a demonstration of Solritas , which formats Solr requests using templates based on Apache Velocity . The templates are VM files in example/solr/conf/ velocity. Solritas is primarily for search UI prototyping. It is not recommended for building anything substantial.
The browse UI as supplied assumes the default example Solr schema. It will not work out of the box against another schema without modification.
Here is a screenshot of the browse interface. Not all of it is captured in this image.
In this article we learned how to get and run Solr, what's included, and what is where. We also learned how to import sample data.
- Configuring Apache and Nginx [Article]
- Apache Geronimo Logging [Article]
- Migration from Apache to Lighttpd [Article]
|Enhance your search with faceted navigation, result highlighting, relevancy ranked sorting, and more with this book and ebook|
eBook Price: $29.99
Book Price: $49.99
About the Author :
Aaron Winborn has been developing websites since the mid-90s. Beginning as a freelancer while teaching at a Sudbury school (a democratic and age-mixed model for young people), his clients demanded more and more features, until he (like everyone and their grandmother) realized he had built a full-featured content management system that required more work to develop and maintain than he was able in his spare time.
He realized at some point that somewhere in the world of Open Source, someone had to have created and released something to the community. Of course, the wonderful news was Drupal.
After converting the existing sites of his clients to Drupal, he continued learning and began to contribute back to the community. About this time, Advomatic, a company with similar interests and a commitment to the Drupal community, began expanding beyond the initial partners who formed it in the wake of Howard Dean's presidential campaign of 2004. Aaron realized that his own goals of creating great sites with a team would be better matched there, and he was hired as their first employee.
Since that time, he has helped to develop some excellent sites, with clients such as Air America Radio, TPM Cafe, NRDC, Greenopia, Mountain News, Viacom, and Bioneers. He has also contributed several modules to Drupal, mostly stemming from his work with multimedia, including Embedded Media Field (for third-party Video, Audio, and Images), Views Slideshow (to create slide shows out of any content), and the RPG module (for online gaming, still in progress).