In this chapter we will introduce Apache Solr. We will start by giving an idea about what it is and when the project began.
We will prepare our local Solr installation, using the standard distribution to run our examples; we will also see how to start/stop Solr from the command line in a simple way, how to index some example data, and how to perform the first query on the web interface.
We will also introduce some convenient tools such as cURL, which we will use later in the book. This is a simple and effective way to play with our examples, which we will use in the next chapters too.
Apache Solr is an open source Enterprise Java full-text search server. It was initially started in 2004 at CNET (at that time, one of the most well-known site for news and reviews on technology), then it became an Apache project in 2007, and since then it has been used for many projects and websites. It was initially conceived to be a web application for providing a wide range of full-text search capabilities, using and extending the power of the well-known Apache Lucene library. The two projects have been merged into a single development effort since 2010, with improved modularity.
Solr is designed to be a standalone web server, exposing full-text and other functionalities via its own REST-like services, which can be consumed in many different ways from nearly any platform or language. This is the most common use case, and we will focus on it.
It can be also used as an embedded framework if needed, adding some of its functionalities into our Java application by a direct call to its internal API. This is a special case: useful if you need it, for example, for using its features inside a desktop application. We will only give some suggestions on how to start programming using an embedded Solr instance, at the end of the book.
Moreover Solr is not a database; it is very different from the relational ones, as it is designed to manage indexes of the actual data (let's say, metadata useful for searching over the actual data) and not the data itself or the relations between them. However, this distinction can be very blurry in some contexts, and Solr itself is becoming a good NoSQL solution for some specific use cases. You can also see Solr as an open and evolving platform, with integrations to external third-party libraries: for data acquisitions, language processing, document clustering, and more. We will have the chance to cite some of those advanced topics when needed though the book, to have a broader idea of the possible scenarios, looking for interesting readings.
Solr is a very powerful, flexible, mature technology, and it offers not only powerful full-text search capabilities but also autosuggestion, advanced filtering, geocoded search, highlighting in text, faceted search, and much more. The following are the most interesting ones from our perspective:
Advanced full-text search: This is the most obvious option. If we need to create some kind of an internal search engine on our site or application, or if we want to have more flexibility than the internal search capabilities of our database, Solr is the best choice. Solr is designed to perform fast searches and also to give us some flexibility on terms that are useful to intercept a natural user search, as we will see later. We can also combine our search with out of the box functionalities to perform searches over value intervals (imagine a search for a certain period in time), or by using geocoding functions.
Suggestions: Solr has components for creating autosuggestion results using internal similarity algorithms. This is useful because autosuggestion is one of the most intuitive user interface patterns; for example, think about the well-known Google search box that is shown in the following screenshot:
This simple Google search box performs queries on a remote server while we are typing, and automatically shows us some alternative term sequence that can be used for a query and has a chance to be relevant for us; it uses recurring terms and similarity algorithms over the data for this purpose. In the example, the
tutorialkeyword is suggested before the
drupalone as it is judged more relevant from the system. With Solr, we can provide the backend service for developing our own autosuggestion component, inspired by this example.
Language analysis: Solr permits us to configure different types of language analysis even on a per-field basis, with the possibility to configure them specifically for a certain language. Moreover, integrations with tools such as Apache UIMA for metadata extraction already exist; and in general, you might have more new components so that you will be able to plug in to the architecture in the future, covering advanced language processing, information extraction capabilities, and other specific tasks.
Faceted search: This is a particular type of search based on classification. With Solr, we can perform faceted search automatically over our fields to gain information such as how many documents have the value
cityfield. This is useful to construct some kind of faceted navigation. This is another very familiar pattern in user experience that you probably know from having used it on e-commerce site such as Amazon. To see an example of faceted navigation, imagine a search on the Amazon site where we are typing
apache s, as shown in the following screenshot:
In the previous screenshot you can clearly recognize some facets on the top-left corner, which is suggesting that we will find a certain number of items under a certain specific "book category". For example, we know in advance that we will find 11 items for the facet "Books: Java Programming". Then, we can decide from this information whether to narrow our search or not. In case we click on the facet, a new query will be performed, adding a filter based on the choice we implicitly made. This is exactly the way a Solr faceted search will perform a similar query. The term category here is somewhat misleading, as it seems to suggest a predefined taxonomy. But with Solr we can also obtain facets on our fields without explicitly classifying the document under a certain category. It's indeed Solr that automatically returns the faceted result using the current search keywords and criteria and shows us how many documents have the same value for a certain field. You may note that we have used an example of a user interface to give an introductory explanation for the service behind. This is true, and we can use faceted results in many different ways, as we will see later in the book. But I feel the example should help you to fix the first idea; we will explore this in Chapter 6, Using Faceted Search – from Searching to Finding.
It's easy to index data using Solr: for example, we can send data using a POST over HTTP, or we can index the text and metadata over a collection of rich documents (such as PDF, Word, or HTML) without too much effort, using the Apache Tika component. We can also read data from a database or another external data source, and configure an internal workflow to directly index them if needed—using the
Note that other serialization formats can be used; which are designed for specific languages, such as Ruby or PHP, or to directly return the serialization of a Java object. There are already some third-party wrappers developed over these services to provide integration on existing applications, from Content Management Systems (CMS) such as Drupal, WordPress, to e-commerce platforms such as Magento. In a similar way, there are integrations that use the Java APIs, such as Alfresco and Broadleaf, directly (if you prefer you can see this as a type of "embedded" example).
It's possible to start Solr with a very small configuration, adopting an almost schemaless approach; but the internal schema is written in XML, and it is simple to read and write. The Solr application gives us a default web interface for administration, simple monitoring of the used resources, and direct testing of our queries.
This list is far from being exhaustive, and it had the purpose of only introducing you to some of the topics that we will see in the next chapters. If you visit the official site at http://lucene.apache.org/solr/features.html, you will find the complete list of features.
The first thing we need to start exploring Solr is a working Java installation. If you already have one (for example, if you are a Java programmer you will probably have it), you can skip to the Installing and testing Solr section directly.
If you don't have Java installed, please download the most appropriate version for your operating system from http://www.java.com/it/download/manual.jsp, paying attention to the appropriate architecture choice (32 or 64 bit) as well. Since it's impossible to provide a detailed step-by-step description on how to install Java for every platform, I'll ask you to follow the steps described on the Oracle site: http://www.java.com/en/download/help/download_options.xml. The described procedure is not complex; it would require only a few minutes.
The Java Virtual Machine (JVM) has been open sourced since some years, so it's possible to use some alternative implementation of the JVM specification. For example, there is an OpenJDK implementation for the *nix users (http://openjdk.java.net/), or the IBM one (http://www.ibm.com/developerworks/java/jdk/); these are largely adopted, but we will use the official Java distribution from Oracle. The official documentation warns about GNU's GCJ, which is not supported and does not work with Solr.
When Java is installed, we need to configure two environment variables:
CLASSPATH. These are described again at the Oracle website: http://docs.oracle.com/javase/tutorial/essential/environment/paths.html.
PATH variable is generally used to make a command available on the terminal, without having to prepend the full path to call it; basically we will be able to call the Java interpreter from within every folder. For example, the simplest way to verify that this variable is correctly configured is to ask for the Java version installed, which is done as follows:
>> java -version
Once Java is correctly installed, it's time to install Solr and make some initial tests. To simplify things, we will adopt the default distribution that you can download from the official page: http://lucene.apche.org/solr/ (the current version at the time of writing is Version 4.5). The zipped package can be extracted and copied to a folder of choice.
Once extracted, the Solr standard distribution will contain the folders shown in the following screenshot:
We will start Solr from here; even if we don't need to use all the libraries and examples obtained with the distribution, you can continue exploring the folders with your own examples after reading this book. Some of the folders are as follows:
/solr: This represents a simple single core configuration
/multicore: This represents a multiple core (multicore) configuration example
/example-DIH: This provides examples for the data import handler capabilities
/exampledocs: This contains some toy data to play with
For the moment, we will ignore the folders external to
/. These folders will be useful later when we will need to use third-party libraries.
The simplest way to run the Solr instance will be by using the
solr.jar launcher, which we can find in the
/example folder. For our convenience, it's useful to define a new environment variable
SOLR_DIST that will point to the absolute path:
/the-path-of-solr-distribution/example. In order to use the example, in the most simplest way, I suggest you to put the unzipped Solr distribution at the location
SolrStarterBook is the folder where you have the complete code examples for this book. We can easily create this new environment variable in the same way we created the
Ok, now it's time to start Solr for the first time.
>> cd %SOLR_DIST% (windows) >> cd $SOLR_DIST (mac, linux)
We change the directory to
/example, and then we finally start Solr using the following command:
>> java -jar start.jar
We should obtain an output similar to the one seen in the following screenshot:
You will quickly become familiar with the line highlighted in this output as it is easily recognizable (it ends in
0.0.0.0:8983). If we have not noticed any errors before it in the output, our Solr instance is running correctly. When Solr is running, you can leave the terminal window open and minimized in order to be able to see what happened when you need, in particular if there were errors as and when you need. This can be avoided on production systems where we will have scripts to automate start and stop Solr, but it's useful for our testing.
When you wish to stop Solr, simply press the Ctrl + C combination in the terminal window.
solr.jar launcher is a small Java library that starts an embedded Jetty container to run the Solr application. By default, this application will be running on port 8983. Jetty is a lightweight container that has been adopted for distributing Solr for its simplicity and small memory footprint. While Solr is distributed as a standard Java web application (you can find a
solr.war under the
/example/webapps folder), and then its WAR file can be deployed to any application server (such as Tomcat, JBoss, and others), the standard preferable way to use it is with the embedded Jetty instance. Then, we will start with the local Jetty instance bundled with Solr in order to let you familiarize yourself with the platform and its services, using a standard installation where you can also follow the tutorials on your own.
Note that in our example we need to change the current directory to
/example, which is included in the folder that is unzipped from the standard distribution archive. The
start.jar tool is designed to start the local jetty instance by accepting parameters for changing the Solr configurations. If it does not receive any particular option (as in our case), it searches the Solr configurations from the default examples. So, it needs to be started from that specific directory. In a similar way, the
post.jar tool can be started from every directory containing the data to be sent to Solr.
If you want to change the default port value for Jetty (for example, if the default port results is occupied by other programs), you should look at the
jetty.xml file in the
[SOLR_DIST]/examples/etc directory where I wrote
[SOLR_DIST] in place of the Windows, Mac, and Linux versions of the same environment variable. If you also want some control over the logging inside the terminal (sometimes it could become very annoying to find errors inside a huge quantity of lines running fast), please look for the
logging.properties file in the same directory.
Now that the server is running, we are curious about how the Solr web application will look in our browser, so let's copy and paste this URL into the browser:
http://localhost:8983/solr/#/. We will obtain the default home screen for the Solr web application, as shown in the following screenshot:
Note that since the default installation does not provide automatic redirection from the base root to the path seen before, a very common error is pointing to
http://localhost:8983/ and obtaining the error shown in the following screenshot:
We can execute our first query in the default admin screen on the default
http://localhost:8983/solr/#/collection1/query. (In the next chapter, we will introduce the concept of core. So please be patient if there are things not well documented.)
Now that we have prepared the system and installed Solr, we are ready to post some example data as suggested by the default tutorial. In order to check if our installation is working as expected, we need to perform the following steps:
We can easily post some of the example data contained in the
/example/exampledocsfolder of our Solr installation. First of all, we move to that directory using the following command:
>> cd %SOLR_DIST% (windows) >> cd $SOLR_DIST (linux, mac)
Then we will index some data using the
post.jarlibrary provided, using the following command:
>> java -jar post.jar .
/example/exampledocssubfolder, you can find some documents written using the XML, CSV, or JSON format that Solr recognizes to index the data. The
post.jarJava library is designed to send every file contained in a directory (in this case, the current directory). This library is written in one of these formats to a running Solr instance, in this case the default one. The data is sent by an HTTP POST request, and this should explain the name.
Once the example data is indexed, we can again run a query with simple parameters, as shown in the following screenshot:
Here, we are able to see some results exposed by default using the
jsonformat. The example data describes items in a hypothetical catalog of electronic devices.
As you can see in the screenshot, the results are recognizable as items inside a docs collection; we can see the first, which has both fields containing a single value or multiple values (these are easily recognizable by the
[ , ] JSON syntax for lists). The header section of the results contains some general information. For example, the query sent (
q=*:*, which basically means "I want to obtain all the documents") and the format chosen for the output (in our case JSON). Moreover, you should note that the number of results is
32, which is bigger than the number of files in that directory. This should suggest to us that we send more than one single document in a single post (we will see this in the later chapters).
Lastly, you can see in the address that we are actually querying over a subpath called
collection1. This is the name of the default collection where we have indexed our example data. In the next chapter, we will start using our first collection instead of this example one.
If you look at the top of the previous screenshot containing results, you would recognize the address
http://localhost:8983/solr/collection1/query?q=*:*&wt=json&indent=true. It represents a specific query with its parameters. You can copy this address and paste it directly into the browser to obtain the same results as seen before, without necessarily passing it from the web interface. Note that the browser will encode some character when sending the query via HTTP. For example, the character
: will be encoded as
%3A. This will be one of our methods for directly testing queries. But while the browser can be more comfortable in many cases, a command-line approach is surprisingly clearer on many others; and I want to you to be familiar with both ones.
You can download the latest cURL version for your platform/architecture from here: http://curl.haxx.se/download.html.
Please remember that it is better for Linux systems to use the package manager of your distribution (yum, apt, and similar ones). For Windows users, it's important to add the cURL executable into the environment variable
PATH as we have done previously for Java. This is done in order to have it usable from the command line, without having to prepend the absolute path every time.
We can execute the following query with cURL on the command line in the same way we ran it before:
>> curl -X GET "http://localhost:8983/solr/collection1/query?q=*:*&wt=json&indent=true"
Next chapter onwards, we will use the browser and cURL interchangeably; adopting from time to time the clearest method for each specific case.
When cURL is configured, the result of the query will be the same seen in the browser. We generally use cURL by putting the HTTP request address containing its parameters in double quotes; and we will explicitly adopt the
-X GET parameter to make the requests more clear: saving in some
.txt files the cURL requests made permits us, for example, to fully reconstruct the exact queries sent. We can also send POST queries with cURL, and this is very useful to perform indexing and administrative tasks (for example, a delete action) from the command line.
Solr is widely used in many different scenarios: from the well-known, big, new sites such as The Guardian (http://www.guardian.co.uk/) to an application as popular as Instagram (http://instagr.am/). It has also been adopted by big companies such as Apple, Disney, or Goldman Sachs; and there have been some very specific adoptions such as the search over http://citeseerx.ist.psu.edu/network for scientific citations.
Two very widely adopted use cases are aggregators and metasearch engines and Open Catalog Access (OPAC). The first type requires continuous indexing over sparse data and makes a business out of being able to capture users by its indexes, and the second generally needs a read-only exposition of metadata with powerful search. Good examples of online catalogs that have used Solr for a long time are Internet Archive (http://www.archive.org/), the well-known open digital library, and VuFind (http://www.vufind.org/), an open source discovery portal for libraries.
Other very common use cases include news sites, institutional sites such as USA government sites, a publisher site, and others. Even if every site will not necessarily require full-text and the other features of Solr, the use cases can fill a very long list.
A good place to play with some more queries is the official tutorial on the site http://lucene.apache.org/solr/4_5_0/tutorial.html, where you'll see some functionality that we will see in action in Chapter 3, Indexing Example Data from DBpedia – Paintings.
For a better understanding while you are reading the book and playing with our examples, you can refer to the excellent reference guide at this link: https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide. I strongly suggest that you read this once you have finished reading this book. It will help you move a step further.
We will introduce faceted search in detail in Chapter 6, Using Faceted Search – from Searching to Finding. Since this is one of the main features of Solr, you can be interested to start reading about the topic since the beginning. For an introduction to faceted classification, faceted search, and faceted navigation, there are two good books: Design Patterns: Faceted Navigation, Jeffery Callender and Peter Morville, on the A List Apart blog at http://alistapart.com/article/design-patterns-faceted-navigation, and Usability Studies of Faceted Browsing: A Literature Review, by Jody Condit Fagan at http://napoleon.bc.edu/ojs/index.php/ital/article/download/3144/2758.
The main focus of this book will be a gradual introduction to Solr that can be used by a beginner without too much code at the beginning, even if we will introduce some coding near the end. In this approach, I hope that you'll find the chance to share what you read and your ideas with your teammates also if you want, and hopefully you'll have the freedom to find your own way of adopting this technology.
I would also like to suggest the adoption of Solr at an earlier stage of development, as a prototype tool. We will see that indexing data is easy; it doesn't matter if we do not have a final design for our data model yet. Hence, providing filters and faceting capabilities that can be adopted at the beginning of the user experience design. A Solr configuration can be improved at every stage of an incremental development (not necessarily when all the actual data already exists, as you might think), without "breaking" functionalities and giving us a fast view of the data that is near to the user perspective. This can be useful to construct a working preview for our customers, which is flexible enough to be improved fast later.
In order to use the scripts available in the repository for the book examples that we will use in the next chapters, we have defined a
SOLR_DIST environment variable that will be available for some useful scripts you will find in the repository. The code can be downloaded as a zipped package from https://bitbucket.org/seralf/solrstarterbook. If you are familiar with Mercurial, you can download it directly as the source. We will use some of the scripts used to download the toy data for our indexing tests that are written using the Scala language. So, you can directly add the Scala library to the system
CLASSPATH variable for you convenience, although it's not needed. We will discuss our scripts and example later in Chapter 3, Indexing Example Data from DBpedia – Paintings.
Q1. Which of the following are the features of Solr?
Full-text and faceted search
Web crawling and site indexing
Spellchecking and autosuggestion
Q2. From which of these options can we obtain a list of all the documents in the example?
Using the query
Using the query
Using the query
Q3. Why does the standard Solr distribution include a working Jetty instance?
Because Solr can't be run without Jetty
Because we can't deploy the Solr war (web application) into other containers/application servers, such as Tomcat or Jboss
Because Solr war needs to be run inside a web container, such as Jetty
Q4. What is cURL?
cURL is a program used for parsing data from a remote URL, using the HTTP protocol
cURL is a command line tool for transferring data with URL syntax, using the HTTP protocol
cURL is a command line tool for sending queries to Solr, using the HTTP protocol
Q5. Which of the following statements are not true?
Solr application exposes full-text and faceting search capabilities
Solr application can be used for adding full-text search capabilities to a database systems
Solr can be used as an embedded framework in Java application
In this first chapter, we have introduced Apache Solr; we have explained what it is and what it is not, cited its history, and explained the main role of the Apache Lucene library. We cited a list of features that are the most interesting ones from our perspective. Then we saw how to set up a simple Solr instance, how to have it running using the default distribution, and its examples. Using these examples we performed our first query by using the web interface, the direct call to Solr REST services, and the cURL command-line tool, which will be useful in the next chapters.
We end the chapter citing how Solr is widely adopted by players who have been using Solr for some years. Then we resume the perspective we will adopt through the book, which will be based on prototypes and writing as little code as possible.