About this book

Implementing different search engines on web products is a mandate these days. Apache Solr is a robust search engine, but simply implementing Apache Solr and forgetting about it is not a good idea, especially when you have to fight for the search ranking of your web product. In such a scenario, you need to keep monitoring, administrating, and optimizing your Solr to retain your ranking.

"Administrating Solr" is a practical, hands-on guide. This book will provide you with a number of clear, step-by-step exercises and some advanced concepts which will help you administrate, monitor, and optimize Solr using Drupal and associated scripts. Administrating Solr will also provide you with a solid grounding on how you can use Apache Solr with Drupal.

"Administrating Solr" starts with an overview of Apache Solr and the installation process to get you familiar with Solr. It then gradually moves on to discuss the mysteries that make Solr flexible enough to render appropriate search results in different scenarios. This book will take you through clear and practical concepts that will help you monitor, administrate, and optimize your Solr appropriately using both scripts and tools. This book will also teach you ways to query your search and methods to keep your Solr healthy and well maintained. With this book, you will learn how to effectively implement and optimize Solr using Drupal.

Publication date:
October 2013
Publisher
Packt
Pages
120
ISBN
9781783283255

 

Chapter 1. Searching Data

In this chapter we will cover how to install Apache Solr on your system. For instance, a Windows-based system. We will cover the following in this chapter:

  • Request/response handling

  • Querying

  • Faceted search

  • Geospatial search

  • Distributed search

Let's get started.

 

Installation


Before we get ready for the installation, you need to have the necessary downloads ready.

Once you have the mentioned installers ready, you may proceed installing them as follows:

  1. Install XAMPP, and follow the instructions.

  2. Install Tomcat, and follow the instructions.

  3. Install the latest Java JDK.

    By now there must be a folder called /xampp in your C Drive (by default). Navigate to the xampp folder and find xampp-control application (shown in the following screenshot) and then start it.

  4. Start Apache, MySQL, and Tomcat services and click on the Services button at the right-hand side of the panel as demonstrated in the following screenshot:

  5. Locate Apache Tomcat Service, right-click on it and navigate to Properties as demonstrated in the following screenshot:

  6. After the Properties Window pop up, set the Startup type to Automatic, and close the window by clicking on OK as shown in the following screenshot:

    For the next few steps, we will stop Apache Tomcat in the Services window. If this doesn't work, then click on the Stop link.

  7. Extract Apache Solr and navigate to the /dist folder. You will find a file called solr-4.3.1.war as demonstrated in the following screenshot; copy this file.

  8. Navigate to C:/xampp/tomcat/webapps/ and paste the solr-4.3.1.war file (which you have copied in the previous step) into this folder; rename solr-4.3.1.war to solr.war as shown in the following screenshot:

  9. Navigate back to <ApacheSolrFolder>/example/solr/ and copy these files as demonstrated in the next screenshot:

  10. Create a directory in C:/xampp/ called /solr/ and paste ApacheSolrFolder>/example/solr/ files into this directory, that is, C:/xampp/solr, as shown in the following screenshot:

  11. Now navigate to C:/xampp/tomcat/bin/tomcat6w, click on the Java Tab, and copy the command -Dsolr.solr.home=C:\xampp\solr into the Java Options section, as shown in the following screenshot:

  12. Now it is time to navigate to the services window. Start Apache Tomcat in the Services window.

  13. Now you are done with installing Apache Solr at your local environment. To confirm, type http://localhost:8080/solr/admin/ and hit Enter into the browser. You should be able to see Apache Solr Dashboard.

 

Request/response handling


Let us understand what a request and response stands for and get a brief idea about the components handling these requests.

  • Request: As the name suggests, when you search for a keyword, an action is triggered (in a form of query) to Solr to take care of the action (in this case, find out the search keywords) and display the results relevant to it. The action which is triggered is called a request.

  • Response: Response is nothing but what is being displayed on your screen based on the search keywords and other specifications you have stated in your search query.

  • RequestHandler: It is a component which is responsible for answering your requests and is installed and configured in the solrconfig.xml file. Moreover, it has a specific name and class assigned to handle the requests efficiently. If the name starts with a /, you will be able to reach the requesthandler by calling the appropriate path.

    For instance, let us consider an example of the updatehandler which is configured like this:

    <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />

    In the above example, the handler can be reached by calling <solr_url>/update. You may visit http://lucene.apache.org/solr/4_3_1/solr-core/org/apache/solr/request/SolrRequestHandler.html to explore further the list of RequesetHandlers.

    Request and response handling are the primary steps you should be aware of in order to play around with various optimal methods of searching data. We will cover how to efficiently handle request and responses in this section.

    Before we start with how to handle a request or response, let's walk through a few of the important directories which we will be using throughout the chapter along with what they are used to store. They are:

  • Conf: It is one of the mandatory directories in Solr which contains configuration related files like solrconfig.xml and schema.xml. You may also place your other configuration files here in this directory.

  • Data: This is the directory where Solr keeps your index by default and is used by replication scripts. If you are not happy with this default location, you have enough flexibility to override it at solrconfig.xml. Don't panic! If the stated custom directory doesn't exist, Solr will create it for you.

  • Lib: It is not mandatory to have this directory. JARS resides here which is located by Solr to resolve any "plugins" which have been defined in your solrconfig.xml or schema.xml. For example, Analyzers, Requesthandlers, and so on come into the picture.

  • Bin: Replication scripts reside here in this directory and it is up to you whether to have and/or use this directory.

Requests are handled using multiple handlers and/or multiple instances of the same SolrRequestHandler class. How do you wish to use the handler and instances of the handler class is differentiated based on the custom configurations, and are registered with SolrCore. An alternate way to register your SolrRequestHandler with the core is through the solrconfig.xml file.

For instance:

<requestHandler name="/foo" class="solr.CustomRequestHandler" />
    <!-- initialization args may optionally be defined here -->
     <lst name="defaults">
       <int name="rows">10</int>
       <str name="fl">*</str>
       <str name="version">2.1</str>
     </lst>
  </requestHandler>

The easiest way to implement SolrRequestHandler is to extend the RequestHandlerBase class.

 

Querying


Writing a simple query is definitely an easy job; however, writing a complex one with queries playing around with phrases, boosting and prioritizing search results, nesting your query, and a search even based on partial match would be a challenging task. In addition to this, you must remember to write your query taking the performance aspects into account. This is one of the reasons why something that seems to be simple at first sight, actually proves to be even more challenging like writing a complex query which is equally good and efficient in terms of performance. This chapter will guide you through a few of the tasks you are expected to encounter during your everyday work with Solr.

Querying based on a particular field value

You might encounter situations wherein you need to ask for a particular field value, for instance, searching for an author of a book in an internet library or an e-store. Solr can do this for you and we will show you how to achieve it.

Let us assume, we have the following index structure (just add the following lines to the field definition section of your schema.xml file).

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="title" type="text" indexed="true" stored="true" /> 
<field name="author" type="string" indexed="true" stored="true"/>

Hit the following URL on your browser to ask for a value in the author field, which will send the query to Solr.

http://localhost:8080/solr/select?q=author:surendra

You are done with your search; and the documents you get from Solr will be the ones that have the given value in the author field. Remember that the query shown in the preceding example is using a standard query parser, and not dismax.

We defined three fields in the index (which are just for demonstration purpose, and can be customized based on your requirement). As you can see in the preceding query to ask for a particular field value, you need to send a q parameter in FIELD_NAME:VALUE format and that's it. You may extend your search by adding logical operators to the query, hence increasing its complexity.

Tip

In case you forget to specify the field name in your query; your value will be checked again in the default search field that has been defined in the schema.xml file.

While discussing a particular field value, there are a couple of points you should know and would definitely prove useful for you, which are:

  • Single value using extended dismax query parser

    You may sometimes need to ask for a particular field value when using the dismax query parser. Though the dismax query parser doesn't fully support lucene query syntax; we have an alternative. You can use extended dismax query parser instead. It has the same list of functionality as the dismax query parser and it also fully supports lucene query syntax. The query shown here, but using extended dismax, would look like this:

    http://localhost:8080/solr/select?q=author:surendra&defType=edismax

  • Multiple values in the same field

    You may often need to ask for multiple values in a single field. For example, you want to find the solr, monitoring and optimization values in the title field. To do that, you need to run the following query (the brackets surrounding the values are the highlights of this concept):

    http://localhost:8080/solr/select?q=author:(solr monitoring optimization)

 

Searching for a phrase


There might be situations wherein you need to search a document title within millions of documents for which string based search is of course not a good idea. So, the question for ourselves; is it possible to achieve using Solr? Fortunately, yes and the next example will guide you through it.

Assume that you have the following type defined, that needs to be added to your schema.xml file.

<fieldType name="text" class="solr.TextField" positionIncrementGap="100"> 
<analyzer> 
<tokenizer class="solr.WhitespaceTokenizerFactory"/> 
<filter class="solr.LowerCaseFilterFactory"/> 
<filter class="solr.SnowballPorterFilterFactory" language="English"/> 
</analyzer> 
</fieldType>

And then, add the following fields to your schema.xml.

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="title" type="text" indexed="true" stored="true" />

Assume that your data looks like this:

<add> 
<doc>
<field name="id">1</field> 
<field name="title">2012 report</field> 
</doc> 
<doc> 
<field name="id">2</field> 
<field name="title">2007 report</field> 
</doc> 
<doc> 
<field name="id">3</field> 
<field name="title">2012 draft report</field> 
</doc> 
</add>

Now, let us instruct Solr to find the documents that have a 2012 report phrase embedded in the title. Execute the following query to Solr:

http://localhost:8080/solr/select?q=title:"2012 report"

If you get the following result, bingo !!! your query worked!

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader"> 
<int name="status">0</int> 
<int name="QTime">1</int> 
<lst name="params"> 
<str name="q">title:"2012 report"</str> 
</lst> 
</lst> 
<result name="response" numFound="1" start="0"> 
<doc> 
<str name="id">1</str> 
<str name="title">2012 report</str> 
</doc> 
</result> 
</response>

The debug query (the debugQuery=on parameter) shows us what lucene query was made:

<str name="parsedquery">PhraseQuery(title:"2012 report")</str>

As you must have noticed, we got just one document as a result of our query, omitting even the document with the title: 2012 draft report (which is very appropriate and perfect output).

We have used only two fields to demonstrate the concept due to the fact that we are more committed to search a phrase within the title field, here in this demonstration.

Interestingly, here standard Solr query parser has been queried; hence, the field name and the associated value we are looking for can be specified. The query differs from the standard word-search query by using the " character both at the start and end of the query. It dictates Solr to consider the search as a phrase query instead of a term query (which actually makes the difference!). So, this phrase query tells Solr to search considering all the words as a single unit, and not individually.

In addition to this, the phrase query just ensured that the phrase query (that is, the desired one) was made instead of the standard term query.

 

Boosting phrases over words


Since you are in a competitive market, assume that one day your online product met a disaster wherein your product's search result suddenly falls down. To overcome this scenario and survive in such a competitive market, probably you would like to favor documents that have the exact phrase typed by the end-user over the documents that have matches in separate words. We will guide you on how to achieve this in this section.

I assume that we will use dismax query parser, instead of the standard one. Moreover, we will re-use the same schema.xml that was demonstrated in the Searching for a phrase section in this chapter.

Our sample data looks like this:

<add> 
<doc> 
<field name="id">1</field> 
<field name="title">Annual 2012 report final draft</field> 
</doc> 
<doc> 
<field name="id">2</field> 
<field name="title">2007 report</field> 
</doc> 
<doc> 
<field name="id">3</field> 
<field name="title">2012 draft report</field> 
</doc> 
</add>

As mentioned earlier, we would like to boost or give preference to those documents that have phrase matches over others matching the query. To achieve this, run the following query to your Solr instance:

http://localhost:8080/solr/select?defType=dismax&pf=title^100&q=2012 +report&qf=title

And the desired result should look like:

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader"> 
<int name="status">0</int> 
<int name="QTime">1</int> 
<lst name="params"> 
<str name="qf">title</str> 
<str name="pf">title^100</str> 
<str name="q">2012 report</str> 
<str name="defType">dismax</str>
</lst> 
</lst> 
<result name="response" numFound="2" start="0"> 
<doc> 
<str name="id">1</str> 
<str name="title">Annual 2012 report last draft</str> 
</doc> 
<doc> 
<str name="id">3</str> 
<str name="title">2012 draft report</str> 
</doc> 
</result> 
</response>

We have a couple of parameters which have been added to this example and might be new to you. Don't worry! I will explain all of them. The first parameter is defType, which tells Solr which query parser we will be using (dismax in our case). If you are not familiar or would like to learn more about dismax, http://wiki.apache.org/solr/DisMax is where you should go! One of the features of this query parser is the ability to tell Solr which field should be used to search for phrases, and this is achieved using the pf parameter. The pf parameter takes a list of fields with the boost that corresponds to them, for instance, pf=title^100 which means that the phrase found in the title field will be boosted with a value of 100. The q parameter is the standard query parameter which you might be familiar with. In our example, we passed the words we are searching for using AND operator. Through our example we are looking for the documents which satisfy '2012' AND 'report' equation, also known as occurrences of both '2012' and 'report' words found in the title.

Tip

You must remember that you can't pass a query such as fieldname: value to the q parameter and use dismax query parser. The fields you are searching against should be specified using the qf parameter.

 

Prioritizing your document in search results


You might come across situations wherein you need to promote some of your products and would like to find those on top of other documents in the search result list. Additionally, you might also need to have such products flexible and define exclusive queries applicable only to these products and not to the others. To achieve so, you might think of options such as boosting, index time boosting, or probably some special field. Don't worry! Solr will help you out via this section using a robust component known as QueryElevationComponent.

As QueryElevationComponent is biased to specific documents, it impacts the overall search process for other documents. Thus, it is recommended to use this feature only when it is required.

First of all, let us add the component definition in the solrconfig.xml file, which should look like this:

<searchComponent name="elevator" class="solr.QueryElevationComponent" > 
<str name="queryFieldType">string</str> 
<str name="config-file">elevate.xml</str> 
</searchComponent>

Now we will add the appropriate request handler that will include the elevation component. We will name it /promote it, due to the fact that this feature is mainly used to promote your document in search results. Add this to your solrconfig.xml file:

<requestHandler name="/promotion" class="solr.SearchHandler"> 
<arr name="last-components"> 
<str>elevator</str> 
</arr> 
</requestHandler>

You must have noticed a mysterious file, elevate.xml that has been included in the query elevation component, which actually contains the following data and are placed in the configuration directory of the Solr instance.

<?xml version="1.0" encoding="UTF-8" ?> 
<elevate> 
<query text="solr"> 
<doc id="3" /> 
<doc id="1" /> 
</query> 
</elevate>

Here we want our documents with identifiers 3 and 1 to be on the first and second position respectively in the search result list.

Now it is time to add the below field definition to the schema.xml file.

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="name" type="text" indexed="true" stored="true" />

The following are the data which have been indexed:

<add> 
<doc> 
  <field name="id">1</field> 
  <field name="name">Solr Optimization</field> 
</doc> 
<doc>
  <field name="id">2</field> 
  <field name="name">Solr Monitoring</field> 
</doc> 
<doc> 
   <field name="id">3</field> 
   <field name="name">Solr annual report</field> 
</doc> 
</add>

Now, it's time to run the following query:

http://localhost:8080/solr/promotion?q=solr

If you get the following result, you can be assured that your query worked out successfully:

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader"> 
<int name="status">0</int> 
<int name="QTime">1</int> 
<lst name="params"> 
<str name="q">solr</str> 
</lst> 
</lst> 
<result name="response" numFound="3" start="0"> 
<doc> 
<str name="id">3</str> 
<str name="name">Solr annual report</str> 
</doc> 
<doc> 
<str name="id">1</str> 
<str name="name">Solr Optimization</str> 
</doc> 
<doc> 
<str name="id">2</str> 
<str name="name">Solr Monitoring</str> 
</doc> 
</result> 
</response>

In the first part of the configuration, we have defined a new search component (elevator component in our case) and a class attribute (the QueryElevationComponent class in our case). Along with these, we have two additional attributes that define the elevation component behavior which are as follows:

  • queryFieldType: This attribute tells Solr which type of field should be used to parse the query text that is given to the component (for example, if you want the component to ignore letter case, you should set this parameter to the field type that makes its contents lowercase)

  • config-file: This is the configuration file which will be used by the component. It denotes the path of the file that defines query elevation. This file will reside either at ${instanceDir}/conf/${config-file} or ${dataDir}/${config-file}. If the file exists in /conf/ directory, it will be loaded during startup. On the contrary, if the file exists in data directory, it would reload for each IndexReader.

Now, let us step into the next part of solrconfig.xml, which is search handler definition. It tells Solr to create a new search handler with the name /promotion (the name attribute) and using the solr.SearchHandler class (the class attribute). This handler definition also tells Solr to include a component named elevator, which means that the search handler is going to use our defined component. As you might know, you can use more than one search component in a single search handler.

In the actual configuration of the elevate component, you can see that there is a query defined (the query XML tag) with an attribute text="solr", which defines the behavior of the component when a user passes solr to the q parameter. You can see a list of unique identifiers of documents that will be placed on top of the results list for the defined query under this tag, where each document is defined by a doc tag and an id attribute (which have to be defined on the basis of solr.StrField) which holds the unique identifier.

The query is made to our new handler with just a simple one word q parameter (the default search field is set to name in the schema.xml file). Recall the elevate.xml file and the documents we defined for the query we just passed to Solr. Yes of course, we told Solr that we want documents with id=3 and id=1 to be placed on first and second positions respectively in the search result list. And ultimately, our query worked and you can see the documents were placed exactly as we wanted.

 

Query nesting


You might come across situations wherein you need to nest a query within another query. Let us imagine that you want to run a query using the standard request handler, but you need to embed a query that is parsed by the dismax query parser inside it. Isn't that interesting? We will show you how to do it.

Let us assume that we use the same field definition in schema.xml that was used in our previous section "Based on a partial keyword/phrase match".

Our example data looks like this:

<add> 
<doc> 
<field name="id">1</field> 
<field name="title">Reviewed solrcook book</field> 
</doc> 
<doc> 
<field name="id">2</field> 
<field name="title">Some book reviewed</field> 
</doc> 
<doc> 
<field name="id">3</field> 
<field name="title">Another reviewed little book</field> 
</doc> 
</add>

Here, we are going to use the standard query parser to support lucene query syntax, but we would like to boost phrases using the dismax query parser. At first it seems to be impossible to achieve, but don't worry, we will handle it. Let us suppose that we want to find books having the words, reviewed and book, in their title field; and we would like to boost the reviewed book phrase by 10. Here we go with the query:

http://localhost:8080/solr/select?q=reviewed+AND+book+AND+_ query_:"{!dismax qf=title pf=title^10 v=$qq}"&qq=reviewed+book

The results of the preceding query should look like this:

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader"> 
<int name="status">0</int> 
<int name="QTime">2</int> 
<lst name="params"> 
<str name="fl">*,score</str> 
<str name="qq">book reviewed</str> 
<str name="q">book AND reviewed AND _query_:"{!dismax qf=title pf=title^10 v=$qq}"</str> 
</lst> 
</lst> 
<result name="response" numFound="3" start="0" maxScore="0.77966106"> 
<doc> 
<float name="score">0.77966106</float> 
<str name="id">2</str> 
<str name="title">Some book reviewed</str> 
</doc> 
<doc> 
<float name="score">0.07087828</float> 
<str name="id">1</str> 
<str name="title">Reviewed solrcook book</str> 
</doc> 
<doc> 
<float name="score">0.07087828</float> 
<str name="id">3</str> 
<str name="title">Another reviewed little book</str> 
</doc> 
</result> 
</response>

As you can see, we have used the same and simple index, let us skip its description and step into the next section.

Let us focus on the query. The q parameter is built of two parts connected together with AND operator. The first one reviewed+AND+book is just a usual query with a logical operator AND defined. In the second part, building the query starts with a strange looking expression, _query_. This expression tells Solr that another query should be made that will affect the results list. We then see the expression stating that Solr should use the dismax query parser (the !dismax part) along with the parameters that will be passed to the parser (qf and pf).

Note

The v parameter is an abbreviation for value and it is used to pass the value of the q parameter (in our case, reviewed+book is being passed to the dismax query parser).

And thats it! We get to the search results which we had expected.

 

Faceted search


One of the advantages of Solr is the ability to group results on the basis of the field's contents. This ability to group results using Solr is defined as faceting which can help us in several tasks that we need to do in our everyday work. For instance, getting the number of documents with the same values in a field (such as the companies from the same city) through the ability of value and ranges grouping, to the autocomplete features based on faceting. In this section, I will show you how to handle some of the important and common tasks when using faceting.

Search based on the same value range

You have an application that allows the users to search for companies in Europe (for instance), and imagine a situation where your customer wants to have the number of companies in the cities where the companies that were found by the query are located. Just think how frustrating it would be to run several queries to do this. Don't panic, Solr will relieve your frustration and will make this task much easier by using faceting. Let me show you how to do it.

Let us assume that we have the following index structure which we have added to our field definition section of our schema.xml file; we will use the city field to do the faceting:

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="name" type="text" indexed="true" stored="true" /> 
<field name="city" type="string" indexed="true" stored="true" />

And our example data looks like this:

<add> 
<doc> 
<field name="id">1</field> 
<field name="name">Company 1</field> 
<field name="city">New York</field> 
</doc> 
<doc> 
<field name="id">2</field> 
<field name="name">Company 2</field> 
<field name="city">California</field> 
</doc> 
<doc> 
<field name="id">3</field> 
<field name="name">Company 3</field> 
<field name="city">New York</field> 
</doc> 
</add>

Let us suppose that a user searches for the word company. The query will look like this:

http://localhost:8080/solr/select?q=name:company&facet=true&facet. field=city

The result produced by this query looks like:

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader"> 
<int name="status">0</int> 
<int name="QTime">1</int> 
<lst name="params"> 
<str name="facet">true</str> 
<str name="facet.field">city</str> 
<str name="q">name:company</str> 
</lst> 
</lst> 
<result name="response" numFound="3" start="0"> 
<doc> 
<str name="city">New York</str> 
<str name="id">1</str> 
<str name="name">Company 1</str> 
</doc> 
<doc> 
<str name="city">California</str> 
<str name="id">2</str> 
<str name="name">Company 2</str> 
</doc> 
<doc> 
<str name="city">New York</str> 
<str name="id">3</str> 
<str name="name">Company 3</str> 
</doc> 
</result> 
<lst name="facet_counts"> 
<lst name="facet_queries"/> 
<lst name="facet_fields"> 
<lst name="city"> 
<int name="New York">2</int> 
<int name="California">1</int> 
</lst> 
</lst> 
<lst name="facet_dates"/>
</lst> 
</response>

Note

Notice that, besides the normal results list, we got the faceting results with the numbers that we wanted.

The index structure and data are quite simple and the field we would like to focus on is the city field based on which we would like to fetch the number of companies having the same value of this city field.

We query Solr and inform the query parser that we want the documents that have the word company in the title field and indicate that we also wish to enable faceting by using the facet=true parameter. The facet.field parameter tells Solr which field to use to calculate the faceting numbers.

Note

You are open to specify the facet.field parameter multiple times to get the faceting numbers for different fields in the same query.

As you can see in the results list, all types of faceting are grouped in the list with the name="facet_counts" attribute. The field based faceting is grouped under the list with the name="facet_fields" attribute. Every field that you specified using the facet.field parameter has its own list which has the name attribute same as the value of the parameter in the query (in our case, city). Finally, we see the results that we are interested in: the pairs of values (the name attribute) and how many documents have that value in the specified field.

Filter your facet results

Imagine a situation where you need to search for books in your eStore or library. If this was only the situation, it would have been very simple to search. Just think of the adds-on of showing the book count which lies between a specific price range! Can Solr handle such a complex situation? I would answer yes, and here we go.

Suppose that we have the following index structure which has been added to field definition section of our schema.xml; we will use the price field to do the faceting:

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="name" type="text" indexed="true" stored=
  "true" /> 
<field name="price" type="float" indexed="true" stored="true" />

Here is our example data:

<add> 
<doc> 
<field name="id">1</field> 
<field name="name">Book 1</field> 
<field name="price">70</field> 
</doc> 
<doc> 
<field name="id">2</field> 
<field name="name">Book 2</field> 
<field name="price">100</field> 
</doc> 
<doc> 
<field name="id">3</field> 
<field name="name">Book 3</field> 
<field name="price">210.95</field> 
</doc> 
<doc> 
<field name="id">4</field> 
<field name="name">Book 4</field> 
<field name="price">99.90</field> 
</doc> 
</add>

Let us assume that the user searches for a book and wishes to fetch the document count within the price range of 60 to 100 or 200 to 250.

Our query will look like this:

http://localhost:8080/solr/select?q=name:book&facet=true&facet. query=price:[60 TO 100]&facet.query=price:[200 TO 250]

The result list of our query would look like this:

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader"> 
<int name="status">0</int> 
<int name="QTime">1</int> 
<lst name="params"> 
<str name="facet">true</str> 
<arr name="facet.query"> 
<str>price:[60 TO 100]</str> 
<str>price:[200 TO 250]</str> 
</arr> 
<str name="q">name:book</str> 
</lst> 
</lst> 
<result name="response" numFound="4" start="0"> 
<doc> 
<str name="id">1</str> 
<str name="name">Book 1</str> 
<float name="price">70.0</float> 
</doc> 
<doc> 
<str name="id">2</str> 
<str name="name">Book 2</str> 
<float name="price">100.0</float> 
</doc> 
<doc> 
<str name="id">3</str> 
<str name="name">Book 3</str> 
<float name="price">210.95</float> 
</doc> 
<doc> 
<str name="id">4</str>
<str name="name">Book 4</str> 
<float name="price">99.9</float> 
</doc> 
</result> 
<lst name="facet_counts"> 
<lst name="facet_queries"> 
<int name="price:[60 TO 100]">3</int> 
<int name="price:[200 TO 250]">1</int> 
</lst> 
<lst name="facet_fields"/> 
<lst name="facet_dates"/> 
</lst> 
</response>

As you can see, the index structure is quite simple and we have already discussed it earlier. So, let's omit it here for now.

Next is the query I would like you to pay special attention to. We see a standard query where we instruct Solr that we want to get all the documents that have the word book in the name field (the q=name:book parameter). Then, we say that we want to use faceting by adding the facet=true parameter to the query, that is, we can now pass the query to faceting and as a result, we expect the number of documents that match the given query; in our case, we want two price ranges: 60 to 100 and 200 to 250.

We achieved it by adding the facet.query parameter with the appropriate value. The first price range is defined as a standard range query (price:[60 TO 100]). The second query is very similar, just different values where we define the other price range (price:[200 TO 250]).

Note

The value passed to the facet.query parameter should be a lucene query written using the default query syntax.

As you can see in the result list, the query faceting results are grouped under the <lst name="facet_queries"> XML tag with the names exactly as in the passed queries. You can see that Solr calculated the number of books in each of the price ranges appropriately, which proved to be a perfect solution to our assumption.

Autosuggest feature using faceting

Imagine that when a user types a keyword to search for a book title on your Web based library and suggestions based on the typed keyword pop up to the user helping him/her choose the appropriate search keyword! We have most of the known search engines implementing features such as autocomplete or autosuggest. Why don't you? Yes of course, and the next example will guide you on how to implement such a robust feature.

Let us consider the following index structure which needs to be added in the field definition section of our schema.xml file.

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="title" type="text" indexed="true" stored="true" /> 
<field name="title_autocomplete" type="lowercase" indexed="true" stored="true">

We also wish to add some field copying to automate some of the operations. To do so, we will add the following after the field definition section in our schema.xml file:

<copyField source="title" dest="title_autocomplete" />

We will then add the lower case field type definition in the types definition section of our schema.xml file, which will look like this:

<fieldType name="lowercase" class="solr.TextField"> 
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/> 
<filter class="solr.LowerCaseFilterFactory" /> 
</analyzer> 
</fieldType>

Our example data looks like this:

<add> 
<doc> 
<field name="id">1</field> 
<field name="title">Lucene or Solr ?</field> 
</doc> 
<doc> 
<field name="id">2</field> 
<field name="title">My Solr and the rest of the world</field> 
</doc> 
<doc> 
<field name="id">3</field> 
<field name="title">Solr recipes</field> 
</doc> 
<doc> 
<field name="id">4</field> 
<field name="title">Solr cookbook</field> 
</doc> 
</add>

Now, let us assume that user typed the letters so in the search box, and we wish to give him/her the first 10 suggestions with the highest counts. We also wish to give suggestions of the whole titles instead of just the single words. To do so, send the following query to Solr:

http://localhost:8080/solr/select?q=*:*&rows=0&facet=true&facet. field=title_autocomplete&facet.prefix=so

And here we go with the result of this query:

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader"> 
<int name="status">0</int> 
<int name="QTime">16</int> 
<lst name="params"> 
<str name="facet">true</str> 
<str name="q">*:*</str> 
<str name="facet.prefix">so</str> 
<str name="facet.field">title_autocomplete</str> 
        <str name="rows">0</str>
</lst> 
</lst> 
<result name="response" numFound="4" start="0"/> 
<lst name="facet_counts"> 
<lst name="facet_queries"/> 
<lst name="facet_fields"> 
<lst name="title_autocomplete"> 
<int name="solr cookbook">1</int> 
<int name="solr recipes">1</int> 
</lst> 
</lst> 
<lst name="facet_dates"/> 
</lst> 
</response>

You can see that our index structure looks more or less as the one we have been using, except for the additional autosuggest field which is used to provide autosuggest feature.

We have the copy field section to automatically copy the contents of the title field to the title_autocomplete field.

We used the lowercase field type to provide the autocomplete feature regardless of the case of the letter typed by the user (lower or upper).

Now it is time to analyze the query. As you can see we are searching the whole index (the parameter q=*:*), but we are not interested in any search results (the rows=0 parameter). We instruct Solr that we want to use the faceting mechanism (the facet=true parameter) and that it will be a field based faceting on the basis of the title_autocomplete field (the facet. field=title_autocomplete parameter). The last parameter, the facet.prefix can be something new. Basically, it tells Solr to return only those faceting results that are beginning with the prefix specified as the value of this parameter, which in our case is the value of so. The use of this parameter enables us to show the suggestions that the user is interested in; and we can see from the results achieved what we had intended.

Tip

It is recommended not to use heavily analyzed text (for example, stemmed text) to ensure that your word isn't modified frequently.

 

Geospatial search


Geomatics (also known as geospatial technology or geomatics engineering) is a discipline of gathering, storing, processing, and delivering geographic information, or spatial referenced information. This geographic information is based out of longitudes (vertical lines) and latitudes (horizontal lines) and can be effectively used in various ways and forms. For instance, you wish to store the location of your company when your company has multiple locations; or sorting the search results based on the distance from a point. To be more specific, geospatial is playing around with different co-ordinates throughout the globe.

In this section, we will talk about and understand how to:

  • Store geographical points in the index

  • Sort results by a distance from a point

Storing geographical points in the index

You might come across situations wherein you are supposed to store multiple locations of a company in the index. Yes of course, we can add multiple dynamic fields and remember the field names in our application, but that isn't comfortable. No worries, Solr will be able to handle such a situation and the next example will guide you how to store pairs of fields (in our case, location co-ordinates/geographical point).

Let us define three fields in the field definition section of our schema.xml file to store company's data:

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="name" type="text" indexed="true" stored="true" /> 
<field name="location" type="point" indexed="true" stored="true" multiValued="true" /> 

In addition to the preceding fields, we shall also have one dynamic field defined in our schema.xml file as shown:

<dynamicField name="*_d" type="double" indexed="true" stored="true"/>

Our point type should look like this:

<fieldType name="point" class="solr.PointType" dimension="2" subFieldSuffix="_d"/>

Now, let us look into our example data which I stored in the geodata.xml file:

<add> 
<doc> 
<field name="id">1</field> 
<field name="name">company</field> 
<field name="location">10,10</field> 
<field name="location">30,30</field> 
</doc> 
</add>

Let us now index our data and for doing so, run the following command from the exampledocs directory (where our geodata.xml file resides).

java -jar post.jar geodata.xml

After we index our data, now it's time to run our following query to get the data:

http://localhost:8080/solr/select?q=location:10,10

If you get the following response, then its bingo! You have done it.

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader"> 
<int name="status">0</int> 
<int name="QTime">3</int> 
<lst name="params">
<str name="q">location:10,10</str> 
</lst> 
</lst> 
<result name="response" numFound="1" start="0"> 
<doc> 
<str name="id">1</str> 
<arr name="location"> 
<str>10,10</str> 
<str>30,30</str> 
</arr> 
<arr name="location_0_d"> 
<double>10.0</double> 
<double>30.0</double> 
</arr> 
<arr name="location_1_d"> 
<double>10.0</double> 
<double>30.0</double> 
</arr> 
<str name="name">company</str> 
</doc> 
</result> 
</response>

We have four fields, one of them being a dynamic field which we have defined in our schema.xml file. The first field is the one responsible for holding the unique identifier. The second one is responsible for holding the name of the company. The third one, named location, is responsible for holding the geographical points and of course can have multiple values. The dynamic field will be used as a helper for the point type.

Then, we have the point type definition, which is based on the solr.PointType class and is defined by the following two attributes:

  • dimension: The number of dimensions that the field will store. In our case, as we have stored a pair of values, we set this attribute to 2.

  • subFieldSuffix: It is used to store the actual values of the field. This is where our dynamic field comes into play. Using this field, we instruct Solr that our helper field will be the dynamic field ending with the suffix of _d.

How did this type of field actually work? When defining a two dimensional field, like we did, there are actually three fields created in the index. The first field is named like the field we added in the schema.xml file, so in our case it is location. This field will be responsible for holding the stored value of the field. Additionally, this field will only be created when we set the field attribute store to true.

The next two fields are based on the dynamic field. Their names would be field_0_d and field_1_d. Fields are ordered as the field name, _ character, the index of the value, another _ character, and finally the suffix defined by the subFieldSuffix attribute of the type.

Now, let us understand how the data is indexed. If you look at our example data file, you will see that the values in each pair are separated by the comma character. And that's how you can add the data to the index.

Querying is just the same as the pairs should be represented, except it differs from the standard one-valued fields as each value in the pair is separated by a comma character which is passed in the query.

Looking at the response, you can see that besides the location field, there are two dynamic fields (location_0_d and location_1_d) created.

Sort results by a distance from a point

Taking forward the above described scenario (as discussed in the Storing Geographical points in the index section of this chapter), imagine a scenario wherein you got to sort your search results based on the distance from a user's location. This section will show you how to do it.

Let us assume that we have the following index which we have added to the field definition section of schema.xml.

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="name" type="string" indexed="true" stored="true" /> 
<field name="x" type="float" indexed="true" stored="true" /> 
<field name="y" type="float" indexed="true" stored="true" />

Here in this example, we have assumed that the user location will be provided from the application making the query.

Our example data looks like this:

<add> 
<doc> 
<field name="id">1</field> 
<field name="name">Company 1</field> 
<field name="x">56.4</field> 
<field name="y">40.2</field> 
</doc> 
<doc> 
<field name="id">2</field> 
<field name="name">Company 2</field> 
<field name="x">50.1</field> 
<field name="y">48.9</field> 
</doc> 
<doc> 
<field name="id">3</field> 
<field name="name">Company 3</field> 
<field name="x">23.18</field> 
<field name="y">39.1</field> 
</doc> 
</add>

Suppose that the user is using this search application standing at the North Pole. Our query to find the companies and sort them in ascending order on the basis of the distance from the North Pole would be:

http://localhost:8080/solr/select?q=company&sort=dist(2,x,y,0,0)+asc

Our result would look something like this:

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader">
<int name="status">0</int> 
<int name="QTime">2</int> 
<lst name="params"> 
<str name="q">company</str> 
<str name="sort">dist(2,x,y,0,0) asc</str> 
</lst> 
</lst> 
<result name="response" numFound="3" start="0"> 
<doc> 
<str name="id">3</str> 
<str name="name">Company 3</str> 
<float name="x">23.18</float> 
<float name="y">39.1</float> 
</doc> 
<doc> 
<str name="id">1</str> 
<str name="name">Company 1</str> 
<float name="x">56.4</float> 
<float name="y">40.2</float> 
</doc> 
<doc> 
<str name="id">2</str> 
<str name="name">Company 2</str> 
<float name="x">50.1</float> 
<float name="y">48.9</float> 
</doc> 
</result> 
</response>

As you can see in the index structure and the data, every company is described by four fields: the unique identifier (id), company name (name), the latitude of the company's location (x), and the longitude of the company's location (y).

To achieve the expected results, we run a standard query with a non-standard sort. The sort parameter consists of a function name, dist, which calculates the distance between points. In our example, the function (dist(2,x,y,0,0)) takes five parameters, which are:

The first parameter mentions the algorithm used to calculate the distance. In our case, the value 2 tells Solr to calculate the Euclidean distance.

The second parameter x contains the latitude.

The third parameter y contains the longitude.

The fourth parameter is the latitude value of the point from which the distance will be calculated (Latitude value of North Pole is 0).

The fifth parameter is the longitude value of the point from which the distance will be calculated (Longitude of North Pole is 0).

If you would like to explore more about the functions available for you with Solr, you may navigate to Solr Wiki page at http://wiki.apache.org/solr/FunctionQuery

 

Distributed search


Distributed search in Solr is a concept of splitting an index into multiple shards, querying, and/or merging results across these shards. Imagine a situation where either the index is too huge to fit on a single system, or you have a query which takes too long to execute. How would you handle such situations? Don't worry! We have distributed search concept in Solr which is especially designed to handle such situations.

Let us consider the above stated scenario where you need to apply distributed search concept in order to overcome the huge index and/or query execution time concerns.

To overcome this situation, you need to distribute a request across ALL shards in a list using the shard parameter. Our request would follow this syntax:

host:port/base_url[,host:port/base_url]

Note

You can add n-number of hosts in a single request. This means that the number of hosts you add, the number of shards you are distributing your request. Additionally, the shard count would depend upon how expensive your query is or how huge your index is.

A sharded request will go to the standard request handler (not necessarily the original); however we can override it using shards.qt. The following are the list of components that support distributed search:

  • Query component

  • Facet component

  • Highlighting component

  • Stats component

  • Spell check component

  • Terms component

  • Term vector component

  • Debug component

  • Grouping component

On the contrary, distributed search has a list of limitations which are:

  • Unique key requirements

  • No distributed IDF

  • Doesn't support QueryElevationComponent

  • Doesn't support Join

  • Index variations between stages

  • Distributed Deadlock

  • Distributed Indexing

Tip

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

 

Summary


In this chapter, we have learned how to query your Solr based on different criteria such as field value, usage of extended dismax query parser, sorting your search results, phrase search, boosting and prioritizing your document in the search result, and nesting your queries. By now you must have also learned what faceted, Geospatial, and distributed searches are and how to play around with them, based on varied scenarios and conditions.

In the next chapter, we will learn different ways of monitoring Solr, performance metrics we should know, agent-based and agent-less health checks, and how to monitor Solr using monitoring tools like Opsview, New Relic, and SPM.

About the Author

  • Surendra Mohan

    Surendra Mohan, who has served a few top-notch software organizations in varied roles, is currently a freelance software consultant. He has been working on various cutting-edge technologies like Drupal, Moodle, Apache Solr, ElasticSearch, Node.js, SoapUI, and so on for the past 10 years. He also delivers technical talks at various community events like Drupal Meetups and Drupal Camps. To find out more about him, his write-ups, technical blogs, and much more, go to http://www.surendramohan.info/.

    He has also written the books Administrating Solr and Apache Solr High Performance published by Packt Publishing and has reviewed other technical books such as Drupal 7 Multi Site Configuration and Drupal Search Engine Optimization, as well as titles on Drupal commerce, ElasticSearch, Drupal related video tutorials, titles on OpsView, and many more.

    Additionally, he writes technical blogs and articles with SitePoint.com. His published blogs and articles can be found at http://www.sitepoint.com/author/smohan/.

    Browse publications by this author
Book Title
Access this book, plus 7,500 other titles for FREE
Access now