Reader small image

You're reading from  Administrating Solr

Product typeBook
Published inOct 2013
PublisherPackt
ISBN-139781783283255
Edition1st Edition
Tools
Right arrow
Author (1)
Surendra Mohan
Surendra Mohan
author image
Surendra Mohan

Surendra Mohan, who has served a few top-notch software organizations in varied roles, is currently a freelance software consultant. He has been working on various cutting-edge technologies like Drupal, Moodle, Apache Solr, ElasticSearch, Node.js, SoapUI, and so on for the past 10 years. He also delivers technical talks at various community events like Drupal Meetups and Drupal Camps. To find out more about him, his write-ups, technical blogs, and much more, go to http://www.surendramohan.info/. He has also written the books Administrating Solr and Apache Solr High Performance published by Packt Publishing and has reviewed other technical books such as Drupal 7 Multi Site Configuration and Drupal Search Engine Optimization, as well as titles on Drupal commerce, ElasticSearch, Drupal related video tutorials, titles on OpsView, and many more. Additionally, he writes technical blogs and articles with SitePoint.com. His published blogs and articles can be found at http://www.sitepoint.com/author/smohan/.
Read more about Surendra Mohan

Right arrow

Chapter 1. Searching Data

In this chapter we will cover how to install Apache Solr on your system. For instance, a Windows-based system. We will cover the following in this chapter:

  • Request/response handling

  • Querying

  • Faceted search

  • Geospatial search

  • Distributed search

Let's get started.

Installation


Before we get ready for the installation, you need to have the necessary downloads ready.

Once you have the mentioned installers ready, you may proceed installing them as follows:

  1. Install XAMPP, and follow the instructions.

  2. Install Tomcat, and follow the instructions.

  3. Install the latest Java JDK.

    By now there must be a folder called /xampp in your C Drive (by default). Navigate to the xampp folder and find xampp-control application (shown in the following screenshot) and then start it.

  4. Start Apache, MySQL, and Tomcat services and click on the Services button at the right-hand side of the panel as demonstrated in the following screenshot:

  5. Locate Apache Tomcat Service, right-click on it and navigate to Properties as demonstrated in the following screenshot:

  6. After the Properties Window pop up, set the Startup type to Automatic, and close the window by clicking on OK as shown in the following screenshot:

    For the next few steps, we will stop Apache Tomcat in the Services window. If this doesn't work, then click on the Stop link.

  7. Extract Apache Solr and navigate to the /dist folder. You will find a file called solr-4.3.1.war as demonstrated in the following screenshot; copy this file.

  8. Navigate to C:/xampp/tomcat/webapps/ and paste the solr-4.3.1.war file (which you have copied in the previous step) into this folder; rename solr-4.3.1.war to solr.war as shown in the following screenshot:

  9. Navigate back to <ApacheSolrFolder>/example/solr/ and copy these files as demonstrated in the next screenshot:

  10. Create a directory in C:/xampp/ called /solr/ and paste ApacheSolrFolder>/example/solr/ files into this directory, that is, C:/xampp/solr, as shown in the following screenshot:

  11. Now navigate to C:/xampp/tomcat/bin/tomcat6w, click on the Java Tab, and copy the command -Dsolr.solr.home=C:\xampp\solr into the Java Options section, as shown in the following screenshot:

  12. Now it is time to navigate to the services window. Start Apache Tomcat in the Services window.

  13. Now you are done with installing Apache Solr at your local environment. To confirm, type http://localhost:8080/solr/admin/ and hit Enter into the browser. You should be able to see Apache Solr Dashboard.

Request/response handling


Let us understand what a request and response stands for and get a brief idea about the components handling these requests.

  • Request: As the name suggests, when you search for a keyword, an action is triggered (in a form of query) to Solr to take care of the action (in this case, find out the search keywords) and display the results relevant to it. The action which is triggered is called a request.

  • Response: Response is nothing but what is being displayed on your screen based on the search keywords and other specifications you have stated in your search query.

  • RequestHandler: It is a component which is responsible for answering your requests and is installed and configured in the solrconfig.xml file. Moreover, it has a specific name and class assigned to handle the requests efficiently. If the name starts with a /, you will be able to reach the requesthandler by calling the appropriate path.

    For instance, let us consider an example of the updatehandler which is configured like this:

    <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />

    In the above example, the handler can be reached by calling <solr_url>/update. You may visit http://lucene.apache.org/solr/4_3_1/solr-core/org/apache/solr/request/SolrRequestHandler.html to explore further the list of RequesetHandlers.

    Request and response handling are the primary steps you should be aware of in order to play around with various optimal methods of searching data. We will cover how to efficiently handle request and responses in this section.

    Before we start with how to handle a request or response, let's walk through a few of the important directories which we will be using throughout the chapter along with what they are used to store. They are:

  • Conf: It is one of the mandatory directories in Solr which contains configuration related files like solrconfig.xml and schema.xml. You may also place your other configuration files here in this directory.

  • Data: This is the directory where Solr keeps your index by default and is used by replication scripts. If you are not happy with this default location, you have enough flexibility to override it at solrconfig.xml. Don't panic! If the stated custom directory doesn't exist, Solr will create it for you.

  • Lib: It is not mandatory to have this directory. JARS resides here which is located by Solr to resolve any "plugins" which have been defined in your solrconfig.xml or schema.xml. For example, Analyzers, Requesthandlers, and so on come into the picture.

  • Bin: Replication scripts reside here in this directory and it is up to you whether to have and/or use this directory.

Requests are handled using multiple handlers and/or multiple instances of the same SolrRequestHandler class. How do you wish to use the handler and instances of the handler class is differentiated based on the custom configurations, and are registered with SolrCore. An alternate way to register your SolrRequestHandler with the core is through the solrconfig.xml file.

For instance:

<requestHandler name="/foo" class="solr.CustomRequestHandler" />
    <!-- initialization args may optionally be defined here -->
     <lst name="defaults">
       <int name="rows">10</int>
       <str name="fl">*</str>
       <str name="version">2.1</str>
     </lst>
  </requestHandler>

The easiest way to implement SolrRequestHandler is to extend the RequestHandlerBase class.

Querying


Writing a simple query is definitely an easy job; however, writing a complex one with queries playing around with phrases, boosting and prioritizing search results, nesting your query, and a search even based on partial match would be a challenging task. In addition to this, you must remember to write your query taking the performance aspects into account. This is one of the reasons why something that seems to be simple at first sight, actually proves to be even more challenging like writing a complex query which is equally good and efficient in terms of performance. This chapter will guide you through a few of the tasks you are expected to encounter during your everyday work with Solr.

Querying based on a particular field value

You might encounter situations wherein you need to ask for a particular field value, for instance, searching for an author of a book in an internet library or an e-store. Solr can do this for you and we will show you how to achieve it.

Let us assume, we have the following index structure (just add the following lines to the field definition section of your schema.xml file).

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="title" type="text" indexed="true" stored="true" /> 
<field name="author" type="string" indexed="true" stored="true"/>

Hit the following URL on your browser to ask for a value in the author field, which will send the query to Solr.

http://localhost:8080/solr/select?q=author:surendra

You are done with your search; and the documents you get from Solr will be the ones that have the given value in the author field. Remember that the query shown in the preceding example is using a standard query parser, and not dismax.

We defined three fields in the index (which are just for demonstration purpose, and can be customized based on your requirement). As you can see in the preceding query to ask for a particular field value, you need to send a q parameter in FIELD_NAME:VALUE format and that's it. You may extend your search by adding logical operators to the query, hence increasing its complexity.

Tip

In case you forget to specify the field name in your query; your value will be checked again in the default search field that has been defined in the schema.xml file.

While discussing a particular field value, there are a couple of points you should know and would definitely prove useful for you, which are:

  • Single value using extended dismax query parser

    You may sometimes need to ask for a particular field value when using the dismax query parser. Though the dismax query parser doesn't fully support lucene query syntax; we have an alternative. You can use extended dismax query parser instead. It has the same list of functionality as the dismax query parser and it also fully supports lucene query syntax. The query shown here, but using extended dismax, would look like this:

    http://localhost:8080/solr/select?q=author:surendra&defType=edismax

  • Multiple values in the same field

    You may often need to ask for multiple values in a single field. For example, you want to find the solr, monitoring and optimization values in the title field. To do that, you need to run the following query (the brackets surrounding the values are the highlights of this concept):

    http://localhost:8080/solr/select?q=author:(solr monitoring optimization)

Searching for a phrase


There might be situations wherein you need to search a document title within millions of documents for which string based search is of course not a good idea. So, the question for ourselves; is it possible to achieve using Solr? Fortunately, yes and the next example will guide you through it.

Assume that you have the following type defined, that needs to be added to your schema.xml file.

<fieldType name="text" class="solr.TextField" positionIncrementGap="100"> 
<analyzer> 
<tokenizer class="solr.WhitespaceTokenizerFactory"/> 
<filter class="solr.LowerCaseFilterFactory"/> 
<filter class="solr.SnowballPorterFilterFactory" language="English"/> 
</analyzer> 
</fieldType>

And then, add the following fields to your schema.xml.

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="title" type="text" indexed="true" stored="true" />

Assume that your data looks like this:

<add> 
<doc>
<field name="id">1</field> 
<field name="title">2012 report</field> 
</doc> 
<doc> 
<field name="id">2</field> 
<field name="title">2007 report</field> 
</doc> 
<doc> 
<field name="id">3</field> 
<field name="title">2012 draft report</field> 
</doc> 
</add>

Now, let us instruct Solr to find the documents that have a 2012 report phrase embedded in the title. Execute the following query to Solr:

http://localhost:8080/solr/select?q=title:"2012 report"

If you get the following result, bingo !!! your query worked!

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader"> 
<int name="status">0</int> 
<int name="QTime">1</int> 
<lst name="params"> 
<str name="q">title:"2012 report"</str> 
</lst> 
</lst> 
<result name="response" numFound="1" start="0"> 
<doc> 
<str name="id">1</str> 
<str name="title">2012 report</str> 
</doc> 
</result> 
</response>

The debug query (the debugQuery=on parameter) shows us what lucene query was made:

<str name="parsedquery">PhraseQuery(title:"2012 report")</str>

As you must have noticed, we got just one document as a result of our query, omitting even the document with the title: 2012 draft report (which is very appropriate and perfect output).

We have used only two fields to demonstrate the concept due to the fact that we are more committed to search a phrase within the title field, here in this demonstration.

Interestingly, here standard Solr query parser has been queried; hence, the field name and the associated value we are looking for can be specified. The query differs from the standard word-search query by using the " character both at the start and end of the query. It dictates Solr to consider the search as a phrase query instead of a term query (which actually makes the difference!). So, this phrase query tells Solr to search considering all the words as a single unit, and not individually.

In addition to this, the phrase query just ensured that the phrase query (that is, the desired one) was made instead of the standard term query.

Boosting phrases over words


Since you are in a competitive market, assume that one day your online product met a disaster wherein your product's search result suddenly falls down. To overcome this scenario and survive in such a competitive market, probably you would like to favor documents that have the exact phrase typed by the end-user over the documents that have matches in separate words. We will guide you on how to achieve this in this section.

I assume that we will use dismax query parser, instead of the standard one. Moreover, we will re-use the same schema.xml that was demonstrated in the Searching for a phrase section in this chapter.

Our sample data looks like this:

<add> 
<doc> 
<field name="id">1</field> 
<field name="title">Annual 2012 report final draft</field> 
</doc> 
<doc> 
<field name="id">2</field> 
<field name="title">2007 report</field> 
</doc> 
<doc> 
<field name="id">3</field> 
<field name="title">2012 draft report</field> 
</doc> 
</add>

As mentioned earlier, we would like to boost or give preference to those documents that have phrase matches over others matching the query. To achieve this, run the following query to your Solr instance:

http://localhost:8080/solr/select?defType=dismax&pf=title^100&q=2012 +report&qf=title

And the desired result should look like:

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader"> 
<int name="status">0</int> 
<int name="QTime">1</int> 
<lst name="params"> 
<str name="qf">title</str> 
<str name="pf">title^100</str> 
<str name="q">2012 report</str> 
<str name="defType">dismax</str>
</lst> 
</lst> 
<result name="response" numFound="2" start="0"> 
<doc> 
<str name="id">1</str> 
<str name="title">Annual 2012 report last draft</str> 
</doc> 
<doc> 
<str name="id">3</str> 
<str name="title">2012 draft report</str> 
</doc> 
</result> 
</response>

We have a couple of parameters which have been added to this example and might be new to you. Don't worry! I will explain all of them. The first parameter is defType, which tells Solr which query parser we will be using (dismax in our case). If you are not familiar or would like to learn more about dismax, http://wiki.apache.org/solr/DisMax is where you should go! One of the features of this query parser is the ability to tell Solr which field should be used to search for phrases, and this is achieved using the pf parameter. The pf parameter takes a list of fields with the boost that corresponds to them, for instance, pf=title^100 which means that the phrase found in the title field will be boosted with a value of 100. The q parameter is the standard query parameter which you might be familiar with. In our example, we passed the words we are searching for using AND operator. Through our example we are looking for the documents which satisfy '2012' AND 'report' equation, also known as occurrences of both '2012' and 'report' words found in the title.

Tip

You must remember that you can't pass a query such as fieldname: value to the q parameter and use dismax query parser. The fields you are searching against should be specified using the qf parameter.

Prioritizing your document in search results


You might come across situations wherein you need to promote some of your products and would like to find those on top of other documents in the search result list. Additionally, you might also need to have such products flexible and define exclusive queries applicable only to these products and not to the others. To achieve so, you might think of options such as boosting, index time boosting, or probably some special field. Don't worry! Solr will help you out via this section using a robust component known as QueryElevationComponent.

As QueryElevationComponent is biased to specific documents, it impacts the overall search process for other documents. Thus, it is recommended to use this feature only when it is required.

First of all, let us add the component definition in the solrconfig.xml file, which should look like this:

<searchComponent name="elevator" class="solr.QueryElevationComponent" > 
<str name="queryFieldType">string</str> 
<str name="config-file">elevate.xml</str> 
</searchComponent>

Now we will add the appropriate request handler that will include the elevation component. We will name it /promote it, due to the fact that this feature is mainly used to promote your document in search results. Add this to your solrconfig.xml file:

<requestHandler name="/promotion" class="solr.SearchHandler"> 
<arr name="last-components"> 
<str>elevator</str> 
</arr> 
</requestHandler>

You must have noticed a mysterious file, elevate.xml that has been included in the query elevation component, which actually contains the following data and are placed in the configuration directory of the Solr instance.

<?xml version="1.0" encoding="UTF-8" ?> 
<elevate> 
<query text="solr"> 
<doc id="3" /> 
<doc id="1" /> 
</query> 
</elevate>

Here we want our documents with identifiers 3 and 1 to be on the first and second position respectively in the search result list.

Now it is time to add the below field definition to the schema.xml file.

<field name="id" type="string" indexed="true" stored="true" required="true" /> 
<field name="name" type="text" indexed="true" stored="true" />

The following are the data which have been indexed:

<add> 
<doc> 
  <field name="id">1</field> 
  <field name="name">Solr Optimization</field> 
</doc> 
<doc>
  <field name="id">2</field> 
  <field name="name">Solr Monitoring</field> 
</doc> 
<doc> 
   <field name="id">3</field> 
   <field name="name">Solr annual report</field> 
</doc> 
</add>

Now, it's time to run the following query:

http://localhost:8080/solr/promotion?q=solr

If you get the following result, you can be assured that your query worked out successfully:

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader"> 
<int name="status">0</int> 
<int name="QTime">1</int> 
<lst name="params"> 
<str name="q">solr</str> 
</lst> 
</lst> 
<result name="response" numFound="3" start="0"> 
<doc> 
<str name="id">3</str> 
<str name="name">Solr annual report</str> 
</doc> 
<doc> 
<str name="id">1</str> 
<str name="name">Solr Optimization</str> 
</doc> 
<doc> 
<str name="id">2</str> 
<str name="name">Solr Monitoring</str> 
</doc> 
</result> 
</response>

In the first part of the configuration, we have defined a new search component (elevator component in our case) and a class attribute (the QueryElevationComponent class in our case). Along with these, we have two additional attributes that define the elevation component behavior which are as follows:

  • queryFieldType: This attribute tells Solr which type of field should be used to parse the query text that is given to the component (for example, if you want the component to ignore letter case, you should set this parameter to the field type that makes its contents lowercase)

  • config-file: This is the configuration file which will be used by the component. It denotes the path of the file that defines query elevation. This file will reside either at ${instanceDir}/conf/${config-file} or ${dataDir}/${config-file}. If the file exists in /conf/ directory, it will be loaded during startup. On the contrary, if the file exists in data directory, it would reload for each IndexReader.

Now, let us step into the next part of solrconfig.xml, which is search handler definition. It tells Solr to create a new search handler with the name /promotion (the name attribute) and using the solr.SearchHandler class (the class attribute). This handler definition also tells Solr to include a component named elevator, which means that the search handler is going to use our defined component. As you might know, you can use more than one search component in a single search handler.

In the actual configuration of the elevate component, you can see that there is a query defined (the query XML tag) with an attribute text="solr", which defines the behavior of the component when a user passes solr to the q parameter. You can see a list of unique identifiers of documents that will be placed on top of the results list for the defined query under this tag, where each document is defined by a doc tag and an id attribute (which have to be defined on the basis of solr.StrField) which holds the unique identifier.

The query is made to our new handler with just a simple one word q parameter (the default search field is set to name in the schema.xml file). Recall the elevate.xml file and the documents we defined for the query we just passed to Solr. Yes of course, we told Solr that we want documents with id=3 and id=1 to be placed on first and second positions respectively in the search result list. And ultimately, our query worked and you can see the documents were placed exactly as we wanted.

Query nesting


You might come across situations wherein you need to nest a query within another query. Let us imagine that you want to run a query using the standard request handler, but you need to embed a query that is parsed by the dismax query parser inside it. Isn't that interesting? We will show you how to do it.

Let us assume that we use the same field definition in schema.xml that was used in our previous section "Based on a partial keyword/phrase match".

Our example data looks like this:

<add> 
<doc> 
<field name="id">1</field> 
<field name="title">Reviewed solrcook book</field> 
</doc> 
<doc> 
<field name="id">2</field> 
<field name="title">Some book reviewed</field> 
</doc> 
<doc> 
<field name="id">3</field> 
<field name="title">Another reviewed little book</field> 
</doc> 
</add>

Here, we are going to use the standard query parser to support lucene query syntax, but we would like to boost phrases using the dismax query parser. At first it seems to be impossible to achieve, but don't worry, we will handle it. Let us suppose that we want to find books having the words, reviewed and book, in their title field; and we would like to boost the reviewed book phrase by 10. Here we go with the query:

http://localhost:8080/solr/select?q=reviewed+AND+book+AND+_ query_:"{!dismax qf=title pf=title^10 v=$qq}"&qq=reviewed+book

The results of the preceding query should look like this:

<?xml version="1.0" encoding="UTF-8"?> 
<response> 
<lst name="responseHeader"> 
<int name="status">0</int> 
<int name="QTime">2</int> 
<lst name="params"> 
<str name="fl">*,score</str> 
<str name="qq">book reviewed</str> 
<str name="q">book AND reviewed AND _query_:"{!dismax qf=title pf=title^10 v=$qq}"</str> 
</lst> 
</lst> 
<result name="response" numFound="3" start="0" maxScore="0.77966106"> 
<doc> 
<float name="score">0.77966106</float> 
<str name="id">2</str> 
<str name="title">Some book reviewed</str> 
</doc> 
<doc> 
<float name="score">0.07087828</float> 
<str name="id">1</str> 
<str name="title">Reviewed solrcook book</str> 
</doc> 
<doc> 
<float name="score">0.07087828</float> 
<str name="id">3</str> 
<str name="title">Another reviewed little book</str> 
</doc> 
</result> 
</response>

As you can see, we have used the same and simple index, let us skip its description and step into the next section.

Let us focus on the query. The q parameter is built of two parts connected together with AND operator. The first one reviewed+AND+book is just a usual query with a logical operator AND defined. In the second part, building the query starts with a strange looking expression, _query_. This expression tells Solr that another query should be made that will affect the results list. We then see the expression stating that Solr should use the dismax query parser (the !dismax part) along with the parameters that will be passed to the parser (qf and pf).

Note

The v parameter is an abbreviation for value and it is used to pass the value of the q parameter (in our case, reviewed+book is being passed to the dismax query parser).

And thats it! We get to the search results which we had expected.

Summary


In this chapter, we have learned how to query your Solr based on different criteria such as field value, usage of extended dismax query parser, sorting your search results, phrase search, boosting and prioritizing your document in the search result, and nesting your queries. By now you must have also learned what faceted, Geospatial, and distributed searches are and how to play around with them, based on varied scenarios and conditions.

In the next chapter, we will learn different ways of monitoring Solr, performance metrics we should know, agent-based and agent-less health checks, and how to monitor Solr using monitoring tools like Opsview, New Relic, and SPM.

You have been reading a chapter from
Administrating Solr
Published in: Oct 2013Publisher: PacktISBN-13: 9781783283255
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Surendra Mohan

Surendra Mohan, who has served a few top-notch software organizations in varied roles, is currently a freelance software consultant. He has been working on various cutting-edge technologies like Drupal, Moodle, Apache Solr, ElasticSearch, Node.js, SoapUI, and so on for the past 10 years. He also delivers technical talks at various community events like Drupal Meetups and Drupal Camps. To find out more about him, his write-ups, technical blogs, and much more, go to http://www.surendramohan.info/. He has also written the books Administrating Solr and Apache Solr High Performance published by Packt Publishing and has reviewed other technical books such as Drupal 7 Multi Site Configuration and Drupal Search Engine Optimization, as well as titles on Drupal commerce, ElasticSearch, Drupal related video tutorials, titles on OpsView, and many more. Additionally, he writes technical blogs and articles with SitePoint.com. His published blogs and articles can be found at http://www.sitepoint.com/author/smohan/.
Read more about Surendra Mohan