Using Faceted Search, from Searching to Finding

Exclusive offer: get 50% off this eBook here
Apache Solr Beginner’s Guide

Apache Solr Beginner’s Guide — Save 50%

Configure your own search engine experience with real-world data with this practical guide to Apache Solr with this book and ebook

$26.99    $13.50
by Alfredo Serafini | December 2013 | Beginner's Guides Open Source

In this article, Alfredo Serafini, the author of Apache Solr Beginner's Guide, covers how to perform a faceted search.

(For more resources related to this topic, see here.)

Looking at Solr's standard query parameters

The basic engine of Solr is Lucene, so Solr accepts a query syntax based on the Lucene one, even if there are some minor differences, they should not affect our experiments, as they involve more advanced behavior. You can find an explanation on the Solr Query syntax on wiki at: http://wiki.apache.org/solr/SolrQuerySyntax.

Let's see some example of a query using the basic parameters. Before starting our tests, we need to configure a new core again, in the usual way.

Sending Solr's query parameters over HTTP

It is important to take care of the fact that our queries to Solr are sent over the HTTP protocol (unless we are using Solr in embedded mode, as we will see later). With cURL we can handle the HTTP encoding of parameters, for example:

>> curl -X POST 'http://localhost:8983/solr/paintings/select?start
=3&rows=2&fq=painting&wt=json&indent=true' --data-urlencode
'q=leonardo da vinci&fl=artist title'

This command can be instead of the following command:

>> curl -X GET "http://localhost:8983/solr/paintings/select?q
=leonardo%20da%20vinci&fq=painting&start=3&row=2&fl=artist%20title&wt
=json&indent=true"

Please note how using the --data-urlencode parameter in the example we can write the parameters values including characters which needs to be encoded over HTTP.

Testing HTTP parameters on browsers

On modern browsers such as Firefox or Chrome you can look at the parameters directly into the provided console. For example using Chrome you can open the console (with F12):

In the previous image you can see under Query String Parameters section on the right that the parameters are showed on a list, and we can easily switch between the encoded and the more readable un-encoded value's version.

If don't like using Chrome or Firefox and want a similar tool, you can try the Firebug lite (http://getfirebug.com/firebuglite). This is a JavaScript library conceived to port firebug plugin functionality ideally to every browser, by adding this library to your HTML page during the test process.

Choosing a format for the output

When sending a query to Solr directly (by the browser or cURL) we can ask for results in multiple formats, including for example JSON:

>> curl -X GET 'http://localhost:8983/solr/paintings/select?q
=*:*&wt=json&indent=true'

Time for action – searching all documents with pagination

When performing a query we need to remember we are potentially asking for a huge number of documents. Let's observe how to manage partial results using pagination:

  1. For example think about the q=*:* query as seen in previous examples which was used for asking all the documents, without a specific criteria. In a case like this, in order to avoid problems with resources, Solr will send us actually only the first ones, as defined by a parameter in the configuration. The default number of returned results will be 10, so we need to be able to ask for a second group of results, and a third, and so on and on until there are. This is what is generally called a pagination of results, similarly as for scenarios involving SQL.
  2. Executing the command:

    >> curl -X GET "http://localhost:8983/solr/paintings/select?q
    =*:*&start=0&rows=0&wt=json&indent=true"

  3. We should obtain a result similar to this (the number of documents numFound and the time spent for processing query QTime could vary, depending on your data and your system):

In the previous image we see the same results in two different ways: on the right side you'll recognize the output from cURL and on the left side of the browser you see how the results directly in the browser window.

In the second example we had the Json View plugin installed in the browser, which gives a very helpful visualization of JSON, with indentation and colors. You can install it if you want for Chrome at:

https://chrome.google.com/webstore/detail/jsonview/chklaanhfefbnpoihckbnefhakgolnmc

For Firefox the plugin can be installed from:

https://addons.mozilla.org/it/firefox/addon/jsonview/

Note how even if we have found 12484 documents, we are currently seeing none of them in the results!

What just happened?

In this very simple example, we already use two very useful parameters: start and rows, which we should always think as a couple, even if we may be using only one of them explicitly. We could change the default values for these parameters from the solrconfig.xml file, but this is generally not needed:

  • The start value defines the original index of the first document returned in the response, from the ones matching our search criteria, and starting from value 0. The default value will again start at 0.
  • The rows parameter is used to define how many documents we want in the results. The default value will be 10 for rows.

So if for example we only want the second and third document from the results, we can obtain them by the query:

>> curl -X GET "http://localhost:8983/solr/paintings/select?q=*:*&start=1 &rows=2&wt=json&indent=true'

In order to obtain the second document in the results we need to remember that the enumeration starts from 0 (so the second will be at 1), while to see the next group of documents (if present), we will send a new query with values such as, start=10, rows=10, and so on. We are still using the wt and indent parameters only to have results formatted in a clear way.

The start/rows parameters play roles in this context which are quite similar to the OFFSET/LIMIT clause in SQL.

This process of segmenting the output to be able to read it in group or pages of results is usually called pagination, and it is generally handled by some programming code. You should know this mechanism, so you could play with your test even on a small segment of data without a loss of generalization. I strongly suggest you to always add these two parameters explicitly in your examples.

Time for action – projecting fields with fl

Another important parameter to consider is fl, that can be used for fields projection, obtaining only certain fields in the results:

  1. Suppose now that we are interested on obtaining the titles and artist reference for all the documents:

    >>curl -X GET 'http://localhost:8983/solr/paintings/select?q
    =artist:*&wt=json&indent=true&omitHeader=true&fl=title,artist'

  2. We will obtain an output similar to the one shown in the following image:

  3. Note that the results will be indented as requested, and will not contain any header to be more readable. Moreover the parameters list does not need to be written in a specific order.
  4. The previous query could be rewritten also:

    >>curl -X GET 'http://localhost:8983/solr/paintings/select?q
    =artist:*&wt=json&indent=true&omitHeader=true&fl=title&fl=artist'

Here we ask for field projection one by one, if needed (for example when using HTML and JavaScript widget to compose the query following user's choices).

What just happened?

The fl parameter stands for fields list. By using this parameter we can define a comma-separated list of fields names that explicitly define what fields are projected in the results. We can also use a space to separate fields, but in this case we should use the URL encoding for the space, writing fl=title+artist or fl=title%20artist.

If you are familiar with relational databases and SQL, you should consider the fl parameter. It is similar to the SELECT clause in SQL statements, used to project the selected fields in the results. In a similar way writing fl=author:artist,title corresponds to the usage of aliases for example, SELECT artist AS author, title.

Let's see the full list of parameters in details:

  • The parameter q=artist:* is used in this case in place of a more generic q=*:*, to select only the fields which have a value for the field artist. The special character * is used again for indicating all the values.
  • The wt=json, indent=true parameters are used for asking for an indented JSON format.
  • The omitHeader=true parameter is used for omit the header from the response.
  • The fl=title,artist parameter represents the list of the fields to be projected for the results.

Note how the fields are projected in the results without using the order asked in fl, as this has no particular sense for JSON output. This order will be used for the CSV response writer that we will see later, however, where changing the columns order could be mandatory.

In addition to the existing field, which can be added by using the * special character, we could also ask for the projection of the implicit score field. A composition of these two options could be seen in the following query:

>>curl -X GET 'http://localhost:8983/solr/paintings/select?q
=artist:*&wt=json&indent=true&omitHeader=true&fl=*,score'

This will return every field for every document, including the score field explicitly, which is sometimes called a pseudo-field, to distinguish it from the field defined by a schema.

Time for action – selecting documents with filter query

Sometimes it's useful to be able to narrow the collection of documents on which we are currently performing our search. It is useful to add some kind of explicit linked condition on the logical side for navigation on data, and will also have good impact on performances too.

It is shown in the following example:

It shows how the default search is restricted by the introduction of a fq=annunciation condition.

What just happened?

The first result in this simple example shows that we obtain results similar to what we could have obtained by a simple q=annunciation search. Filtered query can be cached (as well as facets, that we will see later), improving performance by reducing the overhead of performing the same query many times, and accessing documents of large datasets to the same group many times.

In this case the analogy with SQL seems less convincing, but q=dali and fq=abstract:painting can be seen corresponding to WHERE conditions in SQL. The fq parameters will then be a fixed condition.

In our scenario, we could define for example specific endpoints with pre-defined filter query by author, to create specific channels. In this case instead of passing the parameters every time we could set them on solrconfig.xml.

Time for action – searching for similar terms with Fuzzy search

Even if the wildcard queries are very flexible, sometimes they simply cannot give us a good results. There could be some weird typo in the term, and we still want to obtain some good results wherever it is possible under certain confidence conditions:

  1. If I want to write painting and I actually search for plainthing, for example:

    >> curl – X GET 'http://localhost:8983/solr/paintings/select?q
    =abstract:plainthing~0.5&wt=json'

  2. Suppose we have a person using a different language, who searched for leonardo by misspelling the name:

    >> curl -X GET 'http://localhost:8983/solr/paintings/select?q
    =abstract:lionardo~0.5&wt=json'

In both cases the examples use misspelled words to be more recognizable, but the same syntax can be used for intercept existing similar words.

What just happened?

Both the preceding examples work as expected. The first gives us documents containing the term painting, the second gives us documents containing leonardo instead. Note that the syntax plainthing^0.5 represents a query that matches with a certain confidence, so for example we will also obtain occurrences of documents with the term paintings, which is good, but on a more general case we could receive weird results. In order to properly set up the confidence value there are not many options, apart from doing tests.

Using fuzzy search is a simple way to obtain a suggested result for alternate forms of search query, just like when we trust some search engine's similar suggestions in the did you mean approaches.

Apache Solr Beginner’s Guide Configure your own search engine experience with real-world data with this practical guide to Apache Solr with this book and ebook
Published: December 2013
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:

Time for action – prototyping an auto-suggester with facets

Let's imagine a free search box on a simple HTML prototype (we will play with it later). When we are writing our term, we have written Hen and pause a little, so that the interface starts suggesting something to us. The screen we see is similar to what we see on the left side of the following screenshot:

We see in the left a simple prototype to give an idea on how the results will be suggested to the users while they are writing their terms for a search. In the previous screenshot we perform the following steps:

  1. The user is writing the term Hen, and some suggestion is prompted, showing how many document results will be available, reporting on what fields some matches are actually found.
  2. On the right I put an example including only the raw suggestion results for the field artist_entity, just to give an idea on what is behind the scenes.
  3. We do not yet have a prototype, but we could easily simulate the output shown on the right by the request:

    >> curl -X GET 'http://localhost:8983/solr/paintings/select?q
    =*:*&rows=0&facet=true&facet.field=artist_entity&facet.prefix=hen&wt=json'

  4. As usual remember to first start the appropriate core: for example here I suppose we have defined a new core in /SolrStarterBook/solr-app/chp06/paintings.

Here we are receiving a short list of suggestions for the field artist_entity. As you can see, the response format gives us a suggested term, followed by the number of documents that currently match that request.

What just happened?

In this small example, we are not interested in having results (rows=0), but instead we want to obtain a small list of items for a certain field (in the example the field is artist_entity), and having information about how many documents contains that term in particular. In order to do this we are activating the faceting capabilities of Solr by facet=true (it's possible to use facet=on, as for most of the Boolean parameters), and restrict the items list to the ones that starts with a particular text, using facet.prefix=hen. Note that here we have to write the term in lowercase, due to our previously adopted approach on analyzing the artist_entity field, but we could change this as usual, if we want.

Time For Action – creating Wordclouds on facets to view and analyze data

  1. Using the faceting capabilities of Solr, it is really simple to produce some wordcloud, or tagcloud. These are very good visualization tools, as they could be used not only for their aesthetics, but also to visually synthesize a weighted list of terms from the domain in which we are moving into. We can create a really simple example of this using only HTML and some JavaScript code, using the simple and powerful jqcloud (http://www.lucaongaro.eu/demos/jqcloud/) library:

  2. All we have to do to play with them, and eventually customize the examples, is to start Solr in the usual way (for example calling start chp06 from the directory /SolrStarterBook/test/), then open the page located at /SolrStarterBook/test/chp06/html/wordclouds.
  3. In this example we see that terms are presented in a sparse visualization, and the word size visually represents its relevance in the context. For example the terms presented in the middle box are collected from the title field by using the faceting capabilities of Solr:

    >> curl -X GET 'http://localhost:8983/solr/paintings/select?q
    =*:*&rows=0&facet=true&facet.field=title&wt=json'

In this simple request we omit the documents in the result (rows=0) because we simply want to obtain facets results.

What Just Happened?

When selecting the term saint, we will be prompted for the possibility to perform a query for the term, while we already know that there will be 22 documents matching that term. If we click on the link, a basic query will be produced:

>> curl -X GET 'http://localhost:8983/solr/paintings/select?q
=title:saint&wt=json'

Where we will as expected find 22 documents, with their fields and details. This is a really simple approach, but can give a lot of interesting ideas on the context, during the prototyping phase, and can also be improved in several different ways.

If you have some experience with SQL, you can probably recognize some similarity between the faceting mechanism and the usage of GROUP BY clause in SQL. For example using facet.field=artist can be seen more or less similar to the SQL expression: SELECT P.artist, COUNT(P.artist) FROM paintings AS P GROUP BY P.artist. With Solr however we can obtain results form many facets at once (the same results will require different queries in SQL). Moreover the facets can be easily combined with other criteria, and they offer very good performance, as they are collecting values from the saved index.

Faceting for narrowing searches and exploring data

A good idea is to again use prototyping to construct, in this case, a simple navigation over facets that can help us focus on one of the most simple way to use them in many contexts. I have prepared a simple example using the very good ajax-solr library. Feel free to use it as a base for prototypes on your data too:

You will find this example by navigating to: /SolrStarterBook/test/chp06/paintings/html/index.html. You can play with it directly with your web browser, without any web server, after you have started your solr painting core in the usual way.

This simple interface on the left hand side of the page gives us the chance to collect one or more terms from different facets, to produce a collection of resources on the right hand side of the page, as a visual preview that changes, reflecting the choice we have made. This is very similar to the concept of filtering the list of resources, as we are accustomed to do on e-commerce sites, but it can be used backwards to explore different search paths from the original, simply by removing a selected criteria from the list.

In other words this approach can expand the traditional "bag of words" based search for full-text in a more wide search capability that mixes the full-text advanced search functionality with a kind of detour-based exploring of the same relevant data. A user could at some point find out that Louvre contains many paintings, and decide to explore the list by simply clicking on the interface. As a general case this leads to performing queries that we had not originally thought of. These queries will become interesting for us, as we have the possibility to be informed in advance on how many relevant documents we will find using the selected criteria, and every time we add or remove a selection a new series of criteria (and their related results) is triggered.

In our example the query uses the parameters tabulated as follows:

q:*:*

We are not searching for specific terms

facet:true

The faceting support has to be enabled

facet.field:museum_entity

facet.field:artist_entity

facet.field:city_entity

The HTML interface asks for facets results on artist_entity, museum_entity, and city_entity field.

fq:city_entity:paris

fq:museum_entity:"musée du louvre"

We are already using a filter query based on the selection made. This is an improvement of the same idea behind the tagcloud example.

If you want to play with the query directly using facets, you can use the following command:

>> curl -X GET 'http://localhost:8983/solr/paintings/select?facet=true&q
=*:*&facet.field=museum_entity&facet.field=artist_entity&
facet.field=city_entity&facet.limit
=20&facet.mincount=1&f.city_entity.facet.limit=10&json.nl=map&
fq=city_entity:paris&fq=museum_entity:"mus%25c3%25a9e+du+louvre"&wt=json'

Note the use of lower case here, HTTP entity substitution (for example for accented chars and double quotes), and + for spaces.

Note that we can decide how many results we will obtain for every facet (facet.limit), or even customize the number of results for a specific field (f.field_name.facet.limit, in our case f.city_entity.facet.limit).

Time for action – finding interesting subjects using facet query

Let's now look at other possible applications for faceting. For example we can use facet query to find the most recurring terms. If this recurring term is on subject field, this could be for example used to obtain suggestions on interesting topics.

  1. We now want to obtain a simple facet result for subjects:

    >> curl -X GET 'http://localhost:8983/solr/paintings/select?q=*:*& rows=0&facet=true&facet.field=subject_entity&facet.limit=-1&facet. mincount=2&facet.sort=count&json.nl=map&wt=json'

    We will find out that the most common subject in our data is related to the religious theme of "Annunciation". This result is not particularly surprising, since this is one of the most widely represented themes in the classic European art.

  2. If we started from the opposite direction, and we ask ourselves if the "annunciating action" is present in the collection, we could easily write a facet query:

    >> curl -X GET 'http://localhost:8983/solr/paintings/select?q
    =*:*&rows=0&facet=on&
    facet.query=subject_entity:annunciating~5&facet.mincount=1&wt=json'

  3. And obtain the same results five matches in facet query counts on 5069 documents.
  4. Note that we could ask the same information we can start querying on other facets, for example:

    >> curl -X GET 'http://localhost:8983/solr/paintings/select?q
    =*:*&rows=0&facet=on&facet.field=city_entity&
    facet.field=artist_entity&facet.query=subject_entity:annunciating~5&
    facet.limit=10&
    f.artist_entity.facet.limit=2&wt=json&json.nl=map&fq=abstract:angel'

  5. We can restrict documents to those that contain reference to an angel figure (fq=abstract:angel). We will ask facets for cities and artists related to that (facet.field=city_entity, facet.field=artist_entity), and to the number of documents that could possibily be related to our search on the subject too (facet.query=subject_entity:annunciating~5).

In this case we will obtain two facet query counts.

What just happened?

We have started from the list of terms in the facet for subject_entity field. We found out that the term annunciation have been used mostly on our dataset. Note that the subject field plays a similar role here than using tags from a controlled vocabulary. This could be used as an idea to play with your fixed, controlled tag vocabulary, if you have one. Once we have found an interesting term, we play in reverse, just to understand how the facet query works. What we see here is that if we use a similar term in the same field (subject_entity:annunciating~5) we will obtain the same expected results. Starting from that acquisition, the next step will be to use the facet query without restricting it on a single field, using the following query:

>> http ://localhost:8983/solr/paintings/select?q=*:*&rows=0&facet=on&fac et.query=annunciating~5&facet.mincount=1&wt=json

In this case we will obtain 50 matches over all the fields.

If we introduce more than a field for faceting and perform a facet query, as in the last example, it is simple to notice that every result is independent from the other. Even if we write a query in the facet.query field it will be not used outside its context to filter the other results. The filter query and the common query will instead produce changes to facet and facet query results too, as they will restrict the collection on which the facets will operate their counts.

Using a filter query is best when we have to fix some criteria to restrict the collection size. We can then use facets as way to provide navigation paths suggestion: a typical user interface will add a filter to our query when we select a specific facet suggestion, thus narrowing the search. On the opposite direction, when a filter is removed, the collection on which we search will be more broader, and the faceting results will change accordingly.

As a last note, it's possible to specify parameters on a per field basis, when needed, for example using f.artist_entity.facet.limit=2 we are deciding to have no more than two facet results for the artist_entity field. Note that facet.mincount does not imply any semantics, it's only an acceptable minimum ground value for a text match, but it can still be used as if it implies some specific simple relevance.

Summary

In this article we explored the idea of how it is possible to improve the search experience in a wider perspective, from searching for some specific terms to finding relevant documents.

We started by introducing the Solr faceting capabilities to create a dynamic search experience for the user. By mixing the common usage of advanced parameters, operators, and function combinations with facets, we started to consider ways for moving into the data collection. This changes how we understand searches, every query could be seen as a pseudo-document, and we used Solr as a match engine rather than a search engine.

Filter queries have a crucial behavior, improving performance on the technical side and restricting the domain in which we execute our searches on the more abstract side. They have also led us to introduce the concept of similarity. We saw similarity in action by using the built-in MoreLikeThis component to obtain recommendations, adding value to common search results.

Resources for Article:


Further resources on this subject:


Apache Solr Beginner’s Guide Configure your own search engine experience with real-world data with this practical guide to Apache Solr with this book and ebook
Published: December 2013
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:

About the Author :


Alfredo Serafini

Alfredo Serafini is a freelance software consultant, currently living in Rome, Italy.

He has a mixed background. He has a bachelor's degree in Computer Science Engineering (2003, with a thesis on Music Information Retrieval), and he has completed a professional master's course in Sound Engineering (2007, with a thesis on gestural interface to MAX/MSP platform).

From 2003 to 2006, he had been involved as a consultant and developer at Artificial Intelligence Research at Tor Vergata (ART) group. During this experience, he got his first chance to play with the Lucene library. Since then he has been working as a freelancer, alternating between working as a teacher of programming languages, a mentor for small companies on topics like Information Retrieval and Linked Data, and (not surprisingly) as a software engineer.

He is currently a Linked Open Data enthusiast. He has also had a lot of interaction with the Scala language as well as graph and network databases.

You can find more information about his activities on his website, titled designed to be unfinished, at http://www.seralf.it/.

Books From Packt


Apache Solr 4 Cookbook
Apache Solr 4 Cookbook

Instant Apache Solr for Indexing Data How-to [Instant]
Instant Apache Solr for Indexing Data How-to [Instant]

Apache Solr PHP Integration
Apache Solr PHP Integration

Apache Solr 3.1 Cookbook
Apache Solr 3.1 Cookbook

Apache Solr 3 Enterprise Search Server
Apache Solr 3 Enterprise Search Server

Instant Apache Camel Messaging System [Instant]
Instant Apache Camel Messaging System [Instant]

Apache Axis2 Web Services, 2nd Edition
Apache Axis2 Web Services, 2nd Edition

 Apache Maven 3 Cookbook
Apache Maven 3 Cookbook


Your rating: None Average: 1.5 (4 votes)

Post new comment

CAPTCHA
This question is for testing whether you are a human visitor and to prevent automated spam submissions.
6
S
i
g
G
9
Enter the code without spaces and pay attention to upper/lower case.
Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software