Apache Solr: Analyzing your Text Data

Exclusive offer: get 50% off this eBook here
Apache Solr 3.1 Cookbook

Apache Solr 3.1 Cookbook — Save 50%

Over 100 recipes to discover new ways to work with Apache’s Enterprise Search Server

$26.99    $13.50
by Rafał Kuć | July 2011 | Open Source

The process of data indexing can be divided into different parts. One of the parts, actually one of the last parts, of this process is data analysis . It's one of the crucial parts of data preparation. It defines how your data will be written into index, its structure, and so on. In Solr, data behavior is defined by types.

In this article by Rafał Kuć, author of Apache Solr 3.1 Cookbook, we will cover:

  • Storing additional information using payloads
  • Eliminating XML and HTML tags from the text
  • Copying the contents of one field to another
  • Changing words to other words
  • Splitting text by camel case
  • Splitting text by whitespace only
  • Making plural words singular, but with out stemming
  • Lowercasing the whole string
  • Storing geographical points in the index
  • Stemming your data
  • Preparing text to do efficient trailing wildcard search
  • Splitting text by numbers and non-white space characters

 

Apache Solr 3.1 Cookbook

Apache Solr 3.1 Cookbook

Over 100 recipes to discover new ways to work with Apache’s Enterprise Search Server

        Read more about this book      

(For more resources on this subject, see here.)

Introduction

Type's behavior can be defined in the context of the indexing process or the context of the query process, or both. Furthermore, type definition is composed of tokenizers and filters (both token filters and character filters). Tokenizer specifies how your data will be preprocessed after it is sent to the appropriate field. Analyzer operates on the whole data that is sent to the field. Types can only have one tokenizer. The result of the tokenizer work is a stream of objects called tokens. Next in the analysis chain are the filters. They operate on the tokens in the token stream. And they can do anything with the tokens—changing them, removing them, or for example, making them lowercase. Types can have multiple filters.

One additional type of filter is the character filter. The character filters do not operate on tokens from the token stream. They operate on the data that is sent to the field and they are invoked before the data is sent to the analyzer.

This article will focus on the data analysis and how to handle the common day-to-day analysis questions and problems.

Storing additional information using payloads

Imagine that you have a powerful preprocessing tool that can extract information about all the words in the text. Your boss would like you to use it with Solr or at least store the information it returns in Solr. So what can you do? We can use something that is called payload and use it to store that data. This recipe will show you how to do it.

How to do it...

I assumed that we already have an application that takes care of recognizing the part of speech in our text data. Now we need to add it to the Solr index. To do that we will use payloads, a metadata that can be stored with each occurrence of a term.

First of all, you need to modify the index structure. For this, we will add the new field type to the schema.xml file:

<fieldtype name="partofspeech" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.DelimitedPayloadTokenFilterFactory"
encoder="integer" delimiter="|"/>
</analyzer>
</fieldtype>

Now add the field definition part to the schema.xml file:

<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="text" type="text" indexed="true" stored="true" />
<field name="speech" type="partofspeech" indexed="true" stored=
"true" multivalued="true" />

Now let's look at what the example data looks like (I named it ch3_payload.xml):

<add>
<doc>
<field name="id">1</field>
<field name="text">ugly human</field>
<field name="speech">ugly|3 human|6</field>
</doc>
<doc>
<field name="id">2</field>
<field name="text">big book example</field>
<field name="speech">big|3 book|6 example|1</field>
</doc>
</add>

Let's index our data. To do that, we run the following command from the exampledocs directory (put the ch3_payload.xml file there):

java -jarpost.jar ch3_payload.xml

How it works...

What information can payload hold? It may hold information that is compatible with the encoder type you define for the solr.DelimitedPayloadTokenFilterFactory filter . In our case, we don't need to write our own encoder—we will use the supplied one to store integers. We will use it to store the boost of the term. For example, nouns will be given a token boost value of 6, while the adjectives will be given a boost value of 3.

First we have the type definition. We defined a new type in the schema.xml file, named partofspeech based on the Solr text field (attribute class="solr.TextField"). Our tokenizer splits the given text on whitespace characters. Then we have a new filter which handles our payloads. The filter defines an encoder, which in our case is an integer (attribute encoder="integer"). Furthermore, it defines a delimiter which separates the term from the payload. In our case, the separator is the pipe character |.

Next we have the field definitions. In our example, we only define three fields:

  • Identifier
  • Text
  • Recognized speech part with payload

 

Now let's take a look at the example data. We have two simple fields: id and text. The one that we are interested in is the speech field. Look how it is defined. It contains pairs which are made of a term, delimiter, and boost value. For example, book|6. In the example, I decided to boost the nouns with a boost value of 6 and adjectives with the boost value of 3. I also decided that words that cannot be identified by my application, which is used to identify parts of speech, will be given a boost of 1. Pairs are separated with a space character, which in our case will be used to split those pairs. This is the task of the tokenizer which we defined earlier.

To index the documents, we use simple post tools provided with the example deployment of Solr. To use it, we invoke the command shown in the example. The post tools will send the data to the default update handler found under the address http://localhost:8983/solr/update. The following parameter is the file that is going to be sent to Solr. You can also post a list of files, not just a single one.

That is how you index payloads in Solr. In the 1.4.1 version of Solr, there is no further support for payloads. Hopefully this will change. But for now, you need to write your own query parser and similarity class (or extend the ones present in Solr) to use them.

Eliminating XML and HTML tags from the text

There are many real-life situations when you have to clean your data. Let's assume that you want to index web pages that your client sends you. You don't know anything about the structure of that page—one thing you know is that you must provide a search mechanism that will enable searching through the content of the pages. Of course, you could index the whole page by splitting it by whitespaces, but then you would probably hear the clients complain about the HTML tags being searchable and so on. So, before we enable searching on the contents of the page, we need to clean the data. In this example, we need to remove the HTML tags. This recipe will show you how to do it with Solr.

How to do it...

Let's suppose our data looks like this (the ch3_html.xml file):

<add>
<doc>
<field name="id">1</field>
<field name="html"><![CDATA[<html><head><title>My page</title></
head><body><p>This is a <b>my</b><i>sample</i> page</body></html>
]]></field>
</doc>
</add>

Now let's take care of the schema.xml file. First add the type definition to the schema.xml file:

<fieldType name="html_strip" class="solr.TextField">
<analyzer>
<charFilter class="solr.HTMLStripCharFilterFactory"/>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

And now, add the following to the field definition part of the schema.xml file:

<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="html" type="html_strip" indexed="true" stored="false"/>

Let's index our data. To do that, we run the following command from the exampledocs directory (put the ch3_html.xml file there):

java -jar post.jar ch3_html.xml

If there were no errors, you should see a response like this:

SimplePostTool: version 1.2
SimplePostTool: WARNING: Make sure your XML documents are encoded
in UTF-8, other encodings are not currently supported
SimplePostTool: POSTing files to http://localhost:8983/solr/update..
SimplePostTool: POSTingfile ch3_html.xml
SimplePostTool: COMMITting Solr index changes..

How it works...

First of all, we have the data example. In the example, we see one file with two fields; the identifier and some HTML data nested in the CDATA section. You must remember to surround the HTML data in CDATA tags if they are full pages, and start from HTML tags like our example, otherwise Solr will have problems with parsing the data. However, if you only have some tags present in the data, you shouldn't worry.

Next, we have the html_strip type definition. It is based on solr.TextField to enable full-text searching. Following that, we have a character filter which handles the HTML and the XML tags stripping. This is something new in Solr 1.4. The character filters are invoked before the data is sent to the tokenizer. This way they operate on untokenized data. In our case, the character filter strips the HTML and XML tags, attributes, and so on. Then it sends the data to the tokenizer, which splits the data by whitespace characters. The one and only filter defined in our type makes the tokens lowercase to simplify the search.

To index the documents, we use simple post tools provided with the example deployment of Solr. To use it we invoke the command shown in the example. The post tools will send the data to the default update handler found under the address http://localhost:8983/solr/ update. The parameter of the command execution is the file that is going to be sent to Solr. You can also post a list of files, not just a single one.

As you can see, the sample response from the post tools is rather informative. It provides information about the update handler address, files that were sent, and information about commits being performed.

If you want to check how your data was indexed, remember not to be mistaken when you choose to store the field contents (attribute stored="true"). The stored value is the original one sent to Solr, so you won't be able to see the filters in action. If you wish to check the actual data structures, please take a look at the Luke utility (a utility that lets you see the index structure, field values, and operate on the index). Luke can be found at the following address: http://code.google.com/p/luke

Solr provides a tool that lets you see how your data is analyzed. That tool is a part of Solr administration pages.

Copying the contents of one field to another

Imagine that you have many big XML files that hold information about the books that are stored on library shelves. There is not much data, just the unique identifier, name of the book, and the name of the author. One day your boss comes to you and says: "Hey, we want to facet and sort on the basis of the book author". You can change your XML and add two fields, but why do that when you can use Solr to do that for you? Well, Solr won't modify your data, but it can copy the data from one field to another. This recipe will show you how to do that.

How to do it...

Let's assume that our data looks like this:

<add>
<doc>
<field name="id">1</field>
<field name="name">Solr Cookbook</field>
<field name="author">John Kowalsky</field>
</doc>
<doc>
<field name="id">2</field>
<field name="name">Some other book</field>
<field name="author">Jane Kowalsky</field>
</doc>
</add>

We want the contents of the author field to be present in the fields named author, author_facet, and author sort. So let's define the copy fields in the schema.xml file (place the following right after the fields section):

<copyField source="author"dest="author_facet"/>
<copyField source="author"dest="author_sort"/>

And that's all. Solr will take care of the rest.

The field definition part of the schema.xml file could look like this:

<field name="id" type="string" indexed="true" stored="true"
required="true"/>
<field name="author" type="text" indexed="true" stored="true"
multiValued="true"/>
<field name="name" type="text" indexed="true" stored="true"/>
<field name="author_facet" type="string" indexed="true"
stored="false"/>
<field name="author_sort" type="alphaOnlySort" indexed="true"
stored="false"/>

Let's index our data. To do that, we run the following command from the exampledocs directory (put the ch3_html.xml file there):

java -jar post.jar data.xml

How it works...

As you can see in the example, we have only three fields defined in our sample data XML file. There are two fields which we are not particularly interested in: id and name. The field that interests us the most is the author field. As I have mentioned earlier, we want to place the contents of that field in three fields:

  • Author (the actual field that will be holding the data)
  • author_ sort
  • author_facet

 

To do that we use the copy fields. Those instructions are defined in the schema.xml file, right after the field definitions, that is, after the tag. To define a copy field, we need to specify a source field (attribute source) and a destination field (attribute dest).

After the definitions, like those in the example, Solr will copy the contents of the source fields to the destination fields during the indexing process. There is one thing that you have to be aware of—the content is copied before the analysis process takes place. This means that the data is copied as it is stored in the source.

There's more...

There are a few things worth nothing when talking about copying contents of the field to another field.

Copying the contents of dynamic fields to one field

You can also copy multiple field content to one field. To do that, you should define a copy field like this:

<copyField source="*_author"dest="authors"/>

The definition like the one above would copy all of the fields that end with _author to one field named authors. Remember that if you copy multiple fields to one field, the destination field should be defined as multivalued.

Limiting the number of characters copied

There may be situations where you only need to copy a defined number of characters from one field to another. To do that we add the maxChars attribute to the copy field definition. It can look like this:

<copyField source="author" dest="author_facet" maxChars="200"/>

The above definition tells Solr to copy upto 200 characters from the author field to the author_facet field. This attribute can be very useful when copying the content of multiple fields to one field.

Apache Solr 3.1 Cookbook Over 100 recipes to discover new ways to work with Apache’s Enterprise Search Server
Published: July 2011
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:
        Read more about this book      

(For more resources on this subject, see here.)

Changing words to other words

Let's assume we have an e-commerce client and we are providing a search system based on Solr. Our index has hundreds of thousands of documents which mainly consist of books, and everything works fine. Then one day, someone from the marketing department comes into your office and says that he wants to be able to find all the books that contain the word "machine" when he types "electronics" into the search box. The first thing that comes to mind is: 'hey, do it in the source and I'll index that'. But that is not an option this time, because there can be many documents in the database that have those words. We don't want to change the whole database. That's when synonyms come into play and this recipe will show you how to use it.

How to do it...

To make the example as simple as possible, I assumed that we only have two fields in our index. This is how the field definition section in the schema.xml file looks like (just add it to your schema.xml file into the field section):

<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="description" type="text_syn" indexed="true" stored=
"true"/>

Now let's add the text_syn type definition to the schema.xml file, as shown in the code snippet:

<fieldType name="text_syn" class="solr.TextField">
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt"
ignoreCase="true" expand="false" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

As you notice, there is a file mentioned synonyms.txt. Let's take a look at its contents:

machine => electronics

The synonyms.txt file should be placed in the same directory as the other configuration files.

How it works...

First we have our field definition. There are two fields: identifier and description. The second one should be of our interest right now. It's based on the new type text_syn, which is shown in the second listing.

Now about the new type, text_syn—it's based on the solr.TextField class . Its definition is divided. It behaves one way while indexing and in a different way while querying. So the first thing we see is the query time analyzer definition. It consists of the tokenizer that splits the data on the basis of whitespace characters and then the lowercase filter makes all the tokens lowercase. The interesting part is the index time behavior. It starts with the same tokenizer, but then the synonyms filter comes into play. Its definition starts like all the other filters with the factory definition. Next we have a synonyms attribute which defines which file contains the synonyms definition. Following that we have the ignoreCase attribute which tells Solr to ignore the case of the tokens and the contents of the synonyms file.

The last attribute named expand is set to false. This means that Solr won't be expanding the synonyms and all equivalent synonyms will be reduced to the first synonym in the line. If the attribute is set to true, all synonyms will be expanded to all equivalent forms.

The example synonyms.txt file tells Solr that when the word machine appears in the field based on the text_syn type, it should be replaced by electronics, but not the other way round. Each synonym rule should be placed in a separate line in the synonyms.txt file. Also, remember that the file should be written in the UTF-8 file encoding. This is crucial and you should always remember it, because Solr will expect the file to be encoded in UTF-8.

There's more...

There is one more thing I would like to add when talking about synonyms.

Equivalent synonyms setup

Let's get back to our example for a second. But what if the person from the marketing department says that he wants not only to be able to find books that have the word machine to be found when entering the word electronics, but also all the books that have the word electronics to be found when entering the word machine. The answer is simple. First, we would set the expand attribute (of the filter) to true. Then we would change our synonyms.txt file to something like this:

machine, electronics

And as I said earlier, Solr would expand synonyms to equivalent forms.

Splitting text by camel case

Let's suppose that you run an e-commerce site with an electronic assortment. The marketing department can be a source of many great ideas. Imagine that one time your colleague from this department comes to you and says that they would like your search application to be able to find documents containing the word "PowerShot" by entering the words "power" and "shot" into the search box. So can we do that? Of course, and this recipe will show you how.

How to do it...

Let's assume that we have the following index structure (add this to your schema.xml file, to the field definition section):

<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="description" type="text_split" indexed="true"
stored="true" />

To split text in the description field, we should add the following type definition to the schema.xml file:

<fieldType name="text_split" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts=
"1" splitOnCaseChange="1"/>
<filter class="solr.LowerCaseFilterFactory" />
</analyzer>
</fieldType>

To test our type, I've indexed the following XML file:

<add>
<doc>
<field name="id">1</field>
<field name="description">TextTest</field>
</doc>
</add>

Then I run the following query in the web browser:

http://localhost:8983/solr/select?q=description:test

You should get the indexed document in response.

How it works...

Let's see how things work. First of all, we have the field definition part of the schema.xml file. This is pretty straightforward. We have two fields defined: one that is responsible for holding the information about the identifier (id field) and the second is responsible for the product description (description field).

Next, we see the interesting part. We name our type text_split and base it on a text type, solr.TextField. We also told Solr that we want our text to be tokenized by whitespaces by adding the whitespace tokenizer (tokenizer tag). To do what we want to do (split by case change) we need more than this. Actually we need a filter named WordDelimiterFilter, which is created by the solr.WordDelimiterFilterFactory class and a filter tag. We also need to define the appropriate behavior of the filter, so we add two attributes: generateWordParts and splitOnCaseChange . The values of these two parameters are set to 1, which means that they are turned on. The first attribute tells Solr to generate word parts, which means that the filter will split the data on non-letter characters. We also add the second attribute which tells Solr to split the tokens by case change.

What will that configuration do with our sample data? As you can see, we have one document sent to Solr. The data in the description field will be split into two words: text and test. You can check it yourself by running the example query in your web browser.

Splitting text by whitespace only

One of the most common problems that you have probably come across is having to split the text with whitespaces in order to segregate words from each other, to be able to process it further. This recipe will show you how to do it.

How to do it...

Let's assume that we have the following index structure (add this to your schema.xml file in the field definition section):

<field name="description_string" type="string" indexed="true"
stored="true" />
<field name="description_split" type="text_split" indexed="true"
stored="true" />

To split the text in the description field, we should add the following type definition:

<fieldType name="text_split" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>

To test our type, I've indexed the following XML file:

<add>
<doc>
<field name="description_string">test text</field>
<field name="description_text">test text</field>
</doc>
</add>

Then I run the following query in the web browser:

http://localhost:8983/solr/select?q=description_split:text

You should get the indexed document in response.

On the other hand, you won't get the indexed document in response after running the following query:

http://localhost:8983/solr/select?q=description_string:text

How it works...

Let's see how things work. First of all we have the field definition part of the schema.xml file. This is pretty straightforward. We have two fields defined: one named description_ string, which is based on a string field and thus not analyzed, and the second is the description_split field, which is based on our text_split type and will be tokenized on the basis of whitespace characters.

Next, we see the interesting part. We named our type text_split and based it on a text type, solr.TextField. We told Solr that we want our text to be tokenized by whitespaces by adding a whitespace tokenizer (tokenizer tag). Because there are no filters defined, the text will be tokenized only by whitespace characters and nothing more.

That's why our sample data in the description_text field will be split into two words: test and text. On the other hand, the text in the description_string field won't be split. That's why the first example query will result in one document in response, while the second example won't find the example document.

Making plural words singular, but without stemming

Nowadays it's nice to have stemming algorithms in your application. But let's imagine that you have a search engine that searches through the contents of the books in the library. One of the requirements is changing the plural forms of the words from plural to singular – nothing less, nothing more. Can Solr do that? Yes, the newest one can, and this recipe will show you how.

How to do it...

Let's assume that our index consists of two fields (add this to your schema.xml file, to the field definition section):

<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="description" type="text_light_stem" indexed="true"
stored="true" />

Our text_light_stem type should look like this (add this to your schema.xml file):

<fieldType name="text_light_stem" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.EnglishMinimalStemFilterFactory" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

If you would like to check the analysis tool of Solr administration pages, you should see that words like ways and keys are changed to their singular forms.

How it works...

First of all, we need to define the fields in the schema.xml file. To do that, we add the contents from the first example into that file. It tells Solr that our index will consist of two fields: the id field, which will be responsible for holding the information about the unique identifier of the document and the description field, which will be responsible for holding the document description.

The description field is actually where the magic is being done. We defined a new field type for that field and we called it text_light_stem. The field definition consists of a tokenizer and two filters. If you want to know how this tokenizer behaves, please refer to the Splitting text by whitespace only recipe in this article. The first filter is a new one. This is the light stemming filter that we will use to perform minimal stemming. The class that enables Solr to use that filter is solr.EnglishMinimalStemFilterFactory . This filter takes care of the process of light stemming. You can see that by using the analysis tools of the Solr administration panel. The second filter defined is the lowercase filter; you can see how it works by referring to the Lowercasing the whole string recipe in this article.

After adding this to your schema.xml file, you should be able to use the light stemming algorithm.

There's more...

Light stemming supports a number of different languages. To use the light stemmers for your respective language, add the following filters to your type:

Language

Filter

Russian

solr.RussianLightStemFilterFactory

Portuguese

solr.PortugueseLightStemFilterFactory

French

solr.FrenchLightStemFilterFactory

German

solr.GermanLightStemFilterFactory

Italian

solr.ItalianLightStemFilterFactory

Spanish

solr.SpanishLightStemFilterFactory

Hungarian

solr.HungarianLightStemFilterFactory

Swedish

solr.SwedishLightStemFilterFactory

Lowercasing the whole string

Let's get back to our books search example. This time your boss comes to you and says that all book names should be searchable when the user types lower or uppercase characters. Of course, Solr can do that, and this recipe will describe how to do it.

How to do it...

Let's assume that we have the following index structure (add this to your schema.xml file in the field definition section):

<field name="id " type="string" indexed="true" stored="true"
required="true" />
<field name="name" type="string_lowercase" indexed="true"
stored="true" />
<field name="description" type="text" indexed="true" stored="true"/>

To make our strings lowercase, we should add the following type definition to the schema.xml file:

<fieldType name="string_lowercase" class="solr.TextField">
<analyzer>
<tokenizer class="solr.KeywordTokenizerFactory"/>
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

To test our type, I've indexed the following XML file:

<add>
<doc>
<field name="id">1</field>
<field name="name">Solr Cookbook</field>
<field name="description">Simple description</field>
</doc>
</add>

Then I run the following query in the web browser:

http://localhost:8983/solr/select?q=name:"solr cookbook"

You should get the indexed document in the response.

On the other hand, you should be able get the indexed document in the response after running the following query:

http://localhost:8983/solr/select?q=name:"solr Cookbook"

How it works...

Let's see how things work. First of all, we have the field definition part of the schema.xml file. This is pretty straightforward. We have three fields defined: first the field named id, which is responsible for holding our unique identifier; the second is the name field, which is actually our lowercase string field; and the third field will be holding the description of our documents, and is based on the standard text type defined in the example Solr deployment.

Now let's get back to our name field. It's based on the string_lowercase type. Let's look at that type. It consists of an analyzer, which is defined as a tokenizer, and one filter. solr. KeywordTokenizerFactory tells Solr that the data in that field should not be tokenized in any way. It should just be passed as a single token to the token stream. Next we have our filter, which is changing all the characters to their lowercase equivalents. And that's how this field analysis is performed.

The example queries show how the field behaves. It doesn't matter if you type lower or uppercase characters, the document will be found anyway. What matters is that you must type the whole string as it is, because we used the keyword tokenizer which, as I already said, is not tokenizing but just passing the whole data through the token stream as a single token.

Storing geographical points in the index

Imagine that till now your application stores information about companies. not much information, only the unique identification and the company name. But now, your client wants to store the location of the companies. Not a problem—just two additional fields. But a company can have multiple addresses and thus can have multiple geographical points assigned. So now, how to do that in Solr? Of course, we can add multiple dynamic fields and remember the field names in our application, but that isn't comfortable. In this recipe, I'll show how to store pairs of fields; in our case, the geographical point.

How to do it...

Let's assume that the companies that we store are defined by three fields (add this to your schema.xml file, to the field definition section):

<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="name" type="text" indexed="true" stored="true" />
<field name="location" type="point" indexed="true"
stored="true"multiValued="true" />

We should also have one dynamic field defined (add this to your schema.xml file in the field definition section):

<dynamicField name="*_d" type="double" indexed="true" stored="true"/>

Our point type should look like this:

<fieldType name="point" class="solr.PointType" dimension="2"
subFieldSuffix="_d"/>

Now let's see how our example data will look (I named the data file task9.xml):

<add>
<doc>
<field name="id">1</field>
<field name="name">Solr.pl company</field>
<field name="location">10,10</field>
<field name="location">20,20</field>
</doc>
</add>

Let's index our data. To do that, we run the following command from the exampledocs directory (put the task9.xml file there):


java -jar post.jar task9.xml

After indexing we should be able to use the query, like the following one, to get our data:

http://localhost:8983/solr/select?q=location:10,10

The response should look like this:

<?xml version="1.0" encoding="UTF-8"?>
<response>
<lst name="responseHeader">
<int name="status">0</int>
<int name="QTime">2</int>
<lst name="params">
<str name="q">location:10,10</str>
</lst>
</lst>
<result name="response" numFound="1" start="0">
<doc>
<str name="id">1</str>
<arr name="location">
<str>10,10</str>
<str>20,20</str>
</arr>
<arr name="location_0_d">
<double>10.0</double>
<double>20.0</double>
</arr>
<arr name="location_1_d">
<double>10.0</double>
<double>20.0</double>
</arr>
<str name="name">Solr.pl company</str>
</doc>
</result>
</response>

How it works...

First of all, we have three fields and one dynamic field defined in our schema.xml file. The first field is the one responsible for holding the unique identifier. The second one is responsible for holding the name of the company. The third one, named location, is responsible for holding the geographical points and can have multiple values. The dynamic field will be used as a helper for the point type.

Next we have our point type definition. It's based on the solr.PointType class and is defined by two attributes:

  • dimension: The number of dimensions that the field will be storing. In our case, we will need to store pairs of values, so we need to set this attribute to 2.
  • subFieldSuffix: The field that will be used to store the actual values of the field. This is where we need our dynamic field. We tell Solr that our helper field will be the dynamic field ending with the suffix of _d.

So how does this type of field actually work? When defining a two dimensional field, like we did, there are actually three fields created in the index. The first field is named like the field we added in the schema.xml file, so in our case it is location. This field will be responsible for holding the stored value of the field. And one more thing, this field will only be created when we set the field attribute store to true.

The next two fields are based on the defined dynamic field. Their names will be field_0_d and field_1_d. First we have the field name, then the _ character, then the index of the value, then another _ character, and finally the suffix defined by the subFieldSuffix attribute of the type.

We can now look at the way the data is indexed. Please take a look at the example data file. You can see that the values in each pair are separated by the comma character. And that's how you can add the data to the index.

Querying is just the same as indexing in the way the pairs should be represented. The example query shows how you should be making your queries. It differs from the standard, one-valued fields with only one thing—each value in the pair is separated by a comma character and passed to the query.

The response is shown just to illustrate how the fields are stored. You can see, beside the location field, that there were two dynamic fields location_0_d and location_1_d created.

There's more...

If you wish to store more than two dimensions in a field, you should change the dimensions attribute of the field. For example, if you want to store four dimensions in a field, your definition should look like this:

<fieldType name="tetragon" class="solr.PointType" dimension="4"
subFieldSuffix="_i"/>

Apache Solr 3.1 Cookbook Over 100 recipes to discover new ways to work with Apache’s Enterprise Search Server
Published: July 2011
eBook Price: $26.99
Book Price: $44.99
See more
Select your format and quantity:
        Read more about this book      

(For more resources on this subject, see here.)

Stemming your data

One of the most common requirements I meet is stemming. Let's imagine the book e-commerce store, where you store the books' names and descriptions. We want to be able to find words like shown and showed when we type the word show, and vice versa. To achieve that, we can use stemming algorithms. This recipe will show you how to add stemming to your data analysis.

How to do it...

Let's assume that our index consists of three fields (add this to your schema.xml file, to the field definition section):

<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="name" type="text" indexed="true" stored="true" />
<field name="description" type="text_stem" indexed="true"
stored="true" />

Our text_stem type should look like this:

<fieldType name="text_stem" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.SnowballPorterFilterFactory" />
</analyzer>
</fieldType>

Now you can index your data. Let's create an example data file:

<add>
<doc>
<field name="id">1</field>
<field name="name">Solr cookbook</field>
<field name="description">This is a book that I'll show</field>
</doc>
<doc>
<field name="id">2</field>
<field name="name">Solr cookbook 2</field>
<field name="description">This is a book I showed</field>
</doc>
</add>

After indexing, we can test how our data was analyzed. To do that, let's run the following query:

http://localhost:8983/solr/select?q=description:show

That's right, Solr found two documents matching the query, which means that our field is working as intended.

How it works...

Our index consists of three fields: one holding the unique identifier of the document, the second one holding the name of the document, and the third one holding the document description. The last field is the field that will be stemmed.

The stemmed field is based on a Solr text field and has an analyzer that is used at the query and indexing time. It is tokenized on the basis of the whitespace characters. Then the stemming filter is used. What does the filter do? It tries to bring the words to its root form, meaning that words like shows, showing, and show will all be changed to show, or at least they should be changed to that form.

Please note that in order to properly use stemming algorithms, they should be used on the query and indexing times. It is a must because of the stemming results.

As you can see, our test data consists of two documents. Please take a look at the description. One of the documents has the word showed and the other has the word show in their description fields. After indexing and running the sample query, Solr would return two documents in the result, which means that the stemming did its job.

There's more...

There are two other things I would like to mention when talking about stemming.

Alternative English stemmer

If you find that the snowball porter stemmer is not sufficient for your needs (for example, if the first one is too invasive or too slow), you can try the other stemmer for English available in Solr. To do that, you change your stemming filter to the following one:

<filter class="solr.PorterStemFilterFactory" />

Stemming other languages

There are too many languages that have stemming support integrated into Solr to mention them all. If you are using a language other than English, please refer to the Language Analysis page of the Solr wiki to find the appropriate filter.

Preparing text to do efficient trailing wildcard search

Many users coming from traditional RDBMS systems are used to wildcard searches. The most common of them are the ones using the * characters, which means zero or more characters. You have probably seen searches like:

AND name LIKE 'ABC12%'

So, how to do that with Solr and not kill our Solr server? This recipe will show you how to prepare your data and make efficient searches.

How to do it...

Let's assume we have the following index structure (add this to your schema.xml file, to the field definition section):

<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="name" type="string_wildcard" indexed="true" stored=
"true"/>

Now, let's define our string_wildcard type (add this to the schema.xml file).

<fieldType name="string_wildcard" class="solr.TextField">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.EdgeNGramFilterFactory" minGramSize="1"
maxGramSize="25" side="front"/>
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
</analyzer>
</fieldType>

The example data looks like this:

<add>
<doc>
<field name="id">1</field>
<field name="name">XYZ1234ABC12POI</field>
</doc>
</add>

Now, send the following query to Solr:

http://localhost:8983/solr/select?q=name:XYZ1

As you see that document has been found, it means our setup is working as intended.

How it works...

First of all, let's look at our index structure defined in the schema.xml file. We have two fields: one holding the unique identifier of the document (id field) and the second one holding the name of the document (name field), which is actually the field we are interested in.

The name field is based on the new type we defined, string_wildcard. This type is responsible for enabling trailing wildcards, the ones that will enable SQL queries like LIKE'WORD%'. As you can see, the field type is divided into two analyzers: one for the data analysis during indexing and the other for queries processing. The querying one is straightforward—it just tokenizes the data on the basis of whitespace characters. Nothing more and nothing less.

Now, the indexing time analysis (of course we are talking about the name field). Similar to the query time, the data is tokenized on the basis of whitespace characters during indexing, but there is also an additional filter defined. solr.EdgeNGramFilterFactory is responsible for generating the so called n-grams. In our setup, we tell Solr that the minimum length of an n-gram is 1 (minGramSize attribute) and the maximum length is 25 (maxGramSize attribute). We also defined that the analysis should start from the beginning of the text (side attribute set to front). So what would Solr do with our example data? It will create the following tokens from the example text: X, XY, XYZ, XYZ1, XYZ12, and so on. It will create tokens by adding the next character from the string to the previous token, up to the maximum length of n-gram that is given in the filter configuration.

So by typing the example query, we can be sure that the example document will be found because of the n-gram filter defined in the configuration of the field. We also didn't define the n-gram in the querying stage of analysis because we don't want our query to be analyzed in such a way, because that could lead to false positive hits and we don't want that to happen.

By the way, this functionality, as described, can be successfully used to provide autocomplete (if you are not familiar with the autocomplete feature, please take a look at http://en.wikipedia.org/wiki/Autocomplete) features to your application.

There's more...

If you would like your field to be able to simulate SQL LIKE '%ABC' queries, you should change the side attribute of solr.EdgeNGramFilterFactory to the back value. The configuration should look like this:

<filter class="solr.EdgeNGramFilterFactory"minGramSize="1"maxGramSi
ze="25" side="back"/>

It would change the end from which Solr starts to analyze the data. In our case, it would start from the end and thus would produce n-grams like: I, OI, POI,2POI, 12POI, and so on.

Splitting text by numbers and non-white space characters

Analyzing the text data is not only about stemming, removing diacritics (if you are not familiar with the word, please take a look at Diacritic), and choosing the right format for the data. Let's assume that our client wants to be able to search by words and numbers that construct product identifiers. For example, he would like to be able to find the product identifier ABC1234XYZ with the use of ABC, 1234, or XYZ.

How to do it...

Let's assume that our index consists of three fields (add this to your schema.xml file, to the field definition section):

<field name="id" type="string" indexed="true" stored="true"
required="true" />
<field name="name" type="text" indexed="true" stored="true"/>
<field name="description" type="text_split" indexed="true"
stored="true" />

Our text_split type should look like this (add this to your schema.xml file):

<fieldType name="text_split" class="solr.TextField">
<analyzer>
<tokenizer class="solr.WhitespaceTokenizerFactory"/>
<filter class="solr.WordDelimiterFilterFactory" generateWordParts=
"1" generateNumberParts="1" splitOnNumerics="1" />
<filter class="solr.LowerCaseFilterFactory"/>
</analyzer>
</fieldType>

Now you can index your data. Let's create an example data file:

<add>
<doc>
<field name="id">1</field>
<field name="name">Test document</field>
<field name="description">ABC1234DEF BL-123_456 adding-documents</
field>
</doc>
</add>

After indexing we can test how our data was analyzed. To do that, let's run the following query:

http://localhost:8983/solr/select?q=description:1234

Solr found our document, which means that our field is working as intended.

How it works...

We have our index defined as three fields in the schema.xml file. We have a unique identifier (an id field) indexed as a string. We have a document name (a name field) indexed as text (type which is provided with the example deployment of Solr) and a document description (a description field), which is based on the text_split field which we defined ourselves.

Our type is defined to make the same text analysis both on the query time and the index time. It consists of the whitespace tokenizer and two filters. The first filter is where the magic is done. The solr.WordDelimiterFilterFactory behavior, in our case, is defined by these parameters:

  • generateWordParts: If set to 1, it tells the filter to generate parts of the word that are connected by non-alphanumeric characters like the dash character. For example, token ABC-EFG would be split to ABC and EFG.
  • generateNumberParts: If set to 1, it tells the filter to generate words from numbers connected by non-numeric characters like the dash character. For example, token 123-456 would be split to 123 and 456.
  • splitOnNumerics: If set to 1, it tells the filter to split letters and numbers from each other. This means that token ABC123 will be split to ABC and 123.

The second filter is responsible for changing the words that lowercase the equivalents and is discussed in the recipe Lowercasing the whole string in this article.

So after sending our test data to Solr, we can run the example query to see if we defined our filter properly. And you probably know the result; yes the result will contain one document, the one that we send to Solr. That's because the word ABC1234DEF is split to ABC, 1234, and DEF tokens, and can thus be found by the example query.

There's more...

In case you would like to preserve the original token that is passed to solr. WordDelimiterFilterFactory, add the following attribute to the filter definition:

preserveOriginal="1"

See also

If you would like to know more about solr.WordDelimiterFilterFactory, please refer to the recipe Splitting text by camel case in this article.

Summary

This article told you how to overcome common problems you may encounter while analyzing your text data.


Further resources on this subject:


About the Author :


Rafał Kuć

Rafał Kuć is a born team leader and software developer. He currently works as a consultant and a software engineer at Sematext Group, Inc., where he concentrates on open source technologies such as Apache Lucene and Solr, Elasticsearch, and Hadoop stack. He has more than 12 years of experience in various branches of software, from banking software to e-commerce products. He focuses mainly on Java but is open to every tool and programming language that will make the achievement of his goal easier and faster. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people with the problems they face with Solr and Lucene. Also, he has been a speaker at various conferences around the world, such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, and Lucene Revolution.

Rafał began his journey with Lucene in 2002, and it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then, Solr came along and this was it. He started working with Elasticsearch in the middle of 2010. Currently, Lucene, Solr, Elasticsearch, and information retrieval are his main points of interest.

Rafał is also the author of Apache Solr 3.1 Cookbook, and the update to it, Apache Solr 4 Cookbook. Also, he is the author of the previous edition of this book and Mastering ElasticSearch. All these books have been published by Packt Publishing.

Books From Packt


Solr 1.4 Enterprise Search Server
Solr 1.4 Enterprise Search Server

Apache Solr 3 Enterprise Search Server: RAW
Apache Solr 3 Enterprise Search Server: RAW

Python 2.6 Text Processing: Beginners Guide
Python 2.6 Text Processing: Beginners Guide

Pentaho Data Integration 4 Cookbook
Pentaho Data Integration 4 Cookbook

iReport 3.7
iReport 3.7

MySQL Admin Cookbook
MySQL Admin Cookbook

NHibernate 3 Beginner's Guide
NHibernate 3 Beginner's Guide

CMS Made Simple Development Cookbook
CMS Made Simple Development Cookbook


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software