Reader small image

You're reading from  Solr Cookbook - Third Edition

Product typeBook
Published inJan 2015
Reading LevelIntermediate
Publisher
ISBN-139781783553150
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Rafal Kuc
Rafal Kuc
author image
Rafal Kuc

Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as a consultant and software engineer at Sematext Group Inc. where he concentrates on open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more than 14 years of experience in various software domains—from banking software to e–commerce products. He is mainly focused on Java; however, he is open to every tool and programming language that might help him to achieve his goals easily and quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people solve their Solr and Lucene problems. He is also a speaker at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days. Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. He started working with Elasticsearch in the middle of 2010. At present, Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest. Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.
Read more about Rafal Kuc

Right arrow

Chapter 3. Analyzing Your Text Data

In this chapter, we will cover the following topics:

  • Using the enumeration type

  • Removing HTML tags during indexing

  • Storing data outside of Solr index

  • Using synonyms

  • Stemming different languages

  • Using nonaggressive stemmers

  • Using the n-gram approach to do performant trailing wildcard searches

  • Using position increment to divide sentences

  • Using patterns to replace tokens

Introduction


The process of data indexing can be divided into parts. One of the parts is data analysis. It's one of the crucial parts of data preparation. It defines how your data will be divided into terms from text, and what type it will be. The Solr data parsing behavior is defined by types. A type's behavior can be defined in the context of the indexing process, query process, or both. Furthermore, the type definition is composed of a tokenizer (or multiple tokenizers, some for querying and some for indexing) and filters (both token and character filters). A tokenizer specifies how your data will be preprocessed after it is sent to the appropriate field. An analyzer operates on the whole data that is sent to the field. Types can only have one tokenizer. The result of the tokenizer is a stream of objects called tokens.

Next in the analysis chain are the filters. They operate on the tokens in the token stream. They can do anything with the tokens—changing, removing, or making them lowercase...

Using the enumeration type


Imagine that we use Solr to store information about our environment's state, error, and events related to them—a simple solution that will work as a simple log centralization solution. For our simple use case, we will store the identifier of the message, the information, what type of event it is, and the severity of the event, showing us how important the event is. However, what we will want to be sure of is that the severity field contains only values from a given list. To achieve all this, we will use the Solr enumeration type.

How to do it...

To achieve our requirements, we will have to perform the following steps:

  1. We will start with the index structure. Our field list from the schema.xml file will look as follows:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="problem" type="text_general" indexed="true" stored="true" />
    <field name="severity" type="enum_type" indexed="true" stored="true" />
  2. In addition...

Removing HTML tags during indexing


There are many real-life situations when you have to clean your data. Let's assume that you want to index web pages that your client sends you. You don't know anything about the structure of the page; one thing you know is that you must provide a search mechanism that will enable searching through the content of the pages. Of course, you can index the whole page splitting it by whitespaces, but then you will probably hear the client complain about the HTML tags being searchable, and so on. So, before we enable searching on the contents of the page, we need to clean the data. In this recipe, we will see how to remove the HTML tags with Solr.

How to do it...

Now, let's take a look at the steps needed to remove the HTML tags from our data.

  1. We start by assuming that our data looks like this:

    <add>
     <doc>
      <field name="id">1</field>
      <field name="html"><![CDATA[<html><head><title>My page</title></head...

Storing data outside of Solr index


Although Solr allows us to use the partial update API to update a single field of our document, what it does in the background is the complete reindexing of a document. However, there are situations where such reindexing is not possible. For example, we can have an index containing articles about published books, and we can store the information on how many users visited this article and read it. The number of users is so high that we have thousands of updates per second. Sending a high amount of updates can be demanding for Solr; however, we can store such information in external files and use it for boosting or sorting. This recipe will show how to do this.

How to do it...

The following steps are needed to achieve our requirements:

  1. First of all, we will create the index structure by adding the following field definition to our schema.xml file:

    <field name="name" type="text_general" indexed="true" stored="true" />
    <field name="visits" type="visitsType...

Using synonyms


Let's assume we have an e-commerce client and we are providing a search system based on Solr. Our index has thousands of documents that mainly consist of books and everything works fine. Then, one day, someone from the marketing department comes into your office and says that he wants to be able to find books that have the word machine when he types electronics into the search box. The first thing that comes to mind is "hey, do it in the source and I'll index that". However, this is not an option this time because there can be many documents in the database that have those words. We don't want to change the whole database. This is when synonyms come into play, and this recipe will show you how to use synonyms.

How to do it...

To make the example as simple as possible, I assumed that we only have two fields in our index.

  1. Let's start with defining our index structure by adding the following field definition section to the schema.xml file:

    <field name="id" type="string" indexed...

Stemming different languages


Stemming is a very common requirement; it is the process of reducing words to their root form (or stems). Let's imagine the book e-commerce store, where you store the books' names and descriptions. We want to be able to find words such as shown and showed when you type the word show, and vice versa. We can achieve this requirement using stemming algorithms. The thing is, there are no general stemmers; they are language-specific. This recipe will show you how to add stemming to your data analysis chain and where to look for a list of stemmers.

How to do it...

To achieve our requirement to stem English, we need to take certain steps:

  1. We will start with the index structure. Let's assume that our index consists of three fields that we defined in the schema.xml file:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="name" type="string" indexed="true" stored="true" />
    <field name="description" type="text_stem" indexed...

Using nonaggressive stemmers


Nowadays, it's nice to have stemming algorithms (algorithms that will reduce words to their stem or root forms) in your application, which will allow you to find words such as cat and cats just by typing cat. However, let's imagine that you have a search engine that searches through contents of the books in a library. One of the requirements is changing the plural forms of the words from plural to singular; nothing less, nothing more. Can Solr do this? Yes, Solr can do this, and this recipe will show you how to do it.

How to do it...

  1. First, let's start with a simple, two-field index (add the following section to your schema.xml file):

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="description" type="text_light_stem" indexed="true" stored="true" />
  2. Now, let's define the text_light_stem field type, which should look like this (add this to your schema.xml file):

    <fieldType name="text_light_stem" class="solr...

Using the n-gram approach to do performant trailing wildcard searches


Many users working with traditional RDBMS systems are used to wildcard searches. The most common among them are the ones using the * characters, which means zero or more characters. If you used SQL databases, you probably saw searches such as this:

AND name LIKE 'ABC12%'

However, wildcard searchers are not too efficient when it comes to Solr. This is because Solr needs to enumerate all the terms because the query is executed. So, how do we prepare our Solr deployment to handle trailing wildcard characters in an efficient way? This recipe will show you how to prepare your data and make efficient searches.

How to do it...

There are some steps we need to make efficient wildcards using the n-gram approach:

  1. The first step is to create a proper index structure. Let's assume we have the following fields defined in the schema.xml file:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name...

Using position increment to divide sentences


Imagine that we want to search in the short notes created by our users. We want to have two possibilities—searching inside a single sentence and searching inside the whole content of the note. We also know that our users don't write notes longer than 100 sentences, and each sentence has a maximum of 100 words, giving us a maximum of 10,000 words per note. To achieve this, we will use position increments that allow us to control how data is divided in the same field.

How to do it...

The following steps will allow us to fulfill our requirements:

  1. We start with example data, which will look like this:

    <add>
     <doc>
      <field name="id">1</field>
      <field name="note_line">Support meeting at Monday.</field>
      <field name="note_line">Need to prepare presentation.</field>
     </doc>
    </add>
  2. Now, we need to create an index structure. To do this, we need to add the fields that will be used. We do this by adding...

Using patterns to replace tokens


Let's assume that we want to search inside user blog posts. We need to prepare a simple search returning only the identifier of the documents that were matched. However, we will want to remove some words because of explicit language. Of course, we can do this using the stop words functionality, but what if we want to know how many documents have their contents censored with compute statistics on. In such a case, we can't use the stop words functionality, we need something more, which means that we need regular expressions. This recipe will show you how to achieve such requirements using Solr and one of its filters.

How to do it...

To achieve our needs, we will use the solr.PatternReplaceFilterFactory filter. Let's assume that we want to remove all the words that start with the word prefix. These are the steps needed:

  1. First, we need to create our index structure, so the fields we add to the schema.xml file are as follows:

    <field name="id" type="string" indexed...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Solr Cookbook - Third Edition
Published in: Jan 2015Publisher: ISBN-13: 9781783553150
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rafal Kuc

Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as a consultant and software engineer at Sematext Group Inc. where he concentrates on open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more than 14 years of experience in various software domains—from banking software to e–commerce products. He is mainly focused on Java; however, he is open to every tool and programming language that might help him to achieve his goals easily and quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people solve their Solr and Lucene problems. He is also a speaker at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days. Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. He started working with Elasticsearch in the middle of 2010. At present, Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest. Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.
Read more about Rafal Kuc