Packt+ | Advance your knowledge in tech

You're reading from Solr Cookbook - Third Edition

Product typeBook

Published inJan 2015

Reading LevelIntermediate

Publisher

ISBN-139781783553150

Edition1st Edition

Languages

Java

Tools

Solr

Concepts

Enterprise Search

Author (1)

Rafal Kuc

Chapter 3. Analyzing Your Text Data

In this chapter, we will cover the following topics:

Using the enumeration type
Removing HTML tags during indexing
Storing data outside of Solr index
Using synonyms
Stemming different languages
Using nonaggressive stemmers
Using the n-gram approach to do performant trailing wildcard searches
Using position increment to divide sentences
Using patterns to replace tokens

Introduction

The process of data indexing can be divided into parts. One of the parts is data analysis. It's one of the crucial parts of data preparation. It defines how your data will be divided into terms from text, and what type it will be. The Solr data parsing behavior is defined by types. A type's behavior can be defined in the context of the indexing process, query process, or both. Furthermore, the type definition is composed of a tokenizer (or multiple tokenizers, some for querying and some for indexing) and filters (both token and character filters). A tokenizer specifies how your data will be preprocessed after it is sent to the appropriate field. An analyzer operates on the whole data that is sent to the field. Types can only have one tokenizer. The result of the tokenizer is a stream of objects called tokens.

Next in the analysis chain are the filters. They operate on the tokens in the token stream. They can do anything with the tokens—changing, removing, or making them lowercase...

Using the enumeration type

Imagine that we use Solr to store information about our environment's state, error, and events related to them—a simple solution that will work as a simple log centralization solution. For our simple use case, we will store the identifier of the message, the information, what type of event it is, and the severity of the event, showing us how important the event is. However, what we will want to be sure of is that the severity field contains only values from a given list. To achieve all this, we will use the Solr enumeration type.

How to do it...

To achieve our requirements, we will have to perform the following steps:

We will start with the index structure. Our field list from the schema.xml file will look as follows:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="problem" type="text_general" indexed="true" stored="true" />
<field name="severity" type="enum_type" indexed="true" stored="true" />

In addition...

Removing HTML tags during indexing

There are many real-life situations when you have to clean your data. Let's assume that you want to index web pages that your client sends you. You don't know anything about the structure of the page; one thing you know is that you must provide a search mechanism that will enable searching through the content of the pages. Of course, you can index the whole page splitting it by whitespaces, but then you will probably hear the client complain about the HTML tags being searchable, and so on. So, before we enable searching on the contents of the page, we need to clean the data. In this recipe, we will see how to remove the HTML tags with Solr.

How to do it...

Now, let's take a look at the steps needed to remove the HTML tags from our data.

We start by assuming that our data looks like this:

<add>
 <doc>
  <field name="id">1</field>
  <field name="html"><![CDATA[<html><head><title>My page</title></head...

Storing data outside of Solr index

Although Solr allows us to use the partial update API to update a single field of our document, what it does in the background is the complete reindexing of a document. However, there are situations where such reindexing is not possible. For example, we can have an index containing articles about published books, and we can store the information on how many users visited this article and read it. The number of users is so high that we have thousands of updates per second. Sending a high amount of updates can be demanding for Solr; however, we can store such information in external files and use it for boosting or sorting. This recipe will show how to do this.

How to do it...

The following steps are needed to achieve our requirements:

First of all, we will create the index structure by adding the following field definition to our schema.xml file:

<field name="name" type="text_general" indexed="true" stored="true" />
<field name="visits" type="visitsType...

Using synonyms

Let's assume we have an e-commerce client and we are providing a search system based on Solr. Our index has thousands of documents that mainly consist of books and everything works fine. Then, one day, someone from the marketing department comes into your office and says that he wants to be able to find books that have the word machine when he types electronics into the search box. The first thing that comes to mind is "hey, do it in the source and I'll index that". However, this is not an option this time because there can be many documents in the database that have those words. We don't want to change the whole database. This is when synonyms come into play, and this recipe will show you how to use synonyms.

How to do it...

To make the example as simple as possible, I assumed that we only have two fields in our index.

Let's start with defining our index structure by adding the following field definition section to the schema.xml file:
```
<field name="id" type="string" indexed...
```

Stemming different languages

Stemming is a very common requirement; it is the process of reducing words to their root form (or stems). Let's imagine the book e-commerce store, where you store the books' names and descriptions. We want to be able to find words such as shown and showed when you type the word show, and vice versa. We can achieve this requirement using stemming algorithms. The thing is, there are no general stemmers; they are language-specific. This recipe will show you how to add stemming to your data analysis chain and where to look for a list of stemmers.

How to do it...

To achieve our requirement to stem English, we need to take certain steps:

We will start with the index structure. Let's assume that our index consists of three fields that we defined in the schema.xml file:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="name" type="string" indexed="true" stored="true" />
<field name="description" type="text_stem" indexed...

Using nonaggressive stemmers

Nowadays, it's nice to have stemming algorithms (algorithms that will reduce words to their stem or root forms) in your application, which will allow you to find words such as cat and cats just by typing cat. However, let's imagine that you have a search engine that searches through contents of the books in a library. One of the requirements is changing the plural forms of the words from plural to singular; nothing less, nothing more. Can Solr do this? Yes, Solr can do this, and this recipe will show you how to do it.

How to do it...

First, let's start with a simple, two-field index (add the following section to your schema.xml file):

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="description" type="text_light_stem" indexed="true" stored="true" />

Now, let's define the text_light_stem field type, which should look like this (add this to your schema.xml file):
```
<fieldType name="text_light_stem" class="solr...
```

Using the n-gram approach to do performant trailing wildcard searches

Many users working with traditional RDBMS systems are used to wildcard searches. The most common among them are the ones using the * characters, which means zero or more characters. If you used SQL databases, you probably saw searches such as this:

AND name LIKE 'ABC12%'

However, wildcard searchers are not too efficient when it comes to Solr. This is because Solr needs to enumerate all the terms because the query is executed. So, how do we prepare our Solr deployment to handle trailing wildcard characters in an efficient way? This recipe will show you how to prepare your data and make efficient searches.

How to do it...

There are some steps we need to make efficient wildcards using the n-gram approach:

The first step is to create a proper index structure. Let's assume we have the following fields defined in the schema.xml file:
```
<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name...
```

Using position increment to divide sentences

Imagine that we want to search in the short notes created by our users. We want to have two possibilities—searching inside a single sentence and searching inside the whole content of the note. We also know that our users don't write notes longer than 100 sentences, and each sentence has a maximum of 100 words, giving us a maximum of 10,000 words per note. To achieve this, we will use position increments that allow us to control how data is divided in the same field.

How to do it...

The following steps will allow us to fulfill our requirements:

We start with example data, which will look like this:

<add>
 <doc>
  <field name="id">1</field>
  <field name="note_line">Support meeting at Monday.</field>
  <field name="note_line">Need to prepare presentation.</field>
 </doc>
</add>

Now, we need to create an index structure. To do this, we need to add the fields that will be used. We do this by adding...

Using patterns to replace tokens

Let's assume that we want to search inside user blog posts. We need to prepare a simple search returning only the identifier of the documents that were matched. However, we will want to remove some words because of explicit language. Of course, we can do this using the stop words functionality, but what if we want to know how many documents have their contents censored with compute statistics on. In such a case, we can't use the stop words functionality, we need something more, which means that we need regular expressions. This recipe will show you how to achieve such requirements using Solr and one of its filters.

How to do it...

To achieve our needs, we will use the solr.PatternReplaceFilterFactory filter. Let's assume that we want to remove all the words that start with the word prefix. These are the steps needed:

First, we need to create our index structure, so the fields we add to the schema.xml file are as follows:
```
<field name="id" type="string" indexed...
```

The rest of the chapter is locked

You have been reading a chapter from

Solr Cookbook - Third Edition

Published in: Jan 2015Publisher: ISBN-13: 9781783553150

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Rafal Kuc

Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as a consultant and software engineer at Sematext Group Inc. where he concentrates on open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more than 14 years of experience in various software domains—from banking software to e–commerce products. He is mainly focused on Java; however, he is open to every tool and programming language that might help him to achieve his goals easily and quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people solve their Solr and Lucene problems. He is also a speaker at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days. Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. He started working with Elasticsearch in the middle of 2010. At present, Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest. Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.
Read more about Rafal Kuc

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages