Packt+ | Advance your knowledge in tech

You're reading from Solr Cookbook - Third Edition

Product typeBook

Published inJan 2015

Reading LevelIntermediate

Publisher

ISBN-139781783553150

Edition1st Edition

Languages

Java

Tools

Solr

Concepts

Enterprise Search

Author (1)

Rafal Kuc

Chapter 2. Indexing Your Data

In this chapter, we will cover the following topics:

Indexing PDF files
Counting the number of fields
Using parsing update processors to parse data
Using scripting update processors to modify documents
Indexing data from a database using Data Import Handler
Incremental imports with DIH
Transforming data when using DIH
Indexing multiple geographical points
Updating document fields
Detecting the document language during indexation
Optimizing the primary key indexation
Handling multiple currencies

Introduction

Indexing data is one of the most crucial things in Lucene and Solr deployment. When your data is not indexed properly, your search results will be poor. When the search results are poor, it's almost certain the users will not be satisfied with the application that uses Solr. This is why we need our data to be prepared and indexed as timely and correctly as possible.

On the other hand, preparing data is not an easy task. Nowadays, we have more and more data floating around. We need to index multiple formats of data from multiple sources. Do we need to parse the data manually and prepare the data in XML format? The answer is no; we can let Solr do this for us. This chapter will concentrate on the indexing process and data preparation, starting with how to index data that is a binary PDF file to how to use Data Import Handler to fetch data from database and index it with Apache Solr and describing how we can detect the document language during indexation. We will also learn how...

Indexing PDF files

The library on the corner, we used to go to, wants to expand its collection and become available for the wider public through the World Wide Web. It asked its book suppliers to provide sample chapters of all the books in PDF format so that they can share it with online users. With all the samples provided by the supplier comes a problem—how to extract data for the search box from more than 900,000 PDF files. Solr can do it with the use of Apache Tika (http://tika.apache.org/). This recipe will show you how to handle such a task.

How to do it...

To index PDF files, we will need to set up Solr to use extracting request handlers. To do this, we will take the following steps:

First, let's edit our Solr instance, solrconfig.xml, and add the following configuration:

<requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
 <lst name="defaults">
  <str name="fmap.content">text</str>
  <str name="lowernames">true</str...

Counting the number of fields

Imagine a situation where we have a simple document to be indexed to Solr with titles and tags. What we will want to do is separate the premium documents that have more tag values because they are better in terms of our business. Of course, we can count the number of tags ourselves, but why not let Solr do this? This recipe will show you how to do this with Solr.

How to do it...

Let's look at the steps we need to take to count the number of field values.

We start with the index structure. What we need to do is put the following section in the schema.xml file:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="title" type="text_general" indexed="true" stored="true"/>
<field name="tags" type="string" indexed="true" stored="true" multiValued="true"/>
<field name="tags_count" type="int" indexed="true" stored="true"/>

The next thing is our test data, which looks as follows:
```
<add>
 <doc>
  <...
```

Using parsing update processors to parse data

Let's assume that we are running a bookstore, we want to sort our books by the publication date, and run faceting on the number of likes each book gets. However, we get all our data in XML, and we don't have data in the proper format, and so on. The good thing is that we can tell Solr to parse our data property so that we don't have to change what we already have. This recipe will show you how to do this.

Getting ready

Before continuing with this recipe, I suggest reading the Counting the number of fields recipe of this chapter to get used to updating the request processor configuration.

How to do it...

Let's look at the steps we need to take to make data parsing work.

First, we need to prepare our index structure, so we add the following section to the schema.xml file:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="title" type="text_general" indexed="true" stored="true" />
<field name="published...

Using scripting update processors to modify documents

Sometimes, we need to modify documents during indexing, and we don't want to do this on the indexing application side. For example, we have documents describing the Internet sites. What we want to be able to do is filter the sites on the basis of the protocol used, for example, http or https. We don't have this information; we only have the whole URL address. Let's see how we can achieve this with Solr.

Getting ready

Before continuing with the following recipe, I suggest reading the Counting the number of fields recipe of this chapter to get used to updating request processor configuration.

How to do it...

The following steps will take you through the process of achieving our goal:

First, we start with the index structure, putting the following section in the schema.xml file:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="url" type="text_general" indexed="true" stored="true"/>
<field...

Indexing data from a database using Data Import Handler

One of our clients has a problem. His database of users grows to such a size that even a simple SQL select takes too much time, and he seeks how to improve the search times. Of course, he has heard about Solr, but he doesn't want to generate XML or any other data format and push it to Solr; he would like the data to be fetched. What can we do about it? Well, there is one thing—we can use one of the contribute modules of Solr, which is the Data Import Handler. This task will show you how to configure the basic setup of the Data Import Handler and how to use it.

How to do it...

Let's assume that we have a database table. To select users from our table, we use the following SQL query:

SELECT user_id, user_name FROM users

The response might look like this:

| user_id | user_name     |
| 1       | John Kowalski |
| 2       | Amanda Looks  |

We also have a second table called users_description, where we store the descriptions of users. The SQL query...

Incremental imports with DIH

In most use cases, indexing the data from scratch during every indexation doesn't make sense. Why index your 1,00,000 documents when only 1,000 were modified or added? This is where the Solr Data Import Handler delta queries come in handy. Using them, we can index our data incrementally. This recipe will show you how to set up the Data Import Handler to use delta queries and index data in an incremental way.

Getting ready

Refer to the Indexing data from a database using Data Import Handler recipe in this chapter to get to know the basics of the Data Import Handler configuration. I assume that Solr is set up according to the description given in the mentioned recipe.

How to do it...

We will reuse parts of the configuration shown in the Indexing data from a database using Data Import Handler recipe in this chapter, and we will modify it. Execute the following steps:

The first thing you should do is add an additional column to the tables you use, a column that will specify...

Transforming data when using DIH

Data that is stored in our data source is not always in a form we would like it to be indexed in our Solr cluster. For example, imagine that you want to split the first and second names into two fields during indexing because these two reside in a single column in the database and are separated by a whitespace character. Of course, we can modify our database, but in most cases this is not possible. Can we do this? Of course we can, we just need to add some more configuration details to the Data Import Handler configuration. This recipe will show you how to do this.

Getting ready

Refer to the Indexing data from a database using Data Import Handler recipe in this chapter.

How to do it...

We will reuse the data from the Indexing data from a database using Data Import Handler recipe in this chapter. So, to select users from our table, we use the following SQL query:

SELECT user_id, user_name FROM users

The response in the text client looks as follows:

| user_id | user_name...

Indexing multiple geographical points

Let's assume we have a website allowing you to search for companies not only using key words but also using a geographical location. In the real world, companies tend to have more than a single location. This is where we hit a limitation in the default spatial field used by Solr; we can only have a single location indexed using it. So, we have to create multiple documents for each company location and use group collapsing, or we can use a different field type that allows multivalued location fields. The recipe will show you how to do the latter.

How to do it...

The following steps will take you through the process of enabling the indexation of multivalued spatial fields.

First, we need to prepare our index structure by adding the following section to the schema.xml file:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="name" type="text_general" indexed="true" stored="true" />
<field name="loc" type...

Updating document fields

Imagine that you have a system where you store documents your users upload. In addition to this, your users can add other users to have access to the files they uploaded. Before Solr 4, you had to reindex the whole document to update it. With the release of Solr 4 and later versions, we are allowed to update a single field if we fulfill some basic requirements. This recipe will show you how to do this.

How to do it...

Let's look at the steps we need to take to update the document field:

For the purpose of the recipe, let's assume we have the following index structure (put the following entries into your schema.xml file):

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="file" type="text_general" indexed="true" stored="true"/>
<field name="count" type="int" indexed="true" stored="true"/>
<field name="user" type="string" indexed="true" stored="true" multiValued="true" />

In addition to this, we need the _version_...

Detecting the document language during indexation

Imagine a situation when you have users from different countries and you would like to give them a choice to only see content you index that is written in their native language. However, there is one problem; your documents don't have their language identified, so we need to do this ourselves. Let's see how we can identify the language of the documents during indexing time and store this information along with the documents in the index for later use.

How to do it...

For language identification, we will use one of the Solr contribution modules, but let's start from the beginning:

For the purpose of the recipe, I assume that we will use the following index structure (we just need to add the following to the schema.xml file):

<field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
<field name="name" type="text_general" indexed="true" stored="true"/>
<field name="description" type="text_general...

Optimizing the primary key indexation

Most of the data stored in Solr has some kind of primary key. Primary keys are different from most of the fields in your data as each document has a unique value stored because they are primary, and in most cases, unique. However, this search on the primary field is not always as fast as you would expect when you compare to other databases. So, is there anything we can do to make it faster? With Solr 4.0, we can, and this recipe will show you how to improve the execution time of queries run against unique fields in Solr.

Note

Keep in mind that the method shown in this recipe is very case dependent, and you might not see a great performance increase with the mentioned change. What's more, if you are using the newest version of Solr/Lucene, the pulsing codec is already a part of the default Lucene posting format.

How to do it...

Let's assume we have the following field defined as the unique key for our Solr collection. So, in your schema.xml file, you will...

Handling multiple currencies

Imagine a situation where you run an e-commerce site and sell your products all over the world. One day, you say that you want to calculate the currencies by yourself and have all the goodies that Solr gives you on all the currencies you support. You can, of course, add multiple fields, one for each currency. On the other hand, you can use the new functionality introduced in Solr 4 and create a field that will use the provided currency exchange rates. This recipe will show you how to configure and use multiple currencies using a single field in the index.

How to do it...

Let's start with creating a sample index structure by modifying the schema.xml file so that the field definition looks like this:

<field name="id" type="string" indexed="true" stored="true" required="true" />
<field name="name" type="text_general" indexed="true" stored="true" />
<field name="price" type="currencyField" indexed="true" stored="true" />

In addition to this, we need...

The rest of the chapter is locked

You have been reading a chapter from

Solr Cookbook - Third Edition

Published in: Jan 2015Publisher: ISBN-13: 9781783553150

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Rafal Kuc

Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as a consultant and software engineer at Sematext Group Inc. where he concentrates on open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more than 14 years of experience in various software domains—from banking software to e–commerce products. He is mainly focused on Java; however, he is open to every tool and programming language that might help him to achieve his goals easily and quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people solve their Solr and Lucene problems. He is also a speaker at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days. Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. He started working with Elasticsearch in the middle of 2010. At present, Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest. Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.
Read more about Rafal Kuc

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages