Reader small image

You're reading from  Solr Cookbook - Third Edition

Product typeBook
Published inJan 2015
Reading LevelIntermediate
Publisher
ISBN-139781783553150
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Rafal Kuc
Rafal Kuc
author image
Rafal Kuc

Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as a consultant and software engineer at Sematext Group Inc. where he concentrates on open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more than 14 years of experience in various software domains—from banking software to e–commerce products. He is mainly focused on Java; however, he is open to every tool and programming language that might help him to achieve his goals easily and quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people solve their Solr and Lucene problems. He is also a speaker at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days. Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. He started working with Elasticsearch in the middle of 2010. At present, Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest. Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.
Read more about Rafal Kuc

Right arrow

Chapter 2. Indexing Your Data

In this chapter, we will cover the following topics:

  • Indexing PDF files

  • Counting the number of fields

  • Using parsing update processors to parse data

  • Using scripting update processors to modify documents

  • Indexing data from a database using Data Import Handler

  • Incremental imports with DIH

  • Transforming data when using DIH

  • Indexing multiple geographical points

  • Updating document fields

  • Detecting the document language during indexation

  • Optimizing the primary key indexation

  • Handling multiple currencies

Introduction


Indexing data is one of the most crucial things in Lucene and Solr deployment. When your data is not indexed properly, your search results will be poor. When the search results are poor, it's almost certain the users will not be satisfied with the application that uses Solr. This is why we need our data to be prepared and indexed as timely and correctly as possible.

On the other hand, preparing data is not an easy task. Nowadays, we have more and more data floating around. We need to index multiple formats of data from multiple sources. Do we need to parse the data manually and prepare the data in XML format? The answer is no; we can let Solr do this for us. This chapter will concentrate on the indexing process and data preparation, starting with how to index data that is a binary PDF file to how to use Data Import Handler to fetch data from database and index it with Apache Solr and describing how we can detect the document language during indexation. We will also learn how...

Indexing PDF files


The library on the corner, we used to go to, wants to expand its collection and become available for the wider public through the World Wide Web. It asked its book suppliers to provide sample chapters of all the books in PDF format so that they can share it with online users. With all the samples provided by the supplier comes a problem—how to extract data for the search box from more than 900,000 PDF files. Solr can do it with the use of Apache Tika (http://tika.apache.org/). This recipe will show you how to handle such a task.

How to do it...

To index PDF files, we will need to set up Solr to use extracting request handlers. To do this, we will take the following steps:

  1. First, let's edit our Solr instance, solrconfig.xml, and add the following configuration:

    <requestHandler name="/update/extract" class="solr.extraction.ExtractingRequestHandler">
     <lst name="defaults">
      <str name="fmap.content">text</str>
      <str name="lowernames">true</str...

Counting the number of fields


Imagine a situation where we have a simple document to be indexed to Solr with titles and tags. What we will want to do is separate the premium documents that have more tag values because they are better in terms of our business. Of course, we can count the number of tags ourselves, but why not let Solr do this? This recipe will show you how to do this with Solr.

How to do it...

Let's look at the steps we need to take to count the number of field values.

  1. We start with the index structure. What we need to do is put the following section in the schema.xml file:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text_general" indexed="true" stored="true"/>
    <field name="tags" type="string" indexed="true" stored="true" multiValued="true"/>
    <field name="tags_count" type="int" indexed="true" stored="true"/>
  2. The next thing is our test data, which looks as follows:

    <add>
     <doc>
      <...

Using parsing update processors to parse data


Let's assume that we are running a bookstore, we want to sort our books by the publication date, and run faceting on the number of likes each book gets. However, we get all our data in XML, and we don't have data in the proper format, and so on. The good thing is that we can tell Solr to parse our data property so that we don't have to change what we already have. This recipe will show you how to do this.

Getting ready

Before continuing with this recipe, I suggest reading the Counting the number of fields recipe of this chapter to get used to updating the request processor configuration.

How to do it...

Let's look at the steps we need to take to make data parsing work.

  1. First, we need to prepare our index structure, so we add the following section to the schema.xml file:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="title" type="text_general" indexed="true" stored="true" />
    <field name="published...

Using scripting update processors to modify documents


Sometimes, we need to modify documents during indexing, and we don't want to do this on the indexing application side. For example, we have documents describing the Internet sites. What we want to be able to do is filter the sites on the basis of the protocol used, for example, http or https. We don't have this information; we only have the whole URL address. Let's see how we can achieve this with Solr.

Getting ready

Before continuing with the following recipe, I suggest reading the Counting the number of fields recipe of this chapter to get used to updating request processor configuration.

How to do it...

The following steps will take you through the process of achieving our goal:

  1. First, we start with the index structure, putting the following section in the schema.xml file:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="url" type="text_general" indexed="true" stored="true"/>
    <field...

Indexing data from a database using Data Import Handler


One of our clients has a problem. His database of users grows to such a size that even a simple SQL select takes too much time, and he seeks how to improve the search times. Of course, he has heard about Solr, but he doesn't want to generate XML or any other data format and push it to Solr; he would like the data to be fetched. What can we do about it? Well, there is one thing—we can use one of the contribute modules of Solr, which is the Data Import Handler. This task will show you how to configure the basic setup of the Data Import Handler and how to use it.

How to do it...

Let's assume that we have a database table. To select users from our table, we use the following SQL query:

SELECT user_id, user_name FROM users

The response might look like this:

| user_id | user_name     |
| 1       | John Kowalski |
| 2       | Amanda Looks  |

We also have a second table called users_description, where we store the descriptions of users. The SQL query...

Incremental imports with DIH


In most use cases, indexing the data from scratch during every indexation doesn't make sense. Why index your 1,00,000 documents when only 1,000 were modified or added? This is where the Solr Data Import Handler delta queries come in handy. Using them, we can index our data incrementally. This recipe will show you how to set up the Data Import Handler to use delta queries and index data in an incremental way.

Getting ready

Refer to the Indexing data from a database using Data Import Handler recipe in this chapter to get to know the basics of the Data Import Handler configuration. I assume that Solr is set up according to the description given in the mentioned recipe.

How to do it...

We will reuse parts of the configuration shown in the Indexing data from a database using Data Import Handler recipe in this chapter, and we will modify it. Execute the following steps:

  1. The first thing you should do is add an additional column to the tables you use, a column that will specify...

Transforming data when using DIH


Data that is stored in our data source is not always in a form we would like it to be indexed in our Solr cluster. For example, imagine that you want to split the first and second names into two fields during indexing because these two reside in a single column in the database and are separated by a whitespace character. Of course, we can modify our database, but in most cases this is not possible. Can we do this? Of course we can, we just need to add some more configuration details to the Data Import Handler configuration. This recipe will show you how to do this.

Getting ready

Refer to the Indexing data from a database using Data Import Handler recipe in this chapter.

How to do it...

We will reuse the data from the Indexing data from a database using Data Import Handler recipe in this chapter. So, to select users from our table, we use the following SQL query:

SELECT user_id, user_name FROM users

The response in the text client looks as follows:

| user_id | user_name...

Indexing multiple geographical points


Let's assume we have a website allowing you to search for companies not only using key words but also using a geographical location. In the real world, companies tend to have more than a single location. This is where we hit a limitation in the default spatial field used by Solr; we can only have a single location indexed using it. So, we have to create multiple documents for each company location and use group collapsing, or we can use a different field type that allows multivalued location fields. The recipe will show you how to do the latter.

How to do it...

The following steps will take you through the process of enabling the indexation of multivalued spatial fields.

  1. First, we need to prepare our index structure by adding the following section to the schema.xml file:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="name" type="text_general" indexed="true" stored="true" />
    <field name="loc" type...

Updating document fields


Imagine that you have a system where you store documents your users upload. In addition to this, your users can add other users to have access to the files they uploaded. Before Solr 4, you had to reindex the whole document to update it. With the release of Solr 4 and later versions, we are allowed to update a single field if we fulfill some basic requirements. This recipe will show you how to do this.

How to do it...

Let's look at the steps we need to take to update the document field:

  1. For the purpose of the recipe, let's assume we have the following index structure (put the following entries into your schema.xml file):

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="file" type="text_general" indexed="true" stored="true"/>
    <field name="count" type="int" indexed="true" stored="true"/>
    <field name="user" type="string" indexed="true" stored="true" multiValued="true" />
  2. In addition to this, we need the _version_...

Detecting the document language during indexation


Imagine a situation when you have users from different countries and you would like to give them a choice to only see content you index that is written in their native language. However, there is one problem; your documents don't have their language identified, so we need to do this ourselves. Let's see how we can identify the language of the documents during indexing time and store this information along with the documents in the index for later use.

How to do it...

For language identification, we will use one of the Solr contribution modules, but let's start from the beginning:

  1. For the purpose of the recipe, I assume that we will use the following index structure (we just need to add the following to the schema.xml file):

    <field name="id" type="string" indexed="true" stored="true" required="true" multiValued="false" />
    <field name="name" type="text_general" indexed="true" stored="true"/>
    <field name="description" type="text_general...

Optimizing the primary key indexation


Most of the data stored in Solr has some kind of primary key. Primary keys are different from most of the fields in your data as each document has a unique value stored because they are primary, and in most cases, unique. However, this search on the primary field is not always as fast as you would expect when you compare to other databases. So, is there anything we can do to make it faster? With Solr 4.0, we can, and this recipe will show you how to improve the execution time of queries run against unique fields in Solr.

Note

Keep in mind that the method shown in this recipe is very case dependent, and you might not see a great performance increase with the mentioned change. What's more, if you are using the newest version of Solr/Lucene, the pulsing codec is already a part of the default Lucene posting format.

How to do it...

  1. Let's assume we have the following field defined as the unique key for our Solr collection. So, in your schema.xml file, you will...

Handling multiple currencies


Imagine a situation where you run an e-commerce site and sell your products all over the world. One day, you say that you want to calculate the currencies by yourself and have all the goodies that Solr gives you on all the currencies you support. You can, of course, add multiple fields, one for each currency. On the other hand, you can use the new functionality introduced in Solr 4 and create a field that will use the provided currency exchange rates. This recipe will show you how to configure and use multiple currencies using a single field in the index.

How to do it...

  1. Let's start with creating a sample index structure by modifying the schema.xml file so that the field definition looks like this:

    <field name="id" type="string" indexed="true" stored="true" required="true" />
    <field name="name" type="text_general" indexed="true" stored="true" />
    <field name="price" type="currencyField" indexed="true" stored="true" />
  2. In addition to this, we need...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Solr Cookbook - Third Edition
Published in: Jan 2015Publisher: ISBN-13: 9781783553150
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Rafal Kuc

Rafał Kuć is a software engineer, trainer, speaker and consultant. He is working as a consultant and software engineer at Sematext Group Inc. where he concentrates on open source technologies such as Apache Lucene, Solr, and Elasticsearch. He has more than 14 years of experience in various software domains—from banking software to e–commerce products. He is mainly focused on Java; however, he is open to every tool and programming language that might help him to achieve his goals easily and quickly. Rafał is also one of the founders of the solr.pl site, where he tries to share his knowledge and help people solve their Solr and Lucene problems. He is also a speaker at various conferences around the world such as Lucene Eurocon, Berlin Buzzwords, ApacheCon, Lucene/Solr Revolution, Velocity, and DevOps Days. Rafał began his journey with Lucene in 2002; however, it wasn't love at first sight. When he came back to Lucene in late 2003, he revised his thoughts about the framework and saw the potential in search technologies. Then Solr came and that was it. He started working with Elasticsearch in the middle of 2010. At present, Lucene, Solr, Elasticsearch, and information retrieval are his main areas of interest. Rafał is also the author of the Solr Cookbook series, ElasticSearch Server and its second edition, and the first and second editions of Mastering ElasticSearch, all published by Packt Publishing.
Read more about Rafal Kuc