Reader small image

You're reading from  Lucene 4 Cookbook

Product typeBook
Published inJun 2015
Reading LevelExpert
Publisher
ISBN-139781782162285
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Edwood Ng
Edwood Ng
author image
Edwood Ng

Edwood Ng is a technologist with over a decade of experience in building scalable solutions from proprietary implementations to client-facing web-based applications. Currently, he's the director of DevOps at Wellframe, leading infrastructure and DevOps operations. His background in search engine began at Endeca Technologies in 2004, where he was a technical consultant helping numerous clients to architect and implement faceted search solutions. After Endeca, he drew on his knowledge and began designing and building Lucene-based solutions. His first Lucene implementation that went to production was the search engine behind http://UpDown.com. From there on, he continued to create search applications using Lucene extensively to deliver robust and scalable systems for his clients. Edwood is a supporter of an open source software. He has also contributed to the plugin sfI18NGettextPluralPlugin to the Symphony project.
Read more about Edwood Ng

Vineeth Mohan
Vineeth Mohan
author image
Vineeth Mohan

Vineeth Mohan is an architect and developer. He currently works as the CTO at Factweavers Technologies and is also an Elasticsearch-certified trainer. He loves to spend time studying emerging technologies and applications related to data analytics, data visualizations, machine learning, natural language processing, and developments in search analytics. He began coding during his high school days, which later ignited his interest in computer science, and he pursued engineering at Model Engineering College, Cochin. He was recruited by the search giant Yahoo! during his college days. After 2 years of work at Yahoo! on various big data projects, he joined a start-up that dealt with search and analytics. Finally, he started his own big data consulting company, Factweavers. Under his leadership and technical expertise, Factweavers is one of the early adopters of Elasticsearch and has been engaged with projects related to end-to-end big data solutions and analytics for the last few years. There, he got the opportunity to learn various big-data-based technologies, such as Hadoop, and high-performance data ingress systems and storage. Later, he moved to a start-up in his hometown, where he chose Elasticsearch as the primary search and analytic engine for the project assigned to him. Later in 2014, he founded Factweavers Technologies along with Jalaluddeen; it is consultancy that aims at providing Elasticsearch-based solutions. He is also an Elasticsearch-certified corporate trainer who conducts trainings in India. Till date, he has worked on numerous projects that are mostly based on Elasticsearch and has trained numerous multinationals on Elasticsearch.
Read more about Vineeth Mohan

View More author details
Right arrow

Chapter 3. Indexing Your Data

We will cover the following topics in this chapter:

  • Obtaining an IndexWriter

  • Creating a StringField

  • Creating a TextField

  • Creating a numeric field

  • Creating a DocValue field

  • Transactional commits and index versioning

  • Reusing field and document objects per thread

  • Delving into field norms

  • Changing similarity implementation used during indexing

Introduction


An index in a search engine can make or break an application. A well-tuned index with a well-thought out indexing process will not only reduce future maintenance cost, but will also reduce any potentially expensive application failures due to corruption in the data and/or a break down in the data processing pipeline. We will dive into the indexing process more in this chapter to equip you with the knowledge you need to build a stable search application.

So far, we covered the basics of setting up Lucene, injecting data, and configuring the analysis process. In this chapter, we will explore the indexing process to learn more about the advanced techniques in configuring and tuning the process.

Let's review what we've learned already on Lucene's internal index structure so far, regarding the inverted index. Consider the following sentences passing through StandardAnalyzer before being added to our index:

Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king's horses...

Obtaining an IndexWriter


We have seen how an IndexWriter is obtained just by simply initialized with an Analyzer and IndexWriterConfig. The default initialization behavior usually works well for the majority of the time. However, there may be situations where you need finer control during the initialization sequence. For example, when the default behavior creates a new index if an existing index doesn't exist already. This may not be ideal in a production environment where an index should always exist. Generating a new index will automatically hide the issue that an index is missing. Perhaps there was a glitch in the backup routine where it accidentally removed the index, or there was a data corruption issue that somehow wiped out the index directory. In any case, it would be beneficial if we are aware of the indexing status and alerted when issues are detected.

Lucene does provide options to control how an index is opened. We will talk about each option in detail in this section and show...

Creating a StringField


Let's look at a quick recap of field objects in Lucene; they are part of a document containing information about the document. A field is composed of three parts: name, type, and value. Values can be text, binary, or numeric. A field can also be stored in the index so that their values are returned along with hits. Lucene provides a number of field implementations out of the box that are suitable for most applications. In this section, we will cover a field implementation that stores the literal string, StringField. Any value stored in this field can be indexed, but not tokenized. The entire string is treated as a single token.

So why don't we want to tokenize the text since we have talked about tokenization for quite a bit already? Consider that a part of a document is an address and that you have fields such as street address, city, state, and country contained within it. It's not a very good idea to analyze and tokenize the city, state, and country, because it's...

Creating a TextField


Don't be confused between a StringField and TextField. Although both the fields contain textual data, there are major differences between these two fields. A StringField is not tokenized and it's a good tool for exact match and sorting. A TextField is tokenized and it's useful for storing any unstructured text for indexing. When you pass the text into an Analyzer for indexing, a TextField is what's used to store the text content.

How to do it...

Similar to the way in which a StringField is set, adding a TextField is also very straightforward. Let's review how it's done:

    Document document = new Document();
    String text = "Lucene is an Information Retrieval library written in Java.";
    doc.add(new TextField("text", text, Field.Store.YES));
    indexWriter.addDocument(document);
    indexWriter.commit();

How it works...

This is a very simple example showing how a TextField is added, assuming that you have an Analyzer already created for the IndexWriter on the text field...

Creating a numeric field


We've learned how to deal with textual content using a StringField and TextField in Lucene, so now let's take a look at how numerals are handled. Lucene provides four Field classes for storing numeric values. They are IntField, FloatField, LongField, and DoubleField, and are analogous to Java numeric types. Lucene, being a text search engine, treats numeral as term internally and indexes them in a trie structure (also called ordered tree data structure) as illustrated in the following:

Each Term is logically assigned to larger and larger predefined lower-precision brackets. For example, let's assume that the brackets are divided by a quotient of division of a lower level by ten as in the preceding diagram. So, under the 1 bracket (at the top level), we get DocId associated with values in the 100s range, and under the 12 bracket, we get association with values in the 120s range and so on. Now, let's say you want to search by numeric range of all documents with the...

Creating a DocValue Field


Similar to a stored field, DocValue is a part of a document. It's also created at indexing time, and contains value that are specific to a document. The major difference between the two concerns their underlying storage structure. The field's storage is row-oriented, whereas DocValue's storage is column-oriented. In retrieval, all field values are returned at once per document, so that loading the relevant information about a document is very fast. However, if you need to scan a field for any other purpose it will be a slow process, as you will have to iterate through all the documents and load each document's fields per iteration. The DocValue is stored by column in DocId to value mapping, and loading the values for a specific DocValue for all documents at once can be done quickly, as Lucene only has to scan through one column rather than iterating through each document to load a field. In summary, the field and DocValue both contain information about a document...

Transactional commits and index versioning


In the world of data management platforms, anything that supports transactional commits would implement ACID (Atomicity, Consistency, Isolation, Durability). ACID is a set of properties that guarantees that transactions are processed reliably. So, how does Lucene measure against ACID?

  • Atomicity: This property requires that each transaction is all or nothing. When a transaction fails, none of the partial changes performed by the transaction should persist or be visible. Changes from a transaction should only persist and made visible when the transaction completes and is committed. Lucene's IndexWriter supports transactional commit. Changes to the index will only be made visible to IndexReader after we call commit(). If an IndexWriter crashes for whatever reason or never calls commit(), the partial changes will never be made visible to the IndexReader.

  • Consistency: This property ensures that any committed changes will bring the system from one valid...

Reusing field and document objects per thread


Performance has always been a part of the main focus of Lucene's development team. Because they are adamant about achieving high efficiency and performance, we have all benefitted from this. To ensure that users can properly leverage Lucene's speed and efficiency, there are best practices that we should adhere to so that we don't introduce unnecessary inefficiency. One of the best practices is to reuse both the Document and field objects. This minimizes the object creation cost during any massive data import operations. It will also reduce the chance of triggering garbage collection.

There are a couple things to keep in mind when reusing Document object: we need to make sure that we clear out all the fields before putting in the new values; for the field, we can just simply overwrite the value.

How to do It...

Here is a sample code snippet on Document and field reuse:

    Analyzer analyzer = new StandardAnalyzer();
    Directory directory = new RAMDirectory...

Delving into field norms


A norm is part of the calculation of a score that's used to measure relevancy. When we search, a score is calculated for each matching result. This score will then be used to sort the end results. The score is what we refer to as a relevancy score.

Norms are calculated per indexed Field. This is a product of index time calculation (based on TFIDFSimilarity) and lengthNorm (a calculated factor that favors a shorter document). The higher value can help boost the relevancy of a document, which means that the document will rank higher in search results.

To further influence the search results relevancy, Lucene allows for two types of boosting: index time boost and query time boost. Index time boost is set per indexed field. It can be used to promote documents based on certain field values. Query time boost can be set per query clause so that all the documents matched by it are multiplied by the boost. It's useful if a certain filter takes precedence over everything else...

Changing similarity implementation used during indexing


Part of the norms calculation at the index time on is similarity. Lucene has already implemented a complex model called TFIDFSimilarity as a default calculation for norms; you can read more about it on Lucene's website. In this section, we will talk about how we can tune similarity to suit our needs.

We will go through a similar scenario as we used in our example in norms. Instead of using boost to influence relevancy, we will leverage a NumericDocValuesField called ranking that will act as our boost. We will show you how to pull NumericDocValues at a query time within a Similarity class and how we can use it to influence score. This exercise will give you an idea of what you can do with similarity customization.

Getting ready

To start writing your own Similarity class, you can begin by extending Similarity. Then, you can register your new class by simply calling IndexWriterConfig.setSimilarity(Similarity) in indexing and IndexSearcher...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Lucene 4 Cookbook
Published in: Jun 2015Publisher: ISBN-13: 9781782162285
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Edwood Ng

Edwood Ng is a technologist with over a decade of experience in building scalable solutions from proprietary implementations to client-facing web-based applications. Currently, he's the director of DevOps at Wellframe, leading infrastructure and DevOps operations. His background in search engine began at Endeca Technologies in 2004, where he was a technical consultant helping numerous clients to architect and implement faceted search solutions. After Endeca, he drew on his knowledge and began designing and building Lucene-based solutions. His first Lucene implementation that went to production was the search engine behind http://UpDown.com. From there on, he continued to create search applications using Lucene extensively to deliver robust and scalable systems for his clients. Edwood is a supporter of an open source software. He has also contributed to the plugin sfI18NGettextPluralPlugin to the Symphony project.
Read more about Edwood Ng

author image
Vineeth Mohan

Vineeth Mohan is an architect and developer. He currently works as the CTO at Factweavers Technologies and is also an Elasticsearch-certified trainer. He loves to spend time studying emerging technologies and applications related to data analytics, data visualizations, machine learning, natural language processing, and developments in search analytics. He began coding during his high school days, which later ignited his interest in computer science, and he pursued engineering at Model Engineering College, Cochin. He was recruited by the search giant Yahoo! during his college days. After 2 years of work at Yahoo! on various big data projects, he joined a start-up that dealt with search and analytics. Finally, he started his own big data consulting company, Factweavers. Under his leadership and technical expertise, Factweavers is one of the early adopters of Elasticsearch and has been engaged with projects related to end-to-end big data solutions and analytics for the last few years. There, he got the opportunity to learn various big-data-based technologies, such as Hadoop, and high-performance data ingress systems and storage. Later, he moved to a start-up in his hometown, where he chose Elasticsearch as the primary search and analytic engine for the project assigned to him. Later in 2014, he founded Factweavers Technologies along with Jalaluddeen; it is consultancy that aims at providing Elasticsearch-based solutions. He is also an Elasticsearch-certified corporate trainer who conducts trainings in India. Till date, he has worked on numerous projects that are mostly based on Elasticsearch and has trained numerous multinationals on Elasticsearch.
Read more about Vineeth Mohan