Packt+ | Advance your knowledge in tech

You're reading from Lucene 4 Cookbook

Product typeBook

Published inJun 2015

Reading LevelExpert

Publisher

ISBN-139781782162285

Edition1st Edition

Languages

Java

Tools

Lucene

Concepts

Enterprise Search

Authors (2):

Edwood Ng

Vineeth Mohan

View More author details

Chapter 3. Indexing Your Data

We will cover the following topics in this chapter:

Obtaining an IndexWriter
Creating a StringField
Creating a TextField
Creating a numeric field
Creating a DocValue field
Transactional commits and index versioning
Reusing field and document objects per thread
Delving into field norms
Changing similarity implementation used during indexing

Introduction

An index in a search engine can make or break an application. A well-tuned index with a well-thought out indexing process will not only reduce future maintenance cost, but will also reduce any potentially expensive application failures due to corruption in the data and/or a break down in the data processing pipeline. We will dive into the indexing process more in this chapter to equip you with the knowledge you need to build a stable search application.

So far, we covered the basics of setting up Lucene, injecting data, and configuring the analysis process. In this chapter, we will explore the indexing process to learn more about the advanced techniques in configuring and tuning the process.

Let's review what we've learned already on Lucene's internal index structure so far, regarding the inverted index. Consider the following sentences passing through StandardAnalyzer before being added to our index:

Humpty Dumpty sat on a wall,
Humpty Dumpty had a great fall.
All the king's horses...

Obtaining an IndexWriter

We have seen how an IndexWriter is obtained just by simply initialized with an Analyzer and IndexWriterConfig. The default initialization behavior usually works well for the majority of the time. However, there may be situations where you need finer control during the initialization sequence. For example, when the default behavior creates a new index if an existing index doesn't exist already. This may not be ideal in a production environment where an index should always exist. Generating a new index will automatically hide the issue that an index is missing. Perhaps there was a glitch in the backup routine where it accidentally removed the index, or there was a data corruption issue that somehow wiped out the index directory. In any case, it would be beneficial if we are aware of the indexing status and alerted when issues are detected.

Lucene does provide options to control how an index is opened. We will talk about each option in detail in this section and show...

Creating a StringField

Let's look at a quick recap of field objects in Lucene; they are part of a document containing information about the document. A field is composed of three parts: name, type, and value. Values can be text, binary, or numeric. A field can also be stored in the index so that their values are returned along with hits. Lucene provides a number of field implementations out of the box that are suitable for most applications. In this section, we will cover a field implementation that stores the literal string, StringField. Any value stored in this field can be indexed, but not tokenized. The entire string is treated as a single token.

So why don't we want to tokenize the text since we have talked about tokenization for quite a bit already? Consider that a part of a document is an address and that you have fields such as street address, city, state, and country contained within it. It's not a very good idea to analyze and tokenize the city, state, and country, because it's...

Creating a TextField

Don't be confused between a StringField and TextField. Although both the fields contain textual data, there are major differences between these two fields. A StringField is not tokenized and it's a good tool for exact match and sorting. A TextField is tokenized and it's useful for storing any unstructured text for indexing. When you pass the text into an Analyzer for indexing, a TextField is what's used to store the text content.

How to do it...

Similar to the way in which a StringField is set, adding a TextField is also very straightforward. Let's review how it's done:

    Document document = new Document();
    String text = "Lucene is an Information Retrieval library written in Java.";
    doc.add(new TextField("text", text, Field.Store.YES));
    indexWriter.addDocument(document);
    indexWriter.commit();

How it works...

This is a very simple example showing how a TextField is added, assuming that you have an Analyzer already created for the IndexWriter on the text field...

Creating a numeric field

We've learned how to deal with textual content using a StringField and TextField in Lucene, so now let's take a look at how numerals are handled. Lucene provides four Field classes for storing numeric values. They are IntField, FloatField, LongField, and DoubleField, and are analogous to Java numeric types. Lucene, being a text search engine, treats numeral as term internally and indexes them in a trie structure (also called ordered tree data structure) as illustrated in the following:

Each Term is logically assigned to larger and larger predefined lower-precision brackets. For example, let's assume that the brackets are divided by a quotient of division of a lower level by ten as in the preceding diagram. So, under the 1 bracket (at the top level), we get DocId associated with values in the 100s range, and under the 12 bracket, we get association with values in the 120s range and so on. Now, let's say you want to search by numeric range of all documents with the...

Creating a DocValue Field

Similar to a stored field, DocValue is a part of a document. It's also created at indexing time, and contains value that are specific to a document. The major difference between the two concerns their underlying storage structure. The field's storage is row-oriented, whereas DocValue's storage is column-oriented. In retrieval, all field values are returned at once per document, so that loading the relevant information about a document is very fast. However, if you need to scan a field for any other purpose it will be a slow process, as you will have to iterate through all the documents and load each document's fields per iteration. The DocValue is stored by column in DocId to value mapping, and loading the values for a specific DocValue for all documents at once can be done quickly, as Lucene only has to scan through one column rather than iterating through each document to load a field. In summary, the field and DocValue both contain information about a document...

Transactional commits and index versioning

In the world of data management platforms, anything that supports transactional commits would implement ACID (Atomicity, Consistency, Isolation, Durability). ACID is a set of properties that guarantees that transactions are processed reliably. So, how does Lucene measure against ACID?

Atomicity: This property requires that each transaction is all or nothing. When a transaction fails, none of the partial changes performed by the transaction should persist or be visible. Changes from a transaction should only persist and made visible when the transaction completes and is committed. Lucene's IndexWriter supports transactional commit. Changes to the index will only be made visible to IndexReader after we call commit(). If an IndexWriter crashes for whatever reason or never calls commit(), the partial changes will never be made visible to the IndexReader.
Consistency: This property ensures that any committed changes will bring the system from one valid...

Reusing field and document objects per thread

Performance has always been a part of the main focus of Lucene's development team. Because they are adamant about achieving high efficiency and performance, we have all benefitted from this. To ensure that users can properly leverage Lucene's speed and efficiency, there are best practices that we should adhere to so that we don't introduce unnecessary inefficiency. One of the best practices is to reuse both the Document and field objects. This minimizes the object creation cost during any massive data import operations. It will also reduce the chance of triggering garbage collection.

There are a couple things to keep in mind when reusing Document object: we need to make sure that we clear out all the fields before putting in the new values; for the field, we can just simply overwrite the value.

How to do It...

Here is a sample code snippet on Document and field reuse:

    Analyzer analyzer = new StandardAnalyzer();
    Directory directory = new RAMDirectory...

Delving into field norms

A norm is part of the calculation of a score that's used to measure relevancy. When we search, a score is calculated for each matching result. This score will then be used to sort the end results. The score is what we refer to as a relevancy score.

Norms are calculated per indexed Field. This is a product of index time calculation (based on TFIDFSimilarity) and lengthNorm (a calculated factor that favors a shorter document). The higher value can help boost the relevancy of a document, which means that the document will rank higher in search results.

To further influence the search results relevancy, Lucene allows for two types of boosting: index time boost and query time boost. Index time boost is set per indexed field. It can be used to promote documents based on certain field values. Query time boost can be set per query clause so that all the documents matched by it are multiplied by the boost. It's useful if a certain filter takes precedence over everything else...

Changing similarity implementation used during indexing

Part of the norms calculation at the index time on is similarity. Lucene has already implemented a complex model called TFIDFSimilarity as a default calculation for norms; you can read more about it on Lucene's website. In this section, we will talk about how we can tune similarity to suit our needs.

We will go through a similar scenario as we used in our example in norms. Instead of using boost to influence relevancy, we will leverage a NumericDocValuesField called ranking that will act as our boost. We will show you how to pull NumericDocValues at a query time within a Similarity class and how we can use it to influence score. This exercise will give you an idea of what you can do with similarity customization.

Getting ready

To start writing your own Similarity class, you can begin by extending Similarity. Then, you can register your new class by simply calling IndexWriterConfig.setSimilarity(Similarity) in indexing and IndexSearcher...

The rest of the chapter is locked

You have been reading a chapter from

Lucene 4 Cookbook

Published in: Jun 2015Publisher: ISBN-13: 9781782162285

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Edwood Ng

Edwood Ng is a technologist with over a decade of experience in building scalable solutions from proprietary implementations to client-facing web-based applications. Currently, he's the director of DevOps at Wellframe, leading infrastructure and DevOps operations. His background in search engine began at Endeca Technologies in 2004, where he was a technical consultant helping numerous clients to architect and implement faceted search solutions. After Endeca, he drew on his knowledge and began designing and building Lucene-based solutions. His first Lucene implementation that went to production was the search engine behind http://UpDown.com. From there on, he continued to create search applications using Lucene extensively to deliver robust and scalable systems for his clients. Edwood is a supporter of an open source software. He has also contributed to the plugin sfI18NGettextPluralPlugin to the Symphony project.
Read more about Edwood Ng

Vineeth Mohan

Vineeth Mohan is an architect and developer. He currently works as the CTO at Factweavers Technologies and is also an Elasticsearch-certified trainer. He loves to spend time studying emerging technologies and applications related to data analytics, data visualizations, machine learning, natural language processing, and developments in search analytics. He began coding during his high school days, which later ignited his interest in computer science, and he pursued engineering at Model Engineering College, Cochin. He was recruited by the search giant Yahoo! during his college days. After 2 years of work at Yahoo! on various big data projects, he joined a start-up that dealt with search and analytics. Finally, he started his own big data consulting company, Factweavers. Under his leadership and technical expertise, Factweavers is one of the early adopters of Elasticsearch and has been engaged with projects related to end-to-end big data solutions and analytics for the last few years. There, he got the opportunity to learn various big-data-based technologies, such as Hadoop, and high-performance data ingress systems and storage. Later, he moved to a start-up in his hometown, where he chose Elasticsearch as the primary search and analytic engine for the project assigned to him. Later in 2014, he founded Factweavers Technologies along with Jalaluddeen; it is consultancy that aims at providing Elasticsearch-based solutions. He is also an Elasticsearch-certified corporate trainer who conducts trainings in India. Till date, he has worked on numerous projects that are mostly based on Elasticsearch and has trained numerous multinationals on Elasticsearch.
Read more about Vineeth Mohan

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages