Lucene.NET: Optimizing and merging index segments

Instant Lucene.NET


July 2013

$10.00

Learn how to index and search through unstructured data using Lucene.NET

(For more resources related to this topic, see here.)

How to do it…

Index optimization is accomplished by calling the Optimize method on an instance of IndexWriter. The example for this recipe demonstrates the use of the Optimize method to clean up the storage of the index data on the physical disk. The general steps in the process to optimize and index segments are the following:

  1. Create/open an index.
  2. Add or delete documents from the index.
  3. Examine the MaxDoc and NumDocs properties of the IndexWriter class.
  4. If the index is deemed to be too dirty, call the Optimize method of the IndexWriter class.

The following example for this recipe demonstrates taking these steps to create, modify, and then optimize an index.

namespace Lucene.NET.HowTo._12_MergeAndOptimize {
// ...
// build facade and an initial index of 5 documents
var facade = new LuceneDotNetHowToExamplesFacade()
.buildLexicographicalExampleIndex(maxDocs: 5)
.createIndexWriter();
// report MaxDoc and NumDocs
Trace.WriteLine(
string.Format("MaxDoc=={0}", facade.IndexWriter.MaxDoc()));
Trace.WriteLine(
string.Format("NumDocs=={0}",
facade.IndexWriter.NumDocs()));
// delete one document
facade.IndexWriter.DeleteDocuments(
new Term("filename", "0.txt"));
facade.IndexWriter.Commit();
// report MaxDoc and NumDocs
Trace.WriteLine("After delete / commit");
Trace.WriteLine(string.Format(
"MaxDoc=={0}", facade.IndexWriter.MaxDoc()));
Trace.WriteLine(string.Format(
"NumDocs=={0}", facade.IndexWriter.NumDocs()));
// optimize the index
facade.IndexWriter.Optimize();
// report MaxDoc and NumDocs
Trace.WriteLine("After Optimize");
Trace.WriteLine(string.Format(
"MaxDoc=={0}", facade.IndexWriter.MaxDoc()));
Trace.WriteLine(string.Format(
"NumDocs=={0}", facade.IndexWriter.NumDocs()));
Trace.Flush();
// ...
}

How it works…

When this program is run, you will see output similar to that in the following screenshot:

This program first creates an index with five files. It then reports the values of the MaxDoc and NumDocs properties of the instance of IndexWriter. MaxDoc represents the maximum number of documents that have been stored in the index. It is possible to add more documents, but that may incur a performance penalty by needing to grow the index. NumDocs is the current number of documents stored in the index. At this point these values are 5 and 5, respectively.

The next step deletes a single document named 0.txt from the index, and the changes are committed to disk. MaxDoc and NumDocs are written to the console again and now report 5 and 4 respectively. This makes sense as one file has been deleted and there is now "slop" in the index where space is being taken up from a previously deleted document. The reference to the document index information has been removed, but the space is still used on the disk.

The final two steps are to call Optimize and to write MaxDoc and NumDocs values to the console, for the final time. These now are 4 and 4, respectively, as Lucene.NET has merged any index segments and removed any empty disk space formerly used by deleted document index information.

Summary

A Lucene.NET index physically contains one or more segments, each of which is its own index and holds a subset of the overall indexed content. As documents are added to the index, new segments are created as index writer's flush-buffered content into the index's directory and file structure. Over time this fragmentation will cause searches to slow, requiring a merge/optimization to be performed to regain performance.

Resources for Article :


Further resources on this subject:


Books to Consider

comments powered by Disqus