Lucene.NET: Optimizing and merging index segments

Exclusive offer: get 50% off this eBook here
Instant Lucene.NET [Instant]

Instant Lucene.NET [Instant] — Save 50%

Learn how to index and search through unstructured data using Lucene.NET with this book and ebook

£7.99    £4.00
by Michael Heydt | August 2013 | .NET Open Source

In this article, created by Michael Heydt, the author of Instant Lucene.NET, you will learn how to optimize an index when the index has grown too large or has become fragmented.

(For more resources related to this topic, see here.)

How to do it…

Index optimization is accomplished by calling the Optimize method on an instance of IndexWriter. The example for this recipe demonstrates the use of the Optimize method to clean up the storage of the index data on the physical disk. The general steps in the process to optimize and index segments are the following:

  1. Create/open an index.
  2. Add or delete documents from the index.
  3. Examine the MaxDoc and NumDocs properties of the IndexWriter class.
  4. If the index is deemed to be too dirty, call the Optimize method of the IndexWriter class.

The following example for this recipe demonstrates taking these steps to create, modify, and then optimize an index.

namespace Lucene.NET.HowTo._12_MergeAndOptimize {
// ...
// build facade and an initial index of 5 documents
var facade = new LuceneDotNetHowToExamplesFacade()
.buildLexicographicalExampleIndex(maxDocs: 5)
.createIndexWriter();
// report MaxDoc and NumDocs
Trace.WriteLine(
string.Format("MaxDoc=={0}", facade.IndexWriter.MaxDoc()));
Trace.WriteLine(
string.Format("NumDocs=={0}",
facade.IndexWriter.NumDocs()));
// delete one document
facade.IndexWriter.DeleteDocuments(
new Term("filename", "0.txt"));
facade.IndexWriter.Commit();
// report MaxDoc and NumDocs
Trace.WriteLine("After delete / commit");
Trace.WriteLine(string.Format(
"MaxDoc=={0}", facade.IndexWriter.MaxDoc()));
Trace.WriteLine(string.Format(
"NumDocs=={0}", facade.IndexWriter.NumDocs()));
// optimize the index
facade.IndexWriter.Optimize();
// report MaxDoc and NumDocs
Trace.WriteLine("After Optimize");
Trace.WriteLine(string.Format(
"MaxDoc=={0}", facade.IndexWriter.MaxDoc()));
Trace.WriteLine(string.Format(
"NumDocs=={0}", facade.IndexWriter.NumDocs()));
Trace.Flush();
// ...
}

How it works…

When this program is run, you will see output similar to that in the following screenshot:

This program first creates an index with five files. It then reports the values of the MaxDoc and NumDocs properties of the instance of IndexWriter. MaxDoc represents the maximum number of documents that have been stored in the index. It is possible to add more documents, but that may incur a performance penalty by needing to grow the index. NumDocs is the current number of documents stored in the index. At this point these values are 5 and 5, respectively.

The next step deletes a single document named 0.txt from the index, and the changes are committed to disk. MaxDoc and NumDocs are written to the console again and now report 5 and 4 respectively. This makes sense as one file has been deleted and there is now "slop" in the index where space is being taken up from a previously deleted document. The reference to the document index information has been removed, but the space is still used on the disk.

The final two steps are to call Optimize and to write MaxDoc and NumDocs values to the console, for the final time. These now are 4 and 4, respectively, as Lucene.NET has merged any index segments and removed any empty disk space formerly used by deleted document index information.

Summary

A Lucene.NET index physically contains one or more segments, each of which is its own index and holds a subset of the overall indexed content. As documents are added to the index, new segments are created as index writer's flush-buffered content into the index's directory and file structure. Over time this fragmentation will cause searches to slow, requiring a merge/optimization to be performed to regain performance.

Resources for Article :


Further resources on this subject:


Instant Lucene.NET [Instant] Learn how to index and search through unstructured data using Lucene.NET with this book and ebook
Published: July 2013
eBook Price: £7.99
See more
Select your format and quantity:

About the Author :


Michael Heydt

Michael Heydt is a Principal .NET Evangelist with SunGard Global Services, where he currently leads their capital markets advanced technology (CMAT) user experience practice, focusing on building high-performance trading desktops utilizing (but not limited to) .NET technologies. He has a particular interest in parallel and concurrent systems, rich data visualization, natural user interfaces, distributed cloud systems, big-data/search applications, and development operations. Michael has nearly 30 years of software development experience and is a frequent speaker at .NET user groups and global technology conferences, as well as a writer of technology papers and books.

Books From Packt


ElasticSearch Server
ElasticSearch Server

Hibernate Search by Example
Hibernate Search by Example

Apache Solr 3 Enterprise Search Server
Apache Solr 3 Enterprise Search Server

Alfresco 4 Enterprise Content Management Implementation
Alfresco 4 Enterprise Content Management Implementation

ASP.NET 3.5 Social Networking
ASP.NET 3.5 Social Networking

Solr 1.4 Enterprise Search Server
Solr 1.4 Enterprise Search Server

NHibernate 3 Beginner's Guide
NHibernate 3 Beginner's Guide

NHibernate 2 Beginner's Guide
NHibernate 2 Beginner's Guide


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software