A Quick Overview of Lucene
Included with OpenCms is a distribution of the Lucene search engine. Lucene is an open source, high-performance text search engine that is both easy to use and full-featured. Lucene is not a product. It is a Java library providing data indexing, and search and retrieval support. OpenCms integrates with Lucene to provide these features for its VFS content.
Though Lucene is simple to use, it is highly flexible and has many options. We will not go into the full details of all the options here, but will provide a basic overview, which will help us in developing our search code. A full understanding of Lucene is not required for completing this article, but interested readers can find more information at the Lucene website: http://jakarta.apache.org/lucene. There are also several excellent books available, which can easily be found with a web search.
For any data to be searched, it must first be indexed. Lucene supports both disk and memory based indexes, but OpenCms uses the more suitable disk based indexes. There are three basic concepts to understand regarding Lucene search indexes: Documents, Analyzers, and Fields.
- Document: A document is a collection of Lucene fields. A search index is made up of documents. Although each document is built from some actual source content, there is no need for the document to exactly resemble it. The fields stored in the document are indexed and stored and used to locate the document.
- Analyzer: An analyzer is responsible for breaking down source content into words (or terms) for indexing. An analyzer may take a very simple approach of only parsing content at whitespace breaks or a more complex approach by removing common words, identifying email and web addresses, and understanding abbreviations or other languages. Though Lucene provides many optional analyzers, the default one used by OpenCms is usually the best choice. For more advanced search applications, the other analyzers should be looked at in more depth.
- Field: A field consists of data that can be stored, indexed, or queried. Field values are searched when a query is made to the index. There are two characteristics of a field that determine how it gets treated when indexed:
- Field Storage: The storage characteristic of a field determines whether or not the field data value gets stored into the index. It is not necessary to store field data if the value is unimportant and is used only to help locate a document. On the other hand, field data should be stored if the value needs to be returned with the search result.
- Field Indexing: This characteristic determines whether a field will get indexed, and if so, how. There is no need to index fields that will not be used as search terms, and the value should not be indexed. This is useful if we need to return a field value but will never search for the document using that field in a search term. However, for fields that are searchable, the field may be indexed in either a tokenized or an un-tokenized fashion. If a field is tokenized, then it will first be run through an analyzer. Each term generated by the analyzer will be indexed for the field. If it is un-tokenized, then the field's value is indexed, verbatim. In this case, the term must be searched for using an exact match of its value, including the case.
Lucene also provides the ability to define a boost value for a field. This affects the relevance of the field when it is used in a search. A value other than the default value of 1.0 may be used to raise or lower the relevance.
These are the important concepts to be understood while creating a Lucene search index. After an index has been created, documents may be searched through queries.
Querying Lucene search indexes is supported through a Java API and a search querying language. Search queries are made up of terms and operators. A term can be a simple word such as "hello" or a phrase such as "hello world". Operators are used to form logical expressions with terms, such as AND or NOT. With the Java API, terms can be built and aggregated together along with operators to form a query. When using the query language, a Java class is provided to parse the query and convert it into a format suitable for passing to the engine. In addition to these search features, there are more advanced operations that may be performed such as fuzzy searches, range searches, and proximity searches.
All these options and flexibility allow Lucene to be used in an application in many ways. OpenCms does a good job of using these options to provide search capabilities for a wide range of content types. Next, we will look at how OpenCms interfaces with Lucene to provide this support.
Configuring OpenCms Search
OpenCms maintains search settings in the opencms-search.xml configuration file located in the WEB-INF/config directory. Prior to OpenCms 7, most of the settings in this configuration file needed to be made by hand. With OpenCms 7, the Search Management tool in the Administration View has been improved to cover most of the settings. We will first go over the settings that are controlled through the Search Management view, and will then visit the settings that must still be changed by hand. The first thing we'll do is define our own search index for the blog content. Creating a new search index is simple with the Administration tool. We access it by clicking on the Search Management icon of the Administrative View, and then clicking on the New Index icon:
The Name field contains the name of the index file. This name can also be passed to a Java API. If the content differs between the online and offline areas, we can create an index for each one. For now, we will start with the offline index. We'll name it: Blogs – Offline. The other fields are:
- Rebuild mode: This determines if the index is to be built manually or automatically as content changes. We want automatic updating and will hence choose auto.
- Locale: We must select a locale for the content. OpenCms will extract the content for the given locale when it builds our index. If we were supporting more than one locale, then it would be a good idea to include the locale in the index name.
- Project: This selects content from either the Online or Offline project.
- Field configuration: This selects a field configuration to be used for the index.
We do not have our own field configuration yet; so for now press OK to save the index. Next, we will define a field configuration for the blog content.
Once the index has been created, we may define the fields we want it to contain by creating a field configuration. The View field configuration icon shows existing field configurations and allows creation of new ones. The fields in the configuration relate to the fields that will get created in the search index. Each field has the following settings:
- Name: The name of the field.
- Index: True to index the field, untokenized to index the field without running it through an analyzer, and false to not index the field.
- Store: Checked to have the value of the field stored.
- Excerpt: This is an OpenCms setting, which will truncate the field value before storing it. This allows a search to present a synopsis view of the results, without having to actually retrieve the content from the original resource. While the UI allows this option to be selected and disallows the Store option, this combination does not make much sense.
- Display: This is the display name used for the field.
- Boost: This is an optional value that can be provided to adjust the boost.
- Default: This is a default value that may be provided for a field value, in case the source field value does not exist.
In addition to these settings, each field also contains one or more mapping definitions which map the data value to a content field in a VFS resource. The mapping definition includes a mapping type, an optional parameter, and an optional default value. There are four types of mappings to choose from:
- content: This maps the field data value to the value of the VFS resource content. The value that is retrieved is based upon the resource type. Each resource type has a content extractor responsible for converting the resource content into a string value. For XML resource types, the content extractor concatenates all the data fields of the resource into a single string value. For this mapping, the Parameter field is not used.
- property: This maps the field data value to a VFS resource property value. If the property value is not found, then the value will be empty. The Parameter field must contain the name of the property value to be used.
- property-search: This is the same as the property mapping except when the property value is missing on the original resource. It then searches for the value on all the parent folders of the resource. The Parameter field must contain the name of the property value to be used.
- item: This selection is useful only for XML content. It maps the value to an individual field in the XML content. The XML field is specified in the Parameter field using the XPath notation format. For example, for an XML field element with the name of Abstract, the notation used would be Abstract. For a repeating element, multiple mappings may be added per field. For example, for the blog comments, we could map the first three comments using three mappings entries with the values: Comment, Comment, and Comment.
OpenCms provides a general purpose default field configuration named standard. It defines fields that are generally suitable for indexing any OpenCms VFS resource. The standard configuration is defined as follows:
- Content: This field is both indexed and stored and contains the extracted text of a VFS resource. The actual extracted text contained in this field depends upon the resource type. For a Word or text document for example, the Content field will contain all the text contained within the resource. For OpenCms XML content, all the data fields in the resource are concatenated into this field as a single value. This has a nice effect of allowing all data fields of the XML content to be searched, using just this field, such as "content: 90124". However, querying XML content fields using operators, such as "address:main street" NOT "zip: 90124", is not possible.
- Description: This field is stored and indexed and contains the description of the resource obtained via the description property.
- Keywords: This field is stored and indexed, and contains the keywords of the resource obtained via the keywords property.
- Meta: This field is not stored, but is indexed. It indexes the value of the three content properties: Title, Keywords, and Description.
- Title: This field contains the title field value, and is indexed but not stored.
- Title-key: This contains the title field value, and is stored as it is, or untokenized by an analyzer.
We could use the standard field configuration for our search index. But we will design our own instead. This is simple to do and will provide us with more flexibility in the future, if necessary.
Creating a Field Configuration
Field configurations are created from within the Search Management view by first clicking on the View field configurations icon, and then clicking on the New field configuration icon:
After saving the entry and returning to the list of field configurations, click on the field configuration to get to the overview screen:
From here fields may be added. Our search index will contain four fields:
- blogtext: This field contains our blog text. Unlike the content field in the standard mapping, this field contains only the blog text, and none of the other XML field data. The field is indexed so that we can search for text in it. We will also store the field value, but in an excerpted format, so that we can quickly display a synopsis of the blog in the search result.
- category: This field contains the data from the category fields of our content. We create a separate search field so that we can provide the ability to search on category alone. If the category value were to be included within the content field as is the case with the standard field mapping, then we would have no assurance that a search for a category term would find matching text in the blog text.
- title: This field contains the indexed value of the title field. The field is indexed to allow it to be searched on. As indexed fields are analyzed, we can choose not to store the resulting value in the data field.
- title-key: This field is the inverse of the title field. It is not indexed, but is instead stored verbatim in the field to allow the field to be sorted properly in the search result.
After adding these fields the display should look like this:
Now we'll add the field mappings by clicking on the name column of the field, and using the Add new mapping action. Add the following mappings:
- blogtext: This field is mapped to the BlogText field and the type is item. The parameter field contains the XML field name for the blog content, which is BlogText.
- category: This field maps to the blog Category field and the type is item. The category field of our Blog XML content can have more than one value. We will therefore need to have multiple mappings for this field. Unfortunately, this is a problem because we don't know how many category fields there are going to be. For now, we will support searching for only the first five category field values. For this, we add five mappings of the type item. The parameter values of each item should reflect the corresponding XML field names. For example: Category, Category, Category, Category, and Category.
- title: This field is mapped to a content property and is of the type property. The parameter field contains the name of the property, which is Title.
- title-key: This field is the same as the title field and should match its settings.
We can now go back to our index and apply this field configuration to it. Before we do that however, we will do one more thing, create an index source.
Creating an Index Source
An index source defines the locations of content within the VFS that are to be indexed. The first step in doing this is to click on the View index sources icon:
From here, we see that OpenCms has provided two search indexes already:
- source1: This source includes all document types contained in the default site, which is located under /sites/default.
- source2: This source includes only the content located under /system/workplace/locales/, which is used to build the online help system.
Neither of these is suitable for our use as it includes content outside of the Blog content repository. Let's add a new source for the blog content, by clicking on the New index source icon and naming it BlogContent:
The Indexer field contains the name of a Java class file that implements the I_CmsIndexer interface. This class interfaces with Lucene and is responsible for managing the search index. For most cases, the default provided class is sufficient. However, if more complex content indexing requirements are needed, the flexibility of OpenCms allows a custom indexing interface to be used. We can use the default provided class for the blog content. Click OK to save the index, and then use the Assign resources action to assign the blog repository path:
We've now defined the new source only to contain the resources within the Blog repository. Next, we need to define the types of documents that we want to include in the index source.
When the content is indexed, OpenCms iterates over the resources it finds in the given VFS path. Each resource type has a corresponding document type class responsible for converting the given resource into a Lucene document for indexing. OpenCms provides document type classes for all the default resource types. Since Blog content is XML, we will use the document type already available for XML content types. The document type class is org.opencms.search.documents.CmsDocumentXmlContent. No other document types are needed for the blog content source. To add the document type class, click on the plus icon of the Document types available in the Assign document types area:
After adding the document type, return to the Search Management screen where we will continue with the final step.
Now we can finally complete the search index. Go back to the Search Management screen and click on the Edit icon to set the field configuration. Select blogcontent for the Field configuration setting and save it. Then use the Assign index sources icon to remove the source1 source from the list and add the BlogContent source to complete the index definition.
That completes the creation of the Offline index, and we can build the index by clicking on the rebuild icon.
Additional Search Settings
There are a few additional search index settings which are not accessible via the Search Management interface. In most cases, it is not necessary to change these settings. But it is still useful to understand them. The additional settings may only be changed by editing the opencms-search.xml configuration file manually. Within this file are the following settings:
- directory: This setting controls the directory location of the index files. It is relative to the WEB-INF path of the OpenCms application. Each defined search index will have a subdirectory appearing under this location which matches the index name.
- timeout: During indexing, a thread is created for each resource to convert it into a Lucene document. This value specifies the amount of time, in milliseconds, that the indexer will wait for the thread to complete its task.
- forceunlock: This setting controls how indexing threads access the search index. The possible values are:
- always: Always attempts to unlock the index.
- never: Never unlocks the index; instead, waits for it to become unlocked.
- onlyfull (default): This setting behaves the same as always, and attempts to unlock the index, if it is locked.
- excerpt: This setting controls the size that the value of a field will be truncated to, if the excerpt setting in the field configuration is set to true.
- extractCacheMaxAge: When a resource is indexed, the document class responsible for converting it into a Lucene document must convert the content into plain text. As this is often time consuming, OpenCms caches the resulting extracted text on disk. This setting controls the lifetime of the extraction cache. For development purposes, if rebuilding indexes is done frequently, then it may be helpful to set this to a low value, so that the index is rebuilt with recent document content.
- highlighter: This specifies a class file that is used to provide support for highlighting the search terms found in the search result.
- documenttype: This is where the various Document type classes are configured. Each document type must specify a single resource type and can have one or more mime-types associated with it.
- analyzer: This setting defines the analyzer that is used when the text is indexed. The setting allows specification of a single analyzer per locale.
As with all OpenCms configuration file settings, making a change will require a server restart. It should also be noted that even though settings changed via the Search Administration view will appear immediately, they also will often require a server restart before taking effect.