You're reading from Lucene 4 Cookbook
So far, we have explored Lucene's core functionalities and learned how to customize Lucene's core components. However, we haven't yet learnt any extension libraries that expand beyond the core features. In this chapter, we will discover the add-on features in Lucene and demonstrate how these features work and how to leverage these new functionalities effectively. We will cover the following features: spatial search, joins, faceting, grouping, autosuggest, and highlighting.
Spatial search provides the ability to search by location data. This is also called geo-spatial search. We usually leverage this type of search to look for things within a certain location proximity. It's very useful in map applications where most of the searches are about searching nearby places. Lucene provides this feature by incorporating another open source project called Spatial4j. This offers utilities for shape and distance calculation. Lucene uses these utilities to generate indexable fields to perform query calculations.
When dealing with spatial search, there are several considerations. Are we indexing points or shapes? What kind of shapes are we indexing? Can a document contain more than one point/shape? What kind of query is supported? There is no one solution that fits all in spatial search.
Lucene provides four built-in spatial strategies to help handle various types of spatial search requirements:
BBoxStrategy: This strategy permits indexing and searching...
Join is a relational database concept where data are typically normalized, stored in separate tables for efficiency in storage and maintaining data integrity, and then data are joined together between tables to provide a coherent view of the data. In Lucene, there are no concepts of tables because all the records are supposed to be flattened and stored as documents. Even setting up schema in advance is optional. In a document-based store such as Lucene, joins always seem like an afterthought. However, it doesn't mean that you can't do joins at all in Lucene. There are many techniques to simulate joins such as adding a document type field to identify different types (tables) of records. Then, manually combine data at runtime by issuing multiple search queries retrieving data of different types.
Lucene offers two types of join methods. One is an index-time join where documents are added in blocks; basically, parent record and child record relationships. The other type is...
Faceting provides a way to drill down data by categories. It allows a user to refine a search by concatenating new categories to narrow down a search to the desired results. A useful feature in faceting is that you can preview the total hits with each category values before selecting it. The faceted search method can be used in conjunction with a text search, so it's okay for users to do a text search in addition to a faceted search at any point. In many search applications where data can be systematically categorized, such as product inventory where you have product categories, types, availability, and so on, faceting offers a convenient way for users to refine their searches. A notable benefit of faceting is that the user will never hit a zero results page because you can only drill down with in the available category values.
Let's look at an example for a fictional online bookstore; we will categorize books by subject and author. In the following figure, note that we...
The Grouping feature in Lucene offers a way to group results together by a group field. Say you may want to list books by category so it'd be convenient to display results when the resulting documents are already grouped by the group field. Another example is a web search engine where results may be grouped by sites, so a site with excessive numbers of matching pages does not overwhelm the top result page. The search engine can show the top matching page from each site so there is more variety of quality matches.
Lucene has two implementations for grouping a single-pass search and a two-pass search. A single-pass search requires document grouping at index time and a two-pass search requires two searches to provide both group-level and document-level results.
To set up a single-pass search, we will add a document in document blocks by calling IndexWriter.addDocuments
. We also need to add an end of the document marker, as Field, to the last document in each document block...
Lucene's suggest module offers a number of implementations that can be used to support a real-time autosuggest feature when a user types into a search box. You will find many tools in the suggest module that helps facilitate the data ingestion process to the autosuggest index. In our demonstrations, we will be using LuceneDictionary to ingest data as it provides the convenience of extracting tokens from a field in an existing index.
We will go over four suggester implementations in this section:
AnalyzingSuggester: This suggester analyzes input text and provides suggestions based on prefixed matches.
AnalyzingInfixSuggester: This suggester builds on top of AnalyzingSuggester and provides suggestions based on the prefix matches in any tokens in the indexed text.
FreeTextSuggester: This suggester is an n-gram implementation that produces suggestions based on matches to an N-gram index. N-gram is a contiguous sequence of n items from a given text sequence. "n" represents...
Highlighting offers the ability to highlight search terms in results, in order to show where matches occur. It helps to improve the user experience by allowing a user to see how matching documents are found. This is especially useful when displaying a text result that contains several lines of text.
The highlighting feature starts with the highlighter class. It can be configured with a formatter and an encoder to render the results. Then, we retrieve a TokenStream from the matching Field and pass it on to the Highlighter to render highlighting results.
Here is the Maven dependency:
<dependency> <groupId>org.apache.lucene</groupId> <artifactId>lucene-highlighter</artifactId> <version>${lucene.version}</version> </dependency>