Reader small image

You're reading from  Lucene 4 Cookbook

Product typeBook
Published inJun 2015
Reading LevelExpert
Publisher
ISBN-139781782162285
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Edwood Ng
Edwood Ng
author image
Edwood Ng

Edwood Ng is a technologist with over a decade of experience in building scalable solutions from proprietary implementations to client-facing web-based applications. Currently, he's the director of DevOps at Wellframe, leading infrastructure and DevOps operations. His background in search engine began at Endeca Technologies in 2004, where he was a technical consultant helping numerous clients to architect and implement faceted search solutions. After Endeca, he drew on his knowledge and began designing and building Lucene-based solutions. His first Lucene implementation that went to production was the search engine behind http://UpDown.com. From there on, he continued to create search applications using Lucene extensively to deliver robust and scalable systems for his clients. Edwood is a supporter of an open source software. He has also contributed to the plugin sfI18NGettextPluralPlugin to the Symphony project.
Read more about Edwood Ng

Vineeth Mohan
Vineeth Mohan
author image
Vineeth Mohan

Vineeth Mohan is an architect and developer. He currently works as the CTO at Factweavers Technologies and is also an Elasticsearch-certified trainer. He loves to spend time studying emerging technologies and applications related to data analytics, data visualizations, machine learning, natural language processing, and developments in search analytics. He began coding during his high school days, which later ignited his interest in computer science, and he pursued engineering at Model Engineering College, Cochin. He was recruited by the search giant Yahoo! during his college days. After 2 years of work at Yahoo! on various big data projects, he joined a start-up that dealt with search and analytics. Finally, he started his own big data consulting company, Factweavers. Under his leadership and technical expertise, Factweavers is one of the early adopters of Elasticsearch and has been engaged with projects related to end-to-end big data solutions and analytics for the last few years. There, he got the opportunity to learn various big-data-based technologies, such as Hadoop, and high-performance data ingress systems and storage. Later, he moved to a start-up in his hometown, where he chose Elasticsearch as the primary search and analytic engine for the project assigned to him. Later in 2014, he founded Factweavers Technologies along with Jalaluddeen; it is consultancy that aims at providing Elasticsearch-based solutions. He is also an Elasticsearch-certified corporate trainer who conducts trainings in India. Till date, he has worked on numerous projects that are mostly based on Elasticsearch and has trained numerous multinationals on Elasticsearch.
Read more about Vineeth Mohan

View More author details
Right arrow

Chapter 9. Extending Lucene with Modules

In this chapter, we will cover the following recipes:

  • Exploring spatial search

  • Implementing joins

  • Performing faceting

  • Implementing grouping

  • Employing autosuggest

  • Implementing highlighting

Introduction


So far, we have explored Lucene's core functionalities and learned how to customize Lucene's core components. However, we haven't yet learnt any extension libraries that expand beyond the core features. In this chapter, we will discover the add-on features in Lucene and demonstrate how these features work and how to leverage these new functionalities effectively. We will cover the following features: spatial search, joins, faceting, grouping, autosuggest, and highlighting.

Implementing joins


Join is a relational database concept where data are typically normalized, stored in separate tables for efficiency in storage and maintaining data integrity, and then data are joined together between tables to provide a coherent view of the data. In Lucene, there are no concepts of tables because all the records are supposed to be flattened and stored as documents. Even setting up schema in advance is optional. In a document-based store such as Lucene, joins always seem like an afterthought. However, it doesn't mean that you can't do joins at all in Lucene. There are many techniques to simulate joins such as adding a document type field to identify different types (tables) of records. Then, manually combine data at runtime by issuing multiple search queries retrieving data of different types.

Lucene offers two types of join methods. One is an index-time join where documents are added in blocks; basically, parent record and child record relationships. The other type is...

Performing faceting


Faceting provides a way to drill down data by categories. It allows a user to refine a search by concatenating new categories to narrow down a search to the desired results. A useful feature in faceting is that you can preview the total hits with each category values before selecting it. The faceted search method can be used in conjunction with a text search, so it's okay for users to do a text search in addition to a faceted search at any point. In many search applications where data can be systematically categorized, such as product inventory where you have product categories, types, availability, and so on, faceting offers a convenient way for users to refine their searches. A notable benefit of faceting is that the user will never hit a zero results page because you can only drill down with in the available category values.

Let's look at an example for a fictional online bookstore; we will categorize books by subject and author. In the following figure, note that we...

Implementing grouping


The Grouping feature in Lucene offers a way to group results together by a group field. Say you may want to list books by category so it'd be convenient to display results when the resulting documents are already grouped by the group field. Another example is a web search engine where results may be grouped by sites, so a site with excessive numbers of matching pages does not overwhelm the top result page. The search engine can show the top matching page from each site so there is more variety of quality matches.

Lucene has two implementations for grouping a single-pass search and a two-pass search. A single-pass search requires document grouping at index time and a two-pass search requires two searches to provide both group-level and document-level results.

To set up a single-pass search, we will add a document in document blocks by calling IndexWriter.addDocuments. We also need to add an end of the document marker, as Field, to the last document in each document block...

Employing autosuggest


Lucene's suggest module offers a number of implementations that can be used to support a real-time autosuggest feature when a user types into a search box. You will find many tools in the suggest module that helps facilitate the data ingestion process to the autosuggest index. In our demonstrations, we will be using LuceneDictionary to ingest data as it provides the convenience of extracting tokens from a field in an existing index.

We will go over four suggester implementations in this section:

  • AnalyzingSuggester: This suggester analyzes input text and provides suggestions based on prefixed matches.

  • AnalyzingInfixSuggester: This suggester builds on top of AnalyzingSuggester and provides suggestions based on the prefix matches in any tokens in the indexed text.

  • FreeTextSuggester: This suggester is an n-gram implementation that produces suggestions based on matches to an N-gram index. N-gram is a contiguous sequence of n items from a given text sequence. "n" represents...

Implementing highlighting


Highlighting offers the ability to highlight search terms in results, in order to show where matches occur. It helps to improve the user experience by allowing a user to see how matching documents are found. This is especially useful when displaying a text result that contains several lines of text.

The highlighting feature starts with the highlighter class. It can be configured with a formatter and an encoder to render the results. Then, we retrieve a TokenStream from the matching Field and pass it on to the Highlighter to render highlighting results.

Getting ready…

Here is the Maven dependency:

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-highlighter</artifactId>
    <version>${lucene.version}</version>
</dependency>

How to do it…

Let's take a look at a sample implementation:

StandardAnalyzer analyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config...
lock icon
The rest of the chapter is locked
You have been reading a chapter from
Lucene 4 Cookbook
Published in: Jun 2015Publisher: ISBN-13: 9781782162285
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Edwood Ng

Edwood Ng is a technologist with over a decade of experience in building scalable solutions from proprietary implementations to client-facing web-based applications. Currently, he's the director of DevOps at Wellframe, leading infrastructure and DevOps operations. His background in search engine began at Endeca Technologies in 2004, where he was a technical consultant helping numerous clients to architect and implement faceted search solutions. After Endeca, he drew on his knowledge and began designing and building Lucene-based solutions. His first Lucene implementation that went to production was the search engine behind http://UpDown.com. From there on, he continued to create search applications using Lucene extensively to deliver robust and scalable systems for his clients. Edwood is a supporter of an open source software. He has also contributed to the plugin sfI18NGettextPluralPlugin to the Symphony project.
Read more about Edwood Ng

author image
Vineeth Mohan

Vineeth Mohan is an architect and developer. He currently works as the CTO at Factweavers Technologies and is also an Elasticsearch-certified trainer. He loves to spend time studying emerging technologies and applications related to data analytics, data visualizations, machine learning, natural language processing, and developments in search analytics. He began coding during his high school days, which later ignited his interest in computer science, and he pursued engineering at Model Engineering College, Cochin. He was recruited by the search giant Yahoo! during his college days. After 2 years of work at Yahoo! on various big data projects, he joined a start-up that dealt with search and analytics. Finally, he started his own big data consulting company, Factweavers. Under his leadership and technical expertise, Factweavers is one of the early adopters of Elasticsearch and has been engaged with projects related to end-to-end big data solutions and analytics for the last few years. There, he got the opportunity to learn various big-data-based technologies, such as Hadoop, and high-performance data ingress systems and storage. Later, he moved to a start-up in his hometown, where he chose Elasticsearch as the primary search and analytic engine for the project assigned to him. Later in 2014, he founded Factweavers Technologies along with Jalaluddeen; it is consultancy that aims at providing Elasticsearch-based solutions. He is also an Elasticsearch-certified corporate trainer who conducts trainings in India. Till date, he has worked on numerous projects that are mostly based on Elasticsearch and has trained numerous multinationals on Elasticsearch.
Read more about Vineeth Mohan