Packt+ | Advance your knowledge in tech

You're reading from Lucene 4 Cookbook

Product typeBook

Published inJun 2015

Reading LevelExpert

Publisher

ISBN-139781782162285

Edition1st Edition

Languages

Java

Tools

Lucene

Concepts

Enterprise Search

Authors (2):

Edwood Ng

Vineeth Mohan

View More author details

Chapter 9. Extending Lucene with Modules

In this chapter, we will cover the following recipes:

Exploring spatial search
Implementing joins
Performing faceting
Implementing grouping
Employing autosuggest
Implementing highlighting

Introduction

So far, we have explored Lucene's core functionalities and learned how to customize Lucene's core components. However, we haven't yet learnt any extension libraries that expand beyond the core features. In this chapter, we will discover the add-on features in Lucene and demonstrate how these features work and how to leverage these new functionalities effectively. We will cover the following features: spatial search, joins, faceting, grouping, autosuggest, and highlighting.

Exploring spatial search

Spatial search provides the ability to search by location data. This is also called geo-spatial search. We usually leverage this type of search to look for things within a certain location proximity. It's very useful in map applications where most of the searches are about searching nearby places. Lucene provides this feature by incorporating another open source project called Spatial4j. This offers utilities for shape and distance calculation. Lucene uses these utilities to generate indexable fields to perform query calculations.

When dealing with spatial search, there are several considerations. Are we indexing points or shapes? What kind of shapes are we indexing? Can a document contain more than one point/shape? What kind of query is supported? There is no one solution that fits all in spatial search.

Lucene provides four built-in spatial strategies to help handle various types of spatial search requirements:

BBoxStrategy: This strategy permits indexing and searching...

Implementing joins

Join is a relational database concept where data are typically normalized, stored in separate tables for efficiency in storage and maintaining data integrity, and then data are joined together between tables to provide a coherent view of the data. In Lucene, there are no concepts of tables because all the records are supposed to be flattened and stored as documents. Even setting up schema in advance is optional. In a document-based store such as Lucene, joins always seem like an afterthought. However, it doesn't mean that you can't do joins at all in Lucene. There are many techniques to simulate joins such as adding a document type field to identify different types (tables) of records. Then, manually combine data at runtime by issuing multiple search queries retrieving data of different types.

Lucene offers two types of join methods. One is an index-time join where documents are added in blocks; basically, parent record and child record relationships. The other type is...

Performing faceting

Faceting provides a way to drill down data by categories. It allows a user to refine a search by concatenating new categories to narrow down a search to the desired results. A useful feature in faceting is that you can preview the total hits with each category values before selecting it. The faceted search method can be used in conjunction with a text search, so it's okay for users to do a text search in addition to a faceted search at any point. In many search applications where data can be systematically categorized, such as product inventory where you have product categories, types, availability, and so on, faceting offers a convenient way for users to refine their searches. A notable benefit of faceting is that the user will never hit a zero results page because you can only drill down with in the available category values.

Let's look at an example for a fictional online bookstore; we will categorize books by subject and author. In the following figure, note that we...

Implementing grouping

The Grouping feature in Lucene offers a way to group results together by a group field. Say you may want to list books by category so it'd be convenient to display results when the resulting documents are already grouped by the group field. Another example is a web search engine where results may be grouped by sites, so a site with excessive numbers of matching pages does not overwhelm the top result page. The search engine can show the top matching page from each site so there is more variety of quality matches.

Lucene has two implementations for grouping a single-pass search and a two-pass search. A single-pass search requires document grouping at index time and a two-pass search requires two searches to provide both group-level and document-level results.

To set up a single-pass search, we will add a document in document blocks by calling IndexWriter.addDocuments. We also need to add an end of the document marker, as Field, to the last document in each document block...

Employing autosuggest

Lucene's suggest module offers a number of implementations that can be used to support a real-time autosuggest feature when a user types into a search box. You will find many tools in the suggest module that helps facilitate the data ingestion process to the autosuggest index. In our demonstrations, we will be using LuceneDictionary to ingest data as it provides the convenience of extracting tokens from a field in an existing index.

We will go over four suggester implementations in this section:

AnalyzingSuggester: This suggester analyzes input text and provides suggestions based on prefixed matches.
AnalyzingInfixSuggester: This suggester builds on top of AnalyzingSuggester and provides suggestions based on the prefix matches in any tokens in the indexed text.
FreeTextSuggester: This suggester is an n-gram implementation that produces suggestions based on matches to an N-gram index. N-gram is a contiguous sequence of n items from a given text sequence. "n" represents...

Implementing highlighting

Highlighting offers the ability to highlight search terms in results, in order to show where matches occur. It helps to improve the user experience by allowing a user to see how matching documents are found. This is especially useful when displaying a text result that contains several lines of text.

The highlighting feature starts with the highlighter class. It can be configured with a formatter and an encoder to render the results. Then, we retrieve a TokenStream from the matching Field and pass it on to the Highlighter to render highlighting results.

Getting ready…

Here is the Maven dependency:

<dependency>
    <groupId>org.apache.lucene</groupId>
    <artifactId>lucene-highlighter</artifactId>
    <version>${lucene.version}</version>
</dependency>

How to do it…

Let's take a look at a sample implementation:

StandardAnalyzer analyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config...

The rest of the chapter is locked

You have been reading a chapter from

Lucene 4 Cookbook

Published in: Jun 2015Publisher: ISBN-13: 9781782162285

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Edwood Ng

Edwood Ng is a technologist with over a decade of experience in building scalable solutions from proprietary implementations to client-facing web-based applications. Currently, he's the director of DevOps at Wellframe, leading infrastructure and DevOps operations. His background in search engine began at Endeca Technologies in 2004, where he was a technical consultant helping numerous clients to architect and implement faceted search solutions. After Endeca, he drew on his knowledge and began designing and building Lucene-based solutions. His first Lucene implementation that went to production was the search engine behind http://UpDown.com. From there on, he continued to create search applications using Lucene extensively to deliver robust and scalable systems for his clients. Edwood is a supporter of an open source software. He has also contributed to the plugin sfI18NGettextPluralPlugin to the Symphony project.
Read more about Edwood Ng

Vineeth Mohan

Vineeth Mohan is an architect and developer. He currently works as the CTO at Factweavers Technologies and is also an Elasticsearch-certified trainer. He loves to spend time studying emerging technologies and applications related to data analytics, data visualizations, machine learning, natural language processing, and developments in search analytics. He began coding during his high school days, which later ignited his interest in computer science, and he pursued engineering at Model Engineering College, Cochin. He was recruited by the search giant Yahoo! during his college days. After 2 years of work at Yahoo! on various big data projects, he joined a start-up that dealt with search and analytics. Finally, he started his own big data consulting company, Factweavers. Under his leadership and technical expertise, Factweavers is one of the early adopters of Elasticsearch and has been engaged with projects related to end-to-end big data solutions and analytics for the last few years. There, he got the opportunity to learn various big-data-based technologies, such as Hadoop, and high-performance data ingress systems and storage. Later, he moved to a start-up in his hometown, where he chose Elasticsearch as the primary search and analytic engine for the project assigned to him. Later in 2014, he founded Factweavers Technologies along with Jalaluddeen; it is consultancy that aims at providing Elasticsearch-based solutions. He is also an Elasticsearch-certified corporate trainer who conducts trainings in India. Till date, he has worked on numerous projects that are mostly based on Elasticsearch and has trained numerous multinationals on Elasticsearch.
Read more about Vineeth Mohan

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages