Packt+ | Advance your knowledge in tech

You're reading from Lucene 4 Cookbook

Product typeBook

Published inJun 2015

Reading LevelExpert

Publisher

ISBN-139781782162285

Edition1st Edition

Languages

Java

Tools

Lucene

Concepts

Enterprise Search

Authors (2):

Edwood Ng

Vineeth Mohan

View More author details

Chapter 7. Flexible Scoring

We will take a deep dive into Lucene's scoring methodology and explore the available options in customization. Here is a list of topics we will cover in this chapter:

Overriding similarity
Implementing the BM25 model
Implementing the language model
Implementing the divergence from randomness model
Implementing the information-based model

Introduction

Scoring is fundamental to Lucene's search capability and accuracy. Normally, you don't see scores in search results, but it's there to help sort results by relevancy. Knowing how scoring works, its boundary will help you make informed decisions in your application design.

The goal of scoring is to objectively calculate weights to rank already matched results. The contents that are more relevant to the search criteria are sorted before the less relevant ones. This is called relevancy ranking. Lucene employs a number of techniques to perform this calculation. The expandable nature of Lucene also allows you to customize scoring and expand from the default configuration. This flexibility is part of the appeal of Lucene's popularity. In this chapter, we will first look into Lucene's scoring methodology. Then, we will explore customization techniques to expand from default behavior. The intention of this chapter is to give you a primer into Lucene's scoring implementations. Hopefully...

Overriding similarity

The Similarity class is an abstract class that defines a set of components for score calculation. To steer away from default scoring, we can create a new class extending from the DefaultSimilarity (TFIDFSimilarity) or one of the other Similarity classes. We will perform some experimentation in this section to see how each scoring components affect the overall score.

Let's begin by reviewing Similarity's methods:

computeNorm(FieldInvertState): This calculates a normalization value for a Field at indexing time.
computeWeight(float, CollectionStatics, TermStatistics): This returns a SimWeight object to calculate a score. It accepts a boost (float) value for query-time boosting.
coord(int, int): This returns a score factor based on term overlap in a query. This value helps to integrate coordinate-level matching. The default is disabled with the returning value 1.
queryNorm(float): This generates a normalization value for a query. The value is also passed back to the Weight...

Implementing the BM25 model

Let's take a look at how we use the BM25 model in Lucene. Lucene implements this model as BM25Similarity. We can start using this model as simply as instantiating it with default parameters. The constructor accepts two parameters for tuning. The first parameter controls nonlinear term frequency normalization. Its default value is 1.2. The second parameter controls to what degree a document length normalizes the tf values.

How to do It…

Here we have our sample code to demonstrate how to use BM25Similarity;

StandardAnalyzer analyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);
BM25Similarity similarity = new BM25Similarity(1.2f, 0.75f);
config.setSimilarity(similarity);
IndexWriter indexWriter = new IndexWriter(directory, config);
Document doc = new Document();
TextField textField = new TextField("content", "", Field.Store.YES);
String[] contents = {"Humpty Dumpty sat...

Implementing the language model

Lucene implemented two language models, LMDirichletSimilarity and LMJelinekMercerSimilarity, based on different distribution smoothing methods. Smoothing is a technique that adds a constant weight so that the zero query term frequency on partially matched documents does not result in a zero score where it's useless in ranking. We will look at these two implementations and see how their weight distributions affect scoring.

How to do it…

We will take a look at LMDirichletSimilarity first and we will reuse our test case from the previous section, but will revert the extended second sentence input:

StandardAnalyzer analyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);
LMDirichletSimilarity similarity = new LMDirichletSimilarity(2000);
config.setSimilarity(similarity);
IndexWriter indexWriter = new IndexWriter(directory, config);
Document doc = new Document();
TextField...

Implementing the divergence from randomness model

In Lucene, divergence from randomness model is implemented as DFRSimilarity. It's made up of three components: BasicModel, AfterEffect, and Normalization. BasicModel is a model of information content, AfterEffect is the first normalization, and Normalization is second (length) normalization. Here is an excerpt from Lucene's Javadoc on DFRSimilarity's components:

BasicModel: This is a basic model of information content:
- BasicModelBE: This is the limiting form of Bose-Einstein
- BasicModelG: This is the geometric approximation of Bose-Einstein
- BasicModelP: This is the Poisson approximation of the Binomial
- BasicModelD: This is the divergence approximation of the Binomial
- BasicModelIn: This is the inverse document frequency
- BasicModelIne: This is the inverse expected document frequency (mixture of Poisson and IDF)
- BasicModelIF: This is the inverse term frequency (approximation of I(ne))
AfterEffect: This is the first normalization of information...

Implementing the information-based model

The information-based model in Lucene consists of three components: Distribution, Lambda, and Normalization. The setup is somewhat similar to DFRSimilarity where you need to instantiate these components in its constructor. The name of the Similarity class for this model is called IBSimilarity. Here is an excerpt from Lucene's Javadoc on the components:

Distribution: This is probabilistic distribution used to model term occurrence:
- DistributionLL: This is the Log-logistic distribution
- DistributionSPL: This is the Smoothed power-law distribution
Lambda: This is the λw parameter of the probability distribution:
- LambdaDF: This is the now/nor average number of documents where w occurs
- LambdaTTF: This is the Fw/Nor average number of occurrences of w in the collection
Normalization: This is term frequency normalization:
- NormalizationH1: In this, there is a uniform distribution of term frequency
- NormalizationH2: In this, term frequency density is inversely...

The rest of the chapter is locked

You have been reading a chapter from

Lucene 4 Cookbook

Published in: Jun 2015Publisher: ISBN-13: 9781782162285

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Edwood Ng

Edwood Ng is a technologist with over a decade of experience in building scalable solutions from proprietary implementations to client-facing web-based applications. Currently, he's the director of DevOps at Wellframe, leading infrastructure and DevOps operations. His background in search engine began at Endeca Technologies in 2004, where he was a technical consultant helping numerous clients to architect and implement faceted search solutions. After Endeca, he drew on his knowledge and began designing and building Lucene-based solutions. His first Lucene implementation that went to production was the search engine behind http://UpDown.com. From there on, he continued to create search applications using Lucene extensively to deliver robust and scalable systems for his clients. Edwood is a supporter of an open source software. He has also contributed to the plugin sfI18NGettextPluralPlugin to the Symphony project.
Read more about Edwood Ng

Vineeth Mohan

Vineeth Mohan is an architect and developer. He currently works as the CTO at Factweavers Technologies and is also an Elasticsearch-certified trainer. He loves to spend time studying emerging technologies and applications related to data analytics, data visualizations, machine learning, natural language processing, and developments in search analytics. He began coding during his high school days, which later ignited his interest in computer science, and he pursued engineering at Model Engineering College, Cochin. He was recruited by the search giant Yahoo! during his college days. After 2 years of work at Yahoo! on various big data projects, he joined a start-up that dealt with search and analytics. Finally, he started his own big data consulting company, Factweavers. Under his leadership and technical expertise, Factweavers is one of the early adopters of Elasticsearch and has been engaged with projects related to end-to-end big data solutions and analytics for the last few years. There, he got the opportunity to learn various big-data-based technologies, such as Hadoop, and high-performance data ingress systems and storage. Later, he moved to a start-up in his hometown, where he chose Elasticsearch as the primary search and analytic engine for the project assigned to him. Later in 2014, he founded Factweavers Technologies along with Jalaluddeen; it is consultancy that aims at providing Elasticsearch-based solutions. He is also an Elasticsearch-certified corporate trainer who conducts trainings in India. Till date, he has worked on numerous projects that are mostly based on Elasticsearch and has trained numerous multinationals on Elasticsearch.
Read more about Vineeth Mohan

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages