Reader small image

You're reading from  Lucene 4 Cookbook

Product typeBook
Published inJun 2015
Reading LevelExpert
Publisher
ISBN-139781782162285
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Edwood Ng
Edwood Ng
author image
Edwood Ng

Edwood Ng is a technologist with over a decade of experience in building scalable solutions from proprietary implementations to client-facing web-based applications. Currently, he's the director of DevOps at Wellframe, leading infrastructure and DevOps operations. His background in search engine began at Endeca Technologies in 2004, where he was a technical consultant helping numerous clients to architect and implement faceted search solutions. After Endeca, he drew on his knowledge and began designing and building Lucene-based solutions. His first Lucene implementation that went to production was the search engine behind http://UpDown.com. From there on, he continued to create search applications using Lucene extensively to deliver robust and scalable systems for his clients. Edwood is a supporter of an open source software. He has also contributed to the plugin sfI18NGettextPluralPlugin to the Symphony project.
Read more about Edwood Ng

Vineeth Mohan
Vineeth Mohan
author image
Vineeth Mohan

Vineeth Mohan is an architect and developer. He currently works as the CTO at Factweavers Technologies and is also an Elasticsearch-certified trainer. He loves to spend time studying emerging technologies and applications related to data analytics, data visualizations, machine learning, natural language processing, and developments in search analytics. He began coding during his high school days, which later ignited his interest in computer science, and he pursued engineering at Model Engineering College, Cochin. He was recruited by the search giant Yahoo! during his college days. After 2 years of work at Yahoo! on various big data projects, he joined a start-up that dealt with search and analytics. Finally, he started his own big data consulting company, Factweavers. Under his leadership and technical expertise, Factweavers is one of the early adopters of Elasticsearch and has been engaged with projects related to end-to-end big data solutions and analytics for the last few years. There, he got the opportunity to learn various big-data-based technologies, such as Hadoop, and high-performance data ingress systems and storage. Later, he moved to a start-up in his hometown, where he chose Elasticsearch as the primary search and analytic engine for the project assigned to him. Later in 2014, he founded Factweavers Technologies along with Jalaluddeen; it is consultancy that aims at providing Elasticsearch-based solutions. He is also an Elasticsearch-certified corporate trainer who conducts trainings in India. Till date, he has worked on numerous projects that are mostly based on Elasticsearch and has trained numerous multinationals on Elasticsearch.
Read more about Vineeth Mohan

View More author details
Right arrow

Chapter 7. Flexible Scoring

We will take a deep dive into Lucene's scoring methodology and explore the available options in customization. Here is a list of topics we will cover in this chapter:

  • Overriding similarity

  • Implementing the BM25 model

  • Implementing the language model

  • Implementing the divergence from randomness model

  • Implementing the information-based model

Introduction


Scoring is fundamental to Lucene's search capability and accuracy. Normally, you don't see scores in search results, but it's there to help sort results by relevancy. Knowing how scoring works, its boundary will help you make informed decisions in your application design.

The goal of scoring is to objectively calculate weights to rank already matched results. The contents that are more relevant to the search criteria are sorted before the less relevant ones. This is called relevancy ranking. Lucene employs a number of techniques to perform this calculation. The expandable nature of Lucene also allows you to customize scoring and expand from the default configuration. This flexibility is part of the appeal of Lucene's popularity. In this chapter, we will first look into Lucene's scoring methodology. Then, we will explore customization techniques to expand from default behavior. The intention of this chapter is to give you a primer into Lucene's scoring implementations. Hopefully...

Overriding similarity


The Similarity class is an abstract class that defines a set of components for score calculation. To steer away from default scoring, we can create a new class extending from the DefaultSimilarity (TFIDFSimilarity) or one of the other Similarity classes. We will perform some experimentation in this section to see how each scoring components affect the overall score.

Let's begin by reviewing Similarity's methods:

  • computeNorm(FieldInvertState): This calculates a normalization value for a Field at indexing time.

  • computeWeight(float, CollectionStatics, TermStatistics): This returns a SimWeight object to calculate a score. It accepts a boost (float) value for query-time boosting.

  • coord(int, int): This returns a score factor based on term overlap in a query. This value helps to integrate coordinate-level matching. The default is disabled with the returning value 1.

  • queryNorm(float): This generates a normalization value for a query. The value is also passed back to the Weight...

Implementing the BM25 model


Let's take a look at how we use the BM25 model in Lucene. Lucene implements this model as BM25Similarity. We can start using this model as simply as instantiating it with default parameters. The constructor accepts two parameters for tuning. The first parameter controls nonlinear term frequency normalization. Its default value is 1.2. The second parameter controls to what degree a document length normalizes the tf values.

How to do It…

Here we have our sample code to demonstrate how to use BM25Similarity;

StandardAnalyzer analyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);
BM25Similarity similarity = new BM25Similarity(1.2f, 0.75f);
config.setSimilarity(similarity);
IndexWriter indexWriter = new IndexWriter(directory, config);
Document doc = new Document();
TextField textField = new TextField("content", "", Field.Store.YES);
String[] contents = {"Humpty Dumpty sat...

Implementing the language model


Lucene implemented two language models, LMDirichletSimilarity and LMJelinekMercerSimilarity, based on different distribution smoothing methods. Smoothing is a technique that adds a constant weight so that the zero query term frequency on partially matched documents does not result in a zero score where it's useless in ranking. We will look at these two implementations and see how their weight distributions affect scoring.

How to do it…

We will take a look at LMDirichletSimilarity first and we will reuse our test case from the previous section, but will revert the extended second sentence input:

StandardAnalyzer analyzer = new StandardAnalyzer();
Directory directory = new RAMDirectory();
IndexWriterConfig config = new IndexWriterConfig(Version.LATEST, analyzer);
LMDirichletSimilarity similarity = new LMDirichletSimilarity(2000);
config.setSimilarity(similarity);
IndexWriter indexWriter = new IndexWriter(directory, config);
Document doc = new Document();
TextField...

Implementing the divergence from randomness model


In Lucene, divergence from randomness model is implemented as DFRSimilarity. It's made up of three components: BasicModel, AfterEffect, and Normalization. BasicModel is a model of information content, AfterEffect is the first normalization, and Normalization is second (length) normalization. Here is an excerpt from Lucene's Javadoc on DFRSimilarity's components:

  1. BasicModel: This is a basic model of information content:

    • BasicModelBE: This is the limiting form of Bose-Einstein

    • BasicModelG: This is the geometric approximation of Bose-Einstein

    • BasicModelP: This is the Poisson approximation of the Binomial

    • BasicModelD: This is the divergence approximation of the Binomial

    • BasicModelIn: This is the inverse document frequency

    • BasicModelIne: This is the inverse expected document frequency (mixture of Poisson and IDF)

    • BasicModelIF: This is the inverse term frequency (approximation of I(ne))

  2. AfterEffect: This is the first normalization of information...

Implementing the information-based model


The information-based model in Lucene consists of three components: Distribution, Lambda, and Normalization. The setup is somewhat similar to DFRSimilarity where you need to instantiate these components in its constructor. The name of the Similarity class for this model is called IBSimilarity. Here is an excerpt from Lucene's Javadoc on the components:

  1. Distribution: This is probabilistic distribution used to model term occurrence:

    • DistributionLL: This is the Log-logistic distribution

    • DistributionSPL: This is the Smoothed power-law distribution

  2. Lambda: This is the λw parameter of the probability distribution:

    • LambdaDF: This is the now/nor average number of documents where w occurs

    • LambdaTTF: This is the Fw/Nor average number of occurrences of w in the collection

  3. Normalization: This is term frequency normalization:

    • NormalizationH1: In this, there is a uniform distribution of term frequency

    • NormalizationH2: In this, term frequency density is inversely...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Lucene 4 Cookbook
Published in: Jun 2015Publisher: ISBN-13: 9781782162285
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Edwood Ng

Edwood Ng is a technologist with over a decade of experience in building scalable solutions from proprietary implementations to client-facing web-based applications. Currently, he's the director of DevOps at Wellframe, leading infrastructure and DevOps operations. His background in search engine began at Endeca Technologies in 2004, where he was a technical consultant helping numerous clients to architect and implement faceted search solutions. After Endeca, he drew on his knowledge and began designing and building Lucene-based solutions. His first Lucene implementation that went to production was the search engine behind http://UpDown.com. From there on, he continued to create search applications using Lucene extensively to deliver robust and scalable systems for his clients. Edwood is a supporter of an open source software. He has also contributed to the plugin sfI18NGettextPluralPlugin to the Symphony project.
Read more about Edwood Ng

author image
Vineeth Mohan

Vineeth Mohan is an architect and developer. He currently works as the CTO at Factweavers Technologies and is also an Elasticsearch-certified trainer. He loves to spend time studying emerging technologies and applications related to data analytics, data visualizations, machine learning, natural language processing, and developments in search analytics. He began coding during his high school days, which later ignited his interest in computer science, and he pursued engineering at Model Engineering College, Cochin. He was recruited by the search giant Yahoo! during his college days. After 2 years of work at Yahoo! on various big data projects, he joined a start-up that dealt with search and analytics. Finally, he started his own big data consulting company, Factweavers. Under his leadership and technical expertise, Factweavers is one of the early adopters of Elasticsearch and has been engaged with projects related to end-to-end big data solutions and analytics for the last few years. There, he got the opportunity to learn various big-data-based technologies, such as Hadoop, and high-performance data ingress systems and storage. Later, he moved to a start-up in his hometown, where he chose Elasticsearch as the primary search and analytic engine for the project assigned to him. Later in 2014, he founded Factweavers Technologies along with Jalaluddeen; it is consultancy that aims at providing Elasticsearch-based solutions. He is also an Elasticsearch-certified corporate trainer who conducts trainings in India. Till date, he has worked on numerous projects that are mostly based on Elasticsearch and has trained numerous multinationals on Elasticsearch.
Read more about Vineeth Mohan