Reader small image

You're reading from  Apache Solr Search Patterns

Product typeBook
Published inApr 2015
Reading LevelIntermediate
Publisher
ISBN-139781783981847
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Jayant Kumar
Jayant Kumar
author image
Jayant Kumar

Jayant Kumar is an experienced software professional with a bachelor of engineering degree in computer science and more than 14 years of experience in architecting and developing large-scale web applications. Jayant is an expert on search technologies and PHP and has been working with Lucene and Solr for more than 11 years now. He is the key person responsible for introducing Lucene as a search engine on www.naukri.com, the most successful job portal in India. Jayant is also the author of the book Apache Solr PHP Integration, Packt Publishing, which has been very successful. Jayant has played many different important roles throughout his career, including software developer, team leader, project manager, and architect, but his primary focus has been on building scalable solutions on the Web. Currently, he is associated with the digital division of HT Media as the chief architect responsible for the job site www.shine.com. Jayant is an avid blogger and his blog can be visited at http://jayant7k.blogspot.in. His LinkedIn profile is available at http://www.linkedin.com/in/jayantkumar.
Read more about Jayant Kumar

Right arrow

Implementing the information gain model


The problem with the information gain model is that, for each term in the index, we will have to evaluate the occurrence of every other term. The complexity of the algorithm will be of the order of square of the two terms, square(xy). It is not possible to compute this using a simple machine. What is recommended is that we create a map-reduce job and use a distributed Hadoop cluster to compute the information gain for each term in the index.

Our distributed Hadoop cluster would do the following:

  • Count all occurrences of each term in the index

  • Count all occurrences of each co-occurring term in the index

  • Construct a hash table or a map of co-occurring terms

  • Calculate the information gain for each term and store it in a file in the Hadoop cluster

In order to implement this in our scoring algorithm, we will need to build a custom scorer where the IDF calculation is overwritten by the algorithm for deriving the information gain for the term from the Hadoop cluster...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Apache Solr Search Patterns
Published in: Apr 2015Publisher: ISBN-13: 9781783981847

Author (1)

author image
Jayant Kumar

Jayant Kumar is an experienced software professional with a bachelor of engineering degree in computer science and more than 14 years of experience in architecting and developing large-scale web applications. Jayant is an expert on search technologies and PHP and has been working with Lucene and Solr for more than 11 years now. He is the key person responsible for introducing Lucene as a search engine on www.naukri.com, the most successful job portal in India. Jayant is also the author of the book Apache Solr PHP Integration, Packt Publishing, which has been very successful. Jayant has played many different important roles throughout his career, including software developer, team leader, project manager, and architect, but his primary focus has been on building scalable solutions on the Web. Currently, he is associated with the digital division of HT Media as the chief architect responsible for the job site www.shine.com. Jayant is an avid blogger and his blog can be visited at http://jayant7k.blogspot.in. His LinkedIn profile is available at http://www.linkedin.com/in/jayantkumar.
Read more about Jayant Kumar