Reader small image

You're reading from  Apache Solr Search Patterns

Product typeBook
Published inApr 2015
Reading LevelIntermediate
Publisher
ISBN-139781783981847
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Jayant Kumar
Jayant Kumar
author image
Jayant Kumar

Jayant Kumar is an experienced software professional with a bachelor of engineering degree in computer science and more than 14 years of experience in architecting and developing large-scale web applications. Jayant is an expert on search technologies and PHP and has been working with Lucene and Solr for more than 11 years now. He is the key person responsible for introducing Lucene as a search engine on www.naukri.com, the most successful job portal in India. Jayant is also the author of the book Apache Solr PHP Integration, Packt Publishing, which has been very successful. Jayant has played many different important roles throughout his career, including software developer, team leader, project manager, and architect, but his primary focus has been on building scalable solutions on the Web. Currently, he is associated with the digital division of HT Media as the chief architect responsible for the job site www.shine.com. Jayant is an avid blogger and his blog can be visited at http://jayant7k.blogspot.in. His LinkedIn profile is available at http://www.linkedin.com/in/jayantkumar.
Read more about Jayant Kumar

Right arrow

Chapter 4. Solr for Big Data

In the previous chapter, we learned about Solr internals and the creation of custom queries. We understood the algorithms behind the working of AND and OR clauses in Solr and the internals of the eDisMax parser. We implemented our own plugin in Solr for running a proximity search by using SWAN queries. We understood the internals of how filters work.

In this chapter, we will discuss how and why Solr is an appropriate choice for churning out analytical reports. We will understand the concept of big data and how Solr can be used to solve the problems that come along with running queries on big data. We will discuss different faceting concepts and see how distributed pivot faceting works.

The topics that we will cover in this chapter are:

  • Introduction to big data

  • Getting data points using facets

  • Radius faceting for location-based data

  • Data analysis using pivot faceting

  • Introduction to graphical representation of analytical reports

Introduction to big data


Big data can simply be defined as data too large to be processed by a single machine. Let us say that we have 1 TB of data and the reports that need to be generated from it cannot be processed on a single machine in a time span acceptable to us. Let us take the example of click stream analysis. Internet companies such as Yahoo or Google keep an eye on the activity of the user by capturing each click that the user does on their website. Sometimes the complete page by page flow is also captured. Google, for example, captures the position from the top of a search result page for a search on a particular keyword or phrase. The amount of data generated and captured is huge and may be running into exabytes every day. This data needs to be processed on a day-to-day basis for analytical purposes. The analytical reports that are generated from this data are used to improve the experience of the user visiting the website.

Is it possible to process an exabyte of data? Of course...

Getting data points using facets


Let us refresh our memory about facets. Simply put, faceting refers to the method of categorizing data. A facet on a search result will contain categories and the number of documents in each category. The purpose of facets is to help the user narrow down his or her search result on the basis of some categories. Let us take an example to understand this better.

A search on mobile a phone would bring up a few of the following facets on the Amazon website:

  • Facet for Brand: We can see a facet for Brand in the following screenshot:

The brand facet is purely intended to help the user shortlist his or her preferences. The count of cell phones for each brand is not displayed, although this information is readily available and can be used for display.

  • Facet for display size: We can see the facet for display size in the following image:

The display size category shows facets based on the range of display sizes. Phones having sizes of less than 3.9 inches are grouped together...

Radius faceting for location-based data


Location-based data can be represented in Solr using latitudes and longitudes. Applications can combine other data with location information to provide more insight into the data pertaining to a certain location. In analytics, location-based data is very important. Whether we are dealing with sales information, statistical information of any kind, or information pertaining to visits to a website, having a location in addition to the numbers that we already have provides an additional insight with a regional perspective.

We will delve into how geospatial searches happen in Solr in Chapter 6, Solr for Spatial Search. For the current chapter, let us understand the different types of location filters available with Solr.

For spatial filters, the following parameters are used in Solr:

  • d: Radial distance in kilometers

  • pt: Center point in the format of latitude and longitude

  • sfield: Refers to a spatial indexed field

Tip

In order to run queries, we would need...

Data analysis using pivot faceting


As per the definition of pivoting in the Solr wiki, it is a summarization tool that lets you automatically sort, count, total, or average data stored in a table. Pivot faceting lets you create a summary table of the results from a query across numerous documents.

The output of pivot faceting can be referred to as decision trees. This means the output of pivot faceting is represented by a hierarchy of all sub-facets under a facet with counts both for individual facets and sub-facets. We can constrain the previous facet with a new sub-facet and get counts of the sub-sub-facets inside it. Let us see an example to understand pivot faceting.

Facet A has constraints as X,Y with counts M for X and N for Y. We could go ahead and constrain facet A by X and get a new sub-facet B with constraints W,Z and counts O for W and P for Z.

To understand better how pivot faceting works and hence how it could be helpful in analytics, let us see an example. Our index contains some...

Graphs for analytics


Once we know which queries to execute for getting the facets and hierarchical information, we need a graphical representation of the same. There are a few open source graph engines, mostly JavaScript based, that can be used for this. Most of these engines take JSON data and use it to display the graphs. Let us see some of the engines:

  • chart.js: This is an HTML5 based graph engine. It can be downloaded from http://www.chartjs.org.

  • D3.js: This is another JavaScript library that brings data to life using HTML and CSS. D3 can be used to generate an HTML table from an array of numbers or the same numbers can be used to draw an interactive bar chart. It is available for download at http://d3js.org.

  • Google charts: This is another library provided by Google. It can be used to draw graphs based on data from Solr. Google charts provide a large range of graphs from simple line charts to complex hierarchical tree maps. Most of the charts are ready to use. Google charts can be downloaded...

Summary


In this chapter, we learnt how Solr can be used to churn out data for analytics purposes. We also understood big data and learnt how to use different faceting concepts, such as radius faceting and pivot faceting, for data analytics purposes. We saw some codes that can be used for generating graphs and discussed the different libraries available for this. We discussed that, with SolrCloud, we can build our own data warehouse and get graphs of not only historical data but also real-time data.

In the next chapter, we will learn about the problems that we normally face during the implementation of Solr on an e-commerce platform. We will also discuss how to debug such problems along with tweaks to further optimize the instance(s). Additionally, we will learn about semantic search and its implementation in e-commerce scenarios.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Apache Solr Search Patterns
Published in: Apr 2015Publisher: ISBN-13: 9781783981847
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Jayant Kumar

Jayant Kumar is an experienced software professional with a bachelor of engineering degree in computer science and more than 14 years of experience in architecting and developing large-scale web applications. Jayant is an expert on search technologies and PHP and has been working with Lucene and Solr for more than 11 years now. He is the key person responsible for introducing Lucene as a search engine on www.naukri.com, the most successful job portal in India. Jayant is also the author of the book Apache Solr PHP Integration, Packt Publishing, which has been very successful. Jayant has played many different important roles throughout his career, including software developer, team leader, project manager, and architect, but his primary focus has been on building scalable solutions on the Web. Currently, he is associated with the digital division of HT Media as the chief architect responsible for the job site www.shine.com. Jayant is an avid blogger and his blog can be visited at http://jayant7k.blogspot.in. His LinkedIn profile is available at http://www.linkedin.com/in/jayantkumar.
Read more about Jayant Kumar