Packt+ | Advance your knowledge in tech

You're reading from Scaling Big Data with Hadoop and Solr, Second Edition

Product type Book

Published in Apr 2015

Publisher

ISBN-13 9781783553396

Pages 166 pages

Edition 1st Edition

Languages

Concepts

Big Data

Author (1):

Hrishikesh Vijay Karambelkar

Table of Contents (13) Chapters

Scaling Big Data with Hadoop and Solr Second Edition

Credits

About the Author

About the Reviewers

www.PacktPub.com

Preface

Processing Big Data Using Hadoop and MapReduce

Understanding Apache Solr

Enabling Distributed Search using Apache Solr

Big Data Search Using Hadoop and Its Ecosystem

Scaling Search Performance

Use Cases for Big Data Search

Index

Chapter 4. Big Data Search Using Hadoop and Its Ecosystem

Sometime back, Gartner (http://www.gartner.com/newsroom/id/2304615) published an executive program survey report, which revealed that big data and analytics are among the top 10 business priorities for CIOs; similarly, analytics and BI are also at the top of CIO's Technical Priorities. Big data presents three major concerns for any organization: namely the storage of big data, data access or querying, and data analytics. Apache Hadoop provides an excellent implementation framework for the organizations looking to solve these problems. Similarly, there is other software that provides efficient storage and access to big data, such as Apache Cassandra and R Statistical. In this chapter, we will explore the possibilities of Apache Solr in working with big data.

We have already discussed a scaling search with SolrCloud in the previous chapters. In this chapter, we will be focusing on the following topics:

Understanding NoSQL
Working with...

Understanding NoSQL

Traditional relational databases allow users to define a strict data structure, and use an SQL-based querying mechanism. NoSQL databases, rather than confining users to define the data structures, allow an open database with which they can store any kind of data and retrieve it by running queries that are not SQL based. In an enterprise, data is generated from all the software used in day-to-day operations. This data has different formats, and bringing in this data for big-data processing requires for a storage system that is flexible enough, to accommodate data with varying data models. The NoSQL database, by design is best suited for such storage.

Note

The CAP theorem or Brewer's theorem talks about distributed consistency. It states that it is impossible to achieve all of the following in a distributed system:

Consistency: Every client sees the most recently updated data state.
Availability: The distributed system functions as expected, even if there are node failures...

Working with the Solr HDFS connector

Apache Solr can utilize HDFS for indexing and storing its indices on the Hadoop system. It does not utilize a MapReduce-based framework for indexing. The following diagram shows the interaction pattern between Solr and HDFS. You can read more details about Apache Hadoop at http://hadoop.apache.org/docs/r2.4.0/.

Let's understand how this can be done.

To start with, the first and most important task is getting Apache Hadoop set up on your machine (proxy node configuration), or setting up a Hadoop cluster. You can download the latest Hadoop tarball or zip from http://hadoop.apache.org. The newer generation Hadoop uses advanced MapReduce (also known as YARN).
Based on the requirement, you can set up a single node (Documentation: http://hadoop.apache.org/docs/r<version>/hadoop-project-dist/hadoop-common/SingleCluster.html) or a cluster (Documentation: http://hadoop.apache.org/docs/r<version>/hadoop-project-dist/hadoop-common/ClusterSetup.html).
Typically...

Big data search using Katta

Katta provides highly scalable, fault-tolerant information storage. It is an open source project and uses the underlying Hadoop infrastructure (to be specific, HDFS) for storing its indices and providing access to them. Katta has been in the market for the last few years and while recently, the development on Katta has been stalled, there are still many users who go with Solr-Katta-based integration for big data search. Some organizations customize Katta as per their needs and utilize its capabilities for highly scalable search. Katta brings Apache Hadoop and Solr together, bringing search across a completely distributed MapReduce-based cluster. You can read more information about Katta on http://katta.sourceforge.net/.

How Katta works?

Katta can be primarily used with two different functions. The first is generating the Solr index, and the second is by running a search on the Hadoop cluster. The following diagram depicts what the Katta architecture looks like:

The...

Using Solr 1045 Patch – map-side indexing

Apache Solr 1045 patch provides Solr users a way to build Solr indexes using the MapReduce framework of Apache Hadoop. Once created, this index can be pushed to Solr storage. The following diagram depicts the Mapper and Reducer in Hadoop:

Each Apache Hadoop mapper transforms the input records into a set of (key, value) pairs, which then get transformed into SolrInputDocument. The Mapper task then ends up creating an index from SolrInputDocument.

The focus of Reducer is to perform de-duplication of different indexes and merge them if needed. Once the indexes are created, you can load them on your Solr instance and use them for searching. You can read more about this patch at https://issues.apache.org/jira/browse/SOLR-1045.

The patch follows the standard process of patching up your label through svn (Subversion). To apply a patch to your Solr instance, first, you need to build your Solr instance using source. The instance should be supported by Solr...

Using Solr 1301 Patch – reduce-side indexing

The Solr 1301 patch is responsible for generating an index using the Apache Hadoop MapReduce framework. This patch is merged in Solr version 4.7 and is available in the code-line if you take Apache Solr with 4.7+ versions. This patch is similar to the previously discussed patch (SOLR-1045), but the difference is that the indexes that are generated using Solr 1301 are in the reduce phase and not in the map phase of Apache Hadoop's MapReduce. Once the indexes are generated, they can be loaded on Solr or SolrCloud for further processing and application searching. The following diagram depicts the overall flow:

In case of Solr 1301, a map task is responsible for converting input records into a <key, value> pair. Later, they are passed to the reducer. The reducer is responsible for converting and publishing SolrInputDocument, which is then transformed into Solr indexes. The indexes are then persisted on HDFS directly and can later be exported...

Distributed search using Apache Blur

Apache Blur is a distributed search engine that can work with Apache Hadoop. It is different from the traditional big data system in that it provides a relational data model-like storage, on top of HDFS. Apache Blur does not use Apache Solr; however, it consumes Apache Lucene APIs. Blur provides faster data ingestion using MapReduce and advanced searches such as a faceted search, fuzzy, pagination, and a wildcard search.

Apache Blur provides a row-based data model (similar to RDBMS), with unique row IDs. Records should have a unique record ID, row ID, and column family. Column family is a group of logical columns. For example, the personal information column family will have columns such as name, companies with which the person works, and contact information. The following figure shows how Apache Blur works closely with Apache Hadoop:

Apache Blur uses Hadoop to store its indexes in a distributed manner. It uses Thrift APIs for all interprocess communication...

Apache Solr and Cassandra

Cassandra is one of the most widely used distributed, fault-tolerant NoSQL databases. Cassandra is designed to handle big data workloads across multiple nodes with no single point of failure. There are some interesting performance benchmarks published at Planet Cassandra (http://planetcassandra.org/NoSQL-performance-benchmarks/), which places Apache Cassandra as one of the fastest NoSQL databases among its competitors in terms of the throughput, load, and so on. Apache Cassandra allows the schema-ess storage of user information in its store called the Column Families pattern. For example, look at the data model for sales lead information as shown in the following screenshot:

This model, when transformed for the Cassandra store, becomes columnar storage. The following screenshot shows how this model would look using Apache Cassandra:

As one can see, the key here is the customer ID, and the value is a set of attributes/columns that vary for each row key. Further, columns...

Scaling Solr through Storm

Apache Storm is a real time distributed computation framework. It processes humongous data in real time. Recently, Storm has been adapted by Apache as the incubating project and the development for Apache Storm. You can read more information about Apache Storm Features here: http://storm.incubator.apache.org/.

Apache Storm can be used to process massive streams of data in a distributed manner. It therefore provides excellent batch-oriented processing capabilities for time-sensitive analytics. With Apache Solr and Storm together, organizations can process big data in real time: for example, such industrial plants that would like to extract information from their plant system, which is emitting raw data continuously, and process it to facilitate real-time analytics such as identifying the top problematic systems or looking for recent errors/failures. Apache Solr and Storm can work together to execute such batch processing for big data in real time.

Apache Storm runs...

Advanced analytics with Solr

Apache Solr provides excellent searching capabilities on the metadata. It is also possible to go beyond a search and faceting with the help of the integration space. As the search industry grows into the next generation, the expectations that search will go beyond a basic search has led to the creation of software such as Apache Solr, which is capable of providing an excellent browsing and filtering experience. It provides basic analytical capabilities. However, for many organizations, this is not sufficient. They would like to bring in capabilities of business intelligence and analytics on top of search engines. Today, it is possible to complement Apache Solr with such advanced analytical capabilities. We will be looking at enabling Solr integration with R.

R is an open source language and environment for statistical computing and graphics. More information about R can be found at http://www.r-project.org/. The development of R started in 1994 as an alternative...

Summary

In this chapter, we have discussed different ways in which Apache Solr can be scaled to work with big data/large datasets. We looked at different implementations of Solr-big data such as Solr-HDFS, Katta, Solr-1045, Solr-1301, and Apache Solr with Cassandra. We also looked at advanced analytics by integrating Apache Solr with R. In the next chapter, we will focus on improving the performance for big data.