Packt+ | Advance your knowledge in tech

You're reading from Scaling Big Data with Hadoop and Solr, Second Edition

Product typeBook

Published inApr 2015

Publisher

ISBN-139781783553396

Edition1st Edition

Tools

Solr Hadoop

Concepts

Big Data

Author (1)

Hrishikesh Vijay Karambelkar

Appendix A. Use Cases for Big Data Search

Many organizations across the globe in different sectors have successfully adapted Apache Hadoop and Solr-based architectures, in order to provide a unique browsing and searching experience for their rapidly growing and diversified information. Let's look at some of the interesting use cases where Big Data Search can be used.

E-Commerce websites

E-Commerce websites are meant to work for different types of users. These users visit the websites for multiple reasons:

Visitors are looking for something specific, but they find it difficult to describe
Visitors are looking for a specific price/features of a product
Visitors come looking for good discounts, to see what's new, and so on
Visitors wish to compare multiple products on the basis of cost/features/reviews

Most e-commerce websites used to be built on custom developed pages, which ran on a SQL database. Although a database provides excellent capabilities to manage your data structurally, it does not provide high speed searches and facets as it does in Solr. In addition to this, it becomes difficult to keep up with the queries for high performance. As the size of data grows, it hampers the overall speed and user experience.

Apache Solr in a distributed scenario provides excellent offerings in terms of a browsing and searching experience. Solr can easily integrate with a database, and provide a high-speed search with real-time indexing. Advanced inbuilt features of Solr, such as suggestions, such as the search, and a spell checker, can effectively help customers gain access to the merchandise they're looking for. Such an instance can easily be integrated with current sites. Faceting can provide interesting filters based on the highest discounts on items, price range, types of merchandise, products from different companies, and so on, which in turn helps to provide a unique shopping experience for end users. Many e-commerce based companies, such as Rakuten.com, DollarDays, and Macy's have acquired distributed Solr-based solutions, preferring these to traditional approaches, so as to provide customers with a better browsing experience.

Log management for banking

Today, many banks in the world are moving towards computerization and using automation in business processes to save costs and improve efficiency. This move requires a bank to build various applications that can support the complex banking use cases. These applications need to interact with each other over standardized communication protocols. A typical enterprise banking sector would consist of software for core banking applications, CMS, credit card management, B2B portals, treasury management, HRMS, ERP, CRM, business warehouses, accounting, BI tools, analytics, custom applications, and various other enterprise applications, all working together to ensure smooth business processes. Each of these applications work with sensitive data: hence, a good banking system landscape often provides high performance and high availability of scalable architecture, along with backup and recovery features, bringing in a completely diversified set of software together, into a secured environment.

Most banks today offer web-based interactions; they not only automate their own business processes, but also access various third-party software of other banks and vendors. A dedicated team of administrators are working 24/7 in order to monitor and handle issues/failures and escalations. A simple application that transfers money from your savings bank account to a loan account may touch upon at least twenty different applications. These systems generate terabytes of data everyday and include transactional data, change logs, and so on.

The problem

The problem arises when any business workflow/transaction fails. With such a complex system, it becomes a big task for system administrators/managers to:

Find out the issue or the application that has caused the failure
Try to understand the issue and find out the root cause
Correlate the issue with other applications
Keep monitoring the workflow

When multiple applications are involved, the log management across these applications becomes difficult. Some of the applications provide their own administration and monitoring capabilities. However, it make sense to have a consolidated place where everything can be seen at a glance/in one place.

How can it be tackled?

Log management is one of the standard problems where Big Data Search can effectively play a role. Apache Hadoop along with Apache Solr can provide a completely distributed environment to effectively manage the logs of multiple applications, and also provide searching capabilities along with it. Take a look at this representation of a sample log management application user interface:

This sample UI allows us to have a consolidated log management screen, which may also be transformed into a dashboard to show us the status and the log details. The following reasons explain why Apache Solr and Hadoop-based Big Data Search as the right solution for a given problem:

The number of logs generated by any banking application are huge in size and are continuous. Most of log-based systems use rotational log management, which cleans up old logs. Given that Apache Hadoop can work on commodity hardware, the overall storage cost for storing these logs becomes cheap, and they can remain in Hadoop storage for a longer time.
Although Apache Solr is capable of storing any type of schema, common fields, such as log descriptions, levels, and others can be consolidated easily.
Apache Solr is fast and its efficient searching capabilities can provide different interesting search features, such as highlighting the text or showing snippets of matched results. It also provides a faceted search to drill down and filter results, thereby providing a better browsing experience.
Apache Solr provides near real-time search capabilities to make the logs immediately searchable, so that administrators can see the latest alarming logs with high severity.
The cost of building Apache Hadoop with a Solr-based solution provides a low cost alternative infrastructure, which itself is required to have a high speed batch processing of data.

High-level design

The overall design, as shown in the following diagram, can have a schema that contains common attributes across all the log files, such as date and time of the log, severity, application name, user name, type of log, and so on. Other attributes can be added as dynamic text fields:

Since each system has a different log schema, these logs have to parsed periodically and then uploaded to a distributed search. The Log Upload Utility or an agent can be a custom script or it can also be based in Apache Kafka, Flume, or even RabbitMQ. Kafka is based on publish-subscribe messaging, and it provides high scalability; you can read more at http://blog.mmlac.com/log-transport-with-apache-kafka/ about how it can be used for log streaming. We need to write script/programs that will understand the log schema, and extract the field data from the logs. Log Upload Utility can feed the outcome to distributed search nodes, which are simply Solr instances running on a distributed system, such as Hadoop. To achieve near real-time search, the Solr configuration requires a change accordingly.

Indexing can be done either instantly, that is, right at the time of upload, or in a batch operation periodically. The second approach is more suitable if you have a consistent flow of log streams, and also if you have scheduled-based log uploading. Once the log is uploaded in a certain folder, for example /stage, a batched index operation using Hadoop's Map-Reduce can generate HDFS-based Solr indexes, based on the many alternatives that we saw in Chapter 4, Big Data Search Using Hadoop and Its Ecosystem, and Chapter 5, Scaling Search Performance. The generated index can be read using Solr through a Solr Hadoop connector, which does not use MapReduce capabilities while searching.

Apache Blur is another alternative to indexing and searching on Hadoop using Lucene or Solr. Commercial implementations, such as Hortonworks and LucidWorks provide a Solr-based integrated search on Hadoop (refer to http://hortonworks.com/hadoop-tutorial/searching-data-solr/).

The rest of the chapter is locked

You have been reading a chapter from

Scaling Big Data with Hadoop and Solr, Second Edition

Published in: Apr 2015Publisher: ISBN-13: 9781783553396

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Hrishikesh Vijay Karambelkar

Hrishikesh Vijay Karambelkar is an innovator and an enterprise architect with 16 years of software design and development experience, specifically in the areas of big data, enterprise search, data analytics, text mining, and databases. He is passionate about architecting new software implementations for the next generation of software solutions for various industries, including oil and gas, chemicals, manufacturing, utilities, healthcare, and government infrastructure. In the past, he has authored three books for Packt Publishing: two editions of Scaling Big Data with Hadoop and Solr and one of Scaling Apache Solr. He has also worked with graph databases, and some of his work has been published at international conferences such as VLDB and ICDE.
Read more about Hrishikesh Vijay Karambelkar

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages