Many organizations across the globe in different sectors have successfully adapted Apache Hadoop and Solr-based architectures, in order to provide a unique browsing and searching experience for their rapidly growing and diversified information. Let's look at some of the interesting use cases where Big Data Search can be used.
You're reading from Scaling Big Data with Hadoop and Solr, Second Edition
E-Commerce websites are meant to work for different types of users. These users visit the websites for multiple reasons:
Visitors are looking for something specific, but they find it difficult to describe
Visitors are looking for a specific price/features of a product
Visitors come looking for good discounts, to see what's new, and so on
Visitors wish to compare multiple products on the basis of cost/features/reviews
Most e-commerce websites used to be built on custom developed pages, which ran on a SQL database. Although a database provides excellent capabilities to manage your data structurally, it does not provide high speed searches and facets as it does in Solr. In addition to this, it becomes difficult to keep up with the queries for high performance. As the size of data grows, it hampers the overall speed and user experience.
Apache Solr in a distributed scenario provides excellent offerings in terms of a browsing and searching experience. Solr can easily integrate with a database, and provide a high-speed search with real-time indexing. Advanced inbuilt features of Solr, such as suggestions, such as the search, and a spell checker, can effectively help customers gain access to the merchandise they're looking for. Such an instance can easily be integrated with current sites. Faceting can provide interesting filters based on the highest discounts on items, price range, types of merchandise, products from different companies, and so on, which in turn helps to provide a unique shopping experience for end users. Many e-commerce based companies, such as Rakuten.com, DollarDays, and Macy's have acquired distributed Solr-based solutions, preferring these to traditional approaches, so as to provide customers with a better browsing experience.
Today, many banks in the world are moving towards computerization and using automation in business processes to save costs and improve efficiency. This move requires a bank to build various applications that can support the complex banking use cases. These applications need to interact with each other over standardized communication protocols. A typical enterprise banking sector would consist of software for core banking applications, CMS, credit card management, B2B portals, treasury management, HRMS, ERP, CRM, business warehouses, accounting, BI tools, analytics, custom applications, and various other enterprise applications, all working together to ensure smooth business processes. Each of these applications work with sensitive data: hence, a good banking system landscape often provides high performance and high availability of scalable architecture, along with backup and recovery features, bringing in a completely diversified set of software together, into a secured environment.
Most banks today offer web-based interactions; they not only automate their own business processes, but also access various third-party software of other banks and vendors. A dedicated team of administrators are working 24/7 in order to monitor and handle issues/failures and escalations. A simple application that transfers money from your savings bank account to a loan account may touch upon at least twenty different applications. These systems generate terabytes of data everyday and include transactional data, change logs, and so on.
The problem arises when any business workflow/transaction fails. With such a complex system, it becomes a big task for system administrators/managers to:
Find out the issue or the application that has caused the failure
Try to understand the issue and find out the root cause
Correlate the issue with other applications
Keep monitoring the workflow
When multiple applications are involved, the log management across these applications becomes difficult. Some of the applications provide their own administration and monitoring capabilities. However, it make sense to have a consolidated place where everything can be seen at a glance/in one place.
Log management is one of the standard problems where Big Data Search can effectively play a role. Apache Hadoop along with Apache Solr can provide a completely distributed environment to effectively manage the logs of multiple applications, and also provide searching capabilities along with it. Take a look at this representation of a sample log management application user interface:
This sample UI allows us to have a consolidated log management screen, which may also be transformed into a dashboard to show us the status and the log details. The following reasons explain why Apache Solr and Hadoop-based Big Data Search as the right solution for a given problem:
The number of logs generated by any banking application are huge in size and are continuous. Most of log-based systems use rotational log management, which cleans up old logs. Given that Apache Hadoop can work on commodity hardware, the overall storage cost for storing these logs becomes cheap, and they can remain in Hadoop storage for a longer time.
Although Apache Solr is capable of storing any type of schema, common fields, such as log descriptions, levels, and others can be consolidated easily.
Apache Solr is fast and its efficient searching capabilities can provide different interesting search features, such as highlighting the text or showing snippets of matched results. It also provides a faceted search to drill down and filter results, thereby providing a better browsing experience.
Apache Solr provides near real-time search capabilities to make the logs immediately searchable, so that administrators can see the latest alarming logs with high severity.
The cost of building Apache Hadoop with a Solr-based solution provides a low cost alternative infrastructure, which itself is required to have a high speed batch processing of data.
The overall design, as shown in the following diagram, can have a schema that contains common attributes across all the log files, such as date and time of the log, severity, application name, user name, type of log, and so on. Other attributes can be added as dynamic text fields:
Since each system has a different log schema, these logs have to parsed periodically and then uploaded to a distributed search. The Log Upload Utility or an agent can be a custom script or it can also be based in Apache Kafka, Flume, or even RabbitMQ. Kafka is based on publish-subscribe messaging, and it provides high scalability; you can read more at http://blog.mmlac.com/log-transport-with-apache-kafka/ about how it can be used for log streaming. We need to write script/programs that will understand the log schema, and extract the field data from the logs. Log Upload Utility can feed the outcome to distributed search nodes, which are simply Solr instances running on a distributed system, such as Hadoop. To achieve near real-time search, the Solr configuration requires a change accordingly.
Indexing can be done either instantly, that is, right at the time of upload, or in a batch operation periodically. The second approach is more suitable if you have a consistent flow of log streams, and also if you have scheduled-based log uploading. Once the log is uploaded in a certain folder, for example /stage
, a batched index operation using Hadoop's Map-Reduce can generate HDFS-based Solr indexes, based on the many alternatives that we saw in Chapter 4, Big Data Search Using Hadoop and Its Ecosystem, and Chapter 5, Scaling Search Performance. The generated index can be read using Solr through a Solr Hadoop connector, which does not use MapReduce capabilities while searching.
Apache Blur is another alternative to indexing and searching on Hadoop using Lucene or Solr. Commercial implementations, such as Hortonworks and LucidWorks provide a Solr-based integrated search on Hadoop (refer to http://hortonworks.com/hadoop-tutorial/searching-data-solr/).