Reader small image

You're reading from  Learning Hunk

Product typeBook
Published inDec 2015
Reading LevelIntermediate
Publisher
ISBN-139781782174820
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Dmitry Anoshin
Dmitry Anoshin
author image
Dmitry Anoshin

Dmitry Anoshin is a data-centric technologist and a recognized expert in building and implementing big data and analytics solutions. He has a successful track record when it comes to implementing business and digital intelligence projects in numerous industries, including retail, finance, marketing, and e-commerce. Dmitry possesses in-depth knowledge of digital/business intelligence, ETL, data warehousing, and big data technologies. He has extensive experience in the data integration process and is proficient in using various data warehousing methodologies. Dmitry has constantly exceeded project expectations when he has worked in the financial, machine tool, and retail industries. He has completed a number of multinational full BI/DI solution life cycle implementation projects. With expertise in data modeling, Dmitry also has a background and business experience in multiple relation databases, OLAP systems, and NoSQL databases. He is also an active speaker at data conferences and helps people to adopt cloud analytics.
Read more about Dmitry Anoshin

Sergey Sheypak
Sergey Sheypak
author image
Sergey Sheypak

Sergey Sheypak started his so-called big data practice in 2010 as a Teradata PS consultant. His was leading the Teradata Master Data Management deployment in Sberbank, Russia (which has 110 billion customers). Later Sergey switched to AsterData and Hadoop practices. Sergey joined the Research and Development team at MegaFon (one of the top three telecom companies in Russia with 70 billion customers) in 2012. While leading the Hadoop team at MegaFon, Sergey built ETL processes from existing Oracle DWH to HDFS. Automated end-to-end tests and acceptance tests were introduced as a mandatory part of the Hadoop development process. Scoring geospatial analysis systems based on specific telecom data were developed and launched. Now, Sergey works as independent consultant in Sweden.
Read more about Sergey Sheypak

View More author details
Right arrow

Hunk architecture


Let's explore how Hunk looks. From the end user perspective, Hunk looks like Splunk. You use same interface, you can write searches, visualize big data, and create reports, dashboards, and alerts. In other words, Hunk can do everything Splunk can do. In the following screenshot, you can see a schematic of Hunk's architecture:

Hunk has the same interface and command lines as well. The only change is that Splunk works with data stored in native indexes but the Hunk SPL acts with external data; that's why they call virtual indexes.

Connecting to Hadoop

Hunk is designed to connect to Hadoop via the Hadoop interface. The following screenshot demonstrates that Hunk can connect a Hadoop cluster via Hadoop client libraries and Java:

Moreover, Hunk can work with multiple Hadoop clusters:

In addition, you can use Splunk and Hunk together. You can connect Splunk Enterprise if you have Splunk and Hadoop in your environment. As a result, it is possible to correlate Hunk searches through Hadoop and Splunk Enterprise via the same search head.

Advance Hunk deployment

Sometimes, organizations have really big data. They have thousands of instances of Hadoop. It is a real challenge to get business insight from this extremely large data. However, Hunk can easily handle this titanic task. Of course this isn't as easy as it sounds, but it is possible because you can scale Hunk deployments. Let's look at the following example:

There are hundreds or thousands of users who put their business questions to big data. business users send their queries, they go through the Load Balancer (LB), LB sends them to Hunk, and Hunk makes distributive work to Hadoop.

Native versus virtual indexes

Before we start to compare native and virtual indexes, let's use our previous Splunk experience and see how SPL actually works.

For example, we have a query:

Index=main | stats count by status | rename count AS qty

As you may remember, every step in Splunk is divided by pipes. You can read expression from left to right and follow expression execution sequence.

Tip

Splunk development was motivated by Unix Shell pipes.

In our example, we:

  1. Get all data from Index=main.

  2. Count all rows for every status.

  3. Rename the count field as qty.

  4. Retrieve the final result.

Tip

It is interesting to know that SPL uses a MapReduce algorithm. In other words, it has a map phase when performing retrieve operations and reducing step, and when performing count operations.

The rule is that the first search command is always responsible for retrieving events.

Native indexes

Before Hunk was created, there were only the native indexes of Splunk Enterprise. The data was ingested by Splunk and access to it was via the Splunk interface.

A native index is basically a data store or collection of data. We can put web logs, syslogs, or other machine data in Splunk. We have access controls and the ability to give permissions to users to access data on specific indexes. In addition, Splunk gives us the opportunity to optimize popular and heavy searches. As a result, business users will get their dashboards very quickly.

Virtual index

Virtual indexes lack some features of native indexes. A virtual index is a data container with access controls. Hunk can only read data. Data gets into Hadoop somehow and Hunk can use this data as a container. The inventors of Hunk decided to not build indexes on top of Hadoop data and to not optimize Hunk to perform needle-in-the-haystack searches. However, if data layout is properly designed in Hadoop (for example, there is a hierarchical structure or data is organized based on the timestamp, year, month, or date), this can really improve search performance.

Let's compare both indexes in one table:

Native Indexes

Virtual Indexes

Serve as data containers

Serve as data containers

Access control

Access control

Reads/writes

Read only

Data retention policies

N/A

Optimized for keyword search

N/A

Optimized for time range search

Available via regex/pruning

Tip

You can learn more about virtual indexes on the Splunk website: http://docs.splunk.com/Documentation/Hunk/latest/Hunk/Virtualindexes.

External result provider

The core technology of Hunk is a virtual index and External Result Provider (ERP). We have already encountered virtual indexes. The term ERP is sometime known as resource provider.

The ERP is basically a helper process. It goes out and deals with the details of the external systems that are going to interact with Hadoop or another data store. In other words, it takes searches that users perform in Hunk and somehow translates or interprets them in mrjob. That's how it pushes computation.

There are a few other implementations of ERP that Splunk's partners developed in order to integrate Hunk with Mongo DB, Apache Accumulo, and Cassandra. There are just different implementations of the same interface that helps Hunk to interact with external systems and use any type of data via virtual indexes.

The following diagram demonstrates how ERP looks:

For each Hadoop cluster (or external system) the search process spawns an ERP process that is responsible for executing the (remote part of the) search on that system.

Tip

You can learn more about ERP on the Splunk web site: http://docs.splunk.com/Documentation/Hunk/latest/Hunk/Externalresultsproviders.

Computation models

Previously, we considered some challenges in big data analytics and found out powerful solutions via Hunk. Now we can go deeper in order to understand some of the core advantages of Hunk. Let's start with an easy question: how do we provide interactivity?

There are at least two computational models.

Data streaming

In this approach, data moves from HDFS to the search head. In other words, data is processed in a streaming fashion. As a result users can immediately start to work with data, slice and dice it, or visualize when the first bytes of data will start to appear. But there is a problem with this process. It is a huge volume of data to move and process.

There is one primary benefit that you will probably get; there is a very low response time. In addition, we get low throughput that is not very positive for us.

Data reporting

The second mode is moving computation to data. The way to do this is to create and start a MapReduce job to do the processing, monitor the MapReduce job, and, finally, collect the results. Then, merge the results and visualize the data. There is another problem here—late feedback, because the MapReduce job might take a long time. As a result, this approach has high latency and high throughput.

Mixed mode

Both modes have their pros and cons, but the most important are low latency, because it gives interactivity, and high throughput, because it gives us the opportunity to process larger datasets. These are all benefits and Hunk takes the best from both computational modes.

Let's visualize both modes in order to better understand how they work:

In addition, we consolidate all the modes in the following table, in order to make things clearer:

Streaming

Reporting

Mixed Mode

Pull data from HDFS to search head for processing

Push compute down to data and consume results

Start both streaming and reporting models. Show streaming results until reporting starts to complete

Low latency

High latency

Low latency

Low throughput

High throughput

High throughput

Hunk security

With version 6.1, Hunk became more secure. By default Hunk has superusers with full access. However, very often organizations want to apply a security model to their corporate data, in order to keep their data safe. Hunk can use pass-through authentication. It gives the opportunity to control how MapReduce jobs can submit users and what HDFS files they can access. In addition, it is possible to specify the queue MapReduce jobs should use.

Pass-through authentication gives us the capability to make the Hunk superuser a proxy for any number of configured Hunk users. As a result, Hunk users can act as Hadoop users to own the associated jobs, tasks, and files in Hadoop (and it can limit access to files in HDFS.). Let's look at the following diagram:

Let's explore some common use cases that can help us understand how it works.

One Hunk user to one Hadoop user

For example, say we want our Hunk user to act as a Hadoop user associated with a specific queue or data set. Then we just map the Hunk user to a specific user in Hadoop. For example, in Hunk the user name is Hemal, but in Hadoop it is HemalDesai and the queue is Books.

Many Hunk users to one Hadoop user

For example, say we have many Hunk users and want them to act as a Hadoop user. In Hunk, we can have several users such as Dmitry, Hemal, and Sergey, but in Hadoop they will all execute as an Executive user and will be assigned to the Books queue.

Hunk user(s) to the same Hadoop user with different queues

For example, say you have many Hunk users and the same Hadoop users; it is possible to assign them different queues.

Security will be discussed more closely in Chapter 3, Meet Hunk Features.

Previous PageNext Page
You have been reading a chapter from
Learning Hunk
Published in: Dec 2015Publisher: ISBN-13: 9781782174820
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Dmitry Anoshin

Dmitry Anoshin is a data-centric technologist and a recognized expert in building and implementing big data and analytics solutions. He has a successful track record when it comes to implementing business and digital intelligence projects in numerous industries, including retail, finance, marketing, and e-commerce. Dmitry possesses in-depth knowledge of digital/business intelligence, ETL, data warehousing, and big data technologies. He has extensive experience in the data integration process and is proficient in using various data warehousing methodologies. Dmitry has constantly exceeded project expectations when he has worked in the financial, machine tool, and retail industries. He has completed a number of multinational full BI/DI solution life cycle implementation projects. With expertise in data modeling, Dmitry also has a background and business experience in multiple relation databases, OLAP systems, and NoSQL databases. He is also an active speaker at data conferences and helps people to adopt cloud analytics.
Read more about Dmitry Anoshin

author image
Sergey Sheypak

Sergey Sheypak started his so-called big data practice in 2010 as a Teradata PS consultant. His was leading the Teradata Master Data Management deployment in Sberbank, Russia (which has 110 billion customers). Later Sergey switched to AsterData and Hadoop practices. Sergey joined the Research and Development team at MegaFon (one of the top three telecom companies in Russia with 70 billion customers) in 2012. While leading the Hadoop team at MegaFon, Sergey built ETL processes from existing Oracle DWH to HDFS. Automated end-to-end tests and acceptance tests were introduced as a mandatory part of the Hadoop development process. Scoring geospatial analysis systems based on specific telecom data were developed and launched. Now, Sergey works as independent consultant in Sweden.
Read more about Sergey Sheypak