Before getting started with Hunk, let's dive into the problem of modern big data analytics and highlight the main drawbacks and challenges. It is important to understand how Hunk works and what goes on under the hood. In addition, we'll compare Splunk and Enterprise in order to understand their differences and what they have in common as powerful and flexible products. Finally, we'll perform a lot of practical exercises through real-world use cases.
In this chapter you will learn:
What big data analytics is
Big data challenges and the disadvantages of modern big data analytics tools
Hunk's history
Hunk's architecture
How to set up Hadoop for Hunk
Real-world use cases
We are living in the century of information technology. There are a lot of electronic devices around us that generate lots of data. For example, you can surf the Internet, visit a couple of news portals, order new Nike Air Max shoes from a web store, write a couple of messages to your friends, and chat on Facebook. Every action produces data. And we can multiply the actions by the amount of people who have access to the Internet, or just use a mobile phone, and we get really big data. Of course, you have a question: how big is big data? It probably starts from terabytes or even petabytes now. The volume is not the only issue; we are also struggling with the variety of data. As a result, it is not enough to analyze just the data structure. We should explore unstructured data, such as machine data generated by various machines.
World-famous enterprises try to collect this extremely big data in order to monetize it and find business insights. Big data offers us new opportunities; for example, we can enrich customer data through social networks, using the APIs of Facebook or Twitter. We can build customer profiles and try to predict customer wishes in order to sell our product or improve the customer experience. It is easy to say, but difficult to do. However, organizations try to overcome these challenges and use big data stores, such as Hadoop.
Hadoop is a distributed file system and a distributed framework designed to compute large chunks of data. It is relatively easy to get data into Hadoop. There are plenty of tools for getting data into different formats, such as Apache Phoenix. However it is actually extremely difficult to get value out of the data you put into Hadoop.
Let's look at the path from data to value. First we have to start with collecting data. Then we also spend a lot of time preparing it, making sure that this data is available for analysis, and being able to question the data. This process is as follows:
Unfortunately, you may not have asked the right questions or the answers are not clear, and you have to repeat this cycle. Maybe you have transformed and formatted your data. In other words, it is a long and challenging process.
What you actually want is to collect the data and spend some time preparing it; then you can ask questions and get answers repetitively. Now, you can spend a lot of time asking multiple questions. In addition, you can iterate with data on those questions to refine the answers that you are looking for. Let's look at the following diagram, in order to find a new approach:
What if we could take Splunk and put it on top of all the data stored in Hadoop? This is what Splunk actually did. The following shows the names Hunk was derived from:
Let's discuss some goals that Hunk's inventors were thinking about when they were planning Hunk:
Splunk can take data from Hadoop via the Splunk Hadoop Connection App. However, it is a bad idea to copy massive amounts of data from Hadoop to Splunk, it is much better to process data in-place, because Hadoop provides both storage and computation; why not take advantage of both of them?
Splunk has the extremely powerful Splunk Processing Language (SPL) and it has a wide range of analytic functions. That's why it is a good idea to keep SPL in the new product.
Splunk has a true on-the-fly schema. Data that we store in Hadoop changes constantly. So, Hunk has to be able build a schema on-the-fly independently of the data format.
It is a very good idea to provide the ability to make a preview. As you know, when searching you can get incremental results. It can dramatically reduce outage. For example, we don't want to wait till a map reduce job finishes; we can look at the incremental result and, in the event of a wrong result, we can restart the search query.
Deployment of Hadoop is not easy, and Splunk tries to make the installation and configuration of Hunk easy for us.
Let's discuss more closely the reasons for supporting SPL. You are probably familiar with Splunk and SPL and know how powerful and flexible this language is. These are some of the advantages of SPL:
Naturally suitable for
MapReduce
Reduces adoption time for people who are already familiar with Splunk
There are some challenges in integrating SPL and Hadoop. Hadoop is written in Java but all SPL code is in C++. Does SPL need to convert to Java or reuse what Splunk has provided? Finally, it was decided to reuse C++ code entirely.
No one likes to look at a blank screen. A lot of people using other tools such as Pig or Hive have to wait until the query is finished and you have no idea what the query is actually retrieving for you. Maybe you made a mistake, but you didn't know about it; you will have to wait till the job is completed. It is a kind of frustration—running queries and waiting for hours.
That's why the Hunk team gave their users the ability to preview the result. You will be able to play with this in future chapters.
Before going deeper into Hunk, let's clarify what Hunk does not do:
Hunk does not replace your Hadoop distribution
Hunk does not replace or require Splunk Enterprise
Interactive but no real-time or needle in the haystack searches
No data ingest management
No Hadoop operation management
Hunk is a full-featured platform for rapidly exploring, analyzing, and visualizing data in Hadoop and NoSQL data stores. Based on years of experience building big data products deployed to thousands of Splunk customers, Hunk drives dramatic improvements in the speed and simplicity of getting insights from raw, unstructured, or polystructured big data—all without building fixed schemas or moving data to a separate in-memory store. Hunk delivers proven value for security, risk management, product analytics, a 360-degree customer view, and the Internet of Things.
While many big data initiatives take months and have high rates of failure, Hunk offers a unique approach. Hunk provides a single, fluid user experience designed to drive rapid insights from your big data. Hunk empowers self-service analytics for anyone in your department or organization to quickly and easily unlock actionable insights from raw big data, wherever it may reside.
These are the main capabilities of Hunk:
Full-featured, integrated analytics
Fast to deploy and drive value
Interactive search
Supported data formats
Report acceleration
Results preview
Drag-and-drop analytics
Rich developer environment
Custom dashboards and views
Secure access
Hunk apps
Hunk on the AWS cloud
Let's compare Splunk and Hunk:
Features |
Splunk Enterprise |
Hunk |
---|---|---|
Indexing |
Native |
Virtual |
Where data is stored and read |
Splunk Buckets on Local or SAN Disks |
Any Hadoop-compatible file system (HDFS, MapR, Amazon S3) and NoSQL, or other data stores via streaming resource libraries |
A 60-day free trial license |
500 MB/day |
Unlimited |
Pricing model |
Data invested per day |
Number of task trackers (compute nodes in YARN) |
Real-time streaming events |
+ |
+ |
Data model |
+ |
+ |
Pivot |
+ |
+ |
Rich developer environment |
+ |
+ |
Event breaking, timestamp extraction, source typing |
+ |
+ |
Rare term search |
Index time |
Search time |
Report acceleration |
Fast: Uses index and bloom filters |
Slow: Requires full data scan within partitions |
Access control and single sign-on |
+ |
+ |
Universal forwarder |
+ |
NA |
Forwarder management |
+ |
NA |
Splunk apps |
+ |
Limited |
Premium apps |
+ |
N/A |
Standard support |
+ |
+ |
Enterprise support |
+ |
+ |
In the preceding table + means product support mentioned feature and NA means feature is not supported in a product.
As you saw, there are some differences. But Hunk is designed for another purpose; it is a kind face in the complex world of big data. Throughout this book, we will introduce the various features of Hunk and you will definitely learn this amazing tool.
Let's look closely at Hunk and try to understand how it works.
Let's explore how Hunk looks. From the end user perspective, Hunk looks like Splunk. You use same interface, you can write searches, visualize big data, and create reports, dashboards, and alerts. In other words, Hunk can do everything Splunk can do. In the following screenshot, you can see a schematic of Hunk's architecture:
Hunk has the same interface and command lines as well. The only change is that Splunk works with data stored in native indexes but the Hunk SPL acts with external data; that's why they call virtual indexes.
Hunk is designed to connect to Hadoop via the Hadoop interface. The following screenshot demonstrates that Hunk can connect a Hadoop cluster via Hadoop client libraries and Java:
Moreover, Hunk can work with multiple Hadoop clusters:
In addition, you can use Splunk and Hunk together. You can connect Splunk Enterprise if you have Splunk and Hadoop in your environment. As a result, it is possible to correlate Hunk searches through Hadoop and Splunk Enterprise via the same search head.
Sometimes, organizations have really big data. They have thousands of instances of Hadoop. It is a real challenge to get business insight from this extremely large data. However, Hunk can easily handle this titanic task. Of course this isn't as easy as it sounds, but it is possible because you can scale Hunk deployments. Let's look at the following example:
There are hundreds or thousands of users who put their business questions to big data. business users send their queries, they go through the Load Balancer (LB), LB sends them to Hunk, and Hunk makes distributive work to Hadoop.
Before we start to compare native and virtual indexes, let's use our previous Splunk experience and see how SPL actually works.
For example, we have a query:
Index=main | stats count by status | rename count AS qty
As you may remember, every step in Splunk is divided by pipes. You can read expression from left to right and follow expression execution sequence.
In our example, we:
Get all data from
Index=main
.Count all rows for every status.
Rename the
count
field asqty
.Retrieve the final result.
Tip
It is interesting to know that SPL uses a MapReduce
algorithm. In other words, it has a map phase when performing retrieve operations and reducing step, and when performing count operations.
The rule is that the first search command is always responsible for retrieving events.
Before Hunk was created, there were only the native indexes of Splunk Enterprise. The data was ingested by Splunk and access to it was via the Splunk interface.
A native index is basically a data store or collection of data. We can put web logs, syslogs, or other machine data in Splunk. We have access controls and the ability to give permissions to users to access data on specific indexes. In addition, Splunk gives us the opportunity to optimize popular and heavy searches. As a result, business users will get their dashboards very quickly.
Virtual indexes lack some features of native indexes. A virtual index is a data container with access controls. Hunk can only read data. Data gets into Hadoop somehow and Hunk can use this data as a container. The inventors of Hunk decided to not build indexes on top of Hadoop data and to not optimize Hunk to perform needle-in-the-haystack searches. However, if data layout is properly designed in Hadoop (for example, there is a hierarchical structure or data is organized based on the timestamp, year, month, or date), this can really improve search performance.
Let's compare both indexes in one table:
Native Indexes |
Virtual Indexes |
---|---|
Serve as data containers |
Serve as data containers |
Access control |
Access control |
Reads/writes |
Read only |
Data retention policies |
N/A |
Optimized for keyword search |
N/A |
Optimized for time range search |
Available via regex/pruning |
Tip
You can learn more about virtual indexes on the Splunk website: http://docs.splunk.com/Documentation/Hunk/latest/Hunk/Virtualindexes.
The core technology of Hunk is a virtual index and External Result Provider (ERP). We have already encountered virtual indexes. The term ERP is sometime known as resource provider.
The ERP is basically a helper process. It goes out and deals with the details of the external systems that are going to interact with Hadoop or another data store. In other words, it takes searches that users perform in Hunk and somehow translates or interprets them in mrjob
. That's how it pushes computation.
There are a few other implementations of ERP that Splunk's partners developed in order to integrate Hunk with Mongo DB, Apache Accumulo, and Cassandra. There are just different implementations of the same interface that helps Hunk to interact with external systems and use any type of data via virtual indexes.
The following diagram demonstrates how ERP looks:
For each Hadoop cluster (or external system) the search process spawns an ERP process that is responsible for executing the (remote part of the) search on that system.
Tip
You can learn more about ERP on the Splunk web site: http://docs.splunk.com/Documentation/Hunk/latest/Hunk/Externalresultsproviders.
Previously, we considered some challenges in big data analytics and found out powerful solutions via Hunk. Now we can go deeper in order to understand some of the core advantages of Hunk. Let's start with an easy question: how do we provide interactivity?
There are at least two computational models.
In this approach, data moves from HDFS to the search head. In other words, data is processed in a streaming fashion. As a result users can immediately start to work with data, slice and dice it, or visualize when the first bytes of data will start to appear. But there is a problem with this process. It is a huge volume of data to move and process.
There is one primary benefit that you will probably get; there is a very low response time. In addition, we get low throughput that is not very positive for us.
The second mode is moving computation to data. The way to do this is to create and start a MapReduce
job to do the processing, monitor the MapReduce
job, and, finally, collect the results. Then, merge the results and visualize the data. There is another problem here—late feedback, because the MapReduce
job might take a long time. As a result, this approach has high latency and high throughput.
Both modes have their pros and cons, but the most important are low latency, because it gives interactivity, and high throughput, because it gives us the opportunity to process larger datasets. These are all benefits and Hunk takes the best from both computational modes.
Let's visualize both modes in order to better understand how they work:
In addition, we consolidate all the modes in the following table, in order to make things clearer:
Streaming |
Reporting |
Mixed Mode |
---|---|---|
Pull data from HDFS to search head for processing |
Push compute down to data and consume results |
Start both streaming and reporting models. Show streaming results until reporting starts to complete |
Low latency |
High latency |
Low latency |
Low throughput |
High throughput |
High throughput |
With version 6.1, Hunk became more secure. By default Hunk has superusers with full access. However, very often organizations want to apply a security model to their corporate data, in order to keep their data safe. Hunk can use pass-through authentication. It gives the opportunity to control how MapReduce
jobs can submit users and what HDFS files they can access. In addition, it is possible to specify the queue MapReduce
jobs should use.
Pass-through authentication gives us the capability to make the Hunk superuser a proxy for any number of configured Hunk users. As a result, Hunk users can act as Hadoop users to own the associated jobs, tasks, and files in Hadoop (and it can limit access to files in HDFS.). Let's look at the following diagram:
Let's explore some common use cases that can help us understand how it works.
For example, say we want our Hunk user to act as a Hadoop user associated with a specific queue or data set. Then we just map the Hunk user to a specific user in Hadoop. For example, in Hunk the user name is Hemal
, but in Hadoop it is HemalDesai
and the queue is Books
.
For example, say we have many Hunk users and want them to act as a Hadoop user. In Hunk, we can have several users such as Dmitry
, Hemal
, and Sergey
, but in Hadoop they will all execute as an Executive
user and will be assigned to the Books
queue.
For example, say you have many Hunk users and the same Hadoop users; it is possible to assign them different queues.
Security will be discussed more closely in Chapter 3, Meet Hunk Features.
Before starting to play with Hadoop and Hunk, we are going to download and run a VM. You'll get a short description on how to get everything up and running and put in some data for processing later.
We have decided to take the default Cloudera CDH 5.3.1 VM from the Cloudera site and fine-tune it for our needs. Please open this link to prepare a VM: http://www.bigdatapath.com/2015/08/learning-hunk-links-to-vm-with-all-stuff-you-need/.
This post may have been be updated by the time you're reading this book.
You can run the terminal application by clicking the special icon on the top bar:
Your user is cloudera
. sudo
is passwordless:
[cloudera@quickstart ~]$ whoami cloudera [cloudera@quickstart ~]$ sudo su [root@quickstart cloudera]# whoami root [root@quickstart cloudera]#
MySQL is used as an example of the data ingestion process. The user name is dwhuser
, the password is dwhuser
. You can get root access by using the root
username and the cloudera
password:
[cloudera@quickstart ~]$ mysql -u root -p mysql> show databases; +--------------------+ | Database | +--------------------+ | information_schema | | cdrdb | | cm | | firehose | | hue | | metastore | | mysql | | oozie | | retail_db | | sentry | +--------------------+ 10 rows in set (0.00 sec)
We import data from MySQL to Hadoop from the database named cdrdb
. There are some other databases. They are used by Cloudera Manager services and Hadoop features such as Hive Metastore, Oozie, and so on.
Tip
Hive Metastore is a service designed to centralize metadata management. It's a kind of Teradata DBC.Table
, DBC.Columns
, or IBM DB2 syscat.Columns
, syscat.Tables
. The idea is to create a strict schema description over the bytes stored in Hadoop and then get access to this data using SQL.
Oozie is a kind of Hadoop CRON without a Single Point of Failure (SPOF). Think it through; is it easy to create a distributed reliable CRON with failover functionality? Oozie uses RDBMS to persist metadata about planned, running, and finished tasks. This VM doesn't provide an Oozie HA configuration.
Install VirtualBox: https://www.virtualbox.org/.
Follow the link and download the VM ZIP.
Extract the ZIP to your local drive.
Open VirtualBox.
Go to menu Machine | Add and point to the extracted VBOX file.
You should see the imported VM:
We are going to use several data sources for our applications. The first will use RDBMS interaction since it's one of the most popular use cases. We will show you how to integrate classical data with a strict schema from the RDBMS small dictionary stored in HDFS. Data from RDBMS has big volume in real life (we just have a small subset of it now) and the dictionary keeps the dimensions for the RDBMS value. We will enrich the big data with the dimensions before displaying data on the map.
There are many ways to import data into Hadoop; we could easily publish a book called "1,001 methods to put your data into Hadoop." We are not going to focus on these specialties and will use very simple cases. Why so simple? Because you will meet many problems in a production environment and we can't cover these in a book.
An example from real life: you will definitely need to import data from your existing DWH into Hadoop. And you will have to use Sqoop in conjunction with special Teradata/Oracle connectors to do it quickly without DWH performance penalties. You will spend some time tuning your DB storage schema and connection properties to achieve a reasonable result. That is why we decided to keep all this tricky stuff out of the book; our goal is to use Hunk on top of Hadoop.
Here is a short explanation of the import process. We've split the diagram into three parts:
MySQL, a database that stores data
Oozie, responsible for triggering the job import process
Hadoop, responsible for getting data from MySQL DB and storing it in an HDFS catalog
Initially, the data is stored in MySQL DB. We want to copy the data to HDFS for later processing using Hunk. We will work with the telco dataset in Chapter 6, Discovering Hunk Integration Apps, related to custom application development. We write an Oozie coordinator, started by the Oozie server each night. Oozie is a kind of Hadoop CRON and has some additional features that help us work with data. Oozie can do many useful things, but right now we are using its basic functionality: just running the workflow each day. The coordinator code is here: https://github.com/seregasheypak/learning-hunk/blob/master/import-milano-cdr/src/main/resources/oozie/import-milano-cdr-coord.xml.
Next is the workflow. The coordinator is responsible for scheduling the workflow. The workflow is responsible for doing business logic. The workflow code is here: https://github.com/seregasheypak/learning-hunk/blob/master/import-milano-cdr/src/main/resources/oozie/workflows/import-milano-cdr-wflow.xml.
The workflow has one Sqoop action.
Next is the Sqoop action. This action declares the way the job should read data from RDBMS and store it to HDFS.
The third section is the MapReduce
job that reads data from RDBMS and writes it to HDFS. The Sqoop action internally runs a MapReduce
job that is responsible for getting table rows out of MySQL. The whole process sounds pretty complex, but you don't have to worry. We've created the code to implement it.
We are going to use several open datasets from https://dandelion.eu. One weekly dataset was uploaded to MySQL and contains information about the telecommunication activity in the city of Milano. Later, you will use an Oozie coordinator with the Sqoop action to create a daily partitioned dataset.
The source dataset is: https://dandelion.eu/datagems/SpazioDati/telecom-sms-call-internet-mi/resource/ and the grid map is: https://dandelion.eu/datagems/SpazioDati/milano-grid/resource/.
Milano is divided into equal squares. Each square has a unique ID and four longitude and latitude coordinates.
A mapping between the logical square mesh and spatial area would be helpful for us during geospatial analysis. We will demonstrate how Hunk can deal with geospatial visualizations out-of-the-box.
We've prepared an Oozie coordinator to import data from MySQL to HDFS. Generally, it looks like a production-ready process. Real-life processes are organized in pretty much the same way. The following describes the idea behind the import process:
We have the potentially huge table 2013_12_milan_cdr with time-series data. We are not going to import the whole table in one go; we will partition data using a time field named time_interval.
The idea is to split data by equal time periods and import it to Hadoop. It's just a projection of the RDBMS partitioning/sharing techniques to Hadoop. You'll see seven folders named from /masterdata/stream/2013/12/01
to 07
.
You can get the workflow code here: https://github.com/seregasheypak/learning-hunk/blob/master/import-milano-cdr/src/main/resources/oozie/workflows/import-milano-cdr-wflow.xml.
The general idea is to:
Run the workflow each day
Import the data for the whole day
We've applied a dummy MySQL date function; in real life, you would use the OraOop connector, Teradata connector, or some other tricks to play with the partition properties.
To run the import process, you have to open the console application inside a VM window, go to the catalog with the coordinator configuration, and submit it to Oozie:
cd /home/devops/oozie-configs sudo -u hdfs oozie job -oozie http://localhost/oozie -config import-milano-cdr-coord.properties -run
The console output will be:
job: 0000000-150302102722384-oozie-oozi-C
Where 0000000-15030102722384-oozie-oozi-C
is the unique ID of the running coordinator. We can visit Hue and watch the progress of the import process. Visit this link: http://localhost:8888/oozie/list_oozie_coordinators/
and http://vm-cluster-node3.localdomain:8888/oozie/list_oozie_coordinators/
:
Here is the running coordinator. It took two minutes to import one day. There are seven days in total. We used a powerful PC (32 GB memory and an 8-core AMD CPU) to accomplish this task.
Tip
Downloading the example code
You can download the example code from http://www.bigdatapath.com/wp-content/uploads/2015/05/learning-hunk-05-with-mongo.zip.
The following screenshot shows how the successful result should look:
We can see that the coordinator did produce seven actions for each day starting from December 1st till December 7th.
You can use the open console application to execute this command:
hadoop fs -du -h /masterdata/stream/milano_cdr/2013/12
The output should be:
74.1 M 222.4 M /masterdata/stream/milano_cdr/2013/12/01 97.0 M 291.1 M /masterdata/stream/milano_cdr/2013/12/02 100.4 M 301.3 M /masterdata/stream/milano_cdr/2013/12/03 100.4 M 301.1 M /masterdata/stream/milano_cdr/2013/12/04 100.6 M 301.8 M /masterdata/stream/milano_cdr/2013/12/05 100.6 M 301.7 M /masterdata/stream/milano_cdr/2013/12/06 89.2 M 267.6 M /masterdata/stream/milano_cdr/2013/12/07
The target format is Avro with snappy compression. We'll see later how Hunk works with popular storage formats and compression codecs. Avro is a reasonable choice: it has wide support across Hadoop tools and has a schema.
It's possible to skip the import process; you can move data to the target destination using a command. Open the console application and execute:
sudo -u hdfs hadoop fs -mkdir -p /masterdata/stream/milano_cdr/2013 sudo -u hdfs hadoop fs -mv /backup/milano_cdr/2013/12/masterdata/stream/milano_cdr/2013/
In this chapter we met big data analytics and discovered challenges that can be overcome with Hunk. In addition, we learned about Hunk's history and its internal modes. Moreover, we explored Hunk's architecture and learned about virtual indexes and ERPs. In addition, we touched on some topics related to security. Finally, we started to prepare for technical exercises by setting up Hadoop and discovering real-world use cases.
In the next chapter, we will start to explore big data in Hadoop. We will work with SPL and create amazing visualizations.