Reader small image

You're reading from  Learning Hunk

Product typeBook
Published inDec 2015
Reading LevelIntermediate
Publisher
ISBN-139781782174820
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Dmitry Anoshin
Dmitry Anoshin
author image
Dmitry Anoshin

Dmitry Anoshin is a data-centric technologist and a recognized expert in building and implementing big data and analytics solutions. He has a successful track record when it comes to implementing business and digital intelligence projects in numerous industries, including retail, finance, marketing, and e-commerce. Dmitry possesses in-depth knowledge of digital/business intelligence, ETL, data warehousing, and big data technologies. He has extensive experience in the data integration process and is proficient in using various data warehousing methodologies. Dmitry has constantly exceeded project expectations when he has worked in the financial, machine tool, and retail industries. He has completed a number of multinational full BI/DI solution life cycle implementation projects. With expertise in data modeling, Dmitry also has a background and business experience in multiple relation databases, OLAP systems, and NoSQL databases. He is also an active speaker at data conferences and helps people to adopt cloud analytics.
Read more about Dmitry Anoshin

Sergey Sheypak
Sergey Sheypak
author image
Sergey Sheypak

Sergey Sheypak started his so-called big data practice in 2010 as a Teradata PS consultant. His was leading the Teradata Master Data Management deployment in Sberbank, Russia (which has 110 billion customers). Later Sergey switched to AsterData and Hadoop practices. Sergey joined the Research and Development team at MegaFon (one of the top three telecom companies in Russia with 70 billion customers) in 2012. While leading the Hadoop team at MegaFon, Sergey built ETL processes from existing Oracle DWH to HDFS. Automated end-to-end tests and acceptance tests were introduced as a mandatory part of the Hadoop development process. Scoring geospatial analysis systems based on specific telecom data were developed and launched. Now, Sergey works as independent consultant in Sweden.
Read more about Sergey Sheypak

View More author details
Right arrow

Chapter 2. Explore Hadoop Data with Hunk

Hadoop has become an enterprise standard for big organizations working towards mining and implementing big data strategies. The use of Hadoop on a larger scale is set to become the new standard for practical, result-driven applications for data mining. However, it is a challenging task to extract data from Hadoop in order to explore it and find business insights. It is a fact that Hadoop provides cheap storage for any data but, unfortunately, it is inflexible for data analytics. There are plenty of tools that can add flexibility and interactivity for analytics tasks, but they have many restrictions.

Hunk avoids the main drawbacks of big data analytics and offers rich functionality and interactivity for analytics.

In this chapter you will learn how to deploy Hunk on top of Hadoop in order to start discovering Hunk. In addition, we will load data into Hadoop and will discover it via Hunk, using the Splunk Processing Language (SPL). Finally, we will learn...

Setting up Hunk


In order to start exploring Hadoop data, we have to install Hunk on top of our Hadoop Cluster. Hunk is easy to install and configure. Let's learn how to deploy Hunk version 6.2.1 on top of an existing CDH cluster. It's assumed that your VM is up and running.

Extracting Hunk to a VM

  1. Open the console application.

  2. Run ls -la to see the list of files in your home directory:

    [cloudera@quickstart ~]$ cd ~
    [cloudera@quickstart ~]$ ls -la | grep hunk
    -rw-r--r--   1 root     root     113913609 Mar 23 04:09 hunk-6.2.1-249325-Linux-x86_64.tgz
    
  3. Unpack the archive:

    cd /opt
    sudo tar xvzf /home/cloudera/hunk-6.2.1-249325-Linux-x86_64.tgz -C /opt
    

Setting up Hunk variables and configuration files

  1. It's time to set the SPLUNK_HOME environment variable. This variable has already been added to the profile:

    export SPLUNK_HOME=/opt/hunk
    
  2. Use the default splunk-launch.conf. This is the basic properties file used by the Hunk service. We don't have to change anything special, so let's use the default settings...

Exploring data


We are going to explore Apache web logs taken from the online store. These logs are taken from the Apache web server and uploaded to HDFS. You'll see how to read Apache logs out of the box. The name of the store is unicorn fashion. Here is an example log line:

135.51.156.129 - - [02/Dec/2013:13:52:29] "POST /product.screen?productName=SHORTS&JSESSIONID=CA10MO9AZ5USANA4955 HTTP 1.1" 200 2334 "http://www.yahoo.com" "Opera/9.01 (Windows NT 5.1; U; en)" 167

It's a normal Apache access combined log. We can build reports, dashboards, and alerts on top of this data. You will:

  • Learn the basics of SPL to create queries

  • Learn visualization abilities

  • Drill-down from the aggregated report to the underlying detailed data

  • Check the job details used to prepare report data

  • Create alerts and see a simple alert use-case

  • Create a dashboard presenting web analytics reports on a single page

  • Create a virtual index

You know already how to create a virtual index; we provide a screenshot with an index...

Controlling security with Hunk


What is security? We could work on data security forever; this part of IT infrastructure is infinite. Companies are usually interested in these aspects:

  • User/group/access control list-based access to data: Administrators should have something similar to the read/write/execute in Linux. We can set who owns the data and who can read/write it.

  • Audit/log access to data: We need to know who got access to data, and when and how.

  • Isolation: We don't want our data to be publicly accessible. We would like to set access to clusters from specific subnets, for example. We are not going to try to set up all the functionality required for production-ready security. Our aim is to set simple security to prevent unauthorized users from accessing the data.

What is data security? It consists of three major parts:

  • Authentication

  • Authorization

  • Audit

These three properties give us a clue as to who did something somewhere, and which privileges were used. Hadoop security setup is still...

Summary


In this chapter, we have created a Hadoop connector and virtual index for CDR data. As a result, we got a chance to explore that data using the Hunk interface and the Splunk Search Processing Language. In addition, we learnt how to create reports, dashboards, and alerts. Finally, we considered the authentication approach to Hunk, which allows us to manage the security of Hunk, users. In the next chapter, we will learn about the rich functionality of Hunk such as data models, pivots, and various knowledge objects.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Hunk
Published in: Dec 2015Publisher: ISBN-13: 9781782174820
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Dmitry Anoshin

Dmitry Anoshin is a data-centric technologist and a recognized expert in building and implementing big data and analytics solutions. He has a successful track record when it comes to implementing business and digital intelligence projects in numerous industries, including retail, finance, marketing, and e-commerce. Dmitry possesses in-depth knowledge of digital/business intelligence, ETL, data warehousing, and big data technologies. He has extensive experience in the data integration process and is proficient in using various data warehousing methodologies. Dmitry has constantly exceeded project expectations when he has worked in the financial, machine tool, and retail industries. He has completed a number of multinational full BI/DI solution life cycle implementation projects. With expertise in data modeling, Dmitry also has a background and business experience in multiple relation databases, OLAP systems, and NoSQL databases. He is also an active speaker at data conferences and helps people to adopt cloud analytics.
Read more about Dmitry Anoshin

author image
Sergey Sheypak

Sergey Sheypak started his so-called big data practice in 2010 as a Teradata PS consultant. His was leading the Teradata Master Data Management deployment in Sberbank, Russia (which has 110 billion customers). Later Sergey switched to AsterData and Hadoop practices. Sergey joined the Research and Development team at MegaFon (one of the top three telecom companies in Russia with 70 billion customers) in 2012. While leading the Hadoop team at MegaFon, Sergey built ETL processes from existing Oracle DWH to HDFS. Automated end-to-end tests and acceptance tests were introduced as a mandatory part of the Hadoop development process. Scoring geospatial analysis systems based on specific telecom data were developed and launched. Now, Sergey works as independent consultant in Sweden.
Read more about Sergey Sheypak