Reader small image

You're reading from  Learning Hunk

Product typeBook
Published inDec 2015
Reading LevelIntermediate
Publisher
ISBN-139781782174820
Edition1st Edition
Languages
Tools
Right arrow
Authors (2):
Dmitry Anoshin
Dmitry Anoshin
author image
Dmitry Anoshin

Dmitry Anoshin is a data-centric technologist and a recognized expert in building and implementing big data and analytics solutions. He has a successful track record when it comes to implementing business and digital intelligence projects in numerous industries, including retail, finance, marketing, and e-commerce. Dmitry possesses in-depth knowledge of digital/business intelligence, ETL, data warehousing, and big data technologies. He has extensive experience in the data integration process and is proficient in using various data warehousing methodologies. Dmitry has constantly exceeded project expectations when he has worked in the financial, machine tool, and retail industries. He has completed a number of multinational full BI/DI solution life cycle implementation projects. With expertise in data modeling, Dmitry also has a background and business experience in multiple relation databases, OLAP systems, and NoSQL databases. He is also an active speaker at data conferences and helps people to adopt cloud analytics.
Read more about Dmitry Anoshin

Sergey Sheypak
Sergey Sheypak
author image
Sergey Sheypak

Sergey Sheypak started his so-called big data practice in 2010 as a Teradata PS consultant. His was leading the Teradata Master Data Management deployment in Sberbank, Russia (which has 110 billion customers). Later Sergey switched to AsterData and Hadoop practices. Sergey joined the Research and Development team at MegaFon (one of the top three telecom companies in Russia with 70 billion customers) in 2012. While leading the Hadoop team at MegaFon, Sergey built ETL processes from existing Oracle DWH to HDFS. Automated end-to-end tests and acceptance tests were introduced as a mandatory part of the Hadoop development process. Scoring geospatial analysis systems based on specific telecom data were developed and launched. Now, Sergey works as independent consultant in Sweden.
Read more about Sergey Sheypak

View More author details
Right arrow

Chapter 7. Exploring Data in the Cloud

Hadoop on the cloud is a new deployment option that allows organizations to create and customize Hadoop clusters on virtual machines utilizing the computing resources of virtual instances and deployment scripts. Similar to the on-premise full custom option, this gives businesses full control of the cluster. In addition, it gives flexibility and many advantages—for example, capacity on demand, decreased staff costs, storage services, and technical support. Finally, it gives the opportunity to get fast time to value, that is we can deploy our infrastructure in the Amazon cloud and start analyze our data very quickly because we don't need setup hardware and software as well as we don't need many technical resources. One of the most popular Hadoop cloud is Amazon Elastic MapReduce (EMR).

With Hunk we can interactively explore, analyze, and visualize data stored in Amazon EMR and Amazon S3. The integrated offering lets AWS and Splunk customers:

  • Unlock the...

An introduction to Amazon EMR and S3


In this section, we will learn about Amazon EMR and Simple Storage Service (S3). Moreover, we try to run these services by creating EMR clusters and S3 buckets.

Amazon EMR

Amazon EMR is a Hadoop framework in the cloud offered as a managed service. It is used by thousands of customers. It uses millions of EMR clusters in a variety of big data use cases, including log analysis, web indexing, data warehousing, machine learning, financial analysis, scientific simulation, and bioinformatics. EMR can easily process any type of big data without its own big data infrastructure:

As with any other Amazon service, EMR is easy to run by filling in option forms. Enter the cluster name, the size, and the types of node in the cluster. And it creates in two minutes a fully running EMR cluster. It is ready to process data. It removes all the headache of maintaining clusters and version compatibility. Amazon takes care of all tasks involved in running and supporting Hadoop...

Integrating Hunk with EMR and S3


Integrating Hunk with EMR and S3 is a pretty sensible proposition. If we connect the vast amounts of data that we store in HDFS or S3 with the rich capabilities of Hunk, we can build a full analytics solution for any type of data and any size of data on the cloud:

Fundamentally, we have a three-tier architecture. The first tier is data storage based on HDFS or S3. The next one is the compute or processing framework, provided by EMR. Finally, the visualization, data discovery, analytics, and app development framework is provided by Hunk.

The traditional method for hosting Hunk in the cloud is to simply buy a standard license and then provision a virtual machine in much the same way you would do it on-site. The instance would then have to be manually configured to point to the correct Hadoop or AWS cluster. This method is also called Bring Your Own License (BYOL).

On the other hand, Splunk and Amazon offer another method, in which Hunk instances can be automatically...

Converting Hunk from an hourly rate to a license


We have the option to convert hourly Hunk to a normal license. If we have bought a license, we can add it in Settings | License | Add License. Then, we should clear the cache using the following command in the Terminal:

rm -rf /opt/hunk/var/run/splunk/hunk/aws/emr/

Summary


In this chapter we met Amazon EMR and S3, discussed their advantages for big data analytics and figured out why Hunk is very useful as an analytical tool for cloud Hadoop. In addition, we considered both methods of Hunk licensing in the cloud and learned how to set up EMR clusters and the Hunk AMI. Moreover, we created a new data provider and virtual index based on S3 buckets with access_combined logs. As a result, the reader can solve any big data challenge using cloud computing and avoid the complexity of Hadoop maintenance and deployment.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Learning Hunk
Published in: Dec 2015Publisher: ISBN-13: 9781782174820
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Dmitry Anoshin

Dmitry Anoshin is a data-centric technologist and a recognized expert in building and implementing big data and analytics solutions. He has a successful track record when it comes to implementing business and digital intelligence projects in numerous industries, including retail, finance, marketing, and e-commerce. Dmitry possesses in-depth knowledge of digital/business intelligence, ETL, data warehousing, and big data technologies. He has extensive experience in the data integration process and is proficient in using various data warehousing methodologies. Dmitry has constantly exceeded project expectations when he has worked in the financial, machine tool, and retail industries. He has completed a number of multinational full BI/DI solution life cycle implementation projects. With expertise in data modeling, Dmitry also has a background and business experience in multiple relation databases, OLAP systems, and NoSQL databases. He is also an active speaker at data conferences and helps people to adopt cloud analytics.
Read more about Dmitry Anoshin

author image
Sergey Sheypak

Sergey Sheypak started his so-called big data practice in 2010 as a Teradata PS consultant. His was leading the Teradata Master Data Management deployment in Sberbank, Russia (which has 110 billion customers). Later Sergey switched to AsterData and Hadoop practices. Sergey joined the Research and Development team at MegaFon (one of the top three telecom companies in Russia with 70 billion customers) in 2012. While leading the Hadoop team at MegaFon, Sergey built ETL processes from existing Oracle DWH to HDFS. Automated end-to-end tests and acceptance tests were introduced as a mandatory part of the Hadoop development process. Scoring geospatial analysis systems based on specific telecom data were developed and launched. Now, Sergey works as independent consultant in Sweden.
Read more about Sergey Sheypak