Reader small image

You're reading from  Hadoop Essentials

Product typeBook
Published inApr 2015
Reading LevelIntermediate
PublisherPackt
ISBN-139781784396688
Edition1st Edition
Languages
Tools
Right arrow
Author (1)
Shiva Achari
Shiva Achari
author image
Shiva Achari

Shiva Achari has over 8 years of extensive industry experience and is currently working as a Big Data Architect consultant with companies such as Oracle and Teradata. Over the years, he has architected, designed, and developed multiple innovative and high-performance large-scale solutions, such as distributed systems, data centers, big data management tools, SaaS cloud applications, Internet applications, and Data Analytics solutions. He is also experienced in designing big data and analytics applications, such as ingestion, cleansing, transformation, correlation of different sources, data mining, and user experience in Hadoop, Cassandra, Solr, Storm, R, and Tableau. He specializes in developing solutions for the big data domain and possesses sound hands-on experience on projects migrating to the Hadoop world, new developments, product consulting, and POC. He also has hands-on expertise in technologies such as Hadoop, Yarn, Sqoop, Hive, Pig, Flume, Solr, Lucene, Elasticsearch, Zookeeper, Storm, Redis, Cassandra, HBase, MongoDB, Talend, R, Mahout, Tableau, Java, and J2EE. He has been involved in reviewing Mastering Hadoop, Packt Publishing. Shiva has expertise in requirement analysis, estimations, technology evaluation, and system architecture along with domain experience in telecoms, Internet applications, document management, healthcare, and media. Currently, he is supporting presales activities such as writing technical proposals (RFP), providing technical consultation to customers, and managing deliveries of big data practice groups in Teradata.
Read more about Shiva Achari

Right arrow

Chapter 4. Data Access Components – Hive and Pig

Hadoop can usually hold terabytes or petabytes of data to process; hence Data Access is an extremely important aspect in any project or product, especially with Hadoop. As we deal with Big Data for processing data, we will have to perform some ad hoc processing to get insights of data and design strategies. Hadoop's basic processing layer is MapReduce, which as we discussed earlier, is a massively parallel processing framework that is scalable, faster, adaptable, and fault tolerant.

We will look at some limitations of MapReduce programming and some programming abstraction layers such as Hive and Pig in detail, which can execute MapReduce using a user friendly language for faster development and management. Hive and Pig are quite useful and handy when it comes to easily do some ad hoc analysis and some not very complex analysis.

Need of a data processing tool on Hadoop


MapReduce is the key to perform processing on Big Data, but it is complex to understand, design, code, and optimize. MapReduce has a high learning curve, which requires good programming skills to master. Usually Big Data users come from different backgrounds such as Programming, Database administrators, scripting, Analyst, Data science, Data Managers, and so on, and not all users can adapt to the programming model of MapReduce. Hence we have different abstractions for the data access components for Hadoop.

The data access components are very useful for developers as they may not need to learn MapReduce programming in detail and can still utilize the MapReduce framework in an interface in which they can be much more comfortable and can help in faster development and better manageability of the code. Abstractions can help ad hoc processing on data quickly and concentrate on the business logic.

The two widely used data access components in the Hadoop ecosystem...

Pig


Pig is a component which has the abstraction wrapper of Pig Latin language on top of MapReduce. Pig was developed by Yahoo! around 2006 and was contributed to Apache as an open source project. Pig Latin is a data flow language that is more comfortable for a procedural language developer or user. Pig can help manage the data in a flow which is ideal for the data flow process, ETL (Extract Transform Load), or the ELT (Extract Load Transform) process ad hoc data analysis.

Pig can be used in a much easier way for structured and semi-structured data analysis. Pig was developed based on a philosophy, which is that Pigs can eat anything, live anywhere, can be easily controlled and modified by the user, and it is important to process data quickly.

Pig data types

Pig has a collection of primitive data types, as well as complex data types. Inputs and outputs to Pig's relational operators are specified using these data types:

  • Primitive: int, long, float, double, chararray, and bytearray

  • Map: Map is...

Hive


Hive provides a data warehouse environment in Hadoop with a SQL-like wrapper and also translates the SQL commands in MapReduce jobs for processing. SQL commands in Hive are called as HiveQL, which doesn't support the SQL 92 dialect and should not be assumed to support all the keywords, as the whole idea is to hide the complexity of MapReduce programming and perform analysis on the data.

Hive can also act as an analytical interface with other systems as most of the systems integrate well with Hive. Hive cannot be used for handling transactions, as it doesn't provide row-level updates and real-time queries.

The Hive architecture

Hive architecture has different components such as:

  • Driver: Driver manages the lifecycle of a HiveQL statement as it moves through Hive and also maintains a session handle for session statistics.

  • Metastore: Metastore stores the system catalog and metadata about tables, columns, partitions, and so on.

  • Query Compiler: It compiles HiveQL into a DAG of optimized map...

Summary


In this chapter, we have explored two wrappers of MapReduce programming–Pig and Hive.

MapReduce is very powerful but a very complex high learning curve. The difficult part is to manage the MapReduce programs and the time taken for the development and optimizations. For easier and faster development in MapReduce, we have abstraction layers such as Pig, which is a wrapper of the Pig Latin procedural language on top of MapReduce, and Hive which is a SQL-like HiveQL wrapper.

Pig is used in the data flow model, as it uses the DAG model to transform the Pig Latin language to the MapReduce job. Pig does the transformation in three plans, namely Logical to Physical to MapReduce, where each plan translates the statements and produces an optimized plan of execution. Pig also has the grunt mode for analyzing data interactively. Pig has very useful commands to filter, group, aggregate, cogroup, and so on, and it also supports user-defined functions.

Hive is used by users who are more comfortable...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hadoop Essentials
Published in: Apr 2015Publisher: PacktISBN-13: 9781784396688
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Shiva Achari

Shiva Achari has over 8 years of extensive industry experience and is currently working as a Big Data Architect consultant with companies such as Oracle and Teradata. Over the years, he has architected, designed, and developed multiple innovative and high-performance large-scale solutions, such as distributed systems, data centers, big data management tools, SaaS cloud applications, Internet applications, and Data Analytics solutions. He is also experienced in designing big data and analytics applications, such as ingestion, cleansing, transformation, correlation of different sources, data mining, and user experience in Hadoop, Cassandra, Solr, Storm, R, and Tableau. He specializes in developing solutions for the big data domain and possesses sound hands-on experience on projects migrating to the Hadoop world, new developments, product consulting, and POC. He also has hands-on expertise in technologies such as Hadoop, Yarn, Sqoop, Hive, Pig, Flume, Solr, Lucene, Elasticsearch, Zookeeper, Storm, Redis, Cassandra, HBase, MongoDB, Talend, R, Mahout, Tableau, Java, and J2EE. He has been involved in reviewing Mastering Hadoop, Packt Publishing. Shiva has expertise in requirement analysis, estimations, technology evaluation, and system architecture along with domain experience in telecoms, Internet applications, document management, healthcare, and media. Currently, he is supporting presales activities such as writing technical proposals (RFP), providing technical consultation to customers, and managing deliveries of big data practice groups in Teradata.
Read more about Shiva Achari