Packt+ | Advance your knowledge in tech

You're reading from Hadoop Essentials

Product typeBook

Published inApr 2015

Reading LevelIntermediate

PublisherPackt

ISBN-139781784396688

Edition1st Edition

Languages

Java

Tools

Hadoop

Concepts

Data Processing

Author (1)

Shiva Achari

Chapter 4. Data Access Components – Hive and Pig

Hadoop can usually hold terabytes or petabytes of data to process; hence Data Access is an extremely important aspect in any project or product, especially with Hadoop. As we deal with Big Data for processing data, we will have to perform some ad hoc processing to get insights of data and design strategies. Hadoop's basic processing layer is MapReduce, which as we discussed earlier, is a massively parallel processing framework that is scalable, faster, adaptable, and fault tolerant.

We will look at some limitations of MapReduce programming and some programming abstraction layers such as Hive and Pig in detail, which can execute MapReduce using a user friendly language for faster development and management. Hive and Pig are quite useful and handy when it comes to easily do some ad hoc analysis and some not very complex analysis.

Need of a data processing tool on Hadoop

MapReduce is the key to perform processing on Big Data, but it is complex to understand, design, code, and optimize. MapReduce has a high learning curve, which requires good programming skills to master. Usually Big Data users come from different backgrounds such as Programming, Database administrators, scripting, Analyst, Data science, Data Managers, and so on, and not all users can adapt to the programming model of MapReduce. Hence we have different abstractions for the data access components for Hadoop.

The data access components are very useful for developers as they may not need to learn MapReduce programming in detail and can still utilize the MapReduce framework in an interface in which they can be much more comfortable and can help in faster development and better manageability of the code. Abstractions can help ad hoc processing on data quickly and concentrate on the business logic.

The two widely used data access components in the Hadoop ecosystem...

Pig

Pig is a component which has the abstraction wrapper of Pig Latin language on top of MapReduce. Pig was developed by Yahoo! around 2006 and was contributed to Apache as an open source project. Pig Latin is a data flow language that is more comfortable for a procedural language developer or user. Pig can help manage the data in a flow which is ideal for the data flow process, ETL (Extract Transform Load), or the ELT (Extract Load Transform) process ad hoc data analysis.

Pig can be used in a much easier way for structured and semi-structured data analysis. Pig was developed based on a philosophy, which is that Pigs can eat anything, live anywhere, can be easily controlled and modified by the user, and it is important to process data quickly.

Pig data types

Pig has a collection of primitive data types, as well as complex data types. Inputs and outputs to Pig's relational operators are specified using these data types:

Primitive: int, long, float, double, chararray, and bytearray
Map: Map is...

Hive

Hive provides a data warehouse environment in Hadoop with a SQL-like wrapper and also translates the SQL commands in MapReduce jobs for processing. SQL commands in Hive are called as HiveQL, which doesn't support the SQL 92 dialect and should not be assumed to support all the keywords, as the whole idea is to hide the complexity of MapReduce programming and perform analysis on the data.

Hive can also act as an analytical interface with other systems as most of the systems integrate well with Hive. Hive cannot be used for handling transactions, as it doesn't provide row-level updates and real-time queries.

The Hive architecture

Hive architecture has different components such as:

Driver: Driver manages the lifecycle of a HiveQL statement as it moves through Hive and also maintains a session handle for session statistics.
Metastore: Metastore stores the system catalog and metadata about tables, columns, partitions, and so on.
Query Compiler: It compiles HiveQL into a DAG of optimized map...

Summary

In this chapter, we have explored two wrappers of MapReduce programming–Pig and Hive.

MapReduce is very powerful but a very complex high learning curve. The difficult part is to manage the MapReduce programs and the time taken for the development and optimizations. For easier and faster development in MapReduce, we have abstraction layers such as Pig, which is a wrapper of the Pig Latin procedural language on top of MapReduce, and Hive which is a SQL-like HiveQL wrapper.

Pig is used in the data flow model, as it uses the DAG model to transform the Pig Latin language to the MapReduce job. Pig does the transformation in three plans, namely Logical to Physical to MapReduce, where each plan translates the statements and produces an optimized plan of execution. Pig also has the grunt mode for analyzing data interactively. Pig has very useful commands to filter, group, aggregate, cogroup, and so on, and it also supports user-defined functions.

Hive is used by users who are more comfortable...

The rest of the chapter is locked

You have been reading a chapter from

Hadoop Essentials

Published in: Apr 2015Publisher: PacktISBN-13: 9781784396688

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Shiva Achari

Shiva Achari has over 8 years of extensive industry experience and is currently working as a Big Data Architect consultant with companies such as Oracle and Teradata. Over the years, he has architected, designed, and developed multiple innovative and high-performance large-scale solutions, such as distributed systems, data centers, big data management tools, SaaS cloud applications, Internet applications, and Data Analytics solutions. He is also experienced in designing big data and analytics applications, such as ingestion, cleansing, transformation, correlation of different sources, data mining, and user experience in Hadoop, Cassandra, Solr, Storm, R, and Tableau. He specializes in developing solutions for the big data domain and possesses sound hands-on experience on projects migrating to the Hadoop world, new developments, product consulting, and POC. He also has hands-on expertise in technologies such as Hadoop, Yarn, Sqoop, Hive, Pig, Flume, Solr, Lucene, Elasticsearch, Zookeeper, Storm, Redis, Cassandra, HBase, MongoDB, Talend, R, Mahout, Tableau, Java, and J2EE. He has been involved in reviewing Mastering Hadoop, Packt Publishing. Shiva has expertise in requirement analysis, estimations, technology evaluation, and system architecture along with domain experience in telecoms, Internet applications, document management, healthcare, and media. Currently, he is supporting presales activities such as writing technical proposals (RFP), providing technical consultation to customers, and managing deliveries of big data practice groups in Teradata.
Read more about Shiva Achari

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages