Pig Design Patterns


Pig Design Patterns
eBook: $32.99
Formats: PDF, PacktLib, ePub and Mobi formats
$28.04
save 15%!
Print + free eBook + free PacktLib access to the book: $87.98    Print cover: $54.99
$54.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
Author
Support
Sample Chapters
  • Quickly understand how to use Pig to design end-to-end Big Data systems
  • Implement a hands-on programming approach using design patterns to solve commonly occurring enterprise Big Data challenges
  • Enhances users’ capabilities to utilize Pig and create their own design patterns wherever applicable

Book Details

Language : English
Paperback : 310 pages [ 235mm x 191mm ]
Release Date : April 2014
ISBN : 1783285559
ISBN 13 : 9781783285556
Author(s) : Pradeep Pasupuleti
Topics and Technologies : All Books, Big Data and Business Intelligence, Open Source


Table of Contents

Preface
Chapter 1: Setting the Context for Design Patterns in Pig
Chapter 2: Data Ingest and Egress Patterns
Chapter 3: Data Profiling Patterns
Chapter 4: Data Validation and Cleansing Patterns
Chapter 5: Data Transformation Patterns
Chapter 6: Understanding Data Reduction Patterns
Chapter 7: Advanced Patterns and Future Work
Index
  • Chapter 1: Setting the Context for Design Patterns in Pig
    • Understanding design patterns
    • The scope of design patterns in Pig
    • Hadoop demystified – a quick reckoner
      • The enterprise context
      • Common challenges of distributed systems
      • The advent of Hadoop
      • Hadoop under the covers
      • Understanding the Hadoop Distributed File System
        • HDFS design goals
        • Working of HDFS
      • Understanding MapReduce
        • Understanding how MapReduce works
        • The MapReduce internals
    • Pig – a quick intro
      • Understanding the rationale of Pig
      • Understanding the relevance of Pig in the enterprise
      • Working of Pig – an overview
        • Firing up Pig
        • The use case
        • Code listing
        • The dataset
    • Understanding Pig through the code
      • Pig's extensibility
      • Operators used in code
      • The EXPLAIN operator
      • Understanding Pig's data model
        • Primitive types
        • Complex types
    • Summary
  • Chapter 2: Data Ingest and Egress Patterns
    • The context of data ingest and egress
    • Types of data in the enterprise
    • Ingest and egress patterns for multistructured data
      • Considerations for log ingestion
        • The Apache log ingestion pattern
        • Background
        • Motivation
        • Use cases
        • Pattern implementation
        • Code snippets
        • Results
        • Additional information
      • The Custom log ingestion pattern
        • Background
        • Motivation
        • Use cases
        • Pattern implementation
        • Code snippets
        • Results
        • Additional information
      • The image ingress and egress pattern
        • Background
        • Motivation
        • Use cases
        • Pattern implementation
        • Code snippets
        • Results
        • Additional information
    • The ingress and egress patterns for the NoSQL data
      • MongoDB ingress and egress patterns
        • Background
        • Motivation
        • Use cases
        • Pattern implementation
        • Code snippets
        • Results
        • Additional information
      • The HBase ingress and egress pattern
        • Background
        • Motivation
        • Use cases
        • Pattern implementation
        • Code snippets
        • Results
        • Additional information
    • The ingress and egress patterns for structured data
      • The Hive ingress and egress patterns
        • Background
        • Motivation
        • Use cases
        • Pattern implementation
        • Code snippets
        • Results
        • Additional information
    • The ingress and egress patterns for semi-structured data
      • The mainframe ingestion pattern
        • Background
        • Motivation
        • Use cases
        • Pattern implementation
        • Code snippets
        • Results
        • Additional information
      • XML ingest and egress patterns
        • Background
        • Motivation
        • Use cases
        • Pattern implementation
      • Code snippets
        • Results
        • Additional information
    • JSON ingress and egress patterns
      • Background
        • Motivation
        • Use cases
        • Pattern implementation
        • Code snippets
        • Results
        • Additional information
    • Summary
  • Chapter 3: Data Profiling Patterns
    • Data profiling for Big Data
      • Big Data profiling dimensions
      • Sampling considerations for profiling Big Data
        • Sampling support in Pig
    • Rationale for using Pig in data profiling
    • The data type inference pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
        • Pig script
        • Java UDF
      • Results
      • Additional information
    • The basic statistical profiling pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
        • Pig script
        • Macro
      • Results
      • Additional information
    • The pattern-matching pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
        • Pig script
        • Macro
      • Results
      • Additional information
    • The string profiling pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
        • Pig script
        • Macro
      • Results
      • Additional information
    • The unstructured text profiling pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
        • Pig script
        • Java UDF for stemming
        • Java UDF for generating TF-IDF
      • Results
      • Additional information
    • Summary
  • Chapter 4: Data Validation and Cleansing Patterns
    • Data validation and cleansing for Big Data
    • Choosing Pig for validation and cleansing
    • The constraint validation and cleansing design pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
      • Results
      • Additional information
    • The regex validation and cleansing design pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
      • Results
      • Additional information
    • The corrupt data validation and cleansing design pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
      • Results
      • Additional information
    • The unstructured text data validation and cleansing design pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
      • Results
      • Additional information
    • Summary
  • Chapter 5: Data Transformation Patterns
    • Data transformation processes
    • The structured-to-hierarchical transformation pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
      • Results
      • Additional information
    • The data normalization pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Results
      • Additional information
    • The data integration pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
      • Results
      • Additional information
    • The aggregation pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
      • Results
      • Additional information
    • The data generalization pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
      • Results
      • Additional information
    • Summary
  • Chapter 6: Understanding Data Reduction Patterns
    • Data reduction – a quick introduction
    • Data reduction considerations for Big Data
    • Dimensionality reduction – the Principal Component Analysis design pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
        • Limitations of PCA implementation
      • Code snippets
      • Results
      • Additional information
    • Numerosity reduction – the histogram design pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
      • Results
      • Additional information
    • Numerosity reduction – sampling design pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
      • Results
      • Additional information
    • Numerosity reduction – clustering design pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
      • Results
      • Additional information
    • Summary
  • Chapter 7: Advanced Patterns and Future Work
    • The clustering pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
      • Results
      • Additional information
    • The topic discovery pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
      • Results
      • Additional information
    • The natural language processing pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
      • Results
      • Additional information
    • The classification pattern
      • Background
      • Motivation
      • Use cases
      • Pattern implementation
      • Code snippets
      • Results
      • Additional information
    • Future trends
      • Emergence of data-driven patterns
      • The emergence of solution-driven patterns
      • Patterns addressing programmability constraints
    • Summary

Pradeep Pasupuleti

Pradeep Pasupuleti has over 16 years of experience in architecting and developing distributed and real-time data-driven systems. Currently, his focus is on developing robust data platforms and data products that are fuelled by scalable machine-learning algorithms, and delivering value to customers by addressing business problems by juxtaposing his deep technical insights into Big Data technologies with future data management and analytical needs. He is extremely passionate about Big Data and believes that it will be the cradle of many innovations that will save humans their time, money, and lives.

He has built solid data product teams with experience spanning through every aspect of data science, thus successfully helping clients to build an end-to-end strategy around how their current data architecture can evolve into a hybrid pattern that is capable of supporting analytics in both batch and real time—all of this is done using the lambda architecture. He has created COE's (Center of Excellence) to provide quick wins with data products that analyze high-dimensional multistructured data using scalable natural language processing and deep learning techniques.

He has performed roles in technology consulting advising Fortune 500 companies on their Big Data strategy, product management, systems architecture, social network analysis, negotiations, conflict resolution, chaos and nonlinear dynamics, international policy, high-performance computing, advanced statistical techniques, risk management, marketing, visualization of high dimensional data, human-computer interaction, machine learning, information retrieval, and data mining. He has a strong experience of working in ambiguity to solve complex problems using innovation by bringing smart people together.

His other interests include writing and reading poetry, enjoying the expressive delights of ghazals, spending time with kids discussing impossible inventions, and searching for archeological sites.

You can reach him at http://www.linkedin.com/in/pradeeppasupuleti and pasupuleti.pradeepkumar@gmail.com.

Sorry, we don't have any reviews for this title yet.

Code Downloads

Download the code and support files for this book.


Submit Errata

Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.

Sample chapters

You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

Frequently bought together

Pig Design Patterns +    Learning Dart =
50% Off
the second eBook
Price for both: $47.10

Buy both these recommended eBooks together and get 50% off the cheapest eBook.

What you will learn from this book

  • Understand Pig's relevance in an enterprise context
  • Use Pig in design patterns that enable data movement across platforms during and after analytical processing
  • See how Pig can co-exist with other components of the Hadoop ecosystem to create Big Data solutions using design patterns
  • Simplify the process of creating complex data pipelines using transformations, aggregations, enrichment, cleansing, filtering, reformatting, lookups, and data type conversions
  • Apply knowledge of Pig in design patterns that deal with integration of Hadoop with other systems to enable multi-platform analytics
  • Comprehend design patterns and use Pig in cases related to complex analysis of pure structured data

In Detail

Pig Design Patterns is a comprehensive guide that will enable readers to readily use design patterns that simplify the creation of complex data pipelines in various stages of data management. This book focuses on using Pig in an enterprise context, bridging the gap between theoretical understanding and practical implementation. Each chapter contains a set of design patterns that pose and then solve technical challenges that are relevant to the enterprise use cases.

The book covers the journey of Big Data from the time it enters the enterprise to its eventual use in analytics, in the form of a report or a predictive model. By the end of the book, readers will appreciate Pig's real power in addressing each and every problem encountered when creating an analytics-based data product. Each design pattern comes with a suggested solution, analyzing the trade-offs of implementing the solution in a different way, explaining how the code works, and the results.

Approach

A comprehensive practical guide that walks you through the multiple stages of data management in enterprise and gives you numerous design patterns with appropriate code examples to solve frequent problems in each of these stages. The chapters are organized to mimick the sequential data flow evidenced in Analytics platforms, but they can also be read independently to solve a particular group of problems in the Big Data life cycle.

Who this book is for

The experienced developer who is already familiar with Pig and is looking for a use case standpoint where they can relate to the problems of data ingestion, profiling, cleansing, transforming, and egressing data encountered in the enterprises. Knowledge of Hadoop and Pig is necessary for readers to grasp the intricacies of Pig design patterns better.

Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software