Summary
In this chapter, we introduced Apache Pig, a platform for large-scale data analysis on Hadoop. In particular, we covered the following topics:
The goals of Pig as a way of providing a dataflow-like abstraction that does not require hands-on MapReduce development
How Pig's approach to processing data compares to SQL, where Pig is procedural while SQL is declarative
Getting started with Pig — an easy task, as it is a library that generates custom code and doesn't require additional services
An overview of the data types, core functions, and extension mechanisms provided by Pig
Examples of applying Pig to analyze the Twitter dataset in detail, which demonstrated its ability to express complex concepts in a very concise fashion
How libraries such as Piggybank, Elephant Bird, and DataFu provide repositories for numerous useful prewritten Pig functions
In the next chapter, we will revisit the SQL comparison by exploring tools that expose a SQL-like abstraction over data stored in HDFS