Getting Started with Greenplum for Big Data Analytics

Getting Started with Greenplum for Big Data Analytics
eBook: $23.99
Formats: PDF, PacktLib, ePub and Mobi formats
save 15%!
Print + free eBook + free PacktLib access to the book: $63.98    Print cover: $39.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Table of Contents
Sample Chapters
  • Explore the software components and appliance modules available in Greenplum
  • Learn core Big Data Architecture concepts and master data loading and processing patterns
  • Understand Big Data problems and the Data Science lifecycle

Book Details

Language : English
Paperback : 172 pages [ 235mm x 191mm ]
Release Date : October 2013
ISBN : 1782177043
ISBN 13 : 9781782177043
Author(s) : Sunila Gollapudi
Topics and Technologies : All Books, Big Data and Business Intelligence, Enterprise Products and Platforms, Enterprise

Table of Contents

Chapter 1: Big Data, Analytics, and Data Science Life Cycle
Chapter 2: Greenplum Unified Analytics Platform (UAP)
Chapter 3: Advanced Analytics – Paradigms, Tools, and Techniques
Chapter 4: Implementing Analytics with Greenplum UAP
  • Chapter 1: Big Data, Analytics, and Data Science Life Cycle
    • Enterprise data
      • Classification
      • Features
    • Big Data
      • So, what is Big Data?
      • Multi-structured data
    • Data analytics
    • Data science
      • Data science life cycle
        • Phase 1 – state business problem
        • Phase 2 – set up data
        • Phase 3 – explore/transform data
        • Phase 4 – model
        • Phase 5 – publish insights
        • Phase 6 – measure effectiveness
    • References/Further reading
    • Summary
    • Chapter 2: Greenplum Unified Analytics Platform (UAP)
      • Big Data analytics – platform requirements
      • Greenplum Unified Analytics Platform (UAP)
        • Core components
          • Greenplum Database
          • Hadoop (HD)
          • Chorus
          • Command Center
        • Modules
          • Database modules
          • HD modules
          • Data Integration Accelerator (DIA) modules
        • Core architecture concepts
          • Data warehousing
          • Column-oriented databases
          • Parallel versus distributed computing/processing
          • Shared nothing, massive parallel processing (MPP) systems, and elastic scalability
          • Data loading patterns
      • Greenplum UAP components
        • Greenplum Database
          • The Greenplum Database physical architecture
          • The Greenplum high-availability architecture
          • High-speed data loading using external tables
          • External table types
          • Polymorphic data storage and historic data management
          • Data distribution
        • Hadoop (HD)
          • Hadoop Distributed File System (HDFS)
          • Hadoop MapReduce
        • Chorus
      • Greenplum Data Computing Appliance (DCA)
      • Greenplum Data Integration Accelerator (DIA)
      • References/Further reading
      • Summary
      • Chapter 3: Advanced Analytics – Paradigms, Tools, and Techniques
        • Analytic paradigms
          • Descriptive analytics
          • Predictive analytics
          • Prescriptive analytics
        • Analytics classified
          • Classification
          • Forecasting or prediction or regression
          • Clustering
          • Optimization
          • Simulations
        • Modeling methods
          • Decision trees
          • Association rules
            • The Apriori algorithm
          • Linear regression
          • Logistic regression
          • The Naive Bayesian classifier
          • K-means clustering
          • Text analysis
        • R programming
        • Weka
        • In-database analytics using MADlib
        • References/Further reading
        • Summary
        • Chapter 4: Implementing Analytics with Greenplum UAP
          • Data loading for Greenplum Database and HD
            • Greenplum data loading options
              • External tables
              • gpfdist
              • gpload
            • Hadoop (HD) data loading options
              • Sqoop 2
              • Greenplum BulkLoader for Hadoop
            • Using external ETL to load data into Greenplum
              • Extraction, Load, and Transformation (ELT) and Extraction, Transformation, Load, and Transformation (ETLT)
              • Greenplum target configuration
              • Sourcing large volumes of data from Greenplum
              • Unsupported Greenplum data types
              • Push Down Optimization (PDO)
          • Greenplum table distribution and partitioning
            • Distribution
              • Data skew and performance
              • Optimizing the broadcast or redistribution motion for data co-location
            • Partitioning
            • Querying Greenplum Database and HD
            • Querying Greenplum Database
              • Analyzing and optimizing queries
            • Dynamic Pipelining in Greenplum
            • Querying HDFS
              • Hive
              • Pig
            • Data communication between Greenplum Database and Hadoop (using external tables)
          • Data Computing Appliance (DCA)
            • Storage design, disk protection, and fault tolerance
              • Master server RAID configurations
              • Segment server RAID configurations
            • Monitoring DCA
          • Greenplum Database management
          • In-database analytics options (Greenplum-specific)
            • Window functions
              • The PARTITION BY clause
              • The ORDER BY clause
              • The OVER (ORDER BY…) clause
              • Creating, modifying, and dropping functions
            • User-defined aggregates
          • Using R with Greenplum
            • DBI Connector for R
            • PL/R
          • Using Weka with Greenplum
          • Using MADlib with Greenplum
          • Using Greenplum Chorus
          • Pivotal
          • References/Further Reading
          • Summary

          Sunila Gollapudi

          Sunila Gollapudi works as a Technology Architect for Broadridge Financial Solutions Private Limited. She has over 13 years of experience in developing, designing and architecting data-driven solutions with a focus on the banking and financial services domain for around eight years. She drives Big Data and data science practice for Broadridge. Her key roles have been Solutions Architect, Technical leader, Big Data evangelist, and Mentor. Sunila has a Master's degree in Computer Applications and her passion for mathematics enthused her into data and analytics. She worked on Java, Distributed Architecture, and was a SOA consultant and Integration Specialist before she embarked on her data journey. She is a strong follower of open source technologies and believes in the innovation that open source revolution brings. She has been a speaker at various conferences and meetups on Java and Big Data. Her current Big Data and data science specialties include Hadoop, Greenplum, R, Weka, MADlib, advanced analytics, machine learning, and data integration tools such as Pentaho and Informatica. With a unique blend of technology and domain expertise, Sunila has been instrumental in conceptualizing architectural patterns and providing reference architecture for Big Data problems in the financial services domain.
          Sorry, we don't have any reviews for this title yet.

          Submit Errata

          Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.

          Sample chapters

          You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

          Frequently bought together

          Getting Started with Greenplum for Big Data Analytics +    Mudbox 2013 Cookbook =
          50% Off
          the second eBook
          Price for both: £21.95

          Buy both these recommended eBooks together and get 50% off the cheapest eBook.

          What you will learn from this book

          • Load data from multiple data sources using the built-in ELT / ETL
          • Learn Parallel Processing / MPP / MapReduce techniques
          • Program with R and MADlib
          • Understand back-up and recovery implementation in Greenplum
          • Optimize data processing and querying using optimal distribution and partitioning strategies
          • Exchange data between the Greenplum Database and Hadoop
          • Handle high-availability requirements on Greenplum
          • Integrate ETL, reporting, and visualization tools

          In Detail

          Organizations are leveraging the use of data and analytics to gain a competitive advantage over their opposition. Therefore, organizations are quickly becoming more and more data driven. With the advent of Big Data, existing Data Warehousing and Business Intelligence solutions are becoming obsolete, and a requisite for new agile platforms consisting of all the aspects of Big Data has become inevitable. From loading/integrating data to presenting analytical visualizations and reports, the new Big Data platforms like Greenplum do it all. It is now the mindset of the user that requires a tuning to put the solutions to work.

          "Getting Started with Greenplum for Big Data Analytics" is a practical, hands-on guide to learning and implementing Big Data Analytics using the Greenplum Integrated Analytics Platform. From processing structured and unstructured data to presenting the results/insights to key business stakeholders, this book explains it all.

          "Getting Started with Greenplum for Big Data Analytics" discusses the key characteristics of Big Data and its impact on current Data Warehousing platforms. It will take you through the standard Data Science project lifecycle and will lay down the key requirements for an integrated analytics platform. It then explores the various software and appliance components of Greenplum and discusses the relevance of each component at every level in the Data Science lifecycle.

          You will also learn Big Data architectural patterns and recap some key advanced analytics techniques in detail. The book will also take a look at programming with R and integration with Greenplum for implementing analytics. Additionally, you will explore MADlib and advanced SQL techniques in Greenplum for analytics. This book also elaborates on the physical architecture aspects of Greenplum with guidance on handling high-availability, back-up, and recovery.


          Standard tutorial-based approach

          Who this book is for

          "Getting Started with Greenplum for Big Data" Analytics is great for data scientists and data analysts with a basic knowledge of Data Warehousing and Business Intelligence platforms who are new to Big Data and who are looking to get a good grounding in how to use the Greenplum Platform. It’s assumed that you will have some experience with database design and programming as well as be familiar with analytics tools like R and Weka.

          Code Download and Errata
          Packt Anytime, Anywhere
          Register Books
          Print Upgrades
          eBook Downloads
          Video Support
          Contact Us
          Awards Voting Nominations Previous Winners
          Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
          Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software