Pentaho Data Integration Cookbook - Second Edition


Pentaho Data Integration Cookbook - Second Edition
eBook: $29.99
Formats: PDF, PacktLib, ePub and Mobi formats
$25.49
save 15%!
Print + free eBook + free PacktLib access to the book: $79.98    Print cover: $49.99
$49.99
save 37%!
Free Shipping!
UK, US, Europe and selected countries in Asia.
Also available on:
Overview
Table of Contents
Author
Support
Sample Chapters
  • Intergrate Kettle in integration with other components of the Pentaho Business Intelligence Suite, to build and publish Mondrian schemas,create  reports, and populatedashboards
  • This book contains an organized sequence of recipes packed with screenshots, tables, and tips so you can complete the tasks as efficiently as possible
  • Manipulate your data by exploring, transforming, validating, integrating, and performing data analysis

Book Details

Language : English
Paperback : 462 pages [ 235mm x 191mm ]
Release Date : December 2013
ISBN : 1783280670
ISBN 13 : 9781783280674
Author(s) : Alex Meadows, Adrián Sergio Pulvirenti, María Carina Roldán
Topics and Technologies : All Books, Cookbooks, Open Source

Table of Contents

Preface
Chapter 1: Working with Databases
Chapter 2: Reading and Writing Files
Chapter 3: Working with Big Data and Cloud Sources
Chapter 4: Manipulating XML Structures
Chapter 5: File Management
Chapter 6: Looking for Data
Chapter 7: Understanding and Optimizing Data Flows
Chapter 8: Executing and Re-using Jobs and Transformations
Chapter 9: Integrating Kettle and the Pentaho Suite
Chapter 10: Getting the Most Out of Kettle
Chapter 11: Utilizing Visualization Tools in Kettle
Chapter 12: Data Analytics
Appendix A: Data Structures
Appendix B: References
Index
  • Chapter 1: Working with Databases
    • Introduction
    • Connecting to a database
    • Getting data from a database
    • Getting data from a database by providing parameters
    • Getting data from a database by running a query built at runtime
    • Inserting or updating rows in a table
    • Inserting new rows where a simple primary key has to be generated
    • Inserting new rows where the primary key has to be generated based on stored values
    • Deleting data from a table
    • Creating or altering a database table from PDI (design time)
    • Creating or altering a database table from PDI (runtime)
    • Inserting, deleting, or updating a table depending on a field
    • Changing the database connection at runtime
    • Loading a parent-child table
    • Building SQL queries via database metadata
    • Performing repetitive database design tasks from PDI
    • Chapter 2: Reading and Writing Files
      • Introduction
      • Reading a simple file
      • Reading several files at the same time
      • Reading semi-structured files
      • Reading files having one field per row
      • Reading files with some fields occupying two or more rows
      • Writing a simple file
      • Writing a semi-structured file
      • Providing the name of a file (for reading or writing) dynamically
      • Using the name of a file (or part of it) as a field
      • Reading an Excel file
      • Getting the value of specific cells in an Excel file
      • Writing an Excel file with several sheets
      • Writing an Excel file with a dynamic number of sheets
      • Reading data from an AWS S3 Instance
      • Chapter 3: Working with Big Data and Cloud Sources
        • Introduction
        • Loading data into Salesforce.com
        • Getting data from Salesforce.com
        • Loading data into Hadoop
        • Getting data from Hadoop
        • Loading data into HBase
        • Getting data from HBase
        • Loading data into MongoDB
        • Getting data from MongoDB
        • Chapter 4: Manipulating XML Structures
          • Introduction
          • Reading simple XML files
          • Specifying fields by using the Path notation
          • Validating well-formed XML files
          • Validating an XML file against DTD definitions
          • Validating an XML file against an XSD schema
          • Generating a simple XML document
          • Generating complex XML structures
          • Generating an HTML page using XML and XSL transformations
          • Reading an RSS Feed
          • Generating an RSS Feed
          • Chapter 5: File Management
            • Introduction
            • Copying or moving one or more files
            • Deleting one or more files
            • Getting files from a remote server
            • Putting files on a remote server
            • Copying or moving a custom list of files
            • Deleting a custom list of files
            • Comparing files and folders
            • Working with ZIP files
            • Encrypting and decrypting files
            • Chapter 6: Looking for Data
              • Introduction
              • Looking for values in a database table
              • Looking for values in a database with complex conditions
              • Looking for values in a database with dynamic queries
              • Looking for values in a variety of sources
              • Looking for values by proximity
              • Looking for values by using a web service
              • Looking for values over intranet or the Internet
              • Validating data at runtime
              • Chapter 7: Understanding and Optimizing Data Flows
                • Introduction
                • Splitting a stream into two or more streams based on a condition
                • Merging rows of two streams with the same or different structures
                • Adding checksums to verify datasets
                • Comparing two streams and generating differences
                • Generating all possible pairs formed from two datasets
                • Joining two or more streams based on given conditions
                • Interspersing new rows between existent rows
                • Executing steps even when your stream is empty
                • Processing rows differently based on the row number
                • Processing data into shared transformations via filter criteria and subtransformations
                • Altering a data stream with Select values
                • Processing multiple jobs or transformations in parallel
                • Chapter 8: Executing and Re-using Jobs and Transformations
                  • Introduction
                  • Launching jobs and transformations
                  • Executing a job or a transformation by setting static arguments and parameters
                  • Executing a job or a transformation from a job by setting arguments and
                  • parameters dynamically
                  • Executing a job or a transformation whose name is determined at runtime
                  • Executing part of a job once for every row in a dataset
                  • Executing part of a job several times until a condition is true
                  • Moving part of a transformation to a subtransformation
                  • Using Metadata Injection to re-use transformations
                  • Chapter 9: Integrating Kettle and the Pentaho Suite
                    • Introduction
                    • Creating a Pentaho report with data coming from PDI
                    • Creating a Pentaho report directly from PDI
                    • Configuring the Pentaho BI Server for running PDI jobs and transformations
                    • Executing a PDI transformation as part of a Pentaho process
                    • Executing a PDI job from the Pentaho User Console
                    • Populating a CDF dashboard with data coming from a PDI transformation
                    • Chapter 10: Getting the Most Out of Kettle
                      • Introduction
                      • Sending e-mails with attached files
                      • Generating a custom logfile
                      • Running commands on another server
                      • Programming custom functionality
                      • Generating sample data for testing purposes
                      • Working with JSON files
                      • Getting information about transformations and jobs (file-based)
                      • Getting information about transformations and jobs (repository-based)
                      • Using Spoon's built-in optimization tools
                        • Chapter 12: Data Analytics
                          • Introduction
                          • Reading data from a SAS datafile
                          • Studying data via stream statistics
                          • Building a random data sample for Weka
                          • Appendix A: Data Structures
                            • Books data structure
                            • museums data structure
                            • outdoor data structure
                            • Steel Wheels data structure
                            • Lahman Baseball Database

                              Alex Meadows

                              Alex Meadows has worked with open source Business Intelligence solutions for nearly 10 years and has worked in various industries such as plastics manufacturing, social and e-mail marketing, and most recently with software at Red Hat, Inc. He has been very active in Pentaho and other open source communities to learn, share, and help newcomers with the best practices in BI, analytics, and data management. He received his Bachelor's degree in Business Administration from Chowan University in Murfreesboro, North Carolina, and his Master's degree in Business Intelligence from St. Joseph's University in Philadelphia, Pennsylvania.


                              Adrián Sergio Pulvirenti

                              Adrián Sergio Pulvirenti was born in Buenos Aires, Argentina, in 1972. He earned his Bachelor's degree in Computer Sciences at UBA, one of the most prestigious universities in South America. He has dedicated more than 15 years to developing desktop and web-based software solutions. Over the last few years he has been leading integration projects and development of BI solutions.

                              María Carina Roldán

                              María Carina, born in Esquel, Argentina, earned her Bachelor's degree in Computer Science at UNLP in La Plata and then moved to Buenos Aires where she has been living since 1994. She has worked as a BI consultant for almost 15 years. Over the last six, she has been dedicated full time to developing BI solutions using the Pentaho Suite. Currently, she works for Webdetails—a Pentaho company—as an ETL specialist. She is the author of Pentaho 3.2 Data Integration Beginner’s Guide book published by Packt Publishing in April 2009 and co-author of Pentaho Data Integration 4 Cookbook, also published by Packt Publishing in June 2011.
                              Sorry, we don't have any reviews for this title yet.

                              Code Downloads

                              Download the code and support files for this book.


                              Submit Errata

                              Please let us know if you have found any errors not listed on this list by completing our errata submission form. Our editors will check them and add them to this list. Thank you.

                              Sample chapters

                              You can view our sample chapters and prefaces of this title on PacktLib or download sample chapters in PDF format.

                              Frequently bought together

                              Pentaho Data Integration Cookbook - Second Edition +    Liferay Portal 5.2 Systems Development =
                              50% Off
                              the second eBook
                              Price for both: £25.25

                              Buy both these recommended eBooks together and get 50% off the cheapest eBook.

                              What you will learn from this book

                              • Configure Kettle to connect to relational and NoSQL databases and web applications like SalesForce, explore them, and perform CRUD operations
                              • Utilize plugins to get even more functionality into your Kettle jobs
                              • Embed Java code in your transformations to gain performance and flexibility
                              • Execute and reuse transformations and jobs in different ways
                              • Integrate Kettle with Pentaho Reporting, Pentaho Dashboards, Community Data Access, and the Pentaho BI Platform
                              • Interface Kettle with cloud-based applications
                              • Learn how to control and manipulate data flows
                              • Utilize Kettle to create datasets for analytics

                              In Detail

                              Pentaho Data Integration is the premier open source ETL tool, providing easy, fast, and effective ways to move and transform data. While PDI is relatively easy to pick up, it can take time to learn the best practices so you can design your transformations to process data faster and more efficiently. If you are looking for clear and practical recipes that will advance your skills in Kettle, then this is the book for you.

                              Pentaho Data Integration Cookbook Second Edition guides you through the features of explains the Kettle features in detail and provides easy to follow recipes on file management and databases that can throw a curve ball to even the most experienced developers.

                              Pentaho Data Integration Cookbook Second Edition provides updates to the material covered in the first edition as well as new recipes that show you how to use some of the key features of PDI that have been released since the publication of the first edition. You will learn how to work with various data sources – from relational and NoSQL databases, flat files, XML files, and more. The book will also cover best practices that you can take advantage of immediately within your own solutions, like building reusable code, data quality, and plugins that can add even more functionality.

                              Pentaho Data Integration Cookbook Second Edition will provide you with the recipes that cover the common pitfalls that even seasoned developers can find themselves facing. You will also learn how to use various data sources in Kettle as well as advanced features.

                              Approach

                              Pentaho Data Integration Cookbook Second Edition is written in a cookbook format, presenting examples in the style of recipes.This allows you to go directly to your topic of interest, or follow topics throughout a chapter to gain a thorough in-depth knowledge.

                              Who this book is for

                              Pentaho Data Integration Cookbook Second Edition is designed for developers who are familiar with the basics of Kettle but who wish to move up to the next level.It is also aimed at advanced users that want to learn how to use the new features of PDI as well as and best practices for working with Kettle.

                              Code Download and Errata
                              Packt Anytime, Anywhere
                              Register Books
                              Print Upgrades
                              eBook Downloads
                              Video Support
                              Contact Us
                              Awards Voting Nominations Previous Winners
                              Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
                              Resources
                              Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software