Home Big-data-and-business-intelligence Pentaho for Big Data Analytics

Pentaho for Big Data Analytics

By Manoj R Patil , Feris Thia
books-svg-icon Book
Subscription
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
Subscription
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
About this book

Pentaho accelerates the realization of value from big data with the most complete solution for big data analytics and data integration. The real power of big data analytics is the abstraction between data and analytics. Data can be distributed across the cluster in various formats, and the analytics platform should have the capability to talk to different heterogeneous data stores and fetch the filtered data to enrich its value.

Pentaho Big Data Analytics is a practical, hands-on guide that provides you with clear, step-by-step exercises for using Pentaho to take advantage of big data systems, where data beats algorithm, and gives you a good grounding in using Pentaho Business Analytics’ capabilities.

This book looks at the key ingredients of the Pentaho Business Analytics platform. We will see how to prepare the Pentaho BI environment, and get to grips with the big data ecosystem through. The book provides a clear guide to the essential tools of Pentaho Business Analytics, providing familiarity with both the various design tools for setting up reports, and the visualization tools necessary for complete data analysis.

Publication date:
November 2013
Publisher
Packt
Pages
118
ISBN
9781783282159

 

Chapter 1. The Rise of Pentaho Analytics along with Big Data

Pentaho, headquartered in Orlando, has a team of BI veterans with an excellent track record. In fact, Pentaho is the first commercial open source BI platform, which became popular quickly because of its seamless integration with many third-party software. It can comfortably talk to data sources: MongoDB, OLAP tools: Palo, or Big Data frameworks: Hadoop and Hive.

The Pentaho brand has been built up over the last 9 years to help unify and manage a suite of open source projects that provide alternatives to proprietary software BI vendors. Just to name, a few open source projects are Kettle, Mondrian, Weka, and JFreeReport. This unification helped to grow Pentaho's community and provided a centralized place. Pentaho claims that its community stands somewhere between 8,000 and 10,000 members strong, a fact that aids its ability to stay afloat offering just technical support, management services, and product enhancements for its growing list of enterprise BI users. In fact, this is how Pentaho mainly generates revenue for its growth.

For research and innovation, Pentaho has its "think tank", named Pentaho Labs, to innovate the breakthrough of Big Data-driven technologies in areas such as predictive and real-time analysis.

The core of business intelligence domain is always the underlined data. In fact, 70 years ago, they encountered the first attempt to quantify the growth rate of volume of data as "information explosion". This term first was used in 1941, according to Oxford English Dictionary. By 2010, this industrial revolution of data gained full momentum fueled by social media sites, and then scientists and computer engineers coined a new term for this phenomenon, "Big Data". Big Data is a collection of data sets, so large and complex that it becomes difficult to process with conventional database management tools. The challenges include capture, curation, storage, search, sharing, transfer, analysis, and visualization. As of 2012, the limits on the size of data sets that are feasible to process in a reasonable amount of time was in the order of exabytes (1 billion gigabytes) of data.

Data sets grow in size partly because they are increasingly being gathered by ubiquitous information-sensing mobile devices, aerial sensory technologies, digital cameras, software logs, microphones, RFID readers, and so on, apart from scientific research data such as micro-array analysis. One EMC-sponsored IDC study projected nearly 45-fold annual data growth by 2020!

So with the pressing need for software to store this variety of huge data, Hadoop was born. To analyze this huge data, the industry needed an easily manageable, commercially viable solution, which integrates with these Big Data software. Pentaho has come up with a perfect suite of software to address all the challenges posed by Big Data.

 

Pentaho BI Suite – components


Pentaho is a trailblazer when it comes to business intelligence and analysis, offering a full suite of capabilities for the ETL (Extract, Transform, and Load) processes, data discovery, predictive analysis, and powerful visualization. It has the flexibility of deploying on premise, in cloud, or can be embedded in custom applications.

Pentaho is a provider of a Big Data analytics solution that spans data integration, interactive data visualization, and predictive analytics. As depicted in the following diagram, this platform contains multiple components, which are divided into three layers: data, server, and presentation:

Let us take a detailed look at each of the components in the previous diagram.

Data

This is one of the biggest advantages of Pentaho; that it integrates with multiple data sources seamlessly. In fact, Pentaho Data Integration 4.4 Community Edition (referred as CE hereafter) supports 44 open source and proprietary databases, flat files, spreadsheets, and more out of box third-party software. Pentaho introduced Adaptive Big Data Layer as part of the Pentaho Data Integration engine to support the evolution of the Big Data stores. This layer accelerates access and integration to the latest version and capabilities of the Big Data stores. It natively supports third-party Hadoop distributions from MapR, Cloudera, Hortonworks, as well as popular NoSQL databases such as Cassandra and MongoDB. These new Pentaho Big Data initiatives bring greater adaptability, abstraction from change, and increased competitive advantage to companies facing the never-ceasing evolution of the Big Data ecosystem. Pentaho also supports analytic databases such as Greenplum and Vertica.

Server applications

The Pentaho Administration Console (PAC) server in CE or Pentaho Enterprise Console (PEC) server in EE (Enterprise Edition) is a web interface used to create, view, schedule, and apply permissions to reports and dashboards. It also provides an easy way to manage security, scheduling, and configuration for the Business Application Server and Data Integration Server along with repository management. The server applications are as follows:

  • Business Analytics (BA) Server: This is a Java-based BI platform with a report management system and lightweight process-flow engine. This platform also provides an HTML5-based web interface for creating, scheduling, and sharing various artifacts of BI such as interactive reporting, data analysis, and a custom dashboard. In CE, we have a parallel application called Business Intelligence (BI) Server.

  • Data Integration (DI) Server: This is a commercially available enterprise class server for the ETL processes and Data Integration. It helps to execute ETL and Data Integration jobs smoothly. It also provides scheduling to automate jobs and supports content management with the help of revision history and security integration.

Thin Client Tools

The Thin Client Tools all run inside Pentaho User Console (PUC) in a web browser (such as Internet Explorer, Chrome, or Firefox). Let's have a look at each of the tools:

  • Pentaho Interactive Reporting: This is a "What You See is What You Get" (WYSIWYG) type of design interface used to build simple and ad hoc reports on the fly without having to rely on IT support. Any business user can design reports using the drag-and-drop feature by connecting to the desired data source and then do rich formatting or use the existing templates.

  • Pentaho Analyzer: This provides an advanced web-based, multiple browser- supported OLAP viewer with support for drag-and-drop. It is an intuitive analytical visualization application with the capability to filter and drill down further into business information data, which is stored in its own Pentaho Analysis (Mondrian) data source. You can also perform other activities such as sorting, creating derived measures, and chart visualization.

  • Pentaho Dashboard Designer (EE): This is a commercial plugin that allows users to create dashboards with great usability. Dashboards can contain a centralized view of key performance indicators (KPI) and other business data movement, dynamic filter controls with customizable layout and themes.

Design tools

Let's take a quick look at each of these tools:

  • Schema Workbench: This is a Graphical User Interface (GUI) for designing Rolap cubes for Pentaho Analysis (Mondrian). It also provides the capability of data exploration and analysis for end BI users without having to understand the MultiDimensional eXpressions (MDX) language.

  • Aggregation Designer: This is based on Pentaho Analysis (Mondrian) schema files in XML and the database with the underlying tables described by the schema XML to generate pre-calculated, pre-aggregated answers, which improve the performance of analysis work and MDX queries executed against Mondrian to a great extent.

  • Metadata Editor: This is a tool used to create logical business models and acts as an abstraction layer from the underlying physical data layer. The resulting metadata mappings are used by Pentaho's Interactive Reporting (the community-based Saiku Reporting), to create reports within the BA Server without any other external desktop application.

  • Report Designer: This is a banded report designing tool with a rich GUI, which can also contain sub-reports, charts, and graphs. It can query and use data from a range of data sources from text files to RDBMS to Big Data, which addresses the requirements of financial, operational, and production reporting. Even standalone reports can be executed from the user console or used within a dashboard. Pentaho Report Designer consists of a reporting engine at its core, which accepts a .ppt template to process reports. This file is in a ZIP format with XML resources to define the report design.

  • Data Integration: This is also known as "Kettle", and consists of a core integration (ETL) engine and GUI application that allows the user to design Data Integration jobs and transformations. It also supports distributed deployment on the cluster or cloud environment as well as on single node computers. It has an adaptive Big Data layer, which supports different Big Data stores by insulating Hadoop, so that you only need to focus on analysis without bothering much about modification of the Big Data stores.

  • Design Studio: This is an Eclipse-based application and plugin, facilitating to create business process flow with a special XML script to define action sequences called xactions and other forms of automation in the platform. Action sequences define a lightweight, result-oriented business flow within the Pentaho BA Server.

 

Edge over competitors


What makes Pentaho unique to other existing BI solutions is the vast data connectivity provided by the Pentaho abstraction layer. This makes it a very complete solution for data integration across many heterogonous entry systems and storages.

Pentaho's OLAP solution also provides flexibility on various relational database engines, regardless of whether it is a proprietary database or open source.

The big benefit of Pentaho is its clear vision in adapting Big Data sources and NoSQL solutions, which is more and more accepted in enterprises across the world.

Apache Hadoop has become increasingly popular, and with it, the growing features of Pentaho have proven themselves able to catch up with it. Once you have the Hadoop platform, you can use Pentaho to put or read data in HDFS (Hadoop Distribution File System) format and also orchestrate a map-reduced process in Hadoop clusters with an easy-to-use GUI designer.

Pentaho has also emphasized visualization, the key ingredient of any analytic platform. Their recent acquisition of the Portugal-based business analytic solution company, Webdetails, clearly shows this. Webdetails brought on board a fantastic set of UI-based community tools (known as CTools) such as Community Dashboard Framework (CDF), and Community Data Access (CDA).

 

Summary


We took a look at the Pentaho Business Analytics platform with its key ingredients. We have also discussed various client tools and design tools with their respective features.

In the next chapter, we will see how to prepare a Pentaho BI environment on your machine, which will help in executing some hands-on assignments.

About the Authors
  • Manoj R Patil

    Manoj R Patil is the Chief Architect in Big Data at Compassites Software Solutions Pvt. Ltd. where he overlooks the overall platform architecture related to Big Data solutions, and he also has a hands-on contribution to some assignments. He has been working in the IT industry for the last 15 years. He started as a programmer and, on the way, acquired skills in architecting and designing solutions, managing projects keeping each stakeholder's interest in mind, and deploying and maintaining the solution on a cloud infrastructure. He has been working on the Pentaho-related stack for the last 5 years, providing solutions while working with employers and as a freelancer as well. Manoj has extensive experience in JavaEE, MySQL, various frameworks, and Business Intelligence, and is keen to pursue his interest in predictive analysis. He was also associated with TalentBeat, Inc. and Persistent Systems, and implemented interesting solutions in logistics, data masking, and data-intensive life sciences.

    Browse publications by this author
  • Feris Thia

    Feris Thia is a founder of PHI-Integration, a Jakarta-based IT consulting company that focuses on data management, data warehousing and Business Intelligence solutions. As a technical consultant, he has spent the last seven years delivering solutions with Pentaho and the Microsoft Business Intelligence platform across various industries, including retail, trading, finance/banking, and telecommunication. He is also a member and maintainer of two very active local Indonesian discussion groups related to Pentaho (pentaho-id@googlegroups.com) and Microsoft Excel (the BelajarExcel.info Facebook group). His current activities include research and building software based on Big Data and the data mining platform, that is, Apache Hadoop, R, and Mahout. He would like to work on a book with a topic on analyzing customer behavior using the Apache Mahout platform.

    Browse publications by this author
Pentaho for Big Data Analytics
Unlock this book and the full library FREE for 7 days
Start now