Home Big-data-and-business-intelligence Pentaho Analytics for MongoDB

Pentaho Analytics for MongoDB

By Bo Borland
books-svg-icon Book
Subscription
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
Subscription
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Getting Started with Pentaho and MongoDB
About this book

Pentaho Analytics for MongoDB will teach you MongoDB and Pentaho integration points and developer skills needed to create turnkey analytic solutions that deliver insight and drive value for your organization.

Starting with how to install, configure, and develop content in both Pentaho and MongoDB, this book will give you the complete range of skills needed to gain insight into MongoDB data using Pentaho Business Analytics.  You will learn about MongoDB data models and query techniques, which are covered in combination with the provided sample MongoDB database. You then advance to data integration, analysis, and reporting using Pentaho.

You will learn how to use Pentaho Data Integration to blend and enrich data from additional sources. From this blended data, you will develop professional-looking reports and analysis views that are visual and interactive. Lastly, we will cover the Pentaho web portal and web interfaces for deploying analytics out to a broader set of consumer users.

Publication date:
February 2014
Publisher
Packt
Pages
146
ISBN
9781782168355

 

Chapter 1. Getting Started with Pentaho and MongoDB

Many people have chosen MongoDB—a leading NoSQL database—as their data storage solution over a relational database approach, and now they need to analyze and report on their data residing in MongoDB. Pentaho, the leading commercial open source analytics platform, decided to meet this need by forming an early partnership with MongoDB, enabling analytics and reporting on the document-oriented data structures inside MongoDB.

This chapter will introduce you to the powerful combination of MongoDB and Pentaho and will provide step-by-step guidance on how to install and configure both technologies as well as restore the sample MongoDB data provided with this book. We start with an overview of the technologies and then set up a local installation of both technologies along with sample data on a single Windows machine.

The following are the topics that we will cover in this chapter:

  • Overview of Pentaho and MongoDB

  • Installing MongoDB

  • Restoring the sample MongoDB database

  • Installing Pentaho Enterprise Edition

We will focus on the Pentaho Enterprise Edition for this book, because it includes the Big Data utility, Instaview, which we use in later chapters. By the end of this chapter, you will have Pentaho and MongoDB running on a Windows computer with the sample MongoDB database provided with this book and restored to your MongoDB server.

 

MongoDB technology overview


Modern businesses capture huge volumes and varieties of data using several different data storage methods. There is no one-size-fits-all data storage method, because each technology has evolved to tackle the data challenges or opportunities of that specific time in history. We continue to see new and innovative data storage solutions as data volume, variety, and velocity grows, and as people figure out new ways to use data. The following is a small sampling of the variety of data sources you might encounter in a single organization:

  • Simple tabular data files: CSV, text files, and MS Excel

  • Commercial relational databases: Oracle, SQL Server, and DB2

  • Open source relational databases: MySQL and PostGreSQL

  • Modern, web-oriented data sources: XML, JSON, web services, and APIs

  • Hadoop distributions: Apache, Hortonworks, Cloudera, MapR, and Intel

  • Analytical databases: Vertica, Greenlplum, and InfoBright

  • Machine generated data sources: Application logs, web server logs, configuration files, sensor data, message queues, and filesystem audit logs

  • NoSQL databases: MongoDB, Redis, Cassandra, HBase, and CouchDB

Organizations invest heavily in these storage technologies and the skills needed to capture, store, and process data. MongoDB has emerged as a leader in the NoSQL category of databases. Because you are reading this book, you are probably well aware of the differences between MongoDB and relational databases; however, it is important to review and remind ourselves where MongoDB came from and why it is popular alternative to relational databases.

MongoDB is a document-oriented database designed to conquer some of the modern data storage challenges that developers and IT departments experience when using traditional relational databases. These modern data storage challenges started with the rise of the internet and high traffic websites such as Amazon and Google. These companies' websites attracted millions of users and subsequently massive volumes of website log, clickstream, and event data. Traditional relational database methods for handling growing data mostly involved scaling up (that is, vertical scaling) by adding more CPU and RAM to a single, often proprietary database server. This method of scaling was expensive and had limits on how far you could scale a single server. As a result, Google and Amazon decided to solve these data challenges by developing their own distributed data stores that could easily scale out (that is, horizontal scaling) across hundreds or thousands of commodity servers, as shown in the following figure. Horizontal scaling made it easier to scale dynamically by adding more machines to the cluster without any downtime or limits to compute capacity.

These early pioneers of distributed databases inspired a NoSQL data storage movement that included MongoDB. The term NoSQL is a popular way to describe MongoDB, because MongoDB does not use SQL, but NoSQL is not just about the query language. It has more to do with the way data is stored than just the query language. For many, the name NoSQL is inadequate, because it simply describes a query language and not the true essence of MongoDB, which is a horizontally scalable, distributed document database. Surprisingly, the name NoSQL originated simply as a way to describe these emerging distributed data stores using a short and unique Twitter hashtag, #nosql, for the purpose of advertising a meet up on the topic. As the story continued, the hashtag name stuck and is widely in use today!

Why would a database solution not leverage SQL, the most popular database query language used by developers and organizations all over the world? One key reason is that the SQL query language is not designed to efficiently query the nested constructs involved in hierarchical JSON documents, which form the foundation of the MongoDB document data model. JSON documents are language independent text files that represent data and are built on two primary data structures: nested collections of name/value pairs and ordered lists such as arrays. The following example shows the JSON representation of a dataset that describes the movie Forrest Gump. The JSON representation contains a parent object for movie information, a nested object representing the production company, and an array of cast member objects, shown as follows:

{
    "movie": "Forrest Gump",
    "rating": "PG-13",
    "duration_min": 142,
    "production_company": {
       "name": "Parmount Pictures",
       "streetAddress": "5555 Melrose Ave",
       "city": "Los Angeles",
       "state": "CA",
       "postalCode": 90038
    },
    "cast": [
        {
            "character": "Forrest Gump",
            "person": "Tom Hanks"
        },
        {
            "character": "Jenny Curran",
            "person": "Robin Wright"
        }
    ]
}

Each object is a comma-separated collection of key-value pairs enclosed in curly braces. MongoDB has its own query language built from the ground as a powerful way to retrieve, process, and update JSON documents. Document-oriented data storage is an alternative to SQL-based relational databases, and it offers some unique advantages that we will discuss in more detail in the next chapter. You can also learn more about JSON at json.org or Wikipedia.org/wiki/JSON.

MongoDB's use of JSON pairs each key with a complex data structure known as a document, and these documents can contain many different key-value pairs, key-array pairs, or even nested documents. MongoDB document-oriented data models enable the following benefits over relational databases:

  • It provides speedy and easy horizontal scaling by auto-sharding and grouping related data together in document collections instead of separate database tables that require joins to pull the data back together. The multitable joins of relational database management systems (RDBMS) reduce performance and make horizontal scaling more difficult.

  • It provides faster and easier application development by providing a data model that maps to native programming language objects. JSON's universal data structures make this possible and are supported by virtually any modern programming language. This makes MongoDB popular among developers because it permits a one-to-one mapping between object-oriented software objects and database entities. It also makes data interchange between software and databases easier.

  • It provides a dynamic schema that makes it easier than enforced RDBMS schemas to manage and evolve your data model. MongoDB allows for insertion of data without a predefined database schema. This gives software developers more flexibility to define and manipulate the database schema instead of relying on a separate database administrator to maintain schema changes.

 

Pentaho technology overview


Pentaho software consists of a suite of analytics products called Pentaho Business Analytics, providing a complete analytics software platform. This end-to-end solution includes data integration, metadata, reporting, OLAP analysis, ad-hoc query, dashboards, and data mining capabilities. The platform is available in two offerings: a community edition (CE) and an enterprise edition (EE). We will focus on the enterprise edition for this book because Instaview is included only in this edition of Pentaho.

Throughout the book, you will see that we group and refer to the platform in two major categories, Pentaho Data Integration (PDI) and Pentaho Business Analytics (BA). Even though we refer to PDI and BA as separate server categories, Pentaho tightly couples the PDI server with the BA server into a single platform offering. This unique approach helps companies solve their data integration challenges with multiple, diverse data sources including Big Data sources and instantly gain insight into business analytics for a broad set of users.

PDI gives users a graphical user interface to a parallel processing ETL engine to solve data integration challenges. The user interface reduces data integration complexity by eliminating the need to code data extractions, data transformations, and data loads. Some additional PDI benefits include:

  • Broad connectivity to any type of data source including native support for Big Data sources such as Hadoop, NoSQL, and analytic databases

  • Integrated self-service Big Data analytics with Instaview—a utility that simplifies Big Data connectivity and bundles the Pentaho OLAP interface into PDI

  • An open, pluggable Java architecture that makes it easy to develop plugins to extend the platform

  • A parallel processing engine that can be dynamically scaled across multiple servers in a cluster

BA provides web-based interfaces to create business models and interactive reports as well as analysis views and dashboards. The focus is on ease-of-use while providing a complete set of reporting and analysis capabilities that include the following web-based components:

  • Interactive Reporting: Relational ad hoc queries and basic tabular, parameterized reporting

  • Analyzer: OLAP interface for analysis and visualization

  • Dashboard Designer: Easy-to-use, interactive dashboard creation

It also includes the following client-based components:

  • Metadata Editor: Developer interface for metadata modeling

  • Schema Workbench: Developer interface to model OLAP cubes

  • Report Designer: Advanced report development to build any type of parameterized report

As mentioned earlier, Pentaho is an early mover into the Big Data space as the first major BI vendor to extend its analytics platform with Big Data capabilities in May 2010. The partnership with MongoDB is one of the first few Big Data partnerships for Pentaho and since then, Pentaho continues to deliver MongoDB innovations. The Pentaho-MongoDB solution covers the entire Big Data life cycle from data extraction and preparation to data discovery, which we will explore throughout this book. Now that we have reviewed both technologies, it is time to install them on your computer.

 

Installing MongoDB


Installing MongoDB on Windows or Linux is quick, easy, and well-documented in the MongoDB manual at http://docs.mongodb.org/manual/. We will install it on one Windows computer and restore the provided sample database for development purposes only. The database does not enable any authentication by default, so before you load any other personal or confidential data into MongoDB or consider it for your production environment, you will want to read about recommended security practices at http://docs.mongodb.org/manual/core/security.

Note

MongoDB has not supported Windows XP since the release of version 2.2, so you will need a more recent version of Windows available for this install. Also, while we will not be using large sample MongoDB datasets for this book, it is recommended that you download the 64-bit version if your machine is 64-bit compliant, because the 32-bit version is limited to 2 GB of data.

The first thing we need to do is download the latest production release of the software. Included in the download are the MongoDB server, MongoDB shell, backup and restoration tools, import and export tools, and GridFS to manage large files that exceed the document size limit of 16 MB. MongoDB is self-contained without other system dependencies, so you can install and run MongoDB from any folder you want. Perform the following steps to set up and run the MongoDB server via the command prompt and set up MongoDB as a Windows service:

  1. Download the Windows 64-bit production release from www.mongodb.org/downloads.

  2. Extract the downloaded archive to C:\.

    Note

    Your folder should look like the following with [version] being the version you downloaded: C:\mongodb-win32-x86_64-[version].

  3. Open a new command prompt window using administrator rights by right-clicking on the command prompt program and selecting Run as Administrator.

  4. Within the command prompt, issue the following commands:

    cd \
    move C:\mongodb-win32-* C:\mongodb
    
  5. Create the required data folder using the command prompt for MongoDB to store its files:

    md data
    md data\db
    
  6. Start the MongoDB database process from the command prompt:

    C:\mongodb\bin\mongod.exe
    

    Note

    The mongod.exe console window output will display waiting for connections if mongod.exe is running correctly. MongoDB also provides an HTTP interface for administrators, which can be accessed in a browser at http://localhost:28017.

  7. Establish a connection to MongoDB using the mongo.exe shell, which will connect to the default test database on localhost using the default port of 27017:

    C:\mongodb\bin\mongo.exe
    

    Note

    The mongo.exe console window output will display MongoDB shell version:[current version] connected to: test if mongo.exe is running correctly.

  8. Insert a document in the test collection of the test database, and then retrieve it using the find command:

    db.test.save( {abc: 1 } )
    db.test.find ()
    

    Note

    The mongo.exe console window will display the document you just created because it is the only document that exists in the collection, and you did not specify any matching criteria with your find command:

    { "id" : ObjectId("51daa85f987a76f4380728ea"), "abc" : 1 }

    The _id field is the unique identifier for a collection that serves as the primary key field. If you don't specify the _id field values, MongoDB uses system-generated ObjectIds as the default value. Your ObjectId will differ from the one shown previously.

Installing MongoDB as a Windows service

Set up MongoDB as a Windows service to start automatically on reboots:

  1. Open a new command prompt window using administrator rights by right-clicking on the command prompt program and selecting Run as Administrator.

  2. Create a folder for the database logfiles and a configuration file for logging by entering the following commands:

    cd C:\mongodb\log
    echo logpath = C:\mongodb\log\mongo.log > C:\mongodb\mongodb.cfg
    
  3. Install the MongoDB service:

    C:\mongodb\bin\mongod.exe --config C:\mongodb\mongodb.cfg ––install
    
  4. Start the MongoDB service:

    net start MongoDB
    

    Tip

    Use the following command to stop the MongoDB service:

    net stop MongoDB

    Also, use the following command to remove the MongoDB service:

    C:\mongodb\bin\mongod.exe --remove

    If you prefer to use a GUI interface to interact with your MongoDB instance, MongoVUE is one of many third-party options. MongoVUE is a MongoDB desktop application for Windows OS that makes it easier to manage and query your MongoDB databases. You can download a free limited version at http://www.mongovue.com/.

Restoring the sample clickstream MongoDB database

The sample clickstream database used throughout this book contains web session information along with user clickstream events that occur within that session. The session is the header record that captures information about a web user's IP address, browser, referring URL, session date, and session length. The following is a sample JSON text output of a single session record containing the seven fields available in the collection:

{
  "_id" :ObjectId("512d200e223e7f294d13d44c"),
  "id_session" : "71E1FF1A25B84045864FCEC447F8D012",
  "ip_address" : "47.54.245.196",
  "browser" : "Safari",
  "date_session" :ISODate("2013-01-01T17:36:32Z"),
  "duration_session" : 7.43,
  "referring_url" : "www.123.com"
}

One or more clickstream events can occur within a single session. These events include visited site, signup newsletter, watched video, completed lead form, and added item to cart. The event records link to the session records by the unique identifier for the session called [id_session]. The following is an example of the JSON text output of a single event record with the id_session field highlighted in bold text:

{
  "_id" :ObjectId("512d1ff2223e7f294d13c0c4"),
"id_session" : "310FA578571D4D90847E33E8B894703D",
  "event" : "Visited Site"
}

Now, let's restore the sample clickstream MongoDB database provided with this book. You will need to download the database file and perform the following steps to restore the database from command line using the mongorestore program:

  1. Download both the Pentaho database ZIP and zip_codes.csv files from http://www.packtpub.com/support.

  2. Extract the downloaded ZIP file to C:\mongodb\bin\dump.

    Note

    Your database directory should look like the following:

    C:\mongodb\bin\dump\pentaho
  3. Open a new Windows command prompt and issue the following commands:

    C:\mongodb\bin\mongorestore /mongodb/bin/dump/pentaho
    
 

Installing Pentaho


Installing Pentaho on Windows, Linux, or Mac OS is quick and easy with the graphical installer option. This option, by default, installs the entire BA platform, PostGreSQLSolution database, Tomcat application server, and Sun JRE needed to provide a completely functioning analytics server. Typically, you would install the server components on a server and the client components on your workstation, but for development and training purposes, we are going to install all of the server and client components on a single Windows computer. The following list shows all of the platform components that will be installed on your computer:

  • Client tools: Aggregation Designer, Design Studio, Metadata Editor, Pentaho Data Integration, Report Designer, and Schema Workbench

  • Web tools: Analyzer, Dashboard Designer, Interactive Reporting, Enterprise Console, and Mobile

  • BA server: Enterprise Console and User Console

  • DI server: Data Integration Server

  • Others: Tomcat App Server, PostGreSQL Solution DB, and Sun JRE

Perform the following steps to download and install Pentaho on your Windows computer:

  1. Check to make sure your computer meets the minimum hardware requirements to run Pentaho server components at http://infocenter.pentaho.com/help/topic/supported_components/reference_supported_components.html.

  2. If your computer is 64-bit compliant, download Pentaho Business Analytics 64-bit Windows version to your computer from http://www.pentaho.com/download.

    Note

    You will be asked to register your contact information for a 30-day evaluation of the Pentaho Enterprise Edition. Your download will begin after registration. The downloaded filename should look like the following with [version] being the version you downloaded:

    pentaho-business-analytics-[version]-x64.exe
  3. Double-click on the downloaded executable file, pentaho-business-analytics-[version]-x64.exe, to run the installer program, which will launch a Pentaho splash screen.

  4. When prompted to choose a setup type, select Default , which will install the entire suite of server and client components to a default directory of C:\Program Files\Pentaho.

  5. You will be prompted to enter a single master password for the PostGreSQL solution repository and enterprise console, and publish it to the web feature. Enter your chosen password and click on Next.

    Note

    When the installation wizard completes, the BA and DI servers should be running on the following default ports:

    • 5432: PostgreSQL server

    • 8080: BA server Tomcat web server startup port

    • 9080: DI server port

    If these ports are not available during the installation, Pentaho will automatically increment the port number by one until an available port is found.

  6. Upon wizard completion, select the option of starting the Pentaho User Console, which will launch the PUC login screen.

  7. Click on the Login as an Evaluator link, which will use the default administrator login with admin as the username and password as the password, and then click on the Go button to access Pentaho.

 

Summary


Congratulations! You should have a working installation of Pentaho, which installs the BA and DI servers as Windows systems services that will start automatically the next time you boot your computer. Please refer to the official Pentaho Installation Guide at http://infocenter.pentaho.com for detailed installation information. In the next chapter, we will explore the MongoDB data model and establish a connection from Pentaho to the sample MongoDB database.

About the Author
  • Bo Borland

    Bo Borland is the vice president of field technical sales at Pentaho, a leading Big Data analytics software provider. He has a passion for building teams and helping companies improve performance with data analytics. Prior to joining Pentaho, Bo worked as a management consultant at Deloitte and as a solution architect at Cognos, an IBM company. He founded a successful analytics consulting company, Management Signals, which merged with Pentaho in 2012. His 14 years' experience in professional analytics includes roles in management, sales, consulting, and sales engineering.

    Pentaho Corporation is a leading Big Data analytics company headquartered in Orlando, FL, with offices in San Francisco, London, and Portugal. Pentaho tightly couples data integration with full business analytics capabilities into a single, integrated software platform. The company's subscription-based software offers SMEs and global enterprises the ability to reduce the time taken to design, develop, and deploy Big Data analytics solutions.

    Browse publications by this author
Pentaho Analytics for MongoDB
Unlock this book and the full library FREE for 7 days
Start now