MongoDB data modeling

Exclusive offer: get 50% off this eBook here
Pentaho Analytics for MongoDB

Pentaho Analytics for MongoDB — Save 50%

Combine Pentaho Analytics and MongoDB to create powerful analysis and reporting solutions with this book and ebook

£10.99    £5.50
by Bo Borland | February 2014 | Open Source

This article has been prepared by Bo Borland, the author of Pentaho Analytics for MongoDB.

(For more resources related to this topic, see here.)

MongoDB data modeling is an advanced topic. We will instead focus on a few key points regarding data modeling so that you will understand the implications for your report queries. These points touch on the strengths and weaknesses of embedded data models.

The most common data modeling decision in MongoDB is whether to normalize or denormalize related collections of data. Denormalized MongoDB models embed related data together in the same database documents while normalized models represent the same records with references between documents. This modeling decision will impact the method used to extract and query data using Pentaho, because the normalized method requires joining documents outside of MongoDB.

With normalized models, Pentaho Data Integration is used to resolve the reference joins.

The following table provides a list of objects that form the fundamental building blocks of a MongoDB database and the associated object names in the pentaho database.

MongoDB objects

Sample MongoDB object names

Database

Pentaho

Collection

sessions, events, and sessions_events

Document

The sessions document fields (key-value pairs) are as follows:

  • id: 512d200e223e7f294d13607
  • ip_address: 235.149.145.57
  • id_session: DF636FB593684905B335982BEA3D967B
  • browser: Chrome
  • date_session: 2013-02-20T15:47:49Z
  • duration_session: 18.24
  • referring_url: www.xyz.com

The events document fields (key-value pairs) are as follows:

  • _id: 512d1ff2223e7f294d13c0c6
  • id_session: DF636FB593684905B335982BEA3D967B
  • event: Signup Free Offer

The sessions_events document fields (key-value pairs) are as follows:

  • _id: 512d200e223e7f294d13607
  • ip_address: 235.149.145.57
  • id_session: DF636FB593684905B335982BEA3D967B
  • browser: Chrome
  • date_session: 2013-02-20T15:47:49Z
  • duration_session: 18.24
  • referring_url: www.xyz.com
  • event_data: [array]
    • event: Visited Site
    • event: Signup Free Offer

This table shows three collections in the pentaho database: sessions, events, and sessions_events. The first two collections, sessions and events, illustrate the concept of normalizing the clickstream data by separating it into two related collections with a reference key field in common. In addition to the two normalized collections, the sessions_events collection is included to illustrate the concept of denormalizing the clickstream data by combining the data into a single collection.

Normalized models

Because multiple clickstream events can occur within a single web session, we know that sessions have a one-to-many relationship with events. For example, during a single 20-minute web session, a user could invoke four events by visiting the website, watching a video, signing up for a free offer, and completing a lead form. These four events would always appear within the context of a single session and would share the same id_session reference.

The data resulting from the normalized model would include one new session document in the sessions collection and four new event documents in the events collection, as shown in the following figure:

Each event document is linked to the parent session document by the shared id_session reference field, whose values are highlighted in red.

This normalized model would be an efficient data model if we expect the number of events per session to be very large for a couple of reasons. The first reason is that MongoDB limits the maximum document size to 16 megabytes, so you will want to avoid data models that create extremely large documents. The second reason is that query performance can be negatively impacted by large data arrays that contains thousands of event values. This is not a concern for the clickstream dataset, because the number of events per session is small.

Denormalized models

The one-to-many relationship between sessions and events also gives us the option of embedding multiple events inside of a single session document. Embedding is accomplished by declaring a field to hold either an array of values or embedded documents known as subdocuments. The sessions_events collection is an example of embedding, because it embeds the event data into an array within a session document. The data resulting from our denormalized model includes four event values in the event_data array within the sessions_events collection as shown in the following figure:

As you can see, we have the choice to keep the session and event data in separate collections, or alternatively, store both datasets inside a single collection. One important rule to keep in mind when you consider the two approaches is that the MongoDB query language does not support joins between collections. This rule makes embedded documents or data arrays better for querying, because the embedded relationship allows us to avoid expensive client-side joins between collections. In addition, the MongoDB query language supports a large number of powerful query operators for accessing documents by the contents of an array. A list of query operators can be found on the MongoDB documentation site at http://docs.mongodb.org/manual/reference/operator/.

To summarize, the following are a few key points to consider when deciding on a normalized or denormalized data model in MongoDB:

  • The MongoDB query language does not support joins between collections
  • The maximum document size is 16 megabytes
  • Very large data arrays can negatively impact query performance

In our sample database, the number of clickstream events per session is expected to be small—within a modest range of only one to 20per session. The denormalized model works well in this scenario, because it eliminates joins by keeping events and sessions in a single collection. However, both data modeling scenarios are provided in the pentaho MongoDB database to highlight the importance of having an analytics platform, such as Pentaho, to handle both normalized and denormalized data models.

Summary

This article expands on the topic of data modeling and explains MongoDB database concepts essential to querying MongoDB data with Pentaho.

Resources for Article:


Further resources on this subject:


Pentaho Analytics for MongoDB Combine Pentaho Analytics and MongoDB to create powerful analysis and reporting solutions with this book and ebook
Published: February 2014
eBook Price: £10.99
Book Price: £16.99
See more
Select your format and quantity:

About the Author :


Bo Borland

Bo Borland is the vice president of field technical sales at Pentaho, a leading Big Data analytics software provider. He has a passion for building teams and helping companies improve performance with data analytics. Prior to joining Pentaho, Bo worked as a management consultant at Deloitte and as a solution architect at Cognos, an IBM company. He founded a successful analytics consulting company, Management Signals, which merged with Pentaho in 2012. His 14 years' experience in professional analytics includes roles in management, sales, consulting, and sales engineering.

Pentaho Corporation is a leading Big Data analytics company headquartered in Orlando, FL, with offices in San Francisco, London, and Portugal. Pentaho tightly couples data integration with full business analytics capabilities into a single, integrated software platform. The company's subscription-based software offers SMEs and global enterprises the ability to reduce the time taken to design, develop, and deploy Big Data analytics solutions.

Books From Packt


Pentaho for Big Data Analytics
Pentaho for Big Data Analytics

Pentaho Reporting 3.5 for Java Developers
Pentaho Reporting 3.5 for Java Developers

Instant Pentaho Data Integration Kitchen [Instant]
Instant Pentaho Data Integration Kitchen [Instant]

 Pentaho Data Integration Beginner's Guide - Second Edition
Pentaho Data Integration Beginner's Guide - Second Edition

 Pentaho Data Integration 4 Cookbook
Pentaho Data Integration 4 Cookbook

Pentaho 5.0 Reporting by Example: Beginner’s Guide
Pentaho 5.0 Reporting by Example: Beginner’s Guide

Pentaho Data Integration Cookbook - Second Edition
Pentaho Data Integration Cookbook - Second Edition

Pentaho 3.2 Data Integration: Beginner's Guide
Pentaho 3.2 Data Integration: Beginner's Guide


Code Download and Errata
Packt Anytime, Anywhere
Register Books
Print Upgrades
eBook Downloads
Video Support
Contact Us
Awards Voting Nominations Previous Winners
Judges Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software
Resources
Open Source CMS Hall Of Fame CMS Most Promising Open Source Project Open Source E-Commerce Applications Open Source JavaScript Library Open Source Graphics Software