(For more resources related to this topic, see here.)
MongoDB data modeling is an advanced topic. We will instead focus on a few key points regarding data modeling so that you will understand the implications for your report queries. These points touch on the strengths and weaknesses of embedded data models.
The most common data modeling decision in MongoDB is whether to normalize or denormalize related collections of data. Denormalized MongoDB models embed related data together in the same database documents while normalized models represent the same records with references between documents. This modeling decision will impact the method used to extract and query data using Pentaho, because the normalized method requires joining documents outside of MongoDB.
With normalized models, Pentaho Data Integration is used to resolve the reference joins.
The following table provides a list of objects that form the fundamental building blocks of a MongoDB database and the associated object names in the pentaho database.
Sample MongoDB object names
sessions, events, and sessions_events
The sessions document fields (key-value pairs) are as follows:
The events document fields (key-value pairs) are as follows:
The sessions_events document fields (key-value pairs) are as follows:
This table shows three collections in the pentaho database: sessions, events, and sessions_events. The first two collections, sessions and events, illustrate the concept of normalizing the clickstream data by separating it into two related collections with a reference key field in common. In addition to the two normalized collections, the sessions_events collection is included to illustrate the concept of denormalizing the clickstream data by combining the data into a single collection.
Because multiple clickstream events can occur within a single web session, we know that sessions have a one-to-many relationship with events. For example, during a single 20-minute web session, a user could invoke four events by visiting the website, watching a video, signing up for a free offer, and completing a lead form. These four events would always appear within the context of a single session and would share the same id_session reference.
The data resulting from the normalized model would include one new session document in the sessions collection and four new event documents in the events collection, as shown in the following figure:
Each event document is linked to the parent session document by the shared id_session reference field, whose values are highlighted in red.
This normalized model would be an efficient data model if we expect the number of events per session to be very large for a couple of reasons. The first reason is that MongoDB limits the maximum document size to 16 megabytes, so you will want to avoid data models that create extremely large documents. The second reason is that query performance can be negatively impacted by large data arrays that contains thousands of event values. This is not a concern for the clickstream dataset, because the number of events per session is small.
The one-to-many relationship between sessions and events also gives us the option of embedding multiple events inside of a single session document. Embedding is accomplished by declaring a field to hold either an array of values or embedded documents known as subdocuments. The sessions_events collection is an example of embedding, because it embeds the event data into an array within a session document. The data resulting from our denormalized model includes four event values in the event_data array within the sessions_events collection as shown in the following figure:
As you can see, we have the choice to keep the session and event data in separate collections, or alternatively, store both datasets inside a single collection. One important rule to keep in mind when you consider the two approaches is that the MongoDB query language does not support joins between collections. This rule makes embedded documents or data arrays better for querying, because the embedded relationship allows us to avoid expensive client-side joins between collections. In addition, the MongoDB query language supports a large number of powerful query operators for accessing documents by the contents of an array. A list of query operators can be found on the MongoDB documentation site at http://docs.mongodb.org/manual/reference/operator/.
To summarize, the following are a few key points to consider when deciding on a normalized or denormalized data model in MongoDB:
- The MongoDB query language does not support joins between collections
- The maximum document size is 16 megabytes
- Very large data arrays can negatively impact query performance
In our sample database, the number of clickstream events per session is expected to be small—within a modest range of only one to 20per session. The denormalized model works well in this scenario, because it eliminates joins by keeping events and sessions in a single collection. However, both data modeling scenarios are provided in the pentaho MongoDB database to highlight the importance of having an analytics platform, such as Pentaho, to handle both normalized and denormalized data models.
This article expands on the topic of data modeling and explains MongoDB database concepts essential to querying MongoDB data with Pentaho.
Resources for Article:
- Installing Pentaho Data Integration with MySQL [article]
- Integrating Kettle and the Pentaho Suite [article]
- Getting Started with Pentaho Data Integration [article]