Data modeling is a subject that has been discussed for a long time. Hence, various authors on the subject might have different views. Not so long ago, when the main discussions were focused on relational databases, data modeling was part of the process of data discovery and analysis in a domain. It was a holistic vision, where the final goal was to have a robust database able to support any kind of application.
Due to the flexibility of NoSQL databases, data modeling has been an inside out process, where you need to have previously understood an application's needs or performance characteristics to have a good data model at the end.
In this chapter, we will provide a brief history of the data modeling process over the years, showing you important concepts. We are going to cover the following topics:
The relationship between MongoDB and NoSQL
Now, on Google Trends, a tool that shows how often a term is searched relative to all searched terms globally, we can see that the growth of interest in both subjects is quite similar:
Since the first open source release in 2009, by a company named 10gen, MongoDB was the choice for many players on the Web and accordingly DB-Engines (http://db-engines.com/en/) became the fourth most popular database, and the most popular NoSQL database system.
10gen converted to MongoDB Inc. on August 27, 2013, showing that all eyes were on MongoDB and its ecosystem. The shift to an open source project was crucial in this change process. Especially, since the community adoption has been tremendous.
According to Dwight Merriman, the current chairman and co-founder of MongoDB:
"Our open source platform has resulted in MongoDB being downloaded 8 million times within the five years since the project has been availableâthat's an extremely fast pace for community adoption."
Furthermore, MongoDB Inc. launched products and services to support this community and enrich the MongoDB ecosystem. Among them are:
Although the concept is new, NoSQL is a highly controversial subject. If you search widely, you may find many different explanations. As we do not have any intention of creating a new one, let's take a look at the most commonly-used explanation.
The term NoSQL, as we know today, was introduced by Eric Evans, after a meet up, organized by Johan Oskarsson from Last.fm.
Indeed, Oskarsson and everyone else who joined that historical meeting in San Francisco, on June 11, 2009, were already discussing many of the databases that today we call NoSQL databases, such as Cassandra, HBase, and CouchDB. As Oskarsson had described, the meeting was about open source, distributed, non-relational databases, for anyone who had "â¦ run into limitations with traditional relational databasesâ¦," with the aim of "â¦ figuring out why these newfangled Dynamo clones and BigTables have become so popular lately."
Four months later, Evans wrote in his weblog that, besides the growth of the NoSQL movement and everything that was being discussed, he thought they were going nowhere. However, Emil Eifren, the Neo4J founder and CEO, was right in naming the term as "Not Only SQL."
More important than giving a definition to the term NoSQL, all these events were a starting point from which to discuss what NoSQL really is. Nowadays, there seems to be a general understanding that NoSQL was born as a response to every subject that relational databases were not designed to address.
Notably, we can now distinguish the problems that information systems must solve from the 70's up until today. At that time, monolithic architectures were enough to supply demand, unlike what we observe nowadays.
Have you ever stopped to think how many websites, such as social networks, e-mail providers, streaming services, and online games, you already have an account with? And, how many devices inside your house are connected to the Internet right now?
Do not worry if you cannot answer the preceding questions precisely. You are not alone. With each new research project, the number of users with Internet access around the globe increases, and the share that represents mobile internet access is more significant too.
This means that a large volume of unstructured or semi-structured data is generated every second, everywhere. The amount of data cannot be estimated, since the user is the main source of information. Thus, it is getting more and more difficult to predict when or why this volume will vary. It's just a matter of an unpredictable event happening somewhere in the worldâsuch as a goal score, a general strike, a mass demonstration, or a plane crashâto have a variation on traffic, and consequently a growth of content generated by users.
In response to this, the development of NoSQL technology brought a variety of different approaches.
As previously stated, Amazon Inc. and Google are at the forefront of NoSQL development with the help of Amazon DynamoDB and Google BigTable. Because of the diversity of styles, we have new types of NoSQL databases that are developed all the time. However, four basic types, based on data model, are known: key-value stores, wide-column stores, document databases, and graph databases, which are explained as follows:
Key-value stores: The key-value is one of the most simple and straightforward data models, where each record is stored as a key together with its value. Examples of key-value stores are Amazon Dynamo, Riak, and Redis.
Wide-column stores: Conceptually, the closest to relational databases, since its data is represented in a table. Nevertheless, the database stores columns of data instead of rows. Examples of wide-column stores are Google BigTable, Cassandra, and HBase.
Document databases: As its name suggests, the data model of this database has as a main concept, the document. Documents are complex structures that store data as key-values, and can contain many key-value pairs, key-array pairs, or even nested documents. Examples of document databases are MongoDB, Apache CouchDB, and Amazon SimpleDB.
Graph databases: Graph databases are the best way to store items of data whose relationships are best represented as graphs, such as network topologies and social networks. Nodes, edges, and properties are the structure of stored data. Examples of graph databases are Neo4J and HyperGraphDB.
Although, as explained earlier, NoSQL database types are based on different data models, they have some common features.
In order to support unstructured or semi-structured data, NoSQL databases have no predefined schema. The dynamic schema makes real-time changes simpler when inserting new data, and more cost-effective when data migration is needed.
To handle an unpredictable, large volume of data, NoSQL databases use auto-sharding to scale horizontally and ensure continuous availability of data. Auto-sharding allows users to automatically spread data and traffic across a number of servers.
NoSQL databases also support replication natively, which gives you high availability and recovery in a quick and easy way. As we distribute our data more and our recovery strategies change, we may fine-tune our consistency levels.
Before I started to write this chapter (or maybe before beginning to write this book) I thought about how to deal with this subject. Firstly, because I would guess this was one of your expectations. Secondly, because this is a subject that is present in almost every literature, and I do not want to (and do not intend to) inflame this discussion.
The truth is that the discussion towards the theory versus practice, and until now in my life, I have favored the practical side. Therefore, I investigated, searched many different sources where I could read more about the subject, and maybe bring to this book a summary of everything that has been written until now on this subject.
Much that I have found at the beginning of my research showed me a clear separation between database design and data modeling. However, in the end, my conclusion was that both concepts have more similarities than divergences. And, to reach this conclusion, I had as stating point a fact mentioned by C.J. Date in An Introduction to Database Systems, Pearson Education.
In it, C.J. Date says that he prefers not to use the term data modeling because it could be refer to the term data model, and this relation may cause a little confusion. C.J. Date reminds us that the term data model has two meanings in the literature. The first is that a data model is a model of data in general, the second is that a data model is a model of persistent data relating to a specific enterprise. Date has chosen the first definition in his book.
As C.J. Date stated:
"We believe that the right way to do database design in a nonrelational system is to do a clean relation design first, and then, as a separate and subsequent step, to map that relational design into whatever nonrelational structures (for example. hierarchies) the target DBMS happens to support."
Therefore, talking about database design is a good start. So, C.J. Date adopted the term semantic modeling, or conceptual modeling, and defined this activity as an aid in the process of database design.
If you want to know more, you can find it in An Introduction to Database Systems, 8th Edition, Chapter 14, page 410.
Another important source that I found, which in some way complements the C.J. Date argumentation, is publications made by Graeme Simsion on The Data Administration Newsletter, http://www.tdan.com and in the book Data Modeling: Theory and Practice, Technics Publications LLC. Graeme Simsion is a data modeler, author of two data modeling books, and a researcher at Melbourne University.
In the vast majority of publications Simsion addresses the database design and data modeling subjects and concludes that data modeling is a discipline of database design and, consequently, the data model is the single and most important component of the design.
We notice that, unlike C.J. Date, Graeme Simsion uses the term data modeling.
In one of this publications, Simsion brings us an important fact regarding the data modeling concepts as a part of the process of database design. He talks about the stages of database design and tries to explain it by using some historic facts, and by a research with people who are directly involved with data modeling.
From the historic point of view, he mentioned the importance of 3-schema architecture on the evolution of data modeling concepts.
To understand this evolution, we have to go back in time to 1975. In that year, the Standards Planning and Requirements Committee of the American National Standards Institute also known as ANSI/SPARC/X3 Study Group on Data Base Management Systems, led by Charles Bachman, published a report proposing a DBMS architecture.
This report introduced an abstract architecture of DBMS that would be applicable for any data modelâthat is, a way that multiples the user's views and perceives the data.
The 3-schema architecture was developed to describe a final product, a database, not the process to design one. However, as previously mentioned, the 3-schema architecture introduced concepts that directly impact on the database design process including data modeling. In the next section we will go through the 3-schema architecture concepts to better understand data modeling concepts.
Hide the physical storage implementation from the user
Ensure that the DBMS will give users access to the same data consistently, which means all users have with their own view
Allow the database administrator to change something in the physical level without affecting the user's view
The external level, also known as the user view, details how each particular user sees the database. This level allows each user to view the data in a different way. For that reason, it is also the appropriate level to keep information about a user's specific requirements. The external schema describes how a database is structured for different user views. So, we can have many external schemas for a database.
The conceptual level, despite being considered by many as the most important level, was the last level to arise in the architecture. This level aims to show how a database is logically structured. We can say that it is an abstract view of the data stored within the database.
The conceptual level acts as a layer between the user view and the database implementation. Therefore, in this level, details about the physical implementation and particularities about user views are not considered.
Once conceptual level is here, the database administrator has an important role in this architecture level where we have a database global view. It is their responsibility to define the logical structure.
A very interesting thing about the conceptual level is that we have to keep in mind that this level is independent from hardware or software. The conceptual schema defines the logical data structure as well as the relationships between the data in the database.
The internal level represents how the data is stored. This schema defines physical storage structures such as indexes, data fields, and representations. There is only one internal schema for a database, but it is possible that there are many internal schemas for a conceptual schema.
The introduction of the concepts demonstrated by Charles Bachman and the ANSI/SPARC/X3 members were very meaningful. They brought a new way to see the database and introduced concepts that helped to develop the data modeling discipline.
As we stated before, data modeling can no longer be seen as a separate process. It is a stage in the database design process and a step that has to be done together with a business analysis. As the final result of the modeling process, we should have the logical data model.
This modeling process raises the controversial question of which approach we use. The core of this discussion deals with what is academic or what we see in practice.
To Matthew West and Julian Fowler, one way to see the modeling process is shown in the following diagram:
Graeme Simsion has an entire article about this discussion. The article shows how the academic view of the modeling process is different than the real-life view. Both give names to the modeling stages, which are quite different.
During the writing process of this chapter, I am trying to present not only the Simsion research but also everything I have been through since I started working with information systems, in conjunction with extensive research about the modeling concepts, along with the countless views that I saw in many other sources.
Moreover, as previously stated, and also observed by Simsion, the three schema ANSI-SPARC architecture played a key role in the formation of the base concepts we have today. With the dissemination of the relational model and the DBMS based on it, the need to support old database architectures such as hierarchical and network-based has passed. Nevertheless, the way we divide the modeling process in two stages, one reflecting concepts very close to user views and followed by an automatic translation to a conceptual schema, remained.
We can say that the stages of the data modeling process we know nowadays came from the 3-schema architecture. Not only on the concepts, but also the names we use to noun each stage.
The conceptual model is a map of the entities and relationships with some attributes to illustrate. This is a high-level, abstract view, with the objective of identifying the fundamental concepts, very close to how users perceive the data, not focusing on a particular idea of the business.
If our audience is the business guys, that is the right model. It is frequently used to describe universal domain concepts, and should be DBMS-independent. For instance, we can mention entities such as person, store, product, instructor, student, and course.
Both in academic literature and in practice, the use of a relational notation is widely used to represent the conceptual model, even though the target implementation is not a relational DBMS. Indeed, this is a good approach to follow, as C.J. Date stated.
A common graphical representation of the conceptual model is the popular "crow's foot notation".
It is common to describe business requirements in this model. Consequently, at this time, the data modeler will focus more on the project's scope. Details such as cardinality and nullability of relationship attributes with data types and constraints are mapped on this model too. As well as the conceptual model, is common to use a relational notation to represent the logical model. A data modeler has to work more on the logical model. This is because the logical model is where the modeler will explore all his possibilities and different ideas.
Generally, the logical model is a graphical presentation. The most widely used is the entity-relationship (ER) model, presented by Peter Chen in 1976. The ER model has a graphical notation that fits all the needs of a logical model.
The physical model is a model where we have more detailed and less generic information about the data. In this model, we should know which technology should be used. Here, we can include tables, column names, keys, indexes, security roles, validation rules, and whatever detail you as a data modeler think is necessary.
Just to make the connection to the three-schema architecture clear, the physical model is in some way linked to the internal level on the architecture because it is in this level that we deal with how the stored data is represented to the user. The goal of this stage is to have an implemented database.
Data modeling is an important step in the database design process. There are many ways to ensure high quality in this process by involving all stakeholders. You will probably have a better knowledge of your data after modeling it.
That being said, we should always think about our data, and use a technique to model it.
In this chapter, you saw the history of NoSQL and also explored database designs and data modeling as a whole. We reviewed database architectures and you also learned about conceptual, logical, and physical models.
Now that you know more about data modeling, we will go through the MongoDB data model and the application of these concepts in the next chapter.