Home Data Mastering MongoDB 6.x - Third Edition

Mastering MongoDB 6.x - Third Edition

By Alex Giamas
books-svg-icon Book
eBook $41.99 $28.99
Print $51.99
Subscription $15.99 $10 p/m for three months
$10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
BUY NOW $10 p/m for first 3 months. $15.99 p/m after that. Cancel Anytime!
eBook $41.99 $28.99
Print $51.99
Subscription $15.99 $10 p/m for three months
What do you get with a Packt Subscription?
This book & 7000+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook + Subscription?
Download this book in EPUB and PDF formats, plus a monthly download credit
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with a Packt Subscription?
This book & 6500+ ebooks & video courses on 1000+ technologies
60+ curated reading lists for various learning paths
50+ new titles added every month on new and emerging tech
Early Access to eBooks as they are being written
Personalised content suggestions
Customised display settings for better reading experience
50+ new titles added every month on new and emerging tech
Playlists, Notes and Bookmarks to easily manage your learning
Mobile App with offline access
What do you get with eBook?
Download this book in EPUB and PDF formats
Access this title in our online reader
DRM FREE - Read whenever, wherever and however you want
Online reader with customised display settings for better reading experience
What do you get with video?
Download this video in MP4 format
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with video?
Stream this video
Access this title in our online reader
DRM FREE - Watch whenever, wherever and however you want
Online reader with customised display settings for better learning experience
What do you get with Audiobook?
Download a zip folder consisting of audio files (in MP3 Format) along with supplementary PDF
What do you get with Exam Trainer?
Flashcards, Mock exams, Exam Tips, Practice Questions
Access these resources with our interactive certification platform
Mobile compatible-Practice whenever, wherever, however you want
  1. Free Chapter
    Chapter 1: MongoDB – A Database for the Modern Web
About this book
MongoDB is a leading non-relational database. This book covers all the major features of MongoDB including the latest version 6. MongoDB 6.x adds many new features and expands on existing ones such as aggregation, indexing, replication, sharding and MongoDB Atlas tools. Some of the MongoDB Atlas tools that you will master include Atlas dedicated clusters and Serverless, Atlas Search, Charts, Realm Application Services/Sync, Compass, Cloud Manager and Data Lake. By getting hands-on working with code using realistic use cases, you will master the art of modeling, shaping and querying your data and become the MongoDB oracle for the business. You will focus on broadly used and niche areas such as optimizing queries, configuring large-scale clusters, configuring your cluster for high performance and availability and many more. Later, you will become proficient in auditing, monitoring, and securing your clusters using a structured and organized approach. By the end of this book, you will have grasped all the practical understanding needed to design, develop, administer and scale MongoDB-based database applications both on premises and on the cloud.
Publication date:
August 2022
Publisher
Packt
Pages
460
ISBN
9781803243863

 

MongoDB – A Database for the Modern Web

In this chapter, we will lay the foundations for understanding MongoDB. We will explore how it is a database designed for the modern web and beyond. Learning is as important as knowing how to learn in the first place. We will go through the references that have the most up-to-date information about MongoDB, for both new and experienced users.

By the end of this chapter, you will have learned where MongoDB is best suited to be used and when it might be sub-optimal to use it. Learning about the evolution of MongoDB and the wider ecosystem will allow you to apply critical thinking when evaluating different database options early on in the software development life cycle.

In this chapter, we will cover the following topics: 

  • SQL and MongoDB’s history and evolution
  • MongoDB from the perspective of SQL and other NoSQL technology users
  • MongoDB’s common use cases and why they matter
  • MongoDB’s configuration and best practices
 

Technical requirements

To sail smoothly through the chapter, you will need MongoDB version 5 installed or a free tier account in MongoDB Atlas The code that has been used for all of the chapters in this book can be found at https://github.com/PacktPublishing/Mastering-MongoDB-6.x.

 

The evolution of SQL and NoSQL

Structured Query Language (SQL) existed even before the World Wide Web (WWW). Dr. E. F. Codd originally published a paper, A Relational Model of Data for Large Shared Data Banks, in June 1970, in the Association of Computer Machinery (ACM) journal, Communications of the ACM. SQL was initially developed at IBM by Chamberlin and Boyce, in 1974. Relational Software (now known as Oracle Corporation) was the first to develop a commercially available implementation of SQL, which was targeted at United States governmental agencies.

The first American National Standards Institute (ANSI) SQL standard came out in 1986. Since then, there have been eight revisions, with the most recent being published in 2016 (SQL:2016).

SQL was not particularly popular at the start of the WWW. Static content could just be hardcoded onto the HTML page without much fuss. However, as the functionality of websites grew, webmasters wanted to generate web page content driven by offline data sources, in order to generate content that could change over time without redeploying code.

Common Gateway Interface (CGI) scripts, developing Perl or Unix shells, were driving early database-driven websites in Web 1.0. With Web 2.0, the web evolved from directly injecting SQL results into the browser to using two-tier and three-tier architectures that separated views from the business and model logic, allowing for SQL queries to be modular and isolated from the rest of the web application.

On the other hand, Not only SQL (NoSQL) is much more modern and supervened web evolution, rising at the same time as Web 2.0 technologies. The term was first coined by Carlo Strozzi, in 1998, for his open source database that did not follow the SQL standard but was still relational.

This is not what we currently expect from a NoSQL database. Johan Oskarsson, a developer at Last.fm, reintroduced the term in early 2009, in order to group a set of distributed, non-relational data stores that were being developed. Many of them were based on Google’s Bigtable and MapReduce papers or Amazon’s DynamoDB, which is a highly available key-value-based storage system.

NoSQL’s foundations grew upon relaxed atomicity, consistency, isolation, and durability (ACID) properties, which guarantee performance, scalability, flexibility, and reduced complexity. Most NoSQL databases have gone one way or the other in providing as many of the previously mentioned qualities as possible, even offering adjustable guarantees to the developer. The following diagram describes the evolution of SQL and NoSQL:

Figure 1.1 – Database evolution

Figure 1.1 – Database evolution

In the next section, we will learn more about how MongoDB has evolved over time, from a basic object store to a full-fledged general-purpose database system.

The evolution of MongoDB

MongoDB Inc’s former name, 10gen Inc., started to develop a cloud computing stack in 2007 and soon realized that the most important innovation was centered around the document-oriented database that they built to power it, which was MongoDB. MongoDB shifted from a Platform as a Service (PaaS) to an open source model and released MongoDB version 1.0 on August 27, 2009.

Version 1 of MongoDB was pretty basic in terms of features, authorization, and ACID guarantees, but it made up for these shortcomings with performance and flexibility.

In the following sections, we will highlight the major features of MongoDB, along with the version numbers with which they were introduced.

The major feature set for versions 1.0 and 1.2

The major new features of versions 1.0 and 1.2 are listed as follows:

  • A document-based model
  • A global lock (process level)
  • Indexes on collections
  • CRUD operations on documents
  • No authentication (authentication was handled at the server level)
  • Primary and secondary replication: Back then, they were named master and slave, respectively, and were changed to their current names with the SERVER-20608 ticket, in version 4.9.0
  • MapReduce (introduced in v1.2)
  • Stored JavaScript functions (introduced in v1.2)

Version 2

The major new features of version 2 are listed as follows:

  • Background index creation (since v1.4)
  • Sharding (since v1.6)
  • More query operators (since v1.6)
  • Journaling (since v1.8)
  • Sparse and covered indexes (since v1.8)
  • Compact commands to reduce disk usage
  • More efficient memory usage
  • Concurrency improvements
  • Index performance enhancements
  • Replica sets are now more configurable and data center-aware
  • MapReduce improvements
  • Authentication (since 2.0, for sharding and most database commands)
  • Geospatial features introduced
  • The aggregation framework (since v2.2) and enhancements (since v2.6)
  • Time-to-Live (TTL) collections (since v2.2)
  • Concurrency improvements, among which there is DB-level locking (since v2.2)
  • Text searching (since v2.4) and integration (since v2.6)
  • Hashed indexes (since v2.4)
  • Security enhancements and role-based access (since v2.4)
  • A V8 JavaScript engine instead of SpiderMonkey (since v2.4)
  • Query engine improvements (since v2.6)
  • A pluggable storage engine API
  • A WiredTiger storage engine has been introduced, with document-level locking, while the previous storage engine (now called MMAPv1) supports collection-level locking

Version 3

The major new features of version 3 are listed as follows:

  • Replication and sharding enhancements (since v3.2)
  • Document validation (since v3.2)
  • The aggregation framework’s enhanced operations (since v3.2)
  • Multiple storage engines (since v3.2, only in Enterprise Edition)
  • Query language and indexes collation (since v3.4)
  • Read-only database views (since v3.4)
  • Linearizable read concerns (since v3.4)

Version 4

The major new features of version 4 are listed as follows:

  • Multi-document ACID transactions (since v4.0)
  • Change streams (since v4.0)
  • MongoDB tools (Stitch, Mobile, Sync, and Kubernetes Operator) (since v4.0)
  • Retryable writes (since v4.0)
  • Distributed transactions (since v4.2)
  • Removing the outdated MMAPv1 storage engine (since v4.2)
  • Updating the shard key (since v4.2)
  • On-demand materialized views using aggregation pipelines (since v4.2)
  • Wildcard indexes (since v4.2)
  • Streaming replication in replica sets (since v4.4)
  • Hidden indexes (since v4.4)

Version 5

The major new features of version 5 are listed as follows:

  • A quarterly MongoDB release schedule going forward
  • Window operators using aggregation pipelines (since v5.0)
  • A new MongoDB shell – mongosh (since v5.0)
  • Native time series collections (since v5.0)
  • Live resharding (since v5.0)
  • Versioned APIs (since v5.0)
  • Multi-cloud client-side field level encryption (since v5.0)
  • Cross-Shard Joins and Graph Traversals (since v5.1)

The following diagram shows MongoDB’s evolution over time:

Figure 1.2 – MongoDB’s evolution

Figure 1.2 – MongoDB’s evolution

As you can see, version 1 was pretty basic, whereas version 2 introduced most of the features present in the current version, such as sharding, usable and spatial indexes, geospatial features, and memory and concurrency improvements.

On the way from version 2 to version 3, the aggregation framework was introduced, mainly as a supplement to the aging MapReduce framework that didn’t keep up to speed with dedicated frameworks, such as Hadoop. Then, text search was added, and slowly but surely, the performance, stability, and security of the framework improved, adapting to the increasing enterprise loads of customers using MongoDB.

With WiredTiger’s introduction in version 3, locking became much less of an issue for MongoDB, as it was brought down from the process (global lock) to the document level, which is almost the most granular level possible.

Version 4 marked a major transition, bridging the SQL and NoSQL world with the introduction of multi-document ACID transactions. This allowed for a wider range of applications to use MongoDB, especially applications that require a strong real-time consistency guarantee. Further, the introduction of change streams allowed for a faster time to market for real-time applications using MongoDB. Additionally, a series of tools have been introduced to facilitate serverless, mobile, and Internet of Things (IoT) development.

With version 5, MongoDB is now a cloud-first database, with MongoDB Atlas offering full customer support for all major and minor releases going forward. In comparison, non-cloud users only get official support for major releases (for example, version 5 and then version 6). This is complemented by the newly released versioned API approach, which futureproofs applications. Live resharding addresses the major risk of choosing the wrong sharding key, whereas native time series collections and cross-shard lookups using $lookup and $graphlookup greatly improve analytics capabilities and unlock new use cases. End-to-end encryption and multi-cloud support can help implement systems in industries that have unique regulatory needs and also avoid vendor locking. The new mongosh shell is a major improvement over the legacy mongo shell.

Version 6 brings many incremental improvements. Now time series collections support sharding, compression, an extended range of secondary indexes, and updates and deletes (with limitations), making them useful for production use. The new slot-based query execution engine can be used in eligible queries such as $group and $lookup, improving execution time by optimizing query calculations. Finally, queryable encryption and cluster-to-cluster syncing improve the operational and management aspects of MongoDB.

In its current state, MongoDB is a database that can handle heterogeneous workloads ranging from startup Minimum Viable Product (MVP) and Proof of Concept (PoC) to enterprise applications with hundreds of servers.

 

MongoDB for SQL developers

MongoDB was developed in the Web 2.0 era. By then, most developers were using SQL or object-relational mapping (ORM) tools from their language of choice to access RDBMS data. As such, these developers needed an easy way to get acquainted with MongoDB from their relational background.

Thankfully, there have been several attempts at making SQL-to-MongoDB cheat sheets that explain the MongoDB terminology in SQL terms.

On a higher level, we have the following:

  • Databases and indexes (SQL databases)
  • Collections (SQL tables)
  • Documents (SQL rows)
  • Fields (SQL columns)
  • Embedded and linked documents (SQL joins)

Further examples of common operations in SQL and their equivalents in MongoDB are shown in the following table:

Table 1.1 – Common operations in SQL/MongoDB

Table 1.1 – Common operations in SQL/MongoDB

A few more examples of common operations can be seen at https://s3.amazonaws.com/info-mongodb-com/sql_to_mongo.pdf.

Next, we will check out the features that MongoDB has brought for NoSQL developers.

 

MongoDB for NoSQL developers

As MongoDB has grown from being a niche database solution to the Swiss Army knife of NoSQL technologies, more developers are also coming to it from a NoSQL background.

Putting the SQL versus NoSQL differences aside, it is the users from the columnar-type databases that face the most challenges. With Cassandra and HBase being the most popular column-oriented database management systems, we will examine the differences between them and how a developer can migrate a system to MongoDB. The different features of MongoDB for NoSQL developers are listed as follows:

  • Flexibility: MongoDB’s notion of documents that can contain sub-documents nested in complex hierarchies is really expressive and flexible. This is similar to the comparison between MongoDB and SQL, with the added benefit that MongoDB can more easily map to plain old objects from any programming language, allowing for easy deployment and maintenance.
  • Flexible query model: A user can selectively index some parts of each document; query based on attribute values, regular expressions, or ranges; and have as many properties per object as needed by the application layer. Primary and secondary indexes, along with special types of indexes (such as sparse ones), can help greatly with query efficiency. Using a JavaScript shell with MapReduce makes it really easy for most developers (and many data analysts) to quickly take a look at data and get valuable insights.
  • Native aggregation: The aggregation framework provides an extract-transform-load (ETL) pipeline for users to extract and transform data from MongoDB, and either load it in a new format or export it from MongoDB to other data sources. This can also help data analysts and scientists to get the slice of data they need in performing data wrangling along the way.
  • Schema-less model: This is a result of MongoDB’s design philosophy to give applications the power and responsibility to interpret the different properties found in a collection’s documents. In contrast to Cassandra’s or HBase’s schema-based approach, in MongoDB, a developer can store and process dynamically generated attributes.

After learning about the major features MongoDB offers to its users, in the next section, we will learn more about the key characteristics and the most widely deployed use cases.

 

MongoDB’s key characteristics and use cases

In this section, we will analyze MongoDB’s characteristics as a database. Understanding the features that MongoDB provides can help developers and architects to evaluate the requirements at hand and how MongoDB can help to fulfill them. Also, we will go over some common use cases from the experience of MongoDB, Inc. that have delivered the best results for its users. Finally, we will uncover some of the most common points of criticism against MongoDB and non-relational databases in general.

Key characteristics

MongoDB has grown to become a general-purpose NoSQL database, offering the best of both the RDBMS and NoSQL worlds. Some of the key characteristics are listed as follows:

  • It is a general-purpose database: In contrast to other NoSQL databases that are built for specific purposes (for example, graph databases), MongoDB can serve heterogeneous loads and multiple purposes within an application. This became even more true after version 4.0 introduced multi-document ACID transactions, further expanding the use cases in which it can be effectively used.
  • Flexible schema design: Document-oriented approaches with non-defined attributes that can be modified on the fly is a key contrast between MongoDB and relational databases.
  • It is built with high availability, from the ground up: In the era of five-nines availability, this has to be a given. Coupled with automatic failover upon detection of a server failure, this can help to achieve high uptime.
  • Feature-rich: Offering the full range of SQL-equivalent operators, along with features such as MapReduce, aggregation frameworks, TTL and capped collections, and secondary indexing, MongoDB can fit many use cases, no matter how diverse the requirements are.
  • Scalability and load balancing: It is built to scale, both vertically and (mainly) horizontally. Using sharding, an architect can share a load between different instances and achieve both read and write scalability. Data balancing happens automatically (and transparently to the user) via the shard balancer.
  • Aggregation framework: Having an ETL framework built into the database means that a developer can perform most of the ETL logic before the data leaves the database, eliminating, in many cases, the need for complex data pipelines.
  • Native replication: Data will get replicated across a replica set without a complicated setup.
  • Security features: Both authentication and authorization are taken into account so that an architect can secure their MongoDB instances.
  • JSON (BSON and Binary JSON) objects for storing and transmitting documents: JSON is widely used across the web for frontend and API communication, and, as such, it is easier when the database is using the same protocol.
  • MapReduce: Even though the MapReduce engine is not as advanced as it is in dedicated frameworks, nonetheless, it is a great tool for building data pipelines.
  • Querying and geospatial information in 2D and 3D: This might not be critical for many applications, but if it is for your use case, then it is really convenient to be able to use the same database for geospatial calculations and data storage.
  • Multi-document ACID transactions: Starting from version 4.0, MongoDB supports ACID transactions across multiple documents.
  • Mature tooling: The tooling for MongoDB has evolved to support systems around DBaaS to Sync, Mobile, and serverless (Stitch). 

Use cases for MongoDB

Since MongoDB is a highly popular NoSQL database, there have been several use cases where it has succeeded in supporting quality applications, with a great delivery time to the market.

Many of its most successful use cases center around the following list of areas:

  • The integration of siloed data, providing a single view of them
  • IoT
  • Mobile applications
  • Real-time analytics
  • Personalization
  • Catalog management
  • Content management

All of these success stories share some common characteristics. We will try to break them down in order of relative importance:

  • Schema flexibility is probably the most important one. Being able to store documents inside a collection that can have different properties can help during both the development phase and when ingesting data from heterogeneous sources that may or may not have the same properties. This is in contrast with an RDBMS, where columns need to be predefined, and having sparse data can be penalized. In MongoDB, this is the norm, and it is a feature that most use cases share. Having the ability to deeply nest attributes into documents and add arrays of values into attributes while also being able to search and index these fields helps application developers to exploit the schema-less nature of MongoDB.
  • Scaling and sharding are the most common patterns for MongoDB use cases. Easily scaling using built-in sharding, using replica sets for data replication, and offloading primary servers from read loads can help developers store data effectively.
  • Additionally, many use cases use MongoDB as a way of archiving data. Used as a pure data store (and without the need to define schemas), it is fairly easy to dump data into MongoDB to be analyzed at a later date by business analysts, using either the shell or some of the numerous BI tools that can integrate easily with MongoDB. Breaking data down further, based on time caps or document counts, can help serve these datasets from RAM, the use case in which MongoDB is most effective.
  • Capped collections are also a feature used in many use cases. Capped collections can restrict documents in a collection by count or by the overall size of the collection. In the latter case, we need to have an estimate of the size per document, in order to calculate how many documents will fit into our target size. Capped collections are a quick and dirty solution used to answer requests such as “Give me the last hour’s overview of the logs” without the need for maintenance and running async background jobs to clean our collection. Oftentimes, they might be used to quickly build and operate a queuing system. Instead of deploying and maintaining a dedicated queuing system, such as ActiveMQ, a developer can use a collection to store messages, and then use the native tailable cursors provided by MongoDB to iterate through the results as they pile up and feed an external system. Alternatively, you can use a TTL index within a regular collection if they require greater flexibility.
  • Low operational overhead is also a common pattern in many use cases. Developers working in agile teams can operate and maintain clusters of MongoDB servers without the need for a dedicated DBA. The free cloud monitoring service can greatly help in reducing administrative overhead for Community Edition users, whereas MongoDB Atlas, the hosted solution by MongoDB, Inc., means that developers do not need to deal with operational headaches.
  • In terms of business sectors using MongoDB, there is a huge variety coming from almost all industries. A common pattern seems to be higher usage where we are more interested in aggregated data than individual transaction-level data. Fields such as IoT can benefit the most by exploiting the availability over consistent design, storing lots of data from sensors in a cost-efficient way. On the other hand, financial services have absolutely stringent consistency requirements, aligned with proper ACID characteristics that make MongoDB more of a challenge to adapt. A financial transaction might be small in size but big in impact, which means that we cannot afford to leave a single message without proper processing.
  • Location-based data is also a field where MongoDB has thrived, with Foursquare being one of the most prominent early clients. MongoDB offers quite a rich set of features around two-dimensional and three-dimensional geolocation data, such as searching by distance, geofencing, and intersections between geographical areas.
  • Overall, the rich feature set is a common pattern across different use cases. By providing features that can be used in many different industries and applications, MongoDB can be a unified solution for all business needs, offering users the ability to minimize operational overhead and, at the same time, iterate quickly in product development.

MongoDB criticism

MongoDB’s criticism can be broken down into the following points:

  • MongoDB has had its fair share of criticism throughout the years. The web-scale proposition has been met with skepticism by many developers. The counterargument is that scale is not needed most of the time, and the focus should be on other design considerations. While this might occasionally be true, it is a false dichotomy, and in an ideal world, we would have both. MongoDB is as close as it can get to combining scalability with features, ease of use, and time to market.
  • MongoDB’s schema-less nature is also a big point of debate and argument. Schema-less can be really beneficial in many use cases, as it allows for heterogeneous data to be dumped into the database without complex cleansing and without ending up with lots of empty columns or blocks of text stuffed into a single column. On the other hand, this is a double-edged sword, as a developer could end up with many documents in a collection that have loose semantics in their fields, and it can become really hard to extract these semantics at the code level. If our schema design is not optimal, we could end up with a data store, rather than a database.
  • A lack of proper ACID guarantees is a recurring complaint from the relational world. Indeed, if a developer needs access to more than one document at a time, it is not easy to guarantee RDBMS properties, as there are no transactions. In the RDBMS sense, having no transactions also means that complex writes will need to have application-level logic to roll back. If you need to update three documents in two collections to mark an application-level transaction complete, and the third document does not get updated for whatever reason, the application will need to undo the previous two writes – something that might not exactly be trivial.
  • With the introduction of multi-document transactions in version 4, MongoDB can cope with ACID transactions at the expense of speed. While this is not ideal, and transactions are not meant to be used for every CRUD operation in MongoDB, it does address the main source of criticism.
  • The configuration setup defaults that favored setting up MongoDB but not operating it in a production environment are disapproved. For years, the default write behavior was write and forget; sending a write wouldn’t wait for an acknowledgment before attempting the next write, resulting in insane write speeds with poor behaviors in the case of failure. Also, authentication is an afterthought, leaving thousands of MongoDB databases on the public internet prey to whoever wants to read the stored data. Even though these were conscious design decisions, they are decisions that have affected developers’ perceptions of MongoDB.

It’s important to note that MongoDB has addressed all of the shortcomings throughout the years, with the aim of becoming a versatile and resilient general-purpose database system. Now that we understand the characteristics and features of MongoDB, we will learn how to configure and set up MongoDB efficiently.

 

MongoDB configuration and best practices

In this section, we will present some of the best practices around operations, schema design, durability, replication, sharding, security, and AWS. Further information on when to implement these best practices will be presented in later chapters.

Operational best practices

As a database, MongoDB is built with developers in mind, and it was developed during the web era, so it does not require as much operational overhead as traditional RDBMS. That being said, there are some best practices that need to be followed to be proactive and achieve high availability goals.

In order of importance, the best practices are as follows:

  • Mind the location of your data files: Data files can be mounted anywhere by using the --dbpath command-line option. It is really important to ensure that data files are stored in partitions with sufficient disk space, preferably XFS, or at the very least Ext4.
  • Keep yourself updated with versions: Before version 5, there was a different versioning naming convention. Even major numbered versions are the stable ones. So, 3.2 is stable, whereas 3.3 is not. In this example, 3.3 is the developmental version that will eventually materialize into the stable 3.4 version. It is a good practice to always update to the latest updated security version (which, at the time of writing this book, is 4.0.2) and to consider updating as soon as the next stable version comes out (4.2, in this example).

Version 5 has become cloud-first. The newest versions are automatically updated in MongoDB Atlas with the ability to opt out of them, whereas all versions are available to download for evaluation and development purposes. Chapter 3, MongoDB CRUD Operations, goes into more detail about the new rapid release schedule and how it affects developers and architects.

  • Use MongoDB Cloud monitoring: The free MongoDB, Inc. monitoring service is a great tool to get an overview of a MongoDB cluster, notifications, and alerts and to be proactive about potential issues.
  • Scale up if your metrics show heavy use: Do not wait until it is too late. Utilizing more than 65% in CPU or RAM, or starting to notice disk swapping, should be the threshold to start thinking about scaling, either vertically (by using bigger machines) or horizontally (by sharding).
  • Be careful when sharding: Sharding is a strong commitment to your shard key. If you make the wrong decision, it might be really difficult to go back from an operational perspective. When designing for sharding, architects need to take deep consideration of the current workloads (reads/writes) and what the current and expected data access patterns are. Live resharding, which was introduced in version 5, mitigates the risk compared to previous versions, but it’s still better to spend more time upfront instead of resharding after the fact. Always use the shard key in queries or else MongoDB will have to query all shards in the cluster, negating the major sharding advantage.
  • Use an application driver maintained by the MongoDB team: These drivers are supported and tend to get updated faster than drivers with no official support. If MongoDB does not support the language that you are using yet, please open a ticket in MongoDB’s JIRA tracking system.
  • Schedule regular backups: No matter whether you are using standalone servers, replica sets, or sharding, a regular backup policy should also be used as a second-level guard against data loss. XFS is a great choice as a filesystem, as it can perform snapshot backups.
  • Manual backups should be avoided: When possible, regular, automated backups should be used. If we need to resort to a manual backup, we can use a hidden member in a replica set to take the backup from. We have to make sure that we are using db.fsync with {lock: true} in this member, to get the maximum consistency at this node, along with having journaling turned on. If this volume is on AWS, we can get away with taking an EBS snapshot straight away.
  • Enable database access control: Never put a database into a production system without access control. Access control should be implemented at a node level, by a proper firewall that only allows access to specific application servers to the database, and at a DB level, by using the built-in roles or defining custom-defined ones. This has to be initialized at start-up time by using the --auth command-line parameter and can be configured by using the admin collection.
  • Test your deployment using real data: Since MongoDB is a schema-less, document-oriented database, you might have documents with varying fields. This means that it is even more important than with an RDBMS to test using data that resembles production data as closely as possible. A document with an extra field of an unexpected value can make the difference between an application working smoothly or crashing at runtime. Try to deploy a staging server using production-level data, or at least fake your production data in staging, by using an appropriate library, such as Faker for Ruby.

Schema design best practices

MongoDB is schema-less, and you need to design your collections and indexes to accommodate for this fact:

  • Index early and often: Identify common query patterns, using cloud monitoring, the GUI that MongoDB Compass offers, or logs. Analyzing the results, you should create indexes that cover the most common query patterns, using as many indexes as possible at the beginning of a project.
  • Eliminate unnecessary indexes: This is a bit counter-intuitive to the preceding suggestion, but monitor your database for changing query patterns, and drop the indexes that are not being used. An index will consume RAM and I/O, as it needs to be stored and updated along with the documents in the database. Using an aggregation pipeline and $indexStats, a developer can identify the indexes that are seldom being used and eliminate them.
  • Use a compound index, rather than index intersection: Most of the time, querying with multiple predicates (A and BC or D and E, and so on) will work better with a single compound index than with multiple simple indexes. Also, a compound index will have its data ordered by field, and we can use this to our advantage when querying. An index on fields AB, and C will be used in queries for A(A,B)(A,B,C), but not in querying for (B,C) or (C).
  • Low selectivity indexes: Indexing a field on gender, for example, will statistically return half of our documents back, whereas an index on the last name will only return a handful of documents with the same last name.
  • Use regular expressions: Again, since indexes are ordered by value, searching using a regular expression with leading wildcards (that is, /.*BASE/) won’t be able to use the index. Searching with trailing wildcards (that is, /DATA.*/) can be efficient, as long as there are enough case-sensitive characters in the expression.
  • Avoid negation in queries: Indexes are indexing values, not the absence of them. Using NOT in queries, instead of using the index, can result in full table scans.
  • Use partial indexes: If we need to index a subset of the documents in a collection, partial indexes can help us to minimize the index set and improve performance. A partial index will include a condition on the filter that we use in the desired query.
  • Use document validation: Use document validation to monitor for new attributes being inserted into your documents and decide what to do with them. With document validation set to warn, we can keep a log of documents that were inserted with new, never-seen-before attributes that we did not expect during the design phase and decide whether we need to update our index or not.
  • Use MongoDB Compass: MongoDB’s free visualization tool is great for getting a quick overview of our data and how it grows over time.
  • Respect the maximum document size of 16 MB: The maximum document size for MongoDB is 16 MB. This is a fairly generous limit, but it is one that should not be violated under any circumstances. Allowing for documents to grow unbounded should not be an option, and, as efficient as it might be to embed documents, we should always keep in mind that this should be kept under control. Additionally, we should keep track of the average and maximum document sizes using monitoring or the bsonSize() method and the aggregation operator.
  • Use the appropriate storage engine: MongoDB has introduced several new storage engines since version 3.2. The in-memory storage engine should be used for real-time workloads, whereas the encrypted storage engine (only available in MongoDB Enterprise Edition) should be the engine of choice when there are strict requirements around data security. Otherwise, the default WiredTiger storage engine is the best option for general-purpose workloads.

Examining some schema design best practices, we will move on to the best practices for write durability as of MongoDB version 6.

Best practices for write durability

Write durability can be fine-tuned in MongoDB, and, according to our application design, it should be as strict as possible, without affecting our performance goals.

Fine-tune the data and flush it to the disk interval in the WiredTiger storage engine; the default is to flush data to the disk every 60 seconds after the last checkpoint. This can be changed by using the --wiredTigerCheckpointDelaySecs command-line option.

MongoDB version 5 has changed the default settings for read and write concerns.

The default write concern is now majority writes, which means that in a replica set of three nodes (with one primary and two secondaries), the operation returns as soon as two of the nodes acknowledge it by writing it to the disk. Writes always go to the primary and then get propagated asynchronously to the secondaries. In this way, MongoDB eliminates the possibility of data rollback in the event of a node failure.

If we use arbiters in our replica set, then writes will still be acknowledged solely by the primary if the following formula resolves to true:

#arbiters > #nodes*0.5 - 1

For example, in a replica set of three nodes of which one is the arbiter and two are storing data, this formula resolves to the following:

1 > 3*0.5 - 1 ... 1 > 0.5 ... true

Note

MongoDB 6 restricts the number of arbiters to a maximum of one.

The default read concern is now local instead of available, which mitigates the risk of returning orphaned documents for reads in sharded collections. Orphaned documents might be returned during chunk migrations, which can be triggered either by MongoDB or, since version 5, also by the user when applying live resharding to the sharded collection.

Multi-document ACID transactions and the transactional guarantees that they have provided since MongoDB 4.2, coupled with the introduction of streaming replication and replicate-before-journaling behavior, have improved replication performance. Additionally, they allow for more durable and consistent default write and read concerns without affecting performance as much. The new defaults are promoting durability and consistent reads and should be carefully evaluated before changing them.

Best practices for replication

Under the right conditions, replica sets are MongoDB’s mechanism to provide redundancy, high availability, and higher read throughput. In MongoDB, replication is easy to configure and focuses on operational terms:

  • Always use replica sets: Even if your dataset is currently small, and you don’t expect it to grow exponentially, you never know when that might happen. Also, having a replica set of at least three servers helps to design for redundancy, separating the workloads between real time and analytics (using the secondary) and having data redundancy built-in from day one. Finally, there are some corner cases that you will identify earlier by using a replica set instead of a single standalone server, even for development purposes.
  • Use a replica set to your advantage: A replica set is not just for data replication. We can (and, in most cases, should) use the primary server for writes and preference reads from one of the secondaries to offload the primary server. This can be done by setting read preferences for reads, along with the correct write concern, to ensure that writes propagate as needed.
  • Use an odd number of replicas in a MongoDB replica set: If a server is down or loses connectivity with the rest of them (network partitioning), the rest have to vote as to which one will be elected as the primary server. If we have an odd number of replica set members, we can guarantee that each subset of servers knows if they belong to the majority of the minority of the replica set members. If we cannot have an odd number of replicas, we need to have one extra host set as an arbiter, with the sole purpose of voting in the election process. Even a micro-instance in EC2 could serve this purpose.

Best practices for sharding

Sharding is MongoDB’s solution for horizontal scaling. In Chapter 9Monitoring, Backup, and Security, we will go over its usage in more detail, but the following list offers some best practices, based on the underlying data architecture:

  • Think about query routing: Based on different shard keys and techniques, the mongos query router might direct the query to some (or all) of the members of a shard. It is important to take our queries into account when designing sharding. This is so that we don’t end up with our queries hitting all of our shards.
  • Use tag-aware sharding: Tags can provide more fine-grained distribution of data across our shards. Using the right set of tags for each shard, we can ensure that subsets of data get stored in a specific set of shards. This can be useful for data proximity between application servers, MongoDB shards, and the users.

Best practices for security

Security is always a multi-layered approach, and the following recommendations do not form an exhaustive list; they are just the bare basics that need to be done in any MongoDB database:

  • Always turn authentication on. There are multiple hacks over the years where open MongoDB servers have been hacked for fun or profit such as being backed up and deleted to extort admins to pay. It is a good practice to set up authentication even in non-production environments to decrease the possibility of human error.
  • The HTTP status interface should be disabled.
  • The RESTful API should be disabled.
  • The JSON API should be disabled.
  • Connect to MongoDB using SSL.
  • Audit the system activity.
  • Use a dedicated system user to access MongoDB with appropriate system-level access.
  • Disable server-side scripting if it is not needed. This will affect MapReduce, built-in db.group() commands, and $where operations. If they are not used in your code base, it is better to disable server-side scripting at startup by using the --noscripting parameter or setting security.javascriptEnabled to false in the configuration file.

After examining the best practices for security in general, we will dive into what are the best practices for AWS deployments.

Best practices for AWS

When we are using MongoDB, we can use our own servers in a data center, a MongoDB-hosted solution such as MongoDB Atlas, or we can rent instances from Amazon by using EC2. EC2 instances are virtualized and share resources in a transparent way, with collocated VMs in the same physical host. So, there are some more considerations to take into account if you wish to go down that route, as follows:

  • Use EBS-optimized EC2 instances.
  • Get EBS volumes with provisioned I/O operations per second (IOPS) for consistent performance.
  • Use EBS snapshotting for backup and restore.
  • Use different availability zones for high availability and different regions for disaster recovery. Using different availability zones within each region that Amazon provides guarantees that our data will be highly available. Different regions should mostly be used for disaster recovery in case a catastrophic event ever takes out an entire region. A region might be EU-West-2 (for London), whereas an availability zone is a subdivision within a region; currently, three availability zones are available in the London region.
  • Deploy globally, access locally.
  • For truly global applications with users from different time zones, we should have application servers in different regions access the data that is closest to them, using the right read preference configuration in each server.
 

Reference documentation and further reading

Reading a book is great (and reading this book is even better), but continuous learning is the only way to keep up to date with MongoDB.

The online documentation available at https://docs.mongodb.com/manual/ is the perfect starting point for every developer, new or seasoned.

The JIRA tracker is a great place to take a look at fixed bugs and the features that are coming up next: https://jira.mongodb.org/browse/SERVER/.

Some other great books on MongoDB are listed as follows:

  • MongoDB Fundamentals: A hands-on guide to using MongoDB and Atlas in the real world, by Amit Phaltankar and Juned Ahsan
  • MongoDB: The Definitive Guide 3e: Powerful and Scalable Data Storage, by Shannon Bradshaw and Eoin Brazil
  • MongoDB Topology Design: Scalability, Security, and Compliance on a Global Scale, by Nicholas Cottrell
  • Any book by Kristina Chodorow

The MongoDB user group (https://groups.google.com/forum/#!forum/mongodb-user) has a great archive of user questions about features and long-standing bugs. It is a place to go when something doesn’t work as expected.

Online forums (such as Stack Overflow and Reddit, among others) are always a source of knowledge, with the caveat that something might have been posted a few years ago and might not apply anymore. Always check before trying.

Finally, MongoDB University is a great place to keep your skills up to date and to learn about the latest features and additions: https://university.mongodb.com/.

 

Summary

In this chapter, we started our journey through web, SQL, and NoSQL technologies, from their inception to their current states. We identified how MongoDB has been shaping the world of NoSQL databases over the years, and how it is positioned against other SQL and NoSQL solutions.

We explored MongoDB’s key characteristics and how MongoDB has been used in production deployments. We identified the best practices for designing, deploying, and operating MongoDB.

Initially, we identified how to learn by going through the documentation and online resources that can be used to stay up-to-date with the latest features and developments.

In the next chapter, we will go deeper into schema design and data modeling. We will look at how to connect to MongoDB by using both the official drivers and an Object Document Mapper (ODM), which is a variation of object-relational mappers for NoSQL databases.

About the Author
  • Alex Giamas

    Alex Giamas is a freelance consultant and a hands-on Lead Technical and Data Architect. Over the past 15 years, he has expertise in designing and developing systems for the UK Government (HMRC, Cabinet Office, DIT) and private sector (Amazon ProServe, PwC, Fintech Fortune 500, Yahoo!, Verizon) clients. Alex is an alumnus of the MassChallenge London cohort as the co-founder and CTO of a digital health startup. Alex has authored Mastering MongoDB 3.x and 4.x, both by Packt Publishing. Alex has developed large-scale robust, distributed software systems in Python, JavaScript, Ruby, and Java. He is a MongoDB Certified Developer, a Cloudera Hadoop Certified Developer with Data Science Essentials, and a Carnegie Mellon and Stanford graduate.

    Browse publications by this author
Mastering MongoDB 6.x - Third Edition
Unlock this book and the full library FREE for 7 days
Start now