Mastering MongoDB 4.x - Second Edition

5 (2 reviews total)
By Alex Giamas
  • Instant online access to over 8,000+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. MongoDB – A Database for Modern Web

About this book

MongoDB is the best platform for working with non-relational data and is considered to be the smartest tool for organizing data in line with business needs. The recently released MongoDB 4.x supports ACID transactions and makes the technology an asset for enterprises across the IT and fintech sectors.

This book provides expertise in advanced and niche areas of managing databases (such as modeling and querying databases) along with various administration techniques in MongoDB, thereby helping you become a successful MongoDB expert. The book helps you understand how the newly added capabilities function with the help of some interesting examples and large datasets. You will dive deeper into niche areas such as high-performance configurations, optimizing SQL statements, configuring large-scale sharded clusters, and many more. You will also master best practices in overcoming database failover, and master recovery and backup procedures for database security.

By the end of the book, you will have gained a practical understanding of administering database applications both on premises and on the cloud; you will also be able to scale database applications across all servers.

Publication date:
March 2019
Publisher
Packt
Pages
394
ISBN
9781789617870

 

Chapter 1. MongoDB – A Database for Modern Web

In this chapter, we will lay the foundations for understanding MongoDB, and how it claims to be a database that's designed for the modern web. Learning in the first place is as important as knowing how to learn. We will go through the references that have the most up-to-date information about MongoDB, for both new and experienced users. We will cover the following topics: 

  • SQL and  MongoDB's history and evolution
  • MongoDB from the perspective of SQL and other NoSQL technology users
  • MongoDB's common use cases and why they matter
  • MongoDB's configuration and best practices
 

Technical requirements


You will require MongoDB version 4+, Apache Kafka, Apache Spark and Apache Hadoop installed to smoothly sail through the chapter. The codes that have been used for all the chapters can be found at: https://github.com/PacktPublishing/Mastering-MongoDB-4.x-Second-Edition.

 

The evolution of SQL and NoSQL


Structured Query Language (SQL) existed even before the WWW. Dr. E. F. Codd originally published the paper, A Relational Model of Data for Large Shared Data Banks, in June, 1970, in the Association of Computer Machinery (ACM) journal, Communications of the ACM. SQL was initially developed at IBM by Chamberlin and Boyce, in 1974. Relational Software (now Oracle Corporation) was the first to develop a commercially available implementation of SQL, targeted at United States governmental agencies.

The first American National Standards Institute (ANSI) SQL standard came out in 1986. Since then, there have been eight revisions, with the most recent being published in 2016 (SQL:2016).

SQL was not particularly popular at the start of the WWW. Static content could just be hardcoded into the HTML page without much fuss. However, as the functionality of websites grew, webmasters wanted to generate web page content driven by offline data sources, in order to generate content that could change over time without redeploying code.

Common Gateway Interface (CGI) scripts, developing Perl or Unix shells,were driving early database-driven websites in Web 1.0. With Web 2.0, the web evolved from directly injecting SQL results into the browser to using two-tier and three-tier architectures that separated views from the business and model logic, allowing for SQL queries to be modular and isolated from the rest of the web application.

On the other hand, Not only SQL (NoSQL) is much more modern and supervened web evolution, rising at the same time as Web 2.0 technologies. The term was first coined by Carlo Strozzi in 1998, for his open source database that did not follow the SQL standard, but was still relational.

This is not what we currently expect from a NoSQL database. Johan Oskarsson, a developer at Last.fm at the time, reintroduced the term in early 2009, in order to group a set of distributed, non-relational data stores that were being developed. Many of them were based on Google's Bigtable and MapReduce papers, or Amazon's DynamoDB, a highly available key-value based storage system.

NoSQL's foundations grew upon relaxed atomicity, consistency,isolation, and durability (ACID) properties, which guarantee the performance, scalability, flexibility, and reduced complexity. Most NoSQL databases have gone one way or another in providing as many of the previously mentioned qualities as possible, even offering adjustable guarantees to the developer. The following diagram describes the evolution of SQL and NoSQL:

The evolution of MongoDB

10gen started to develop a cloud computing stack in 2007 and soon realized that the most important innovation was centered around the document-oriented database that they built to power it, which was MongoDB. MongoDB was initially released on August 27, 2009.

Version 1 of MongoDB was pretty basic in terms of features, authorization, and ACID guarantees but it made up for these shortcomings with performance and flexibility.

In the following sections, we will highlight the major features of MongoDB, along with the version numbers with which they were introduced.

Major feature set for versions 1.0 and 1.2

The different features of versions 1.0 and 1.2 are as follows:

  • Document-based model
  • Global lock (process level)
  • Indexes on collections
  • CRUD operations on documents
  • No authentication (authentication was handled at the server level)
  • Master and slave replication
  • MapReduce (introduced in v1.2)
  • Stored JavaScript functions (introduced in v1.2)

Version 2

The different features of version 2.0 are as follows:

  • Background index creation (since v1.4)
  • Sharding (since v1.6)
  • More query operators (since v1.6)
  • Journaling (since v1.8)
  • Sparse and covered indexes (since v1.8)
  • Compact commands to reduce disk usage
  • Memory usage more efficient
  • Concurrency improvements
  • Index performance enhancements
  • Replica sets are now more configurable and data center aware
  • MapReduce improvements
  • Authentication (since 2.0, for sharding and most database commands)
  • Geospatial features introduced
  • Aggregation framework (since v2.2) and enhancements (since v2.6)
  • TTL collections (since v2.2)
  • Concurrency improvements, among which is DB-level locking (since v2.2)
  • Text searching (since v2.4) and integration (since v2.6)
  • Hashed indexes (since v2.4)
  • Security enhancements and role-based access (since v2.4)
  • V8 JavaScript engine instead of SpiderMonkey (since v2.4)
  • Query engine improvements (since v2.6)
  • Pluggable storage engine API
  • WiredTiger storage engine introduced, with document-level locking, while previous storage engine (now called MMAPv1) supports collection-level locking

Version 3

The different features of version 3.0 are as follows:

  • Replication and sharding enhancements (since v3.2)
  • Document validation (since v3.2)
  • Aggregation framework enhanced operations (since v3.2)
  • Multiple storage engines (since v3.2, only in Enterprise Edition)
  • Query language and indexes collation (since v3.4)
  • Read-only database views (since v3.4)
  • Linearizable read concern (since v3.4)

Version 4

The different features of version 4.0 are as follows:

  • Multi-document ACID transactions
  • Change streams
  • MongoDB tools (Stitch, Mobile, Sync, and Kubernetes Operator)

 

The following diagram shows MongoDB's evolution:

As we can observe, version 1 was pretty basic, whereas version 2 introduced most of the features present in the current version, such as sharding, usable and special indexes, geospatial features, and memory and concurrency improvements.

On the way from version 2 to version 3, the aggregation framework was introduced, mainly as a supplement to the ageing (and never up to par with dedicated frameworks, such as Hadoop) MapReduce framework. Then, text search was added, and slowly but surely, the framework was improving performance, stability, and security, to adapt to the increasing enterprise load of customers using MongoDB.

With WiredTiger's introduction in version 3, locking became much less of an issue for MongoDB, as it was brought down from the process (global lock) to the document level, almost the most granular level possible.

Version 4 marked a major transition, bridging the SQL and NoSQL world with the introduction of multi-document ACID transactions. This allowed for a wider range of applications to use MongoDB, especially applications that require a strong real-time consistency guarantee. Further, the introduction of change streams allowed for a faster time to market for real-time applications using MongoDB. A series of tools have also been introduced, to facilitate serverless, mobile, and Internet of Things (IoT) development.

In its current state, MongoDB is a database that can handle loads ranging from start up MVPs and POCs to enterprise applications with hundreds of servers.

MongoDB for SQL developers

MongoDB was developed in the Web 2.0 era. By then, most developers had been using SQL or object-relational mapping (ORM) tools from their language of choice to access RDBMS data. As such, these developers needed an easy way to get acquainted with MongoDB from their relational background.

Thankfully, there have been several attempts at making SQL to MongoDB cheat sheets that explain the MongoDB terminology in SQL terms.

On a higher level, we have the following:

  • Databases and indexes (SQL databases)
  • Collections (SQL tables)
  • Documents (SQL rows)
  • Fields (SQL columns)
  • Embedded and linked documents (SQL joins)

Some more examples of common operations are shown in the following table:

SQL

MongoDB

Database

Database

Table

Collection

Index

Index

Row

Document

Column

Field

Joins

Embed in document or link via DBRef

CREATE TABLE employee (name VARCHAR(100))

db.createCollection("employee")

INSERT INTO employees VALUES (Alex, 36)

db.employees.insert({name: "Alex", age: 36})

SELECT * FROM employees

db.employees.find()

SELECT * FROM employees LIMIT 1

db.employees.findOne()

SELECT DISTINCT name FROM employees

db.employees.distinct("name")

UPDATE employees SET age = 37 WHERE name = 'Alex'

db.employees.update({name: "Alex"}, {$set: {age: 37}}, {multi: true})

DELETE FROM employees WHERE name = 'Alex'

db.employees.remove({name: "Alex"})

CREATE INDEX ON employees (name ASC)

db.employees.ensureIndex({name: 1})

 

Further examples of common operations can be seen at http://s3.amazonaws.com/info-mongodb-com/sql_to_mongo.pdf.

MongoDB for NoSQL developers

As MongoDB has grown from being a niche database solution to the Swiss Army knife of NoSQL technologies, more developers are coming to it from a NoSQL background, as well.

Putting the SQL to NoSQL differences aside, it is users from columnar-type databases that face the most challenges. With Cassandra and HBase being the most popular column-oriented database management systems, we will examine the differences and how a developer can migrate a system to MongoDB. The different features of MongoDB for NoSQL developers are as follows:

  • Flexibility: MongoDB's notion of documents that can contain sub-documents nested in complex hierarchies is really expressive and flexible. This is similar to the comparison between MongoDB and SQL, with the added benefit that MongoDB can more easily map to plain old objects from any programming language, allowing for easy deployment and maintenance.
  • Flexible query model: A user can selectively index some parts of each document; query based on attribute values, regular expressions, or ranges; and have as many properties per object as needed by the application layer. Primary and secondary indexes, as well as special types of indexes (such as sparse ones), can help greatly with query efficiency. Using a JavaScript shell with MapReduce makes it really easy for most developers (and many data analysts) to quickly take a look at data and get valuable insights.
  • Native aggregation: The aggregation framework provides an extract, transform, load (ETL) pipeline for users to extract and transform data from MongoDB, and either load it in a new format or export it from MongoDB to other data sources. This can also help data analysts and scientists to get the slice of data they need in performing data wrangling along the way.
  • Schema-less model: This is a result of MongoDB's design philosophy to give applications the power and responsibility to interpret the different properties found in a collection's documents. In contrast to Cassandra's or HBase's schema-based approach, in MongoDB, a developer can store and process dynamically generated attributes.

 

 

MongoDB's key characteristics and use cases


In this section, we will analyze MongoDB's characteristics as a database. Understanding the features that MongoDB provides can help developers and architects to evaluate the requirements at hand and how MongoDB can help to fulfill them. Also, we will go over some common use cases from the experience of  MongoDB, Inc. that have delivered the best results for its users.

Key characteristics

MongoDB has grown to become a general purpose NoSQL database, offering the best of both the RDBMS and NoSQL worlds. Some of the key characteristics are as follows:

  • It is a general purpose database: In contrast to other NoSQL databases that are built for specific purposes (for example, graph databases), MongoDB can serve heterogeneous loads and multiple purposes within an application. This became even more true after version 4.0 introduced multi-document ACID transactions, further expanding the use cases in which it can be effectively used.
  • Flexible schema design: Document-oriented approaches with non-defined attributes that can be modified on the fly is a key contrast between MongoDB and relational databases.
  • It is built with high availability, from the ground up: In our era of five nines in availability, this has to be a given. Coupled with automatic failover upon detection of a server failure, this can help to achieve high uptime.
  • Feature rich: Offering the full range of SQL equivalent operators, along with features such as MapReduce, aggregation framework, Time to Live and capped collections, and secondary indexing, MongoDB can fit many use cases, no matter how diverse the requirements are.
  • Scalability and load balancing: It is built to scale, both vertically and (mainly) horizontally. Using sharding, an architect can share a load between different instances and achieve both read and write scalability. Data balancing happens automatically (and transparently to the user) via the shard balancer.
  • Aggregation framework: Having an ETL framework built in the database means that a developer can perform most of the ETL logic before the data leaves the database, eliminating, in many cases, the need for complex data pipelines.
  • Native replication: Data will get replicated across a replica set without complicated setup.
  • Security features: Both authentication and authorization are taken into account, so that an architect can secure their MongoDB instances.
  • JSON (BSON and Binary JSON) objects for storing and transmitting documents: JSON is widely used across the web for frontend and API communication, and, as such, it is easier when the database is using the same protocol.
  • MapReduce: Even though the MapReduce engine is not as advanced as it is in dedicated frameworks, it is nonetheless a great tool for building data pipelines.
  • Querying and geospatial information in 2D and 3D: This may not be critical for many applications, but if it is for your use case, then it is really convenient to be able to use the same database for geospatial calculations and data storage.
  • Multi-document ACID transactions: Starting from version 4.0, MongoDB supports ACID transactions across multiple documents.
  • Mature tooling: The tooling for MongoDB has evolved to support from DBaaS to Sync, Mobile, and serverless (Stitch). 

Use cases for MongoDB

Since MongoDB is a highly popular NoSQL database, there have been several use cases where it has succeeded in supporting quality applications, with a great delivery time to the market.

Many of its most successful use cases center around the following areas:

  • Integration of siloed data, providing a single view of them
  • IoT
  • Mobile applications
  • Real-time analytics
  • Personalization
  • Catalog management
  • Content management

 

All of these success stories share some common characteristics. We will try to break them down in order of relative importance:

  • Schema flexibility is probably the most important one. Being able to store documents inside of a collection that can have different properties can help during both the development phase and in ingesting data from heterogeneous sources that may or may not have the same properties. This is in contrast with an RDBMS, where columns need to be predefined and having sparse data can be penalized. In MongoDB, this is the norm, and it is a feature that most use cases share. Having the ability to deeply nest attributes into documents and add arrays of values into attributes while also being able to search and index these fields helps application developers to exploit the schema-less nature of MongoDB.
  • Scaling and sharding are the most common patterns for MongoDB use cases. Easily scaling using built-in sharding and using replica sets for data replication and offloading primary servers from read load can help developers store data effectively.
  • Many use cases also use MongoDB as a way of archiving data. Used as a pure data store (and not having the need to define schemas), it is fairly easy to dump data into MongoDB to be analyzed at a later date by business analysts, using either the shell or some of the numerous BI tools that can easily integrate with MongoDB. Breaking data down further, based on time caps or document counts, can help serve these datasets from RAM, the use case in which MongoDB is most effective.
  • Keeping datasets in RAM helps performance, and that's why it is commonly used in practice. MongoDB uses MMAP storage (called MMAPv1) in most versions, up to the most recent, which delegates data mapping to the underlying operating system. This means that most GNU/Linux-based systems, working with collections that can be stored in RAM, will dramatically increase performance. This is less of an issue with the introduction of pluggable storage engines, such as WiredTiger (there will be more on that in Chapter 8, Monitoring, Backup, and Security).
  • Capped collections are also a feature used in many use cases. Capped collections can restrict documents in a collection by count or by the overall size of the collection. In the latter case, we need to have an estimate of the size per document, in order to calculate how many documents will fit into our target size. Capped collections are a quick and dirty solution to answering requests such as give me the last hour's overview of the logs without the need for maintenance and running async background jobs to clean our collection. Oftentimes, these may be used to quickly build and operate a queuing system. Instead of deploying and maintaining a dedicated queuing system, such as ActiveMQ, a developer can use a collection to store messages, and then use the native tailable cursors provided by MongoDB to iterate through the results as they pile up and feed an external system.
  • Low operational overhead is also a common pattern in many use cases. Developers working in agile teams can operate and maintain clusters of MongoDB servers without the need for a dedicated DBA. MongoDB Management Service (MMS) can greatly help in reducing administrative overhead, whereas MongoDB Atlas, the hosted solution by MongoDB, Inc., means that developers do not need to deal with operational headaches.
  • In terms of business sectors using MongoDB, there is a huge variety coming from almost all industries. Where there seems to be a greater penetration, however, is in cases that have to deal with lots of data with a relatively low business value in each single data point. Fields such as IoT can benefit the most by exploiting the availability over consistency design, storing lots of data from sensors in a cost-efficient way. Financial services, on the other hand, have absolutely stringent consistency requirements, aligned with proper ACID characteristics that make MongoDB more of a challenge to adapt. A financial transaction may be small in size but big in impact, which means that we cannot afford to leave a single message without proper processing.
  • Location-based data is also a field where MongoDB has thrived, with Foursquare being one of the most prominent early clients. MongoDB offers quite a rich set of features around two-dimensional and three-dimensional geolocation data, offering features such as searching by distance, geofencing, and intersections between geographical areas.
  • Overall, the rich feature set is the common pattern across different use cases. By providing features that can be used in many different industries and applications, MongoDB can be a unified solution for all business needs, offering users the ability to minimize operational overhead and, at the same time, iterate quickly in product development.

 

 

MongoDB criticism

MongoDB's criticism is associated with the following points:

  • MongoDB has had its fair share of criticism throughout the years. The web-scale proposition has been met with skepticism by many developers. The counter argument is that scale is not needed most of the time, and the focus should be on other design considerations. While this may occasionally be true, it is a false dichotomy, and in an ideal world, we would have both. MongoDB is as close as it can get to combining scalability with features, ease of use, and time to market.
  • MongoDB's schema-less nature is also a big point of debate and argument. Schema-less can be really beneficial in many use cases, as it allows for heterogeneous data to be dumped into the database without complex cleansing and without ending up with lots of empty columns or blocks of text stuffed into a single column. On the other hand, this is a double-edged sword, as a developer may end up with many documents in a collection that have loose semantics in their fields, and it can become really hard to extract these semantics at the code level. If our schema design is not optimal, we may end up with a data store, rather than a database.
  • A lack of proper ACID guarantees is a recurring complaint from the relational world. Indeed, if a developer needs access to more than one document at a time, it is not easy to guarantee RDBMS properties, as there are no transactions. Having no transactions, in the RDBMS sense, also means that complex writes will need to have application-level logic to roll back. If you need to update three documents in two collections to mark an application-level transaction complete, and the third document does not get updated for whatever reason, the application will need to undo the previous two writes, something that may not exactly be trivial.
  • With the introduction of multi-document transactions in version 4.0, MongoDB can cope with ACID transactions at the expense of speed. While this is not ideal, and transactions are not meant to be used for every CRUD operation in MongoDB, it does address the main source of criticism.
  • Defaults that favored setting up MongoDB but not operating it in a production environment are disapproved. For years, the default write behavior was write and forget; sending a write wouldn't wait for an acknowledgement before attempting the next write, resulting in insane write speeds with poor behaviors in cases of failure. Authentication is also an afterthought, leaving thousands of MongoDB databases on the public internet prey to whoever wants to read the stored data. Even though these were conscious design decisions, they are decisions that have affected developers' perceptions of MongoDB.
 

MongoDB configuration and best practices


In this section, we will present some of the best practice around operations, schema design, durability, replication, sharding, and security. Further information on when to implement these best practices will be presented in later chapters.

Operational best practices

As a database, MongoDB is built with developers in mind, and it was developed during the web era, so it does not require as much operational overhead as traditional RDBMS. That being said, there are some best practices that need to be followed to be proactive and achieve high availability goals.

In order of importance, the best practices are as follows:

  • Turn journaling on by default: Journaling uses a write-ahead log to be able to recover if a MongoDB server gets shut down abruptly. With the MMAPv1 storage engine, journaling should always be on. With the WiredTiger storage engine, journaling and checkpointing are used together, to ensure data durability. In any case, it is a good practice to use journaling and fine-tune the size of journals and the frequency of checkpoints, to avoid the risk of data loss. In MMAPv1, the journal is flushed to the disk every 100 ms, by default. If MongoDB is waiting for the journal before acknowledging the write operation, the journal is flushed to the disk every 30 ms.
  • Your working set should fit in the memory: Again, especially when using MMAPv1, the working set is best being less than the RAM of the underlying machine or VM. MMAPv1 uses memory mapped files from the underlying operating system, which can be a great benefit if there isn't much swap happening between the RAM and disk. WiredTiger, on the other hand, is much more efficient at using the memory, but still benefits greatly from the same principles. The working set is maximum the datasize and plus the index size as reported by db.stats().
  • Mind the location of your data files: Data files can be mounted anywhere by using the --dbpath command-line option. It is really important to make sure that data files are stored in partitions with sufficient disk space, preferably XFS, or at least Ext4.
  • Keep yourself updated with versions: Even major numbered versions are the stable ones. So, 3.2 is stable, whereas 3.3 is not. In this example, 3.3 is the developmental version that will eventually materialize into the stable version 3.4. It is a good practice to always update to the latest security updated version (4.0.2, at the time of writing this book) and to consider updating as soon as the next stable version comes out (4.2, in this example).
  • Use Mongo MMS to graphically monitor your service: The free MongoDB, Inc. monitoring service is a great tool to get an overview of a MongoDB cluster, notifications, and alerts and to be proactive about potential issues.
  • Scale up if your metrics show heavy use: Do not wait until it is too late. Utilizing more than 65% in CPU or RAM, or starting to notice disk swapping, should both be the threshold to start thinking about scaling, either vertically (by using bigger machines) or horizontally (by sharding).
  • Be careful when sharding: Sharding is a strong commitment to your shard key. If you make the wrong decision, it may be really difficult to go back, from an operational perspective. When designing for sharding, architects need to take deep considerations of the current workloads (reads/writes) and what the current and expected data access patterns are.
  • Use an application driver maintained by the MongoDB team: These drivers are supported and tend to get updated faster than drivers with no official support. If MongoDB does not support the language that you are using yet, please open a ticket in MongoDB's JIRA tracking system.
  • Schedule regular backups: No matter whether you are using standalone servers, replica sets, or sharding, a regular backup policy should also be used as a second-level guard against data loss. XFS is a great choice as a filesystem, as it can perform snapshot backups.
  • Manual backups should be avoided: Regular, automated backups should be used, when possible. If we need to resort to a manual backup, then we can use a hidden member in a replica set to take the backup from. We have to make sure that we are using db.fsyncwithlock at this member, to get the maximum consistency at this node, along with having journaling turned on. If this volume is on AWS, we can get away with taking an EBS snapshot straight away.
  • Enable database access control: Never, ever put a database into a production system without access control. Access control should be implemented at a node level, by a proper firewall that only allows access to specific application servers to the database, and at a DB level, by using the built-in roles or defining custom defined ones. This has to be initialized at start up time by using the --auth command-line parameter and can be configured by using the admin collection.
  • Test your deployment using real data: Since MongoDB is a schema-less, document-oriented database, you may have documents with varying fields. This means that it is even more important than with an RDBMS to test using data that resembles production data as closely as possible. A document with an extra field of an unexpected value can make the difference between an application working smoothly or crashing at runtime. Try to deploy a staging server using production-level data, or at least fake your production data in staging, by using an appropriate library, such as Faker for Ruby.

Schema design best practices

MongoDB is schema-less, and you have to design your collections and indexes to accommodate for this fact:

  • Index early and often: Identify common query patterns, using MMS, Compass GUI, or logs, and index for these early, using as many indexes as possible at the beginning of a project.
  • Eliminate unnecessary indexes: A bit counter-intuitive to the preceding suggestion, monitor your database for changing query patterns, and drop the indexes that are not being used. An index will consume RAM and I/O, as it needs to be stored and updated along with documents in the database. Using an aggregation pipeline and $indexStats, a developer can identify the indexes that are seldom being used and eliminate them.
  • Use a compound index, rather than index intersection: Querying with multiple predicates (A and B, C or D and E, and so on) will work better with a single compound index than with multiple simple indexes, most of the time. Also, a compound index will have its data ordered by field, and we can use this to our advantage when querying. An index on fields A, B, and C will be used in queries for A, (A,B), (A,B,C), but not in querying for (B,C) or (C).
  • Low selectivity indexes: Indexing a field on gender, for example, will statistically return half of our documents back, whereas an index on last name will only return a handful of documents with the same last name.
  • Use regular expressions: Again, since indexes are ordered by value, searching using a regular expression with leading wildcards (that is, /.*BASE/) won't be able to use the index. Searching with trailing wildcards (that is, /DATA.*/) can be efficient, as long as there are enough case-sensitive characters in the expression.
  • Avoid negation in queries: Indexes are indexing values, not the absence of them. Using NOT in queries can result in full table scans, instead of using the index.
  • Use partial indexes: If we need to index a subset of the documents in a collection, partial indexes can help us to minimize the index set and improve performance. A partial index will include a condition on the filter that we use in the desired query.
  • Use document validation: Use document validation to monitor for new attributes being inserted into your documents and decide what to do with them. With document validation set to warn, we can keep a log of documents that were inserted with arbitrary attributes that we did not expect during the design phase, and decide whether this is a bug or a feature of our design.
  • Use MongoDB Compass: MongoDB's free visualization tool is great for getting a quick overview of our data and how it grows over time.
  • Respect the maximum document size of 16 MB: The maximum document size for MongoDB is 16 MB. This is a fairly generous limit, but it is one that should not be violated under any circumstances. Allowing for documents to grow unbounded should not be an option, and, as efficient as it may be to embed documents, we should always keep in mind that this should be under control.
  • Use the appropriate storage engine: MongoDB has introduced several new storage engines since version 3.2. The in-memory storage engine should be used for real-time workloads, whereas the encrypted storage engine should be the engine of choice when there are strict requirements around data security.

Best practices for write durability

Writing durability can be fine-tuned in MongoDB, and, according to our application design, it should be as strict as possible, without affecting our performance goals.

Fine-tune the data and flush it to the disk interval in the WiredTiger storage engine, the default is to flush data to the disk every 60 seconds after the last checkpoint, or after 2 GB of data has been written. This can be changed by using the --wiredTigerCheckpointDelaySecs command-line option.

In MMAPv1, data files are flushed to the disk every 60 seconds. This can be changed by using the --syncDelay command-line option. We can also perform various tasks, such as the following:

  • With WiredTiger, we can use the XFS filesystem for multi-disk consistent snapshots
  • We can turn off atime and diratime in data volumes
  • You can make sure that you have enough swap space (usually double your memory size)
  • You can use a NOOP scheduler if you are running in virtualization environments
  • We can raise file descriptor limits to the tens of thousands
  • We can disable transparent huge pages and enable standard 4-KVM pages instead
  • Write safety should be journaled, at the very least
  • SSD read ahead default should be set to 16 blocks; HDD should be 32 blocks
  • We can turn NUMA off in BIOS
  • We can use RAID 10
  • You can synchronize the time between hosts by using NTP, especially in sharded environments
  • Only use 64-bit builds for production; 32-bit builds are outdated and can only support up to 2 GB of memory

Best practices for replication

Replica sets are MongoDB's mechanism to provide redundancy, high availability, and higher read throughput, under the right conditions. In MongoDB, replication is easy to configure and focus in operational terms:

  • Always use replica sets: Even if your dataset is small at the moment, and you don't expect it to grow exponentially, you never know when that might happen. Also, having a replica set of at least three servers helps to design for redundancy, separating the workloads between real time and analytics (using the secondary) and having data redundancy built from day one.
  • Use a replica set to your advantage: A replica set is not just for data replication. We can (and should, in most cases) use the primary server for writes and preference reads from one of the secondary to offload the primary server. This can be done by setting read preferences for reads, along with the correct write concern, to ensure that writes propagate as needed.
  • Use an odd number of replicas in a MongoDB replica set: If a server is down or loses connectivity with the rest of them (network partitioning), the rest have to vote as to which one will be elected as the primary server. If we have an odd number of replica set members, we can guarantee that each subset of servers knows if they belong to the majority or the minority of the replica set members. If we can not have an odd number of replicas, we need to have one extra host set as an arbiter, with the sole purpose of voting in the election process. Even a micro-instance in EC2 could serve this purpose.

 

 

 

Best practices for sharding

Sharding is MongoDB's solution for horizontal scaling. In Chapter 8, Monitoring, Backup, and Security, we will go over its usage in more detail, but the following are some best practices, based on the underlying data architecture:

  • Think about query routing: Based on different shard keys and techniques, the mongos query router may direct the query to some (or all) of the members of a shard. It is important to take our queries into account when designing sharding, so that we don't end up with our queries hitting all of our shards.
  • Use tag-aware sharding: Tags can provide more fine-grained distribution of data across our shards. Using the right set of tags for each shard, we can ensure that subsets of data get stored in a specific set of shards. This can be useful for data proximity between application servers, MongoDB shards, and the users.

Best practices for security

Security is always a multi-layered approach, and these few recommendations do not form an exhaustive list; they are just the bare basics that need to be done in any MongoDB database:

  • The HTTP status interface should be disabled.
  • The RESTful API should be disabled.
  • The JSON API should be disabled.
  • Connect to MongoDB using SSL.
  • Audit the system activity.
  • Use a dedicated system user to access MongoDB with appropriate system-level access.
  • Disable server-side scripting if it is not needed. This will affect MapReduce, built-in db.group() commands, and $where operations. If these are not used in your codebase, it is better to disable server-side scripting at startup by using the --noscripting parameter.

Best practices for AWS

When we are using MongoDB, we can use our own servers in a data center, a MongoDB-hosted solution such as MongoDB Atlas, or we can get instances from Amazon by using EC2. EC2 instances are virtualized and share resources in a transparent way, with collocated VMs in the same physical host. So, there are some more considerations to take into account if you are going down that route, as follows:

  • Use EBS-optimized EC2 instances.
  • Get EBS volumes with provisioned I/O operations per second (IOPS) for consistent performance.
  • Use EBS snapshotting for backup and restore.
  • Use different availability zones for high availability and different regions for disaster recovery. Using different availability zones within each region that Amazon provides guarantees that our data will be highly available. Different regions should only be used for disaster recovery, in case a catastrophic event ever takes out an entire region. A region might be EU-West-2 (for London), whereas an availability zone is a subdivision within a region; currently, two availability zones are available for London.
  • Deploy global; access local.
  • For truly global applications with users from different time zones, we should have application servers in different regions access the data that is closest to them, using the right read preference configuration in each server.
 

Reference documentation


Reading a book is great (and reading this book is even greater), but continuous learning is the only way to keep up to date with MongoDB. In the following sections, we will highlight the places that you should go for updates and development/operational references.

MongoDB documentation

The online documentation available at https://docs.mongodb.com/manual/ is the starting point for every developer, new or seasoned.

The JIRA tracker is a great place to take a look at fixed bugs and the features that are coming up next: https://jira.mongodb.org/browse/SERVER/.

 

 

Packt references

Some other great books on MongoDB are as follows:

  • MongoDB for Java Developers, by Francesco Marchioni
  • MongoDB Data Modeling, by Wilson da Rocha França
  • Any book by Kristina Chodorow
 

Further reading 


The MongoDB user group (https://groups.google.com/forum/#!forum/mongodb-user) has a great archive of user questions about features and long-standing bugs. It is a place to go when something doesn't work as expected.

Online forums (Stack Overflow and Reddit, among others) are always a source of knowledge, with the caveat that something may have been posted a few years ago and may not apply anymore. Always check before trying.

Finally, MongoDB University is a great place to keep your skills up to date and to learn about the latest features and additions: https://university.mongodb.com/.

 

Summary


In this chapter, we started our journey through web, SQL, and NoSQL technologies, from their inception to their current states. We identified how MongoDB has been shaping the world of NoSQL databases over the years, and how it is positioned against other SQL and NoSQL solutions.

We explored MongoDB's key characteristics and how MongoDB has been used in production deployments. We identified the best practices for designing, deploying, and operating MongoDB.

Initially, we identified how to learn by going through documentation and online resources that can be used to stay up-to-date with the latest features and developments.

In the next chapter, we will go deeper into schema design and data modeling, looking at how to connect to MongoDB by using both the official drivers and an Object Document Mapper (ODM), a variation of object-relational mappers for NoSQL databases.

About the Author

  • Alex Giamas

    Alex Giamas is a consultant and a hands-on technical architect at the Department for International Trade in the UK Government. His experience spans software architecture and development using NoSQL and big data technologies. For more than 15 years, he has contributed to Fortune 15 companies and has worked as a start-up CTO. He is the author of Mastering MongoDB 3.x, published by Packt Publishing, which is based on his use of MongoDB since 2009. Alex has worked with a wide array of NoSQL and big data technologies, building scalable and highly available distributed software systems in Python, Java, and Ruby. He is a MongoDB Certified Developer, a Cloudera Certified Developer for Apache Hadoop and Data Science Essentials.

    Browse publications by this author

Latest Reviews

(2 reviews total)
The content is good and up-to-date.
Pretty thorough book at cost efficient price!

Recommended For You

Book Title
Unlock this full book with a FREE 10-day trial
Start Free Trial