In this chapter, we will lay the foundations for understanding MongoDB and how it is a database designed for the modern web. We will cover the following topics:
- The web, SQL, and MongoDB's history and evolution.
- MongoDB from theÂ perspective ofÂ SQL and other NoSQL technology users.
- MongoDB's common use cases and why they matter.
- Configuration best practices:
- Schema design
- Write durability
- Learning to learn. Nowadays, learning how to learn is as important as learning in the first place. We will go through references that have the most up to date information about MongoDB for both new and experienced users.
In March 1989, more than 28 years ago, Sir Tim Berners-Lee unveiled his vision for what would later be named the World Wide Web (WWW) in a document called Information Management: A ProposalÂ (http://info.cern.ch/Proposal.html). Since then, the WWW has grown to be a tool of information, communication, and entertainment for more than two of every five people on our planet.
The first version of the WWW relied exclusively on web pages and hyperlinks between them, a concept kept until present times. It was mostly read-only, with limited support for interaction between the user and the web page. Brick and mortar companies were using it to put up their informational pages. Finding websites could only be done using hierarchical directories like Yahoo! and DMOZ. The web was meant to be an information portal.
This, while not being Sir Tim Berners-Lee's vision, allowed media outlets such as the BBC and CNN to create a digital presence and start pushing out information to the users. It revolutionized information access as everyone in the world could get first-hand access to quality information at the same time.
Web 1.0 was totally device and software independent, allowing for every device to access all information. Resources were identified by address (the website's URL) and open protocols (
DELETE) could be used to access content resources.
Hyper Text Markup Language (HTML) was used to develop web sites that were serving static content. There was no notion of Cascading Style Sheets (CSS) as positioning of elements in a page could only be modified using tables and framesets were used extensively to embed information in pages.
This proved to be severely limiting and so browser vendors back then started adding custom HTML tags like
<marquee> which lead to the first browser wars, with rivals Microsoft (Internet Explorer) and Netscape racing to extend the HTTP protocol's functionality. Web 1.0 reached 45 million users by 1996.
Here is the LycosÂ startÂ pageÂ as it appeared in Web 1.0Â http://www.lycos.com/:
Yahoo as appeared in Web 1.0Â http://www.yahoo.com:
A term first defined and formulated by Tim O'Reilly, we use it to describe our current WWW sites and services. Its main characteristic is that the web moved from being read-only to the read-write state. Websites evolved into services and human collaboration plays an ever important part in Web 2.0.
From simple information portals, we now have many more types of services such as:
Web 2.0 reached 1+ billion users in 2006 and 3.77 billion users at the time of writing this book (late 2017). Building communities was the differentiating factor for Web 2.0, allowing internet users to connect on common interests, communicate, and share information.
Personalization plays an important part of Web 2.0 with many websites offering tailored content to its users. Recommendation algorithms and human curation decides the content to show to each user.
Moving from websites to web applications also unveiled the era of Service Oriented Architecture (SOA). Applications can interconnect with each other, exposing data through Application Programming Interfaces (API) allowing to build more complex applications on top of application layers.
One of the applications that defined Web 2.0 are social apps. Facebook with 1.86 billion monthly active users at the end of 2016 is the most well known example. We use social networks and many web applications share social aspects that allow us to communicate with peers and extend our social circle.
It's not yet here, but Web 3.0 is expected to bring Semantic Web capabilities. Advanced as Web 2.0 applications may seem, they all rely mostly on structured information. We use the same concept of searching for keywords and matching these keywords with web content without much understanding of context, content and intention of user's request. Also calledÂ Web of Data, Web 3.0 will rely on inter-machine communication and algorithms to provide rich interaction via diverse human computer interfaces.
Structured Query Language existed even before the WWW. Dr. EF Codd originally published the paperÂ A Relational Model of Data for Large Shared Data Banks, in June 1970, in the Association of Computer Machinery (ACM) journal, Communications of the ACM. SQL was initially developed at IBM by Chamberlin and Boyce in 1974. Relational Software (now Oracle Corporation) was the first to develop a commercially available implementation of SQL, targeted at United States governmental agencies.
The first American National Standards Institute (ANSI) SQL standard came out in 1986 and since then there have been eight revisions with the most recent being published in 2016 (SQL:2016).
SQL was not particularly popular at the start of the WWW. Static content could just be hard coded into the HTML page without much fuss. However, as functionality of websites grew, webmasters wanted to generate web page content driven by offline data sources to generate content that could change over time without redeploying code.
Common Gateway Interface (CGI) scripts in Perl or Unix shell were driving early database driven websites in Web 1.0. With Web 2.0, the web evolved from directly injecting SQL results into the browser to using two- and three-tier architecture that separated views from business and model logic, allowing for SQL queries to be modular and isolated from the rest of a web application.
Not only SQL (NoSQL) on the other hand is much more modern and supervenes web evolution, rising at the same time as Web 2.0 technologies. The term was first coined by Carlo Strozzi in 1998 for his open source database that was not following the SQL standard but was still relational.
This is not what we currently expect from a NoSQL database. Johan Oskarsson, a developer at Last.fm at the time, reintroduced the term in early 2009 to group a set of distributed, non-relational data stores that were being developed. Many of them were based on Google's Bigtable and MapReduce papers or Amazon's Dynamo highly available key-value based storage system.
NoSQL foundations grew upon relaxed ACID (atomicity, consistency, isolation, durability) guarantees in favor of performance, scalability, flexibility and reduced complexity. Most NoSQL databases have gone one way or another in providing as many of theÂ previously mentioned qualities as possible, even offering tunable guarantees to the developer.
Timeline of SQL and NoSQL evolution
10gen started developing a cloud computing stack in 2007 and soon realized that the most important innovation was centered around the document oriented database that they built to power it, MongoDB. MongoDB was initially released on August 27th, 2009.
Version 1 of MongoDB was pretty basic in terms of features, authorization, and ACID guarantees and made up for these shortcomings with performance and flexibility.
In the following sections, we can see the major features along with the version number with which they were introduced.
- Document-based model
- Global lock (process level)
- Indexes on collections
- CRUD operations on documents
- No authentication (authentication was handled at the server level)
- Master/slave replication
- MapReduce (introduced in v1.2)
- Background index creation (since v.1.4)
- Sharding (since v.1.6)
- More query operators (since v.1.6)
- Journaling (since v.1.8)
- Sparse and covered indexes (since v.1.8)
- Compact command to reduce disk usage
- Memory usage more efficient
- Concurrency improvements
- Index performance enhancements
- Replica sets are now more configurable and data center aware
- MapReduce improvements
- Authentication (since 2.0 for sharding and most database commands)
- Geospatial features introduced
- Aggregation framework (since v.2.2) and enhancements (since v.2.6)
- TTL collections (since v.2.2)
- Concurrency improvements among which DB level locking (since v.2.2)
- Text search (since v.2.4) and integration (since v.2.6)
- Hashed index (since v.2.4)
- Security enhancements, role based access (since v.2.4)
- Query engine improvements (since v.2.6)
- Pluggable storage engine API
- WiredTiger storage engine introduced, with document level locking while previous storage engine (now called MMAPv1) supports collection level locking
- Replication and sharding enhancements (since v.3.2)
- Document validation (since v.3.2)
- Aggregation framework enhanced operations (since v.3.2)
- Multiple storage engines (since v.3.2, only in Enterprise Edition)
MongoDB evolution diagram
As one can observe, version 1 was pretty basic, whereas version 2 introduced most of the features present in the current version such as sharding, usable and special indexes, geospatial features, and memory and concurrency improvements.
On the way from version 2 to version 3, the aggregation framework was introduced, mainly as a supplement to the ageing (and never up to par with dedicated frameworks like Hadoop) MapReduce framework. Then, adding text search and slowly but surely improving performance, stability, and security to adapt to the increasing enterprise load of customers using MongoDB.
With WiredTiger's introduction in version 3, locking became much less of an issue for MongoDB as it was brought down from process (global lock) to document level, almost the most granular level possible.
At its current state, MongoDB is a database that can handle loads ranging from startup MVPs and POCs to enterprise applications with hundreds of servers.
MongoDB was developed in the Web 2.0 era. By then, most developers had been using SQL or Object-relational mappingÂ (ORM) tools from their language of choice to access RDBMS data. As such, these developers needed an easy way to get acquainted with MongoDB from their relational background.
Thankfully, there have been several attempts at SQL to MongoDB cheat sheets that explain MongoDB terminology in SQL terms.
On a higher level there are:
- Databases, indexes just like in SQL databases
- Collections (SQL tables)
- Documents (SQL rows)
- Fields (SQL columns)
- Embedded and linked documents (SQL Joins)
Some more examples of common operations:
Embed in document or link via DBRef
As MongoDB has grown from being a niche database solution to the Swiss Army knife of NoSQL technologies, more developers are coming to it from a NoSQL background as well.
Setting the SQL to NoSQL differences aside, users from columnar type databases face the most challenges. Cassandra and HBase being the most popular column oriented database management systems, we will examine the differences and how a developer can migrate a system to MongoDB.
- Flexibility: MongoDB's notion of documents that can contain sub-documents nested in complex hierarchies is really expressive and flexible. This is similar to the comparison between MongoDB and SQL, with the added benefit that MongoDB can map easier to plain old objects from any programming language, allowing for easy deployment and maintenance.
- Native aggregation: The aggregation framework provides an ETL pipeline for users to extract and transform data from MongoDB and either load them in a new format or export it from MongoDB to other data sources. This can also help data analysts and scientists get the slice of data they need performing data wrangling along the way.
- Schemaless model: This is a result of MongoDB's design philosophy to give applications the power and responsibility to interpret different properties found in a collection's documents. In contrast to Cassandra's or HBase's schema based approach, in MongoDB a developer can store and process dynamically generated attributes.
In this section, we will analyze MongoDB's characteristics as a database. Understanding the features that MongoDB provides can help developers and architects evaluate the requirement at hand and how MongoDB can help fulfill it. Also, we will go through some common use cases from MongoDB Inc's experience that have delivered the best results for its users.
MongoDB has grown to a general purpose NoSQL database, offering the best of both RDBMS and NoSQL worlds. Some of the key characteristics are:
- It's a general purpose database. In contrast with other NoSQL databases that are built for purpose (for example, graph databases), MongoDB can serve heterogeneous loads and multiple purposes within an application.
- Flexible schema design. Document oriented approaches with non-defined attributes that can be modified on the fly is a key contrast between MongoDB and relational databases.
- It's built with high availability from the ground up. In our era of five nines in availability, this has to be a given. Coupled with automatic failover on detection of a server failure, this can help achieve high uptime.
- Feature rich. Offering the full range of SQL equivalent operators along with features such as MapReduce, aggregation framework, TTL/capped collections, and secondary indexing, MongoDB can fit many use cases, no matter how diverse the requirements are.
- Scalability and load balancing. It's built to scale, both vertically but most importantly horizontally. Using sharding, an architect can share load between different instances and achieve both read and write scalability. Data balancing happens automatically and transparently to the user by the shard balancer.
- Aggregation framework. Having an extract transform load framework built in the database means that a developer can perform most of the ETL logic before the data leaves the database, eliminating in many cases the need for complex data pipelines.
- Native replication. Data will get replicated across a replica set without complicated setup.
- Security features. Both authentication and authorization are taken into account so that an architect can secure her MongoDB instances.
- JSON (BSON, Binary JSON) objects for storing and transmitting documents. JSON is widely used across the web for frontend and API communication and as such it's easier when the database is using the same protocol.
- MapReduce. Even though the MapReduce engine isn't as advanced as it is in dedicated frameworks, it is nonetheless a great tool for building data pipelines.
- Querying and geospatial information in 2D and 3D. This may not be critical for many applications, but if it is for your use case then it's really convenient to be able to use the same database for geospatial calculations along with data storage.
MongoDB being a hugely popular NoSQL database means that there are several use cases where it has succeeded in supporting quality applications with a great time to market delivery time.
Many of its most successful use cases center around the following areas:
- Integration of siloed data providing a single view of them
- Internet of Things
- Mobile applications
- Real-time analytics
- Catalog management
- Content management
All these success stories share some common characteristics. We will try and break these down in order of relative importance.
Schema flexibility is most probably the most important one. Being able to store documents inside a collection that can have different properties can help both during development phase but also in ingesting data from heterogeneous sources that may or may not have the same properties. In contrast with an RDBMS where columns need to be predefined and having sparse data can be penalized, in MongoDB this is the norm and it's a feature that most use cases share. Having the ability to deep nest attributes into documents, add arrays of values into attributes and all the while being able to search and index these fields helps application developers exploit the schema-less nature of MongoDB.
Scaling and sharding are the most common patterns for MongoDB use cases. Easily scaling using built-in sharding and using replica sets for data replication and offloading primary servers from read load can help developers store data effectively.
Many use cases also use MongoDB as a way of archiving data. Used as a pure data store and not having the need to define schemas, it's fairly easy to dump data into MongoDB, only to be analyzed at a later date by business analysts either using the shell or some of the numerous BI tools that can integrate easily with MongoDB. Breaking data down further based on time caps or document count can help serve these datasets from RAM, the use case where MongoDB is most effective.
On this point, keeping datasets in RAM is more often another common pattern. MongoDB uses MMAP storage (called MMAPv1) in most versions up to the most recent, which delegates data mapping to the underlying operating system. This means thatÂ most GNU/Linux based systems working with collections that can be stored in RAM will dramatically increase performance. This is less of an issue with the introduction of pluggable storage engines like WiredTiger, more on that in Chapter 8, Storage Engines.
Capped collections are also a feature used in many use cases. Capped collections can restrict documents in a collection by count or by overall size of the collection. In the latter case, we need to have an estimate of size per document to calculate how many documents will fit in our target size. Capped collections are a quick and dirty solution to answer requests like "Give me the last hour's overview of the logs." without any need for maintenance and running async background jobs to clean our collection. Oftentimes, these may be used to quickly build and operate a queuing system. Instead of deploying and maintaining a dedicated queuing system like ActiveMQ, a developer can use a collection to store messages and then use native tailable cursors provided by MongoDB to iterate through results as they pile up and feed an external system.
Low operational overhead is also a common pattern in use cases. Developers working in agile teams can operate and maintain clusters of MongoDB servers without the need for a dedicated DBA. MongoDB Management Service can greatly help in reducing administrative overhead, whereas MongoDB Atlas, the hosted solution by MongoDB Inc., means that developers don't need to deal with operational headaches.
In terms of business sectors using MongoDB, there is a huge variety coming from almost all industries. Where there seems to be a greater penetration though, is in cases that have to deal with lots of data with a relatively low business value in each single data point. Fields like IoT can benefit the most by exploiting availability over consistency design, storing lots of data from sensors in a cost efficient way. Financial services on the other hand, many times have absolutely stringent consistency requirements aligned with proper ACID characteristics that make MongoDB more of a challenge to adapt. Transactions carrying financial data can be a few bytes but have an impact of millions of dollars, hence all the safety nets around transmitting this type of information correctly.
Location-based data is also a field where MongoDB has thrived. Foursquare being one of the most prominent early clients, MongoDB offers quite a rich set of features around 2D and 3D geolocation data, offering features like searching by distance, geofencing, and intersection between geographical areas.
Overall, the rich feature set is the common pattern across different use cases. By providing features that can be used in many different industries and applications, MongoDB can be a unified solution for all business needs, offering users the ability to minimize operational overhead and at the same time iterate quickly in product development.
MongoDB has had its fair share of criticism throughout the years. The web-scaleÂ proposition has been met with skepticism by many developers. The counter argument is that scale is not needed most of the time and we should focus on other design considerations. While this may be true on several occasions, it's a false dichotomy and in an ideal world we would have both. MongoDB is as close as it can get to combining scalability with features and ease of use/time to market.
MongoDB's schema-less nature is also a big point of debate and argument. Schema-less can be really beneficial in many use cases as it allows for heterogeneous data to be dumped into the database without complex cleansing or ending up with lots of empty columns or blocks of text stuffed into a single column. On the other hand, this is a double-edged sword as a developer may end up with many documents in a collection that have loose semantics in their fields and it becomes really hard to extract these semantics at the code level. What we can have in the end if schema design is not optimal, is a plain datastore rather than a database.
Lack of proper ACID guarantees is a recurring complaint from the relational world. Indeed, if a developer needs access to more than one document at a time it's not easy to guarantee RDBMS properties as there are no transactions. Having no transactions in the RDBMS sense also means that complex writes will need to have application level logic to rollback. If you need to update three documents in two collections to mark an application level transaction complete and the third document doesn't get updated for whatever reason, the application will need to undo the previous two writes, something that may not be exactly trivial.
Defaults that favored setting up MongoDB but not operating it in a production environment are also frowned upon. For years, the default write behavior was write and forget, sending a write wouldn't wait for an acknowledgement before attempting the next write, resulting in insane write speeds with poor behavior in case of failure. Authentication is also an afterthought, leaving thousands of MongoDB databases in the public internet prey to whoever wants to read the stored data. Even though these were conscious design decisions, they are decisions that have affected developers' perception of MongoDB.
There are of course good points to be made from criticism. There are use cases where a non relational, unsupporting transactions database will not be a good choice. Any application that depends on transactions and places ACID properties higher than anything else is probably a great use case for a traditional RDBMS but not for a NoSQL database.
Without diving too deep into why, in this section we present some best practices around operations, schema design, durability, replication, sharding, and security. More information as to why and how to implement these best practices will be presented in the respective chapters and as always with best practices, these have to be taken with a pinch of salt.
MongoDB as a database is built with developers in mind and developed during the web era so does not require as much operational overhead as traditional RDBMSs. That being said, there are some best practices that need to be followed to be proactive and achieve high availability goals.
In order of importance (somewhat), here they are:
- Turn journaling on by default: Journaling uses a write ahead log to be able to recover in case a mongo server gets shut down abruptly. With MMAPv1 storage engine, journaling should be always on. With WiredTiger storage engine, journaling and checkpointing are used together to ensure data durability. In any case, it's a good practice to use journaling and fine tune the size of journals and frequency of checkpoints to avoid risk of data loss. In MMAPv1, the journal is flushed to disk every 100 ms by default. If MongoDB is waiting for the journal before acknowledging the write operation, the journal is flushed to disk every 30 ms.
- Your working set should fit in memory: Again, especially when using MMAPv1 the working set is best being less than the RAM of the underlying machine or VM. MMAPv1 uses memory mapped files from the underlying operating system which can benefit greatly if there isn't much swap happening between RAM and disk. WiredTiger on the other hand is much more efficient at using memory but still benefits greatly from the same principles. The working set is at maximum the datasize plus index size as reported by
- Mind the location of your data files: Data files can be mounted anywhere using the
--dbpathcommand line option. It is really important to make sure data files are stored in partitions with sufficient disk space, preferably XFS or at least Ext4.
- Keep yourself updated with versions:Â Odd major numbered versions are the stable ones. So, 3.2 is stable whereas 3.3 is not. In this exampleÂ 3.3Â is the development version that will eventually materialize into stable version 3.4 . It's a good practice to always update to the latest security updated version (3.4.3 at the time of writing) and consider updating as soon as the next stable version comes out (3.6 at this example).
- Use Mongo MMS to graphically monitor your service: MongoDB Inc's free monitoring service is a great tool to get an overview of a MongoDB cluster, notifications, and alerts and be proactive about potential issues.
- Scale up if your metrics show heavy use: Actually not really heavy usage. Key metrics of >65% in CPU, RAM, or if you are starting to notice disk swapping should be an alert to start thinking about scaling, either vertically by using bigger machines or horizontally by sharding.
- Be careful when sharding: Sharding is like a strong commitment to your shard key. If you make the wrong decision it may be really difficult operationally to go back. When designing for sharding, architects need to take a long and deep consideration of current workloads both in reads and also writes plus what the expected data access patterns are.
- Use an application driver maintained by the MongoDB team: These drivers are supported and in general get updated faster than their equivalents. If MongoDB does not support the language you are using yet, please open a ticket in MongoDB's JIRA tracking system.
- Schedule regular backups: No matter if you are using standalone servers, replica sets, or sharding, a regular backup policy should also be used as a second level guard against data loss. XFS is a great choice as a filesystem as it can perform snapshot backups.
- Manual backups should be avoided: Regular automated backups should be used when possible. If we need to resort to a manual backup then we can use a hidden member in a replica set to take the backup from. We have to make sure that we are using
db.fsyncwithlockat this member to get the maximum consistency at this node, along with journaling turned on. If this volume is on AWS, we can get away with taking an EBS snapshot straight away.
- Enable database access control: Never, ever put a database in a production system without access control. Access control should both be implemented at a node level by a proper firewall that only allows access to specific application servers to the database and also in DB level by using the built-in roles, or defining custom defined ones. This has to be initialized at startup time by using the
--authcommand-line parameter and configured using the
- Test your deployment using real data: MongoDB being a schema-less document oriented database means that you may have documents with varying fields. This means that it's even more important than with an RDBMS to test using data that resembles production data as closely as possible. A document with an extra field of an unexpected value can make the difference between an application working smoothly or crashing at runtime. Try to deploy a staging server using production level data or at least fake your production data in staging using an appropriate library like Faker for Ruby.
MongoDB is schema-less and you have to design your collections and indexes to accommodate for this fact:
- Index early and often: Identify common query patterns using MMS, Compass GUI, or logs and index for these early and using as many indexes as possible at the beginning of a project.
- Eliminate unnecessary indexes: A bit counter-intuitive to the preceding suggestion, monitor your database for changing query patterns and drop the indexes that aren't being used. An index will consume RAM and I/O as it needs to be stored and updated alongside with documents in the database. Using an aggregation pipeline and
$indexStatsa developer can identify indexes that are seldom being used and eliminate them.
- Use a compound index rather than index intersection: Querying with multiple predicates (A and B , C or D and E and so on) will most of the time work better with a single compound index than with multiple simple indexes. Also, a compound index will have its data ordered by field and we can use this to our advantage when querying. An index on fields A,B,C will be used in queries for A, (A,B), (A,B,C) but not in querying for (B,C) or (C) .
- Low selectivity indexes: Indexing a field on gender for example will statistically still return half of our documents back, whereas an index on last name will only return a handful of documents with the same last name.
- Use of regular expressions: Again, since indexes are ordered by value, searching using a regular expression with leading wildcards (that is,
/.*BASE/) won't be able to use the index. Searching with trailing wildcards (that is,
/DATA.*/) can be efficient as long as there are enough case sensitive characters in the expression.
- Avoid negation in queries: Indexes are indexing values, not the absence of them. Using
NOTin queries can result in full table scans instead of using the index.
- Use partial indexes: If we need to index a subset of the documents in a collection, partial indexes can help us minimize the index set and improve performance. A partial index will include a condition on the filter that we use in the desired query.
- Use document validation: Use document validation to monitor for new attributes being inserted to your documents and decide what to do with them. With document validation set to warn, we can keep a log of documents that were inserted with arbitrary attributes that we didn't expect during the design phase and decide if this is a bug or a feature of our design.
- Use MongoDB Compass: MongoDB's free visualization tool is great to get a quick overview of our data and how it grows across time.
- Respect the maximum document size of 16 MB: The maximum document size for MongoDB is 16 MB. This is a fairly generous limit but it is one that should not be violated under any circumstance. Allowing documents to grow unbounded should not be an option and as efficient as it may be to embed documents, we should always keep in mind that this should be under control.
- Use the appropriate storage engine:Â MongoDB has introduced several new storage engines since version 3.2. The in-memory storage engine should be used for real-time workloads, whereas the encrypted storage engine should be the engine of choice when there are strict requirements around data security.
Writing durability can be fine tuned in MongoDB and according to our application design it should be as strict as possible without affecting our performance goals.
Fine tune data flush to disk interval: In the WiredTiger storage engine, the default is to flush data to disk every 60 seconds after the last checkpoint, or after 2 GB of data has been written. This can be changed using the
--wiredTigerCheckpointDelaySecs command-line option.
In MMAPv1, data files are flushed to disk every 60 seconds. This can be changed using the
--syncDelay command-line option:
- With WiredTiger, use the XFS filesystem for multi-disk consistent snapshots
- Turn off atime and diratime in data volumes
- Make sure you have enough swap space, usually double your memory size
- Use a NOOP scheduler if running in virtualized environments
- Raise file descriptor limits to the tens of thousands
- Disable transparent huge pages, enable standard 4K VM pages instead
- Write safety should be at least journaled
- SSD read ahead default should be set to 16 blocks, HDD should be 32 blocks
- Turn NUMA off in BIOS
- Use RAID 10
- Synchronize time between hosts using NTP especially in sharded environments
- Only use 64-bit builds for production; 32-bit builds are outdated and can only support up to 2 GB of memory
Replica sets are MongoDB's mechanism to provide redundancy, high availability, and higher read throughput under the right conditions. Replication in MongoDB is easy to configure and light in operational terms:
- Always use replica sets: Even if your dataset is at the moment small and you don't expect it to grow exponentially, you never know when that might happen. Also, having a replica set of at least three servers helps design for redundancy, separating work loads between real time and analytics (using the secondaries) and having data redundancy built from day one.
- Use a replica set to your advantage: A replica set is not just for data replication. We can and should in most cases use the primary server for writes and preference reads from one of the secondaries to offload the primary server. This can be done by setting read preference for reads, together with the correct write concern to ensure writes propagate as needed.
- Use an odd number of replicas in a MongoDB replica set: If a server does down or loses connectivity with the rest of them (network partitioning), the rest have to vote as to which one will be elected as the primary server. If we have an odd number of replica set members, we can guarantee that each subset of servers knows if they belong to the majority or the minority of the replica set members. If we can't have an odd number of replicas, we need to have one extra host set as an arbiter with the sole purpose of voting in the election process. Even a micro instance in EC2 could serve this purpose.
Sharding is MongoDB's solution for horizontal scaling. In Chapter 8,Â Storage Engines,Â we will cover how to use it in more detail, here are some best practices based on the underlying data architecture:
- Think about query routing: Based on different shard keys and techniques, the mongos query router may direct the query to some or all of the members of a shard. It's important to take our queries into account when designing sharding so that we don't end up with our queries hitting all of our shards.
- Use tag aware sharding: Tags can provide more fine-grained distribution of data across our shards. Using the right set of tags for each shard, we can ensure that subsets of data get stored in a specific set of shards. This can be useful for data proximity between application servers, MongoDB shards, and the users.
Security is always a multi-layered approach and these few recommendations do not form an exhaustive list, rather just the bare basics that need to be done in any MongoDB database:
- HTTP status interface should be disabled.
- REST API should be disabled.
- JSON API should be disabled.
- Connect to MongoDB using SSL.
- Audit system activity.
- Use a dedicated system user to access MongoDB with appropriate system level access
- Disable server-side scripting if not needed. This will affect MapReduce, built-in
$whereoperations. If these are not used in your codebase, it is better to disable server-side scripting at startup using the
When using MongoDB, we can use our own servers in a datacenter, a MongoDB hosted solution like MongoDB Atlas, or get instances from Amazon using EC2. EC2 instances are virtualized and share resources in a transparent way with collocated VMs in the same physical host. So there are some more considerations to take into account if going down that route:
- Use EBS optimized EC2 instances.
- Get EBS volumes with provisioned IOPS (I/O operations per second) for consistent performance.
- Use EBS snapshotting for backup and restore.
- Use different Availability Zones for High Availability and different regions for Disaster Recovery. Different availability zones within each region that Amazon provides guarantee that our data will be highly available. Different regions should only be used for Disaster Recovery in case a catastrophic event ever takes out an entire region. A region can be EU-West-2 for London, whereas an availability zone is a subdivision within a region; currently two availability zones are available for London.
- Deploy global, access local.
- For truly global applications with users from different time zones, we should have application servers in different regions access data that is closest to them using the right read preference configuration in each server.
Reading a book is great, reading this book is even greater, but continuous learning is the only way to keep up to date with MongoDB. These are the places you should go for updates and development/operational reference.
Online documentation available at: https://docs.mongodb.com/manual/ is the starting point for every developer, new or seasoned.
The JIRA tracker is a great place to take a look at fixed bugs and features coming up next: https://jira.mongodb.org/browse/SERVER/.
Some other great books on MongoDB are:
- MongoDB for Java developers,Â by Francesco Marchioni
- MongoDB Data Modeling,Â by Wilson da Rocha FranÃ§a
- Any book by Kristina Chodorow
The MongoDB user group (https://groups.google.com/forum/#!forum/mongodb-user) has a great archive of user questions about features, and long-standing bugs. It's a place to go when something doesn't work as expected.
Online forums (Stack Overflow, reddit, among others) are always a source of knowledge with the trap that something may have been posted a few years ago and may not apply anymore. Always check before trying.
And finally, MongoDB university is a great place to keep your skills up to date and learn about the latest features and additions: https://university.mongodb.com/.
In this chapter, we went on a journey through web, SQL, and NoSQL technologies from their inception to their current state. We identified how MongoDB has been shaping the world of NoSQL databases for the past years and how it is positioned against other SQL and NoSQL solutions.
We explored MongoDB's key characteristics and how MongoDB has been used in production deployments. We identified best practices for designing, deploying, and operating MongoDB.
Finally, we learned how to learn by going through documentation and online resources to stay up to date with latest features and developments.
In the next chapter, we will go deeper into schema design and data modeling and how to connect to MongoDB both using the official drivers and also using an Object Document Mapper (ODM), a variation of object-relational mappers for NoSQL databases.