About MongoDB

Amol Nayak

November 2014

In this article by Amol Nayak, the author of MongoDB Cookbook, describes the various features of MongoDB.

(For more resources related to this topic, see here.)

MongoDB is a document-oriented database and is the most popular and favorite NoSQL database. The rankings given at http://db-engines.com/en/ranking shows us that MongoDB is sitting on the fifth rank overall as of August 2014 and is the first NoSQL product in this list. It is currently being used in production by a huge list of companies in various domains handling terabytes of data efficiently.

MongoDB is developed to scale horizontally and cope up with the increasing data volumes. It is very simple to use and get started with, backed by a good support from its company MongoDB and has a vast array open source and proprietary tools build around it to improve developer and administrator's productivity.

In this article, we will cover the following recipes:

  • Single node installation of MongoDB with options from the config file
  • Viewing database stats
  • Creating an index and viewing plans of queries

Single node installation of MongoDB with options from the config file

As we're aware that providing options from the command line does the work, but it starts getting awkward as soon as the number of options we provide increases. We have a nice and clean alternative to providing the startup options from a configuration file rather than as command-line arguments.

Getting ready

Well, assuming that we have downloaded the MongoDB binaries from the download site, extracted it, and have the bin directory of MongoDB in the operating system's path variable (this is not mandatory but it really becomes convenient after doing it), the binaries can be downloaded from http://www.mongodb.org/downloads after selecting your host operating system.

How to do it…

The /data/mongo/db directory for the database and /logs/ for the logs should be created and present on your filesystem, with the appropriate permissions to write to it. Let's take a look at the steps in detail:

  1. Create a config file, which can have any arbitrary name. In our case, let's say we create the file at /conf/mongo.conf. We will then edit the file and add the following lines of code to it:
    port = 27000
    dbpath = /data/mongo/db
    logpath = /logs/mongo.log
    smallfiles = true
  2. Start the Mongo server using the following command:
    > mongod --config /config/mongo.conf

How it works…

The properties are specified as <property name> = <value>. For all those properties that don't have values, for example, the smallfiles option, the value given is a Boolean value, true. If you need to have a verbose output, you will add v=true (or multiple v's to make it more verbose) to our config file. If you already know what the command-line option is, it is pretty easy to guess the value of the property in the file. It is the same as the command-line option, with just the hyphen removed.

Viewing database stats

In this recipe, we will see how to get the statistics of a database.

Getting ready

To find the stats of the database, we need to have a server up and running, and a single node is what should be ok. The data on which we would be operating needs to be imported into the database. Once these steps are completed, we are all set to go ahead with this recipe.

How to do it…

We will be using the test database for the purpose of this recipe. It already has the postalCodes collection in it. Let's take a look at the steps in detail:

  1. Connect to the server using the Mongo shell by typing in the following command from the operating system terminal (it is assumed that the server is listening to port 27017):
    $ mongo
  2. On the shell, execute the following command and observe the output:
    > db.stats()
  3. Now, execute the following command, but this time with the scale parameter (observe the output):
    > db.stats(1024)
    {
    "db" : "test",
     "collections" : 3,
     "objects" : 39738,
     "avgObjSize" : 143.32699179626553,
     "dataSize" : 5562,
     "storageSize" : 16388,
     "numExtents" : 8,
     "indexes" : 2,
     "indexSize" : 2243,
     "fileSize" : 196608,
     "nsSizeMB" : 16,
     "dataFileVersion" : {
        "major" : 4,
        "minor" : 5
     },
     "ok" : 1
     }

How it works…

Let us start by looking at the collections field. If you look carefully at the number and also execute the show collections command on the Mongo shell, you shall find one extra collection in the stats as compared to those by executing the command. The difference is for one collection, which is hidden, and its name is system.namespaces. You may execute db.system.namespaces.find() to view its contents.

Getting back to the output of stats operation on the database, the objects field in the result has an interesting value too. If we find the count of documents in the postalCodes collection, we see that it is 39732. The count shown here is 39738, which means there are six more documents. These six documents come from the system.namespaces and system.indexes collection. Executing a count query on these two collections will confirm it. Note that the test database doesn't contain any other collection apart from postalCodes. The figures will change if the database contains more collections with documents in it.

The scale parameter, which is a parameter to the stats function, divides the number of bytes with the given scale value. In this case, it is 1024, and hence, all the values will be in KB. Let's analyze the output:

> db.stats(1024)
{
 "db" : "test",
 "collections" : 3,
 "objects" : 39738,
 "avgObjSize" : 143.32699179626553,
 "dataSize" : 5562,
 "storageSize" : 16388,
 "numExtents" : 8,
 "indexes" : 2,
 "indexSize" : 2243,
 "fileSize" : 196608,
 "nsSizeMB" : 16,
 "dataFileVersion" : {
   "major" : 4,
   "minor" : 5
 },
 "ok" : 1
}

The following table shows the meaning of the important fields:

Field

Description

db

This is the name of the database whose stats are being viewed.

collections

This is the total number of collections in the database.

objects

This is the count of documents across all collections in the database. If we find the stats of a collection by executing db.<collection>.stats(), we get the count of documents in the collection. This attribute is the sum of counts of all the collections in the database.

avgObjectSize

This is simply the size (in bytes) of all the objects in all the collections in the database, divided by the count of the documents across all the collections. This value is not affected by the scale provided even though this is a size field.

dataSize

This is the total size of the data held across all the collections in the database. This value is affected by the scale provided.

storageSize

This is the total amount of storage allocated to collections in this database for storing documents. This value is affected by the scale provided.

numExtents

This is the count of all the number of extents in the database across all the collections. This is basically the sum of numExtents in the collection stats for collections in this database.

indexes

This is the sum of number of indexes across all collections in the database.

indexSize

This is the size (in bytes) for all the indexes of all the collections in the database. This value is affected by the scale provided.

fileSize

This is simply the addition of the size of all the database files you should find on the filesystem for this database. The files will be named test.0, test.1, and so on for the test database. This value is affected by the scale provided.

nsSizeMB

This is the size of the file in MBs for the .ns file of the database.

Another thing to note is the value of the avgObjectSize, and there is something weird in this value. Unlike this very field in the collection's stats, which is affected by the value of the scale provided. In database stats, this value is always in bytes, which is pretty confusing and one cannot really be sure why this is not scaled according to the provided scale.

Creating an index and viewing plans of queries

In this recipe, we will look at querying data, analyzing its performance by explaining the query plan, and then optimizing it by creating indexes.

Getting ready

For the creation of indexes, we need to have a server up and running. A simple single node is what we will need. The data with which we will be operating needs to be imported in the database. Once we have this prerequisite, we are good to go.

How to do it…

We will trying to write a query that will find all the zip codes in a given state. To do this, perform the following steps:

  1. Execute the following query to view the plan of a query:
    > db.postalCodes.find({state:'Maharashtra'}).explain()

    Take a note of the cursor, n, nscannedObjects, and millis fields in the result of the explain plan operation

  2. Let's execute the same query again, but this time, we will limit the results to only 100 results:
    > db.postalCodes.find({state:'Maharashtra'}).limit(100).explain()

    Again, take a note of the cursor, n, nscannedObjects, and millis fields in the result

  3. We will now create an index on the state and pincode fields as follows:
    > db.postalCodes.ensureIndex({state:1, pincode:1})
  4. Execute the following query:
    > db.postalCodes.find({state:'Maharashtra'}).explain()

    Again, take a note of the cursor, n, nscannedObjects, millis, and indexOnly fields in the result

  5. Since we want only the pin codes, we will modify the query as follows and view its plan:
    > db.postalCodes.find({state:'Maharashtra'}, {pincode:1, _id:0}).explain()

    Take a note of the cursor, n, nscannedObjects, nscanned, millis, and indexOnly fields in the result.

How it works…

There is a lot to explain here. We will first discuss what we just did and how to analyze the stats. Next, we will discuss some points to be kept in mind for the index creation and some gotchas.

Analysis of the plan

Let's look at the first step and analyze the output we executed:

> db.postalCodes.find({state:'Maharashtra'}).explain()

The output on my machine is as follows (I am skipping the nonrelevant fields for now):

{
       "cursor" : "BasicCursor",
   
       "n" : 6446,
       "nscannedObjects" : 39732,
       "nscanned" : 39732,
         …
       "millis" : 55,
…       
}

The value of the cursor field in the result is BasicCursor, which means a full collection scan (all the documents are scanned one after another) has happened to search the matching documents in the entire collection. The value of n is 6446, which is the number of results that matched the query. The nscanned and nscannedobjects fields have values of 39,732, which is the number of documents in the collection that are scanned to retrieve the results. This is the also the total number of documents present in the collection, and all were scanned for the result. Finally, millis is the number of milliseconds taken to retrieve the result.

Improving the query execution time

So far, the query doesn't look too good in terms of performance, and there is great scope for improvement. To demonstrate how the limit applied to the query affects the query plan, we can find the query plan again without the index but with the limit clause:

> db.postalCodes.find({state:'Maharashtra'}).limit(100).explain()
 
{
"cursor" : "BasicCursor",
     "n" : 100,      "nscannedObjects" : 19951,      "nscanned" : 19951,        …      "millis" : 30,        … }

The query plan this time around is interesting. Though we still haven't created an index, we saw an improvement in the time the query took for execution and the number of objects scanned to retrieve the results. This is due to the fact that Mongo does not scan the remaining documents once the number of documents specified in the limit function is reached. We can thus conclude that it is recommended that you use the limit function to limit your number of results, whereas the maximum number of documents accessed is known upfront. This might give better query performance. The word "might" is important, as in the absence of index, the collection might still be completely scanned if the number of matches is not met.

Improvement using indexes

Moving on, we will create a compound index on state and pincode. The order of the index is ascending in this case (as the value is 1) and is not significant unless we plan to execute a multikey sort. This is a deciding factor as to whether the result can be sorted using only the index or whether Mongo needs to sort it in memory later on, before we return the results. As far as the plan of the query is concerned, we can see that there is a significant improvement:

{
       "cursor" : "BtreeCursor state_1_pincode_1",

         "n" : 6446,
       "nscannedObjects" : 6446,
       "nscanned" : 6446,

       "indexOnly" : false,

       "millis" : 16,

}

The cursor field now has the BtreeCursor state_1_pincode_1 value , which shows that the index is indeed used now. As expected, the number of results stays the same at 6446. The number of objects scanned in the index and documents scanned in the collection have now reduced to the same number of documents as in the result. This is because we now used an index that gave us the starting document from which we could scan, and then, only the required number of documents were scanned. This is similar to using the book's index to find a word or scanning the entire book to search for the word. The time, millis has come down too, as expected.

Improvement using covered indexes

This leaves us with one field, indexOnly, and we will see what this means. To know what this value is, we need to look briefly at how indexes operate.

Indexes store a subset of fields of the original document in the collection. The fields present in the index are the same as those on which the index is created. The fields, however, are kept sorted in the index in an order specified during the creation of the index. Apart from the fields, there is an additional value stored in the index; this acts as a pointer to the original document in the collection. Thus, whenever the user executes a query, if the query contains fields on which an index is present, the index is consulted to get a set of matches. The pointer stored with the index entries that match the query is then used to make another IO operation to fetch the complete document from the collection; this document is then returned to the user.

The value of indexOnly, which is false, indicates that the data requested by the user in the query is not entirely present in the index, but an additional IO operation is needed to retrieve the entire document from the collection that follows the pointer from the index. Had the value been present in the index itself, an additional operation to retrieve the document from the collection will not be necessary, and the data from the index will be returned. This is called covered index, and the value of indexOnly, in this case, will be true.

In our case, we just need the pin codes, so why not use projection in our queries to retrieve just what we need? This will also make the index covered as the index entry that just has the state's name and pin code, and the required data can be served completely without retrieving the original document from the collection. The plan of the query in this case is interesting too. Executing the following query results in the following plan:

db.postalCodes.find({state:'Maharashtra'}, {pincode:1, _id:0}).explain()
{
       "cursor" : "BtreeCursor state_1_pincode_1",

       "n" : 6446,
       "nscannedObjects" : 0,
       "nscanned" : 6446,

 "indexOnly" : true,
…     
       "millis" : 15,

}

The values of the nscannedobjects and indexOnly fields are something to be observed. As expected, since the data we requested in the projection in the find query is pin code only, which can be served from the index alone, the value of indexOnly is true. In this case, we scanned 6,446 entries in the index, and thus, the nscanned value is 6446. We, however, didn't reach out to any document in the collection on the disk, as this query was covered by the index alone, and no additional IO was needed to retrieve the entire document. Hence, the value of nscannedobjects is 0.

As this collection in our case is small, we do not see a significant difference in the execution time of the query. This will be more evident on larger collections. Making use of indexes is great and gives good performance. Making use of covered indexes gives even better performance.

Another thing to remember is that wherever possible, try and use projection to retrieve only the number of fields we need. The _id field is retrieved every time by default, unless we plan to use it set _id:0 to not retrieve it if it is not part of the index. Executing a covered query is the most efficient way to query a collection.

Some gotchas of index creations

We will now see some pitfalls in index creation and some facts where the array field is used in the index.

Some of the operators that do not use the index efficiently are the $where, $nin, and $exists operators. Whenever these operators are used in the query, one should bear in mind a possible performance bottleneck when the data size increases. Similarly, the $in operator must be preferred over the $or operator, as both can be more or less used to achieve the same result. As an exercise, try to find the pin codes in the state of Maharashtra and Gujarat from the postalCodes collection. Write two queries: one using the $or operator and the other using the $in operator. Explain the plan for both these queries.

What happens when an array field is used in the index? Mongo creates an index entry for each element present in the array field of a document. So, if there are 10 elements in an array in a document, there will be 10 index entries, one for each element in the array. However, there is a constraint while creating indexes that contain array fields. When creating indexes using multiple fields, not more than one field can be of the array type. This is done to prevent the possible explosion in the number of indexes on adding even a single element to the array used in the index. If we think of it carefully, for each element in the array, an index entry is created. If multiple fields of type array were allowed to be part of an index, we would have a large number of entries in the index, which would be a product of the length of these array fields. For example, a document added with two array fields, each of length 10, will add 100 entries to the index, had it been allowed to create one index using these two array fields.

This should be good enough for now to scratch the surfaces of plain vanilla index.

Summary

This article provides detailed recipes that describe how to use the different features of MongoDB.

MongoDB is a document-oriented, leading NoSQL database, which offers linear scalability, thus making it a good contender for high-volume, high-performance systems across all business domains. It has an edge over the majority of NoSQL solutions for its ease of use, high performance, and rich features.

In this article, we learned how to start single node installations of MongoDB with options from the config file. We also learned how to create an index from the shell and viewing plans of queries.

Resources for Article:


Further resources on this subject:


You've been reading an excerpt of:

MongoDB Cookbook

Explore Title