Until recently, the most common answer to the question "Where do I store my application information?" was in a relational database, obviously. The answer to this simple yet meaningful question is not so straightforward anymore.
NoSQL databases are becoming more and more popular and DocumentDB is one of them. In August 2014, Scott Guthrie officially announced the first preview version of DocumentDB. DocumentDB is a NoSQL database service offered by Microsoft. It is delivered as a managed service on Azure. This means that we no longer have to manage any infrastructure; we can just take it from the tap and pay per use. DocumentDB is a schema-free store, which means that we can store any kind of JSON document inside the store and work with the data as we used to in traditional SQL databases.
In this chapter, we will do the following:
Learn what DocumentDB is all about
Look at the data model
Make a comparison with other non-SQL technologies
Learn about the pricing model
Build a console application that connects to a database
This book is aimed at architects, developers, database administrators, and IT professionals who want to learn and understand the breadth of DocumentDB.
The short answer to this question is that DocumentDB is a managed JSON document database service. But what is the impact on our programming paradigms? How can we use it? Why should we use it? Can it really make our life easier? The answers to these kinds of questions are a bit more involved and need additional clarification.
This section describes the fundamentals of DocumentDB and can help you decide whether or not it will be a good fit for your solution.
Microsoft built DocumentDB from the ground up because the feedback they got from customers was that they "…need a database that can keep pace with their rapidly evolving applications…." Schema-free databases are increasingly popular, but running these on our premises can be expensive and difficult to scale. Combining this with the need for rich querying and transactions still being available, Microsoft decided to build DocumentDB.
This brings us to the longer version of our answer, which is that DocumentDB is a "…a massively scalable, schema-free database with rich query and transaction processing using the most ubiquitous programming language, JavaScript, data model (JSON), and transport protocol (HTTP)…" (http://blogs.msdn.com/b/documentdb/archive/2014/08/22/introducing-azure-documentdb-microsoft-s-fully-managed-nosql-document-database-service.aspx).
As stated before, NoSQL databases are gaining popularity and are slowly replacing traditional relational databases. The main characteristics of a NoSQL database are listed next:
Schema-less, with the ability to store everything
Non-relational
Extremely scalable
Note
Besides DocumentDB databases, there are other NoSQL databases available, such as graphs and key-value databases. We will study a comparison later in this chapter.
Having no schema (or predefined structure like tables and columns) allows us to store everything. This also includes attachments, user-defined functions, stored procedures, triggers, and more. The only restriction is that the information has to be in valid JSON.
The SQL language that can be used to query and manipulate DocumentDB is based on JavaScript. Having JavaScript at the core means that we do not need to learn new techniques or languages, and our current knowledge of JavaScript can be applied immediately. Using JavaScript is a natural way of working with JSON. JSON parsers are perfectly capable of converting query results into variables, manipulating them, and writing them back to the database. Besides working as a client with JavaScript, the internals are also based on JavaScript. The following entities are written in JavaScript as well:
Stored procedures (SPs): These are executed by issuing an HTTP POST request. Inside the SP, the elements of the designated document(s) are copied to ordinary JavaScript variables. The logic inside the SP then manipulates the data and when the SP finishes, the values are persisted in the document(s) again.
User-defined functions (UDFs): The difference between UDFs and SP is that UDFs do not manipulate databases or documents themselves. A UDF encapsulates logic or business rules that can be called from SP or queries and can help extend the query language. A good example of a UDF is a function called
calculateAge()
that takes the date of birth of a person and returns their age as a value. ThecalculateAge()
function can be used from a query returning only those persons that are older than 40 years. The query is as follows:SELECT * from people p where calculateAge(p.dob) > 40,
Triggers: A trigger is a piece of JavaScript code (comparable to UDFs and SPs), but which is only invoked after some event that happens inside your database. A document being created or deleted could result in a trigger being executed. Triggers can be executed before or after the actual event happens. When a trigger fails or raises an exception, the actual operation is aborted and the transaction is not committed but rolled back. This is useful when we need to validate the incoming data to keep our documents consistent.
We will provide extensive examples of SPs, user-defined functions and triggers later in this book.
In traditional relational databases, the DBA or developer needs to choose the (clustered) indexes. Choosing the right indexing strategy is vital for the performance and consistency of the database.
In DocumentDB, we do not need to choose the index ourselves. In fact, all information inside a document is indexed. This means that we can query on any attribute that is available inside the document. We can choose different indexing policies, but for most applications the default indexing policy will be the best choice between performance and storage efficiency. We can reduce storage space by excluding certain paths within the document used for indexing.
The indexing process inside DocumentDB treats the documents as trees. There needs to be a top node that is the entry point for all the fields inside the document. Imagine a document containing information about a person in the following JSON representation:
{ "firstname": "John", "lastname": "Doe", "dob", "01-01-1960", "hobbies": [ { "type":"sports", "description":"soccer"}, { "type":"reading", "preferences": [ { "type":"scifi"}, { "type":"thriller"} ] } ] }
This JSON snippet describes a person, John Doe, who was born on January 1, 1960, and has two hobbies, sports and reading. His reading hobby focuses on the sci-fi and thriller genres.
Tip
Downloading the example code
You can download the example code files from your account at http://www.packtpub.com for all the Packt Publishing books you have purchased. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.
A JSON document can be depicted like this:

The blue squares are nodes that are implicitly added by the system and do not influence our data model. The figure shows that documents are internally represented as trees. As you can see, the nodes that describe a hobby do not necessary have to be the same in schema. Go ahead and try to build this model in a traditional relational database system!
Microsoft offers DocumentDB as part of their online offerings on the Microsoft Azure platform. Their as-a-service approach enables developers to start using new technologies immediately.
The performance of our DocumentDB system is influenced by a performance level. Performance levels are set on a collection and not a database. This enables fine-tuning of your environment, giving the appropriate performance boost to the right resources. Setting the performance level influences the number of so-called request units. A request unit is a measure for the resources (CPU, memory) needed to perform a certain operation.
There are three performance levels:
S1: Allows up to 250 request units per second
S2: Allows up to 1,000 request units per second
S3: Allows up to 2,500 request units per second
We need to choose the performance level carefully, since it comes with a price impact. We will discuss the pricing of DocumentDB later in this chapter.
DocumentDB also supports transactions providing Atomicity, Consistency, Isolation, Durability (ACID) guarantees. Atomicity enables all operations to be executed as a single piece of work, all being committed at once or not at all. Consistency implies that all data is in the right state across transactions. Isolation makes sure that transactions do not interfere with each other, and durability ensures that all changes that are committed to the database will always be available.
Since JavaScript is executing under snapshot isolation belonging to the collection, SPs and triggers are executed within the same scope, enabling ACID for all operations inside SPs and triggers. If an error occurs in the JavaScript logic, the transaction is automatically rolled back.
Now that we have seen a little of DocumentDB, how can we decide whether DocumentDB is applicable for our own problem scenario? In which scenarios is it a good fit and are there any trade-offs?
A good example of a problem domain in which DocumentDB fits is the domain of the Internet of Things (IoT). The IoT is all about ingesting, egressing, processing, and storing data (visit https://en.wikipedia.org/wiki/Internet_of_Things). It involves data flowing to and from devices, backend services processing that data or controlling devices, storage services persisting that data, or running statistical analysis or analytics on that data. Because DocumentDB can connect to HDInsight (http://azure.microsoft.com/en-us/services/hdinsight/) and Hadoop, the data can be analyzed easily.
Another good area in the IoT domain is device registration. Each and every device in the field is described inside a single document and stored in DocumentDB. These documents contain information for the device to be able to play the game of IoT, having keys and endpoints to communicate with and enable ingress and egress dataflows.
Throughout this book, we will also take the IoT domain as our main example domain. Examples and code snippets will focus on this area because it is a good area to project the possibilities of DocumentDB on.
Storing user profile information inside DocumentDB can be really helpful when it comes to personalized user interfaces or other preferences that can influence an application's behavior or user interface settings.
Note
JavaScript can easily interpret JSON data and is therefore an excellent candidate for describing the markup of a personalized user interface. Extending this thought, the schema-free approach of DocumentDB also makes it an excellent candidate for a CMS system.
Every user is reflected in a single document that describes all user preferences. The list of preferences can be easily extended by adding information to the document. Consider that users authenticate at an authentication service, for example, Azure Active Directory, Facebook, or Twitter, and that these services return a claim set, including a unique identifier called nameidentifier. This field is an excellent candidate for providing the unique entry point in our DocumentDB system and retrieving the user's profile information after logging in.
A well-designed system usually emits logging information in large quantities and contains different types of information. Logging information is straightforward and contains information about a specific event, for example, a user logging in to the system, an exception raised by the system, or an audit trail record that needs to be persisted.
Because DocumentDB automatically indexes all documents, querying data and finding fault causes can be very quick. You can take DocumentDB information offline and store it in a datacenter for further analysis with tools like Hadoop or Power BI.
Building and releasing mobile solutions is tough because we might have millions of customers. Using a schema-free database, it is easier to release new apps with additional data while still being able to service your old versions as well. Remember the troubles we had releasing a new schema of our SQL Server or Oracle database? Adding new tables and columns because of new features, and writing conversion scripts for every new release of the system?
By using a JSON document, we can easily add or remove information, release at a faster pace, and enable development in sprints—changing the data each sprint without the pain of conversion scripts.
Of course, the powerful scaling of DocumentDB is also a great help when building global, mobile apps servicing millions of users!
The internals of DocumentDB can themselves be described in a JSON document itself. The following figure displays a hierarchy of DocumentDB and its entities. This is called the DocumentDB resource model and all the entities are called resources.

The main entry point is obviously a DocumentDB account. You need to have an account before you can start working with it. An account can contain databases and, as part of a preview feature, an amount of blob storage for attachments.
All resources are accessible through a logical URI, for example, the database with the name persons
can be addressed through the logical URI /dbs/persons
and the document describing the person John Doe, which has an ID of 12345
, can be addressed by the local URI /dbs/persons/<collectionid>/docs/12345
.
A database is a container where documents are stored inside collections. Part of the database also forms a user's container. The user's container has a set of permissions, and the permissions to access collections, UDFs, triggers and SP are set on a database level. We can grant access to users in a fine-grained manner for accessing collections and documents.
A database is scaled elastically and does not need any interference from the account owner. It can scale from megabytes to petabytes. The data is stored on an SSD disk, providing fast access to your documents.
Databases are spanned implicitly across different underlying machines in order to provide the level of scaling we need.
A user stored in DocumentDB is an abstraction of the concept user. A user is not a person logging in, but a set of permissions. Eventually, a DocumentDB user can be mapped to an Active Directory user or some third-party identity management system.
A simple, straightforward way of designing the user model is to create exactly one user per tenant or customer. That user only has permission to access collections and documents inside one database. This is the database belonging to the designated tenant and/or customer. This is a flat user model, but it is also possible to create user identities corresponding to actual users representing certain personas. This can give you more fine-grained control but will also increase the burden of user administration.
Users can be managed through the Azure portal (portal.azure.com) and also via the rich REST API or client SDKs that are provided by Microsoft.
Implicitly, DocumentDB contains two types of roles: an administrator and a user. The administrator is the one that has the permission to manage and manipulate database accounts, databases, users, and permissions. These are considered the administrative resources, analogous to the metadata describing the full DocumentDB ecosystem. The administrator is provided with a master key. The master key is part of the DocumentDB account and is provided to the one setting up the account.
A user, on the other hand, is the person who manipulates actual data inside collections and documents, or changes UDFs (application resources). A user gets a resource key that provides access to specific application resources like databases and collections.
A collection can be described as a container for all the documents, but it is also a unit of scale. Adding collections will result in some SSD storage to be allocated for that particular collection. As we saw before, setting the performance level is done on a collection level. Inside your database, you can have multiple collections, each having their own performance level (S1, S2, or S3). For example, we could have a UserProfile
collection containing documents with specific profile information like addresses, images, UI preferences, and so on. This collection is queried once a user logs in to your system and the profile information is retrieved from the UserProfile
collection. Imagine we have another collection containing all the products that can be ordered. This collection will be accessed more frequently, hopefully, and therefore we can set an S3 level on this collection to provide the best performance for our potential buyers.
Collections grow and shrink implicitly when documents are added or removed. There is no need to allocate space or do other preconfiguration steps.
This section compares DocumentDB with other (non-)SQL technologies. The comparison is made with MongoDB and Azure Table storage.
Table storage is a non-SQL tabular based storage mechanism enabling you to store rows and columns inside a table. A table is not fixed, meaning that different rows can have different columns. Azure Table storage is a perfect fit for storing large amounts of data, although it is non-relational. There are no mechanisms like foreign keys, triggers, or user-defined functions.
MongoDB is also a document database (NoSQL), which means that it is schema-free, enables high performance and high availability, and has the ability to scale. MongoDB is open source, and is built around documents and collections. The documents are compiled of sets of key-value pairs, while collections also contain documents. Compared to DocumentDB, MongoDB uses BSON instead of JSON.
The following table provides a high-level comparison on some key features:
Feature |
DocumentDB |
MongoDB |
Table storage |
---|---|---|---|
Model |
Document |
Document |
Rows and columns |
Database schema |
Schema-free |
Schema-free |
Schema-free |
Triggers |
Yes |
No |
No |
Server side scripts |
Yes, JavaScript |
Yes, JavaScript |
No |
Foreign keys |
*N/A |
N/A |
N/A |
Indexing |
Potentially on property |
Potentially on every property |
Partition key and row key only |
Transactions |
Yes, supports ACID |
No |
Limited, using batching |
Hosting |
On Microsoft Azure only, offered as a service |
Can be on-premise or on a virtual machine, not offered as a service |
On Microsoft Azure, offered as a service. |
DocumentDB does not offer referential integrity by design. There is no concept of foreign keys. Integrity can be enforced by using triggers and SPs.
The role of the Database Administrator is still needed to manage DocumentDB. We still need someone to overlook our databases and collections. Some common tasks a DBA for a document might perform are as follows:
This section provides a brief overview of how your bill is influenced by the way you use DocumentDB. There are a few factors that determine the pricing:
Having a DocumentDB account
Number of collections inside a database
Performance level
Capacity units
When you set up a DocumentDB account, you will be billed immediately. An empty account with no databases and hence no collections will be charged for a single S1 collection, at around $25 per month. The reason that you are charged while not having any collections or documents is that Microsoft needs to reserve a DNS and authorization scope for the account.
Collections are billed by the hour. Having a collection for only 10 minutes will still incur charges for a whole hour. An amount of 10 GB is included for all tiers.
The following table defines the standard characteristics per performance level:
Performance level |
SSD storage |
Request units |
Price per hour |
---|---|---|---|
S1 |
10 GB |
250 per second |
$0.034 |
S2 |
10 GB |
1,000 per second |
$0.067 |
S3 |
10 GB |
2,500 per second |
$0.134 |
Request units are calculated based on the amount of resources that are needed for the operation performed. When more CPU, IO, and memory is needed for a certain operation, more request units are calculated. The number of request units needed for each operation is returned in the response's header (x-ms-request-charge). By reading this value, you can keep track of the usage. If you exceed the number of request units, additional operations will be throttled.
To have fine-grained control over the performance of your collections, you could do the housekeeping of consumed Requests Units (RUs) yourself and check if you often exceed the maximum number of RUs. If so, upgrading the performance level for your collection might be a good idea.
Note
It is possible to upgrade and downgrade a collection from S1 to S2 or S3, but the charges are for the highest tier. Switching from S1 to S3 within 1 hour will therefore be billed at $0.134.
The number of RUs needed for an operation depends on the following factors:
Size of the document: Larger documents increase the consumption of RUs.
Number of properties: More properties increase the consumption of RUs. When you use data consistency (we will dive into this concept later on), more RUs will be consumed.
Indexes: When more properties are indexed, more RUs are needed. It is good practice to investigate the actual indexes you need for you scenario. Also, documents are indexed by default; turning this feature off will save RUs. SP and triggers also consume RUs based on the metrics mentioned previously.
By default, a collection is provisioned with 10 GB of storage. Documents consume storage space, but indexes also fill up the space of a collection. If you need more storage space, you need to create a different collection.
Microsoft offers the ability to contact Azure support from the Azure blade portal (portal.azure.com). If you need specific adjustments that you cannot manage from the portal or that are not supported by default, contact Azure support.

Click on the New Support button and follow the wizard that shows up. Make sure you choose the Quotas request type and enter your request details. The following table shows the limits that can be stretched by Azure support:
Database accounts |
5 |
Number of SPs, triggers and UDFs per collection |
25 each |
Maximum collections per database account |
100 |
Maximum document storage per database (100 collections) |
1 TB |
Maximum number of UDFs per query |
1 |
Maximum number of JOINs per query |
2 |
Maximum number of AND clauses per query |
5 |
Maximum number of OR clauses per query |
5 |
Maximum number of values per IN expression |
100 |
Maximum number of collection created per minute |
5 |
This paragraph provides a step-by-step approach to building a console application using Visual Studio 2015 that utilizes the basics of DocumentDB. We will perform the following steps:
Create a DocumentDB account.
Create a database.
Create a collection.
Build a console application that connects to DocumentDB and saves a document.
To create a DocumentDB account, you need to go to the Microsoft Azure portal. If you don't have a Microsoft Azure account yet, you can get a trial version at https://azure.microsoft.com/en-us/pricing/free-trial/.
After logging in to the Azure portal, go ahead and create your first DocumentDB account. For now, you only need to come up with a name.

After clicking on the Create button, your DocumentDB account will be provisioned. This process might take some time to finish.
After provisioning, your account is ready for use. Select the account you have just created and you will get an overview.

On the overview blade of your account, you will see a lot of information. For now, the most important information is located in the settings blade on the right-hand side. From this blade, you can retrieve keys and a connection string. We need this information if we want to start building the console application. Select the DocumentDB option, copy the URI, and copy the primary key.
In order to be able to create collections, we need a database first. Creating a database is straightforward as it only needs a name as input. Click the Add Database button and enter a meaningful name. After selecting OK, your database is provisioned. On the left blade you can scroll down and locate your new database.

As we have seen before, a collection is created inside a database. Selecting your database gives you the ability to add a collection. When the Add Collection option is selected, you need to pick the right performance level (or tier). For this demo, the S1 tier is more than sufficient.

Now that we have our DocumentDB account, a database, and a collection, we can start building our first application.
This sample is built using Visual Studio 2015. If you do not have Visual Studio 2015, you can download the free version Visual Studio 2015 Community from https://www.visualstudio.com/en-us/products/visual-studio-express-vs.aspx.
Here are the steps for creating a Visual Studio solution containing a console application that will demonstrate the basic usage of DocumentDB:
Start Visual Studio.
Go to File | New Project and then click on the Console Application template.
Name your project
MyFirstDocDbApp
.Visual Studio now creates a console application.
In order to work with DocumentDB, we need to pull in a NuGet package. Right-click on your project file and select Manage NuGet Packages. Search for the Microsoft Azure DocumentDB Client Library.
Select the right package in the search results and click Install. Your project is now ready to use the DocumentDB Client Library.
Now that we have set up a solution, created a project, and enabled the right .NET library to manage DocumentDB, we are going to write some C# code.
Note
Keep in mind that although the code samples in this book are mostly in C#, you can also use the programming language of your choice. There are SDKs available for multiple platforms (Java, Python, Node.js, and JavaScript). If yours is not supported, you could always use the REST API.
Add the following
using
statements to the top of theprogram.cs
file:using Microsoft.Azure.Documents; using Microsoft.Azure.Documents.Client; using Microsoft.Azure.Documents.Linq; using Newtonsoft.Json;
We need the URI and primary key that we retrieved in the previous paragraph.
After writing a few lines of C# code, we have the code snippet ready. It performs the following tasks:
Makes a connection to the DocumentDB account
Finds the database that is created in the portal
Creates a collection named
testdevicehub
Saves a document to this collection
private static async Task CreateDocument() { //attach to DocumentDB using the URI and Key from the Azure portal DocumentClient client = new DocumentClient(new Uri(docDBUri), key); //query for the right database inside the DocDB account Database database = client.CreateDatabaseQuery().Where(d => d.Id == "devicehub").AsEnumerable().FirstOrDefault(); //find the right collection where we want to add the document DocumentCollection collection = client.CreateDocumentCollectionQuery((String)database.SelfLink). ToList().Where(cl => cl.Id.Equals("testdevicehub")).FirstOrDefault(); //create a simple document in the collection by providing the DocumentsLink and the object to be serialized //and stored await client.CreateDocumentAsync( collection.DocumentsLink, new PersonInformation { FirstName = "Riccardo", LastName = "Becker", DateOfBirth = new DateTime(1974, 12, 21) } ); }
Replace the values of docDBUri
and the key with your information and run the console application. You have just created your first document.
Now, go to the Azure portal again and open the query documents screen. You need to select the designated collection to enable this option. Running the base query returns the document that we have just created:
select * from c
Here's the screenshot:

As you can see, the document contains more than just the fields from the class PersonInformation
. Here is a brief explanation of these fields:
id
: This is the unique identifier for the document. In the application we have just created, the ID is automatically generated and is represented by a GUID._rid
: This resource ID is an internally used property._ts
: This is a property generated by the system, and it contains a timestamp._self
: This is generated by the system, and it contains a unique URI pointing to this resource document._etag
: This is a system-generated property containing anETag
that can be used for optimistic concurrency scenarios (if somebody updates the same document in the meantime, theETag
will differ and your update will fail)._attachments
: This is generated by the system, and it contains the path for the attachments resource belonging to this document.
In this chapter, we covered the basics of DocumentDB. We saw that we can literally store everything and there is no predefined schema we need to comply with. The Azure portal offers some interesting blades for us to create, configure, and manage DocumentDB resources and offers some quick-starts to help you get started immediately. The internals of DocumentDB were discussed and we got a nice insight of the data model.
We also saw some common use cases that are applicable for DocumentDB. A small section was dedicated to explain the pricing model and how your bill is affected by actions you can do.
Finally, we started to do a bit of coding and wrote a small C# console application that connects to the database and creates a document. We could see in the Azure portal that the document was actually stored, together with some other interesting metadata.
In the next chapter, we will discuss how to manage and monitor your DocumentDB resources.