Mastering RethinkDB

5 (1 reviews total)
By Shahid Shaikh
    What do you get with a Packt Subscription?

  • Instant access to this title and 7,500+ eBooks & Videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. The RethinkDB Architecture and Data Model

About this book

RethinkDB has a lot of cool things to be excited about: ReQL (its readable,highly-functional syntax), cluster management, primitives for 21st century applications, and change-feeds. This book starts with a brief overview of the RethinkDB architecture and data modeling, and coverage of the advanced ReQL queries to work with JSON documents. Then, you will quickly jump to implementing these concepts in real-world scenarios, by building real-time applications on polling, data synchronization, share market, and the geospatial domain using RethinkDB and Node.js. You will also see how to tweak RethinkDB's capabilities to ensure faster data processing by exploring the sharding and replication techniques in depth.

Then, we will take you through the more advanced administration tasks as well as show you the various deployment techniques using PaaS, Docker, and Compose. By the time you have finished reading this book, you would have taken your knowledge of RethinkDB to the next level, and will be able to use the concepts in RethinkDB to develop efficient, real-time applications with ease.

Publication date:
December 2016
Publisher
Packt
Pages
330
ISBN
9781786461070

 

Chapter 1. The RethinkDB Architecture and Data Model

RethinkDB is a real-time, open source distributed database. It stores JSON documents (basically unstructured data) in an operable format with distribution (sharding and replication). It also provides the real-time push of JSON data to the server, which redefines the entire real-time web application development.

In this chapter, we will look over its architecture and key elements in order to understand how RethinkDB supports these awesome features with high performance. We will also look over data modeling along with SQL operations in NoSQL, that is, joins.

Here is the list of topics we are going to cover in this chapter:

  • RethinkDB architectural components

  • Sharding and replication in RethinkDB

  • RethinkDB failover handling

  • The RethinkDB data model

  • Data modeling in RethinkDB

 

RethinkDB architectural components


RethinkDB's architecture consists of various components such as cluster, query execution engine, filesystem storage, push changes (real-time feed), and of course RethinkDB client drivers.

Refer to the following diagram to understand the block-level components of RethinkDB:

Client drivers

RethinkDB provides official client drivers for Node.js, Python, Ruby, and Java and various non official community drivers which are listed at the official website (https://rethinkdb.com/docs/install-drivers/). At the time of writing this book, only these languages were supported. In this book, we will refer to code examples with Node.js.

RethinkDB query engine

RethinkDB query handler, as name implies, performs query execution and returns the response to the client. It does so by performing lot of internal operations such as sorting, indexing, finding the cluster, or merging data from various clusters. All of these operations are performed by RethinkDB query handler. We will look at this in detail in the upcoming section.

RethinkDB clusters

RethinkDB is a distributed database designed for high-performance, real-time operations. RethinkDB manages distribution by clustering (sharding or replication). RethinkDB clusters are just another instance of the main process of RethinkDB and store data. We will look at sharding and replication in detail in the upcoming section.

Pushing changes to a RethinkDB client

This is a revolutionary concept introduced by RethinkDB. Consider this scenario: you are developing an application for the stock market where there are too many changes in a given amount of time. Obviously, we are storing every entry in the database and making sure that other connected nodes or clients know about these changes.

In order to do so, the conventional way is to keep looking (polling) for the data in the particular collection or table in order to find some changes. This improves the latency and turnaround time of packets, and we all know that a network call in a wide area network (WAN) is really costly. An HTTP call in a WAN is really costly.

Then came something called socket. In this, we do the polling operation but from the socket layer, not the HTTP layer. Here, the size of network requests may get reduced, but still we do the polling.

Note

Socket.io is one of the popular projects available for real-time web development.

RethinkDB proposes a reverse approach of this: what about the database itself tells you:

Hey, there are some changes happen in stock value and here are the new and old value.

This is exactly what RethinkDB push changes (change feed in technical terms) does. Once you subscribe to a particular table to look for its changes, RethinkDB just keeps pushing the old and new values of changes to the connected client. By "connected client," I meant a RethinkDB client and not a web application client. The difference between polling and push changes is shown here:

So you will get the changes in the data in one of the RethinkDB clients, say Node.js. And then you can simply broadcast it over the network, using socket probably.

But why are we using socket when RethinkDB can provide us the changes in the data? Because RethinkDB provides it to the middle layer and not the client layer, having a client layer directly talk to the client can be risky. Hence it has not been allowed yet.

But the RethinkDB team is working on another project called Horizon, which solves the issue mentioned previously, to allow clients to communicate to the database using secure layer of the middle tier. We will look at Horizon in detail in Chapter 10, Using RethinkDB and Horizon.

 

Query execution in RethinkDB


RethinkDB query engine is a very critical and important part of RethinkDB. RethinkDB performs various computations and internal logic operations to maintain high performance along with good throughput of the system.

Refer to the following diagram to understand query execution:

RethinkDB, upon arrival of a query, divides it into various stacks. Each stack contains various methods and internal logic to perform its operation. Each stack consists of various methods, but there are three core methods that play key roles:

  • The first method decides how to execute the query or subset of the query on each server in a particular cluster

  • The second method decides how to merge the data coming from various clusters in order to make sense of it

  • The third method, which is very important, deals with transmission of that data in streams rather than as a whole

To speed up the process, these stacks are transported to every related server and each server begins to evaluate it in parallel to other servers. This process runs recursively in order to merge the data to stream to the client.

The stack in the node grabs the data from the stack after it and performs its own method of execution and transformation. The data from each server is then combined into a single result set and streamed to the client.

In order to speed up the process and maintain high performance, every query is completely parallelized across various relevant clusters. Thus, every cluster then performs the query execution and the data is again merged together to make a single result set.

RethinkDB query engine maintains efficiency in the process too; for example, if a client only requests a certain result that is not in a shared or replicated server, it will not execute the parallel operation and just return the result set. This process is also referred to as lazy execution.

To maintain concurrency and high performance of query execution, RethinkDB uses block-level Multiversion Concurrency Control (MVCC). If one user is reading some data while other users are writing on it, there is a high chance of inconsistent data, and to avoid that we use a concurrency control algorithm. One of the simplest and commonly used methods method by SQL databases is to lock the transaction, that is, make the user wait if a write operation is being performed on the data. This slows down the system, and since big data promises fast reading time, this simply won't work.

Multiversion concurrency control takes a different approach. Here each user will see the snapshot of the data (that is, child copies of master data), and if there are some changes going on in the master copy, then the child copies or snapshot will not get updated until the change has been committed:

RethinkDB does use block-level MVCC and this is how it works. Whenever there is any update or write operation being performed during the read operation, RethinkDB takes a snapshot of each shard and maintains a different version of a block to make sure every read and write operation works in parallel. RethinkDB does use exclusive locks on block level in case of multiple updates happening on the same document. These locks are very short in duration because they all are cached; hence it always seems to be lock-free.

RethinkDB provides atomicity of data as per the JSON document. This is different from other NoSQL systems; most NoSQL systems provide atomicity to each small operation done on the document before the actual commit. RethinkDB does the opposite, it provides atomicity to a document no matter what combination of operations is being performed.

For example, a user may want to read some data (say, the first name from one document), change it to uppercase, append the last name coming from another JSON document, and then update the JSON document. All of these operations will be performed automatically in one update operation.

RethinkDB limits this atomicity to a few operations. For example, results coming from JavaScript code cannot be performed atomically. The result of a subquery is also not atomic. Replace cannot be performed atomically.

 

Filesystem and data storage


RethinkDB supports major used filesystems such as NTFS, EXT and so on. RethinkDB also supports direct I/O filesystems for efficiency and performance, but it is not enabled by default.

About direct I/O

File is stored on disk and when it's been requested by any program, the operating system first puts it into the main memory for faster reads. The operating system can read directly from disk too, but that would slow down the response time because of heavy-cost I/O operation. Hence, the operating system first puts it into the main memory for operation. This is called buffer cache.

Databases generally manage data caching at the application and they do not need the operating system to cache it for them. In such cases, the process of buffering at two places (main memory and application cache) becomes an overhead since data is first moved to the main memory and then the application cache.

This double buffering of data results in more CPU consumption and load on the memory too.

Direct I/O is a filesystem for those applications that want to avoid the buffering at the main memory and directly read files from disk. When direct I/O is used, data is transferred directly to the application buffer instead of the memory buffer, as shown in the following diagram:

Direct I/O can be used in two ways:

  • Mounting the filesystem using direct I/O (options vary from OS to OS)

  • Opening the file using the O_DIRECT option specified in the open() system call

Direct I/O provides great efficiency and performance by reducing CPU consumption and the overhead of managing two buffers.

Data storage

RethinkDB uses a custom-built storage engine inspired by the Binary tree file system by Oracle (BTRFS). There is not enough information available on the RethinkDB custom filesystem right now, but we have found the following promises by it:

  • Fully concurrent garbage compactor

  • Low CPU overhead

  • Efficient multi-core operation

  • SSD optimization

  • Power failure recovery

  • Data consistency in case of failure

  • MVCC supports

Due to these features, RethinkDB can handle large amounts of data in very little memory storage.

 

Sharding and replication


Sharding is partitioning where the database is split across multiple smaller databases to improve performance and reading time. In replication, we basically copy the database across multiple databases to provide a quicker look and less response time. Content delivery networks are the best examples of this.

RethinkDB, just like other NoSQL databases, also uses sharding and replication to provide fast response and greater availability. Let's look at it in detail bit by bit.

Sharding in RethinkDB

RethinkDB makes use of a range sharding algorithm to provide the sharding feature. It performs sharding on the table's primary key to partition the data. RethinkDB uses the table's primary key to perform all sharding operations and it cannot use any other keys to do so. In RethinkDB, the shard key and primary key are the same.

Upon a request to create a new shard for a particular table, RethinkDB examines the table and tries to find out the optimal breakpoint to create an even number of shards.

For example, say you have a table with 1,000 rows, the primary key ranging from 0 to 999, and you've asked RethinkDB to create two shards for you.

RethinkDB will likely find primary key 500 as the breaking point. It will store every entry ranging from 0 to 499 in shard 1, while data with primary keys 500 to 999 will be stored in shard 2. The shards will be distributed across clusters automatically.

You can specify the sharding and replication settings at the time of creation of the table or alter it later. You cannot specify the split point manually; that is RethinkDB's job to do internally. You cannot have less server than you shard.

You can always visit the RethinkDB administrative screen to increase the number of shards or replicas:

We will look at this in more detail with practical use cases in Chapter 5, Administration and Troubleshooting Tasks in RethinkDB totally focused on RethinkDB administration.

Let's see in more detail how range-based sharding works. Sharding can be basically done in two ways, using vertical partitioning or horizontal partitioning:

  • In vertical partitioning, we store data in different tables having different documents in different databases.

  • In horizontal partitioning, we store documents of the same table in separate databases. The range shard algorithm is a dynamic algorithm that determines the breaking point of the table and stores data in different shards based on the calculation.

Range-based sharding

In the range sharding algorithm, we use a service called locator to determine the entries in a particular table. The locator service finds out the data using range queries and hence it becomes faster than others. If you do not have a range or some kind of indicator to know which data belongs to which shard in which server, you will need to look over every database to find the particular document, which no doubt turns into a very slow process.

RethinkDB maintains a relevant piece of metadata, which they refer to as the directory. The directory maintains a list of node (RethinkDB instance) responsibilities for each shard. Each node is responsible for maintaining the updated version of the directory.

RethinkDB allows users to provide the location of shards.You can again go to web-based administrative screens to perform the same. However, you need to set up the RethinkDB servers manually using the command line and it cannot be done via web-based interfaces.

Replication in RethinkDB

Replication provides a copy of data in order to improve performance, availability, and failover handling. Each shard in RethinkDB can contain a configurable number of replicas. A RethinkDB instance (node) in the cluster can be used as a replication node for any shard. You can always change the replication from the RethinkDB web console.

Currently, RethinkDB does not allow more than one replica in a single RethinkDB instance due to some technical limitations. Every RethinkDB instance stores metadata of tables. In case of changes in metadata, RethinkDB sends those changes across other RethinkDB instance in the cluster in order to keep the updated metadata across every shard and replica.

Indexing in RethinkDB

RethinkDB uses the primary key by default to index a document in a table. If the user does not provide primary key information during the creation of the table, RethinkDB uses its default name ID.

The default-generated primary key contains information about the shard's location in order to directly fetch the information from the appropriate shard. The primary key of each shard is indexed using the B-Tree data structure.

One of the examples for the RethinkDB primary key is as follows:

D0041fcf-9a3a-460d-8450-4380b00ffac0.

RethinkDB also provides the secondary key and compound key (combination of keys) features. It even provides multi-index features that allow you to have arrays of values acting as keys, which again can be single compound keys.

Having system-generated keys for primary is very efficient and fast, because the query execution engine can immediately determine on which shard the data is present. Hence, there is no need for extra routing, while having a custom primary key, say an alphabet or a number, may force RethinkDB to perform more searching of data on various clusters. This slows down the performance. You can always use secondary keys of your choice to perform further indexing and searching based on your application needs.

 

Automatic failover handling in RethinkDB


RethinkDB provides automatic failover handling in a multi-server configuration where multiple replicas of a table are present. In case of node failure due to any reason, RethinkDB finds out the other node to divert the request and maintain the availability. However, there are some requirements that must be met before considering automatic failover handling:

  • The cluster must have three or more nodes (RethinkDB servers)

  • The table must be set to have three or more replicas set with the voting option

  • During failover, the majority of replicas (greater than half of all replicas) for the table must be online

Every table, by default, has a primary replica created by RethinkDB. You can always change that using the reconfigure() command. In case of failure of the primary replica of the table, as long as more than half of the replicas with voting option are available, one of them will be internally selected as the primary replica. There will be a slight offline scenario while the selection is going on in RethinkDB, but that will be very minor and no data will be lost.

As soon as the primary replica comes online, RethinkDB automatically syncs it with the latest documents and switches control of the primary replica to it automatically.

 

About voting replicas


By default, every replica in RethinkDB is created as a voting replica. That means those replicas will take part in the failover process to perform the selection of the next primary replica. You can also change this option using the reconfigure() command.

Automatic failover requires at least three server clusters with three replicas for table. Two server clusters will not be covered under the automatic failover process and the system may go down during the failure of any RethinkDB instance.

In such cases-where RethinkDB cannot perform failover-you need to do it manually using the reconfigure() command, by passing the emergency repair mode key.

Upon running emergency repair mode, each of the shards is first examined and classified into three categories:If more than half of the total shards are available, it will return a healthy status

  • Repairable: In the case of repairable, the shard is not healthy but has one replica, which can be used

  • Beyond repair: In the case of beyond repair, the shard has no available replica and cannot be used

For each and every shard that can be repaired, RethinkDB will first change all the offline replicas into non-voting replicas. If there is no voting replica available, RethinkDB will choose one non-voting replica and forcefully convert it into a voting replica.

You can specify two options along with the emergency repair option:

  • unsafe_rollback: This will leave those shards that are beyond repair during the failover process

  • unsafe_rollback_or_erase: This will delete those shards that are beyond repair and create one on the available server that holds another shard for that table

Here is the command reference:

r.table(users).reconfigure(
{emergencyRepair: "unsafe_rollback_or_erase"}
).run(conn, callback);

Please note that emergency failover handling is very critical and should be done by a skilled person, or else you may end up losing all your data.

 

The RethinkDB data model


The RethinkDB data model consists of two main components:

  • RethinkDB data types

  • RethinkDB model relationship

RethinkDB data types are further classified into basic data types, dates, binary objects, and geospatial queries.

Refer to the following diagram for more details:

Let's go over to each component in detail.

RethinkDB data types

RethinkDB provides six basic data types; they are represented in tabular format, as shown in the following table:

Data type

Description

Numbers

Numbers are stored using double-precision floating-point numbers.

Strings

Strings are stored in UTF-8 standard.

Objects

Stored as a key-value pair with standard JSON format.

Arrays

Stored as a list of elements. It supports 100,000 elements by default.

Binary

Binary objects includes files, images, and other binary data.

Date and time

RethinkDB stores date and time with millisecond precision.

Booleans

True and false values.

Null

Stores null values.

RethinkDB also provides extended data types supported by it. They are as follows:

  • Tables: These are RethinkDB tables. You can insert and delete documents from them with proper indexing.

  • Streams: Streams provide the transfer of large amounts of data in chunks so that the system won't reach the buffer overflow limit. They are loaded in a lazy fashion. Streams provide the navigation point to a chunk, called cursor, which you can use to traverse the result set. You can club all of them once the streams are collected using the cursor. This makes it easy to read large amounts of data. Streams are read-only operations.

  • Selections: Selections are subsets of data, such as 10 documents out of 1,000 from a table. There are two types of selections, one with objects and one with streams. The difference between the two is that objects can be writable while streams are read-only. You can pass an object selected by, say, the get() command to other commands to manipulate the data.

  • Date and time: Date and time are stored in RethinkDB with millisecond precision, along with time zones. Currently, minute-precision time offsets from UTC are supported. RethinkDB internally calculates the time zone difference; hence you don't need to do it at the client end.

You can use native date commands supported by RethinkDB drivers such as in Node.js; you can use the Date() object.

Binary objects

Binary objects are stored similar to BLOB in SQL databases. You can directly read and upload files, images, and so on directly into the database. Parsing and other dirty tasks will be dealt with by RethinkDB.

One of the amazing functionalities of RethinkDB is calling external APIs from RethinkDB native drivers. So, consider you want to add profile pictures of users directly into the database from Gravatar, a common place of avatars. All you need to do is call the API and convert the detail into binary. Consider the following official code snippet:

var hash = md5(email); 
gravatarUrl = 'http://gravatar.com/avatar/' + hash + '?d=retro'; 
r.table('users').get(userId).update({ 
gravatar: r.http(gravatarUrl, {resultFormat: 'binary'}) 
}).run(conn, callback) 

Assume that conn is the RethinkDB connection and we are passing the email server of the user to get the avatar. If you notice, we are calling the gravatar API using the http() function and converting resultformat into binary.

Specifying the result format is not mandatory here; if the MIME type of the calling server is set correctly, you can just call the HTTP URL and it will be converted to the default MIME type, say binary. It's better to be on the safer side though.

Geospatial queries in RethinkDB

RethinkDB provides and supports geospatial features to help you build location-based applications. By using geospatial queries, you can easily parse, convert, and perform lots of operations, such as computing the distance between two locations, finding an intersecting location, and many more. RethinkDB stores all geographical information in the GeoJSON standard format.

RethinkDB uses a geometry object to perform geographical queries. Geometry objects are derived by plotting two-dimensional objects such as lines and points on a sphere in three-dimensional space.

RethinkDB stores information in a standard geographic system, which is addressing a point on a surface in longitude and latitude. It does not support elevation yet. The range of longitudes is -180 to 180 and the range of latitudes is -90 to 90. To store the same, we use the point() function of RethinkDB. Here is some sample code, assuming r is a RethinkDB object:

r.table('demo').insert([ 
 { 
  name : "Mumbai", 
  location : r.point(19.0760,72.8777)  
 } 
]) 

Supported data types

Geospatial data types and functions are derived from three geometric object data types: points, lines, and polygons.

By using these three, RethinkDB provides various geospatial functions such as circle(), intersect(), getNearest(), and so on.

RethinkDB model relationships

RethinkDB, besides being a NoSQL database, provides one of the most requested features by SQL developers, that is, JOINS. RethinkDB allows you to model and structure your data in such a way that allows you to perform JOINS over it. There are two ways to model relationships in RethinkDB:

  • By using embedded arrays in the document

  • By linking documents stored in multiple tables

Let's see how both of them work, along with their merits and demerits.

Embedded arrays

Embedded arrays basically means using an array of objects in the table in each document. Here is a simple example:

{ 
  "id": "7644aaf2-9928-4231-aa68-4e65e31bf219", 
  "name": "Shahid", 
  "chapters": [ 
    {"title": "Chapter 01", "pages": "30+"}, 
    {"title": "Chapter 02", "pages": "30+"} 
  ] 
} 

Here, the "chapters" key contains the array of objects, and each object contains the name of the chapter and the page count. In order to perform the query, here is how we do it in RethinkDB.

To display all chapters from table:

r.db("book").table("chapters").run()

To get all the chapters written by a particular author; we can do that by passing the author ID:

r.db("book").table("chapters").get('7644aaf2-9928-4231-aa68-4e65e31bf219').run()

Here, we are passing the ID of the document, which is a system-generated primary key ID by RethinkDB. Upon passing it, we can access a single document from the table. This is a very basic way of modelling and is not recommended for production use.

Merits of embedded arrays

The following merits are as follows:

  • Easy to design and query

  • Updates on a single document will automatically update chapters and authors both. Hence, both pieces of data will be atomic

Demerits of embedded arrays

As soon as the size of an embedded array exceeds a certain limit, it costs too much computing power to load the entire array every time we have to perform an operation.

Document linking in multiple tables

This is similar to the foreign key relationships we make in traditional SQL databases. Here, we link two or more tables using the index. We can also query the table using functions provided by RethinkDB to perform various joins (inner, outer, and so on):

As shown in the preceding diagram, we have two tables with multiple documents in them. The Authors table will just contain documents related to authors, say name, location, and (obviously) a primary key.

The Chapters table will contain documents of chapters; each chapter is a separated document instead of an array. The only difference is that each document will contain an index with the same value as the primary key from the Authors table. This way, we link both of them together.

As you can see in the diagram, there is a key in every document in the Chapters table named author_id with the same value as id in the Authors table.

Here is how we query the tables to perform JOINS:

r.db("books").table("chapters").   filter({"author_id": "abc768332"}). run() 

This will return every chapter for the author having ID as abc768332.

You can also use RethinkDB JOINS functions such as eq_join(); we will look at the in the next section.

Merits of document linking

The following merits are:

  • Data distribution and very neat design

  • No need to load all of the data of chapters at once in order to perform any operation

  • No matter what the size of documents in chapters table is, the performance won't be affected as in embedded arrays

Demerits of document linking

The following demerits are:

  • There are no foreign key constraints, so linking of data will be complicated

  • In the case of an update in one table, it won't automatically update the data in another table

Here is a list of JOIN commands provided by RethinkDB:

  • eq_Join

  • innerJoin

  • outerJoin

  • Zip

Before looking at these, let's populate some data in two tables. We have authors and chapters in the books database.

We will add author details in the authors table:

r.db('books').table('authors').insert({name : "Shahid", location : "Mumbai"}) 

We will be adding chapter details in the chapters table:

r.db('books').table('chapters').insert({ 
   author_id : "521b92b1-0d83-483d-a374-b94a400cf699",  
   chapterName : "Chapter 1" 
}) 
 
r.db('books').table('chapters').insert({ 
   author_id : "521b92b1-0d83-483d-a374-b94a400cf699",  
   chapterName : "Chapter 2" 
}) 

Performing eq_Join

Here we are joining the chapters table with the authors table and mentioning which field we have mapped to the ID, that is, author_id:

r.db('books').table('chapters').eqJoin("author_id",r.db('books').table('authors')) 
 

This will return the following:

[ 
   { 
      "left":{ 
         "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", 
         "chapterName":"Chapter 1", 
         "id":"f0b5b2f7-1f82-41ef-a945-f5fa8259dd53" 
      }, 
      "right":{ 
         "id":"521b92b1-0d83-483d-a374-b94a400cf699", 
         "location":"Mumbai", 
         "name":"Shahid" 
      } 
   }, 
   { 
      "left":{ 
         "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", 
         "chapterName":"Chapter 2", 
         "id":"f58826d4-e259-4ae4-91e4-d2e3db2d9ad3" 
      }, 
      "right":{ 
         "id":"521b92b1-0d83-483d-a374-b94a400cf699", 
         "location":"Mumbai", 
         "name":"Shahid" 
      } 
   } 
] 

The left and right keys supposedly represent the tables on the left side and the right side of the query. If you map the keys and values with the ReQL query, you can easily understand.

However, this is not how we really want our data. We want processed, result-oriented data, and to do that, we need to use the zip() command. It basically removes all metadata information from the result and gives you only the documents of tables:

r.db('books').table('chapters').eqJoin("author_id",r.db('books').table('authors')).zip() 

This returns the following:

[ 
   { 
      "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "chapterName":"Chapter 1", 
      "id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "location":"Mumbai", 
      "name":"Shahid" 
   }, 
   { 
      "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "chapterName":"Chapter 2", 
      "id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "location":"Mumbai", 
      "name":"Shahid" 
   } 
] 

Performing inner joins

An inner join, as we all know, equates two tables by comparing each row. It is very similar to working with EQJOIN, except that it compares each document in the table against the target table, while in EQJOIN, we specify which key to compare for.

Here is a simple query to perform an inner join:

r.db('books').table('chapters').innerJoin(r.db('books').table('authors'),function(chapters,authors) { 
  return chapters 
}).zip() 

This function takes the target table as a parameter and the callback function, which contains data of both the tables in the callback argument. If you notice, for understanding, I've named the callback parameters the same as the table name. You can perform a lot of other operations such as comparison, or filtering inside the callback function and then returning the result. Since it's a JOIN, in both the variables, data will be similar-except with different table ID's.

Here is the result for the same:

[ 
   { 
      "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "chapterName":"Chapter 1", 
      "id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "location":"Mumbai", 
      "name":"Shahid" 
   }, 
   { 
      "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "chapterName":"Chapter 2", 
      "id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "location":"Mumbai", 
      "name":"Shahid" 
   } 
] 

If you observe, the result set is pretty much the same as EQJOIN (the eq_Join() function), except that it provides you the result of each document under the callback function. This makes it really slow, and the RethinkDB team does not recommend it for use in production.

Performing outer joins

Outer join union the result of left join and right join and returns it to the client. So, basically, the result test from both the tables will be combined and returned. Here is a sample query.

r.db('books').table('chapters').outerJoin(r.db('books').table('authors'),function(chapters,authors) { 
  return authors 
}).zip() 

This will combine each document of chapters with each document of authors and return the result. Again, we can access each document of the query here under the callback.

It shall return the following:

[ 
   { 
      "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "chapterName":"Chapter 1", 
      "id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "location":"Mumbai", 
      "name":"Shahid" 
   }, 
   { 
      "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "chapterName":"Chapter 2", 
      "id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "location":"Mumbai", 
      "name":"Shahid" 
   } 
] 

In order to check it is working, let's just create one more author in the authors table and not create any chapter document entry for it:

r.db('books').table('authors').insert({name : "RandomGuy", location : "California"}) 

Upon running the outer join query again, here is the result:

[ 
   { 
      "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "chapterName":"Chapter 1", 
      "id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "location":"Mumbai", 
      "name":"Shahid" 
   }, 
   { 
      "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "chapterName":"Chapter 2", 
      "id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "location":"Mumbai", 
      "name":"Shahid" 
   }, 
   { 
      "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "chapterName":"Chapter 1", 
      "id":"78acabb5-a5b8-434b-acb1-52507b71831d", 
      "location":"California", 
      "name":"RandomGuy" 
   }, 
   { 
      "author_id":"521b92b1-0d83-483d-a374-b94a400cf699", 
      "chapterName":"Chapter 2", 
      "id":"78acabb5-a5b8-434b-acb1-52507b71831d", 
      "location":"California", 
      "name":"RandomGuy" 
   } 
] 

Hence we get all of the result-the union of each document of the table present in the left table (that is, authors) with each document present in the right table (that is, chapters).

Zip

This function performs the merging of left fields with right fields with a JOIN operation into a single dataset.

 

Constraints and limitation in RethinkDB


We have covered various architectural features and the data model of RethinkDB. Let's look over some of the constraints of RethinkDB that you need to take into account while architecting your data store.

RethinkDB divides the limitation into hard and soft limitations. The hard limitations are as follows:

  • The number of databases has no hard limit

  • There is a limit on shard creation, which is a maximum of 64 shards

  • Creation of tables inside the database has no hard limit

  • Storage size of a single document has no hard limit (however, it is recommended to keep it under 16 MB for performance reasons)

  • The maximum possible size of a query is 64 MB

RethinkDB also has some memory limitation, as follows:

  • An empty table will need up to 4 MB

  • Each table, after population of documents, requires at least 10 MB of disk space on each server wherever it is replicated in the cluster

  • Each table consumes 8 MB of RAM on each server

RethinkDB, in order to keep performance high, stores some data in the RAM. There are basically three sources of usage of RAM by RethinkDB:

  • Metadata

  • Page cache

  • Running queries and background processes

RethinkDB stores metadata of tables in the main memory in order to ensure fast read access. Every table consumes around 8 MB per server for the metadata. RethinkDB organizes the data into blocks, with size ranging from 512 bytes to 4 KB. Out of these blocks, approximately 10 to 26 bytes per block are kept in memory.

Page cache is a very important aspect of performance. It is basically used to store very frequently accessed data in the RAM rather than reading it from disk (except in the case of direct I/O, where the page cache is in the application buffer than RAM). RethinkDB uses this formula to calculate the size of the cache:

Cache size = available memory - 1024 MB / 2

If the cache size is less than 1224 MB, then RethinkDB set the size of page cache to 100 MB. This is why it is recommended to have at least 2 GB of RAM allocated for RethinkDB processes.

You can also change the size of the page cache when you start the server or later, using configuration files.

Every database uses some memory to store the results of ongoing running queries. Since queries differ, in general, there is no exact estimate about memory usage by running queries; however, a rough estimate is between 1 MB and 20 MB, including background processes such as transferring data between nodes, voting processes, and so on.

 

Summary


RethinkDB is indeed a next-generation database with some amazing features and high performance. Due to these architectural advantages, it has been trusted by top-level organizations such as NASA. We covered the architecture of RethinkDB in detail in order to understand how it works and why it is great for real-time databases. We also covered disk storage, clustering, and failover handling of RethinkDB.

Along with the architecture, we also looked over data modeling in RethinkDB. We also looked over one of the most used features of SQL-JOINS working in a NoSQL database, that is, RethinkDB. There is no doubt that RethinkDB is the next big thing.

In next chapter, we are going to learn about the query language of RethinkDB, called ReQL. We will go over it with examples. We will also cover changefeeds, which is personally my favorite feature of RethinkDB.

About the Author

  • Shahid Shaikh

    Shahid Shaikh is an engineer, published author, and a blogger. He has written tons of tutorials on Node.js and various databases on his blog. He is a believer in open source and almost all his source code is on GitHub. He has also developed and launched a SaaS product called Spark to handle content sharing automation. Currently, he is building a next-generation email marketing tool called Hashmailer.

    Browse publications by this author

Latest Reviews

(1 reviews total)
An essential reference! Nobody else makes this. RethinkDB is not dead, it's moving to purely open source now.
Mastering RethinkDB
Unlock this book and the full library FREE for 7 days
Start now