Learning RethinkDB

Jonathan Pollack

September 28th, 2015

RethinkDB is a relatively new, fully open-source NoSQL database, featuring: ridiculously easy sharding, replicating, & database management, table joins (that’s right!), geospatial & time-series support, and real-time monitoring of complicated queries. I think the feature list alone makes this a piece of tech worth looking further into, to say nothing of the fact that we’ll likely be seeing an explosion of apps that use RethinkDB as their fundamental database–so developers, get ready to have to learn about yet another database. That said, like any tool, you should consult your doctor when deciding if RethinkDB is right for you.

When to avoid

Like most NoSQL offerings, RethinkDB has a few conscience trade-offs in its design, most notably when it comes to ACID compliance, and the CAP-theorem.

  • If you need a fully ACID compliant database, or strong type checking across your schema, you would be better served by a traditional SQL database.
  • If you absolutely need write availability over data consistency–RethinkDB favors consistency.

Also, because of how queries are performed and returned, “big data” use cases are probably not a great fit for this database–specifically if you want to handle results larger than 64 MB, or are performing computationally intensive work on your stored data.

When to consider

  • You want a great web-based management console for data-center configuration (sharding, replication, etc.), database monitoring, and testing queries.
  • You want the flexibility of a schema-less database, with the ability to easily express relationships via table joins.
  • You need to perform geospatial queries (e.g. find all documents with locations within 5km of a given point).
  • You deal with time series data, especially across various times zones.
  • You need to push data to your client based off of realtime changes to your data, as a result of complex queries.

Management console

The web console is insanely easy to use, and gives you all of the control you need for administrating your data-center–even if it is only a data-center of one database.

Setting up a data-center is just a matter of pointing your new database to an existing node in a cluster. Once that’s done, you can use the web console to shard (and re-shard) your data, as well as determine how many replicas you want floating around.

You can also run queries (and profile those queries) against your databases straight form the web console, giving you quick access to your data and performance.

Table joins (capturing data relations)

One of the best pieces of syntatic sugar that RethinkDB provides, in my opinion, is the ability to do table joins. While, certainly, this isn’t that magical–what we’re doing is essentially a nested query via a specified field to be used as the nested lookup’s primary key–it really does make queries easy to read and compose.

r.table("table1").eq_join("doc_field_as_table2_primary_key", r.table("table2")).zip().run()

Even more awesomely, the JavaScript ORM Thinky allows for very slick, seamless query-level joins, based on the same principal.

Geospatial primitives

Given that location aware queries are becoming more and more popular, if not downright necessary, it’s great to see that RethinkDB comes with support for the following geometric primitives:point, line, polygon (at least 3 sided), circle, and polygonSub (subtract one polygon from the larger, enclosing polygon).

It allows for the following types of queries: distance, intersects, includes, getIntersecting, and getNearest. For example, you can find all of the documents within 5 km of Greenwich, England.

r.table("table1").getNearest(r.point(0,0), {index: "table1_geo_index", maxDist: 5, unit: "km"}).run()

Time-series support (sane date & time primitives)

Official drivers do native conversions for you, which means timezone-aware context driven queries can be made that allow you to find documents that occurred at a given time on a given day in a given timezone.

Some other cool features:

  • Times can be used as indexes.
  • Time operations are handled on the database, allowing them to be executed across the cluster effortlessly.

Take, for example, the desire to figure out how many customer support tickets were coming in between 9 am, and 5 pm, every day. We don’t want to have to figure out how to offset the time-stamp on each document, given that the timezones could each be different. Thankfully, RethinkDB will do this accounting, and spread out the computation across the cluster without asking us for a thing.

r.table('customer-support-tickets').filter(function (ticket) {
  // ticket.hours() is automatically dealt with in its own timezone
  return ticket('time').hours().lt(9).or(
        ticket('time').hours().ge(17));
}).count().run();

Realtime query result monitoring (change feeds)

Probably by far and away the most impressive feature of RethinkDB has to be change-feeds. You can turn almost every practical query that you would want to monitor into a live stream of changes just by chaining the function call changes() to the end.

For example, monitor the changes to a given table:

r.table("table1").changes().run()

or to a given query (the ordering of a table, for instance):

r.table("table1").orderBy("key").changes().run()

And of course, the queries can be made more complicated, but these examples above should blow your mind. No more pulling, no more having to come up with the data diffs yourself before pushing them to the client. RethinkDB will do the diff for you, and push the results straight to your server.

There is one caveat here, however; while this is decent for order-of-magnitude: 10 clients, it is more efficient to couple your change-feeds to a pub-sub service when pushing to many clients.

Conclusion

RethinkDB has a lot of cool things to be excited about: ReQL (it’s readable, highly functional syntax), cluster management, primitives for 21st century applications, and change-feeds. And you know what, if RethinkDB only had change-feeds, I would still be extremely excited about it–think of all that time you no longer have to spend banging your head against the wall trying to deal with consistence and concurrency issues!

If you are thinking about starting a new project, or are tired of fighting with your current NoSQL database, and don’t have any requirements in the “avoid camp”, you should highly consider using RethinkDB.

About the author

Jonathan Pollack is a full stack developer living in Berlin. He previously worked as a web developer at a public shoe company, and prior to that, worked at a start up that’s trying to build the world’s best pan-cloud virtualization layer. He can be found on Twitter @murphydanger.

Get beneath the Big Data hype with our free eBook

Want to find out more about Big Data? Start by exploring Hadoop. Our free eBook shows you how it works - and explains how it's become the key technology defining and driving the Big Data revolution.