Now that you have got this book in your hand, you must be both excited and anxious about NoSQL. In this chapter, we get a head-start on:
What NoSQL is
What NoSQL is not
Why NoSQL
A list of NoSQL databases
For over decades, relational databases have been used to store what we know as structured data. The data is sub-divided into groups, referred to as tables. The tables store well-defined units of data in terms of type, size, and other constraints. Each unit of data is known as column while each unit of the group is known as row . The columns may have relationships defined across themselves, for example parent-child, and hence the name relational databases. And because consistency is one of the critical factors, scaling horizontally is a challenging task, if not impossible.
About a decade earlier, with the rise of large web applications, research has poured into handling data at scale. One of the outputs of these researches is non-relational database, in general referred to as NoSQL database. One of the main problems that a NoSQL database solves is scale, among others.
According to Wikipedia:
In computing, NoSQL (mostly interpreted as "not only SQL") is a broad class of database management systems identified by its non-adherence to the widely used relational database management system model; that is, NoSQL databases are not primarily built on tables, and as a result, generally do not use SQL for data manipulation.
The NoSQL movement began in the early years of the 21st century when the world started its deep focus on creating web-scale database. By web-scale, I mean scale to cater to hundreds of millions of users and now growing to billions of connected devices including but not limited to mobiles, smartphones, internet TV, in-car devices, and many more.
Although Wikipedia treats it as "not only SQL", NoSQL originally started off as a simple combination of two words—No and SQL—clearly and completely visible in the new term. No acronym. What it literally means is, "I do not want to use SQL". To elaborate, "I want to access database without using any SQL syntax". Why? We shall explore the in a while.
Whatever be the root phrase, NoSQL today is the term used to address to the class of databases that do not follow relational database management system (RDBMS) principles, specifically being that of ACID nature, and are specifically designed to handle the speed and scale of the likes of Google, Facebook, Yahoo, Twitter, and many more.
Before we take a deep dive into it, let us set our context right by exploring some key landmarks in history that led to the birth of NoSQL.
From Inktomi, probably the first true search engine, to Google, the present world leader, the computer scientists have well recognized the limitations of the traditional and widely used RDBMS specifically related to the issues of scalability, parallelization, and cost, also noting that the data set is minimally cross-referenced as compared to the chunked, transactional data, which is mostly fed to RDBMS.
Specifically, if we just take the case of Google that gets billions of requests a month across applications that may be totally unrelated in what they do but related in how they deliver, the problem of scalability is to be solved at each layer—right from data access to final delivery. Google, therefore, had to work innovatively and gave birth to a new computing ecosystem comprising of:
These systems were initially described in papers released from 2003 to 2006 listed as follows:
Google File System, 2003: http://research.google.com/archive/gfs.html
Chubby, 2006: http://research.google.com/archive/chubby.html
MapReduce, 2004: http://research.google.com/archive/mapreduce.html
Big Data, 2006: http://research.google.com/archive/bigtable.html
These and other papers led to a spike in increased activities, specially in open source, around large scale distributed computing and some of the most amazing products were born. Some of the initial products that came up included:
Lucene: Java-based indexing and search engine (http://lucene.apache.org)
Hadoop: For reliable, scalable, distributed computing (http://hadoop.apache.org)
Cassandra: Scalable, multi-master database with no single point of failure (http://cassandra.apache.org)
ZooKeeper: High performance coordination service for distributed applications (http://zookeeper.apache.org)
Pig: High level dataflow language and execution framework for parallel computation (http://pig.apache.org)
Now that we have a fair idea on how this side of the world evolved, let us examine at what NoSQL is and what it is not.
NoSQL is a generic term used to refer to any data store that does not follow the traditional RDBMS model—specifically, the data is non-relational and it does not use SQL as the query language. It is used to refer to the databases that attempt to solve the problems of scalability and availability against that of atomicity or consistency.
NoSQL is not a database. It is not even a type of database. In fact, it is a term used to filter out (read reject) a set of databases out of the ecosystem. There are several distinct family trees available. In Chapter 4, Advantages and Drawbacks, we explore various types of data models (or simply, database types) available under this umbrella.
Traditional RDBMS applications have focused on ACID transactions:
Howsoever indispensible these qualities may seem, they are quite incompatible with availability and performance on applications of web-scale. For example, if a company like Amazon were to use a system like this, imagine how slow it would be. If I proceed to buy a book and a transaction is on, it will lock a part of the database, specifically the inventory, and every other person in the world will have to wait until I complete my transaction. This just doesn’t work!
Amazon may use cached data or even unlocked records resulting in inconsistency. In an extreme case, you and I may end up buying the last copy of a book in the store with one of us finally receiving an apology mail. (Well, Amazon definitely has a much better system than this).
The point I am trying to make here is, we may have to look beyond ACID to something called BASE , coined by Eric Brewer:
Basic availability: Each request is guaranteed a response—successful or failed execution.
Soft state: The state of the system may change over time, at times without any input (for eventual consistency).
Eventual consistency: The database may be momentarily inconsistent but will be consistent eventually.
Eric Brewer also noted that it is impossible for a distributed computer system to provide consistency, availability and partition tolerance simultaneously. This is more commonly referred to as the CAP theorem.
Note, however, that in cases like stock exchanges or banking where transactions are critical, cached or state data will just not work. So, NoSQL is, definitely, not a solution to all the database related problems
Looking at what we have explored so far, does it mean that we should look at NoSQL only when we start reaching the problems of scale? No.
NoSQL databases have a lot more to offer than just solving the problems of scale which are mentioned as follows:
Schemaless data representation: Almost all NoSQL implementations offer schemaless data representation. This means that you don’t have to think too far ahead to define a structure and you can continue to evolve over time—including adding new fields or even nesting the data, for example, in case of JSON representation.
Development time: I have heard stories about reduced development time because one doesn’t have to deal with complex SQL queries. Do you remember the
JOIN
query that you wrote to collate the data across multiple tables to create your final view?Speed: Even with the small amount of data that you have, if you can deliver in milliseconds rather than hundreds of milliseconds—especially over mobile and other intermittently connected devices—you have much higher probability of winning users over.
Plan ahead for scalability: You read it right. Why fall into the ditch and then try to get out of it? Why not just plan ahead so that you never fall into one. Or in other words, your application can be quite elastic—it can handle sudden spikes of load. Of course, you win users over straightaway.
The buzz around NoSQL still hasn’t reached its peak, at least to date. We see more offerings in the market over time. The following table is a list of some of the more mature, popular, and powerful NoSQL databases segregated by data model used:
Document |
Key-Value |
XML |
Column |
Graph |
---|---|---|---|---|
MongoDB |
Redis |
BaseX |
BigTable |
Neo4J |
CouchDB |
Membase |
eXist |
Hadoop / HBase |
FlockDB |
RavenDB |
Voldemort |
Cassandra |
InfiniteGraph | |
Terrastore |
MemcacheDB |
SimpleDB | ||
Cloudera |
This list is by no means comprehensive, nor does it claim to be. One of the positive points about this list is that most of the databases in the list are open source and community driven.
Chapter 4, Advantages and Drawbacks, provides an in-depth study of the various popular data models used in NoSQL databases.
Chapter 6, Case Study, does an exhaustive comparison of some of these databases along various key parameters including, but not limited to, data model, language, performance, license, price, community, resources, extensibility, and many more.
In this chapter, we learned about the fundamentals of NoSQL—what it is all about and more critically, what it is not. We took a splash in the history to appreciate the reasons and science behind it. You are recommended to explore the web for historical events around this to take a deep dive in appreciating it.
NoSQL is not a solution to each and every application. It is worth noting that most of the products do throw away the traditional ACID nature giving way to BASE infrastructure. Having said that, some products standout—CouchDB and Neo4j, for example, are ACID compliant NoSQL databases.
Adopting NoSQL is not only a technological change but also change in mindset, behaviour and thought process meaning that if you plan to hire a developer to work with NoSQL, he/she must understand the new models.
In the next chapter, we will have a quick look at the taxonomy and jack up our vocabulary before we dive deeply into NoSQL.