Learning Couchbase

Chapter 1. Introduction to Couchbase

This chapter will introduce a new type of database technology called NoSQL. You too are a contributor to the evolution of this technology. Surprised? You do have a Facebook account, upload pictures, and use messenger services, such as WeChat, WhatsApp, right? The data in these are generated at a fast rate and in huge amounts (terabytes per day). They also vary in format or structure. We usually use the term big data for such types of data. Such large amounts of data can't be handled by a traditional relational database management system. That is why a new way needs to be discovered to solve this. This is how NoSQL came into existence. This chapter will introduce you to NoSQL and its fundamentals. Next, you will be introduced to one of the fastest NoSQL databases in the world, called Couchbase. Right, you read it correct! It's the fastest database since all of the data is, by default, cached in the RAM or volatile memory, and the most interesting part is that you don't need to do any configuration for caching the data. Everything will be taken care of by Couchbase Server. Following this, you will learn to install Couchbase Server in Windows and Linux environments. Finally, this chapter will introduce you to the various logs and configuration folders.

In this chapter, we will cover the following topics:

What is NoSQL and why do we need it?
Couchbase architecture
Concepts of Couchbase

What is NoSQL and why do we need it?

It's always a challenge to introduce a new technology, especially when it changes the fundamentals that have been taught for so long. An example is the one I am going to introduce right now. However, it's easy to comprehend it if we understand the rationale behind it. So, let's understand the need for NoSQL. Oh, hold on! We will elaborate on this later.

We are all aware of and use Relational Database Management Systems (RDBMS). RDBMS is a database management system, which is based on the relational model invented by E. F. Codd, that has features such as normalization, joins, foreign keys, and so on. (Examples of such a database management system would be MySQL, Oracle, DB2 DB, and so on). RDBMS provides features such as transactions, table joins, locking mechanisms, ACID properties, and so on. However, there are some limitations to RDBMS, predominantly in terms of scalability and readiness for schema changes.

Note

ACID stands for Atomicity, Consistency, Isolation, and Durablity. These are properties that are essential for supporting transactions in any database system. In order to guarantee a meaningful and successful transaction, the system has to support all of these properties:

Atomicity: The operation will be performed as a single unit
Consistency: All the operations will ensure a valid state and consistency of data at the end of the transaction
Isolation: No two transactions will intervene with each other
Durability: The transaction will survive system failures

In order to get more clarity, let's look at a scenario. Your organization has recently launched an e-commerce application and you are the technical architect. Everything has been going on smoothly and everyone, including your boss, is happy with the outcome. However, after a couple of months, you start getting complaints from the business team that the application is not performing well. After some investigation, you realize that the consumer base has increased, hence Users traffic has increased. The application server and the infrastructure are not able to handle such an increase in traffic. So what will you do? Think about it. If you are like most other architects, the initial measures would be to scale the application servers, introduce multiple servers, and provide a load balancer, or increase the system resources, such as the RAM and CPU. After you take these steps, the application seems to show some improvement.

But after a couple of weeks comes a realization that the same improvement needs to be done at the database server too. So, what can be done? You have two options:

Vertical scaling
Horizontal scaling

The first is vertical scaling, wherein you increase the hardware resources in terms of CPU and RAM. The second is horizontal scaling, wherein you increase the sever nodes.

However, there is a challenge here; we can't just scale the database server horizontally as we do for application servers. If we need to scale database servers horizontally, we need to find a mechanism to distribute data across the servers, balance the load, and what not! The only easy way left is to increase your hardware resources. However, after a certain stage, physical servers can't expand further due to limitations of sockets, chips, and so on, just like if you have four CPU socket servers, then you cannot scale up further than that. Therefore, we need to find a way to scale out, horizontally, when we anticipate an increase in the number of database requests or hits or load in the database layer. Such a situation is encountered in most content-driven, social networking, and e-commerce sites, where there are a large number of transactions taking place in milliseconds.

Besides this, due to dynamics in business functions, the database schema needs to be changed very frequently, which is very common in agile development. It is difficult to incorporate the changes in RDBMS. Sometimes you need to bring the application down to modify the schema, such as adding one column in a table. In order to address such issues, companies such as Facebook and Google started exploring alternatives to RDBMS for data storage that can scale out and handle changes in schemas seamlessly without any impact on business operations. These are the fundamentals of NoSQL.

So what is NoSQL?

NoSQL is a nonrelational database management system that is different from traditional relational database management systems in significant ways. It is designed for distributed data stores in which there are very large-scale data storage requirements (terabytes and petabytes of data). These types of data storage mechanisms may not require fixed schemas, avoid join operations, and typically scale horizontally.

The main feature of NoSQL is that it is schemaless. There is no fixed schema to store data. Also, there is no join between one or more data records or documents. However, nowadays, most of the NoSQL systems have started providing join features. It allows distributed storage and utilizes computing resources, such as CPU and RAM, spanning across the nodes that are part of the NoSQL cluster.

There are different types of NoSQL data stores. Let's try to cover the four main categories of NoSQL systems in brief:

Key-value store: A simple data storage system that uses a key to access values. Some examples are Redis, Riak, and DynamoDB.
Use Case: Multiplayer online gaming to manage each player session.
Column family store: A sparse matrix system that uses a row and a column as keys, for example, Apache HBase, Apache Cassandra.
Use Case: Stream massive write loads such as log analysis.
Graph store: This is used for relationship-intensive problems. An example is Neo4j.
Use Case: Complicated graph problems, such as moving from one point to another.
Document store: This is used to store hierarchical data structures directly in the database, for example, MongoDB (10Gen), CouchDB, and Couchbase.
Use Case: Storing structured product information.

Why do we need NoSQL?

Electronic data is generated at rapid speed from a variety of sources, such as social media, web server logs, and e-commerce transactions and so on; these include Facebook, Google+, e-commerce websites such as Amazon, eBay, and others. Personal user information, social graphs, geolocation data, user-generated content, and machine logging data are just a few examples of areas in which data has been increasing exponentially. Such data is termed as big data, which usually has a variety of data formats, is generated at a rapid speed, and contains a large set of data. In order to derive information from such big data, large amounts of data have to be processed, for which RDBMS was never designed! The evolution of NoSQL databases is the way to handle such huge data efficiently.

Most of NoSQL databases provide the following benefits:

It provides a flexible data model. You don't need to worry about the schema. You can design your schema depending on the needs of your application domain and not by storage demands.
It's scalable and can be done very easily. Since it's a distributed system, it can scale out horizontally without too many changes in the application. In some of the NoSQL systems, such as Couchbase, you can scale out with a few mouse clicks and rebalance it very easily.
It provides high availability, since there are multiple servers and data are replicated across nodes.

Since NoSQL is a distributed database system, you need to know a theorem called CAP to understand it better, and take better decisions when the system fails in a distributed environment. Let me explain the CAP theorem to you. There are three important properties of this theorem:

Consistency: What comes to your mind when we say consistency in a distributed system? When data is replicated to multiple nodes in a distributed system, it should return the same value or state as any of the other replicated nodes. Generally speaking, the data in all nodes must be consistent with each other.
Availability: Systems should be able to serve client requests all the time, irrespective of the situation. In any distributed system, there are multiple nodes and it is ideal that the failure of a node should not stop the availability of the system. In short, the client should be able to perform read, write, and update operations at all times.
Partition tolerance: In any distributed system, depending on an algorithm such as hashing, data or records are partitioned across the nodes or the servers in the database ecosystem. Failures in replicating or transferring data between cluster nodes should not stop the system from responding to client requests. This feature of providing tolerance when there is a disturbance between nodes is called partition tolerance.

The following is a Venn diagram depicting the CAP theorem:

So, you have understood what the CAP properties signify. The CAP theorem states that in any distributed system it can provide only two features out of these three features. Depending on the type of use cases that the system is intended to address, the database system can choose two out of these three features.

There are a number of database systems available in the IT software market—RDBMS such as MySQL, or NoSQL such as MongoDB, Couchbase, Cassandra, and so on. How do you choose a database system that suits your business requirements? This theorem will help you to decide it. In our context, Couchbase has opted for AP—availability and partition tolerance. So, if your application demands availability and partition tolerance more than consistency, you could opt for Couchbase. However, Couchbase provides a feature called eventual consistency, which will be discussed later in Chapter 6, Retrieving Documents without Keys Using Views. This feature enables the developer to decide the consistency level per operation.

Having understood what NoSQL is all about and why it's a buzzword nowadays, let's try to understand Couchbase, which is the purpose of this book.

Couchbase Server is a persistent, distributed, document-based database that is part of the NoSQL database movement. It combines the capabilities of Apache CouchDB: document-based and indexing-with that of a Membase database, an integrated RAM caching layer, enabling it to support very fast operations, such as create, store, update, and retrieval.

Couchbase Server is a leading NoSQL database project that focuses on distributed database technology and the surrounding ecosystems. It supports both key-value and document-oriented use cases. All components are available under the Apache 2.0 Public License. It can be obtained as packaged software in both an enterprise edition, which is rigorously tested and provides support, and a community edition that do not have support and is open source.

Let's cover some of the main features of Couchbase Server here:

Schemaless: You don't need to worry about the database schema when changing your application object. Records can have different structures; there is no fixed schema. It allows changes in a data model for rapid application development easily, without the need to perform expensive alter table operations in the database. In short, it provides a flexible data model with JSON support.
JSON-based document structure: The documents in Couchbase are natively stored as JSON. In a document-based NoSQL, metadata of the data like types are stored along with the data and normally all related information is stored together as a single document. When you build an application, you don't require explicit mapping of application objects with that of the database schema. Couchbase provides an interface to create new documents for viewing and editing.
Built in clustering with replication: The Couchbase also provides built-in clustering, wherein all nodes in a cluster are equal. Furthermore, it provides data replication with auto-failover.
365 day availability: The Couchbase cluster provides almost zero downtime maintenance. You can remove a node of the cluster for maintenance and join the cluster after the maintenance operation without suffering any application downtime. High availability of data in the cluster is provided by the replication mechanism.
Cache: By default, all documents are stored in the RAM, and hence provide a built-in managed cache. It provides easy scalability and consistent high performance by adding nodes, thus increasing the RAM in the cluster resources pool.
Web UI: There are simple and easy-to-use admin APIs and UIs provided for smooth administration of the Couchbase cluster. It can also be used to monitor the cluster with ease.
Varieties of SDK: Software development kits for a variety of languages, such as Java, PHP, and so on, are provided to connect to Couchbase Servers.
In a Couchbase cluster, there are a number of nodes and all nodes are equal; the cluster works on the concept of peer-to-peer. You can easily scale the cluster by adding a node to it. Since all the nodes are the same, there is no single point of failure. The cluster ensures that every node manages some active data and some replica data. The data is distributed across the cluster, and hence the load is also uniformly distributed using auto-sharding. The data is divided into chunks and distributed across the nodes automatically.
Note
Auto-sharding is a feature of NoSQL databases that spreads documents across the nodes in a cluster automatically. It remains transparent to the application that consumes the data from the cluster.

The architecture of Couchbase

Couchbase clusters consist of multiple nodes. A cluster is a collection of one or more instances of Couchbase server that are configured as a logical cluster. The following is a Couchbase server architecture diagram:

As mentioned earlier, while most of the cluster technologies work on master-slave relationships, Couchbase works on a peer-to-peer node mechanism. This means there is no difference between the nodes in the cluster. The functionality provided by each node is the same. Thus, there is no single point of failure. When there is a failure of one node, another node takes up its responsibility, thus providing high availability.

Data manager

Any operation performed on the Couchbase database system gets stored in the memory, which acts as a caching layer. By default, every document gets stored in the memory for each read, insert, update, and so on, until the memory is full. It's a drop-in replacement for Memcache. However, in order to provide persistency of the record, there is a concept called disk queue. This will flush the record to the disk asynchronously, without impacting the client request. This functionality is provided automatically by the data manager, without any human intervention.

Cluster management

The cluster manager is responsible for node administration and node monitoring within a cluster. Every node within a Couchbase cluster includes the cluster manager component, data storage, and data manager. It manages data storage and retrieval. It contains the memory cache layer, disk persistence mechanism, and query engine.

Couchbase clients use the cluster map provided by the cluster manager to find out which node holds the required data, and then communicate with the data manager on that node to perform database operations.

The Erlang language is used to develop cluster management. Erlang provides a dynamic type system and built-in support for concurrency processes, which are isolated from one another, and are very lightweight in nature. A single Erlang VM, can run a quarter of a million processes. It also provides a lot of modules that help with distributed processing. Moreover, it provides enhancement of debugging and patching live systems is easy compared to any other language.

Concepts of Couchbase

Let's take a look at some of the concepts of Couchbase next in this section.

Buckets

In RDBMS, we usually encapsulate all relevant data for a particular application in a database namespace. Say, for example, we are developing an e-commerce application. We usually create a database name, e-commerce, which will be used as the logical namespace to store records in a table, such as customer or shopping cart details. It's called a bucket in a Couchbase terminology. So, whenever you want to store any document in a Couchbase cluster, you will be creating a bucket as a logical namespace as a first step. A bucket is an independent virtual container that groups documents logically in a Couchbase cluster, which is equivalent to a database namespace in RDBMS. It can be accessed by various clients in an application. You can also configure features such as security, replication, and so on per bucket. We usually create one database and consolidate all related tables in that namespace in the RDBMS development. Likewise, in Couchbase too, you will usually create one bucket per application and encapsulate all the documents in it.

Views

Views enable indexing and querying by looking inside JSON documents for a key, for ranges of keys, or to aggregate data.

Views are created using incremental MapReduce, which powers indexing. We will discuss this in detail in Chapter 6, Retrieving Documents without Keys Using Views. You can build complex views for your data using the map reduce feature. Views enable us to define materialized views on JSON documents and then query across the dataset.

Note

A materialized view is a database object that contains the result of MapReduce.

Using views, you can define primary, simple secondary (the most common use case), complex secondary, tertiary, and composite indexes, as well as aggregations (reduction). It is developed using MapReduce technology. MapReduce functions are written in JavaScript. You will understand more about MapReduce and views in detail in Chapter 6, Retrieving Documents without Keys Using Views.

Cross Data Center Replication

Cross Data Center Replication (XDCR) is the mechanism provided by Couchbase to replicate documents from one cluster to another. Most of the time, data will be replicated across clusters that are geographically spread out. Usually, when you want to replicate data in a separate data center for disaster recovery or to provide performance by enabling data locality for application, we configure XDCR on a per bucket (per database) basis. When you configure replication across two clusters—say Cluster A is located in Imphal (India) and Cluster B is in Mumbai (India), which are 2,500 kms away from each other—you can specify to replicate only from Cluster A to Cluster B, unidirectional or in both directions, that is, bidirectional. Thus, you are enabling clients to read/write from both the clusters when you enable bidirectional active replication across the clusters. Lastly, you need to remember that this is different from intracluster replication, which occurs within a cluster.

Note

In case of intracluster replication, documents are replicated as a replica for a failover in the other nodes of the same cluster. More details about XDCR will be discussed in Chapter 9, Data Replication and Compaction.

Installation on Windows and Linux environments

Enough of concepts! Let's install Couchbase so that we can get some hands-on experience. We will install in a Windows environment first. You can download the software from www.couchbase.com.

You can download the enterprise or community edition. We will use the Couchbase 64-bit enterprise edition.

To start the installation wizard, double-click on couchbase-server-enterprise_x86_64_3.0.0.setup.exe. Then, you will see the following window:

Installation on Windows and Linux environments

Click on Next. The next screen will allow you to select the folder where you want to install Couchbase. In my case, I chose the default location. You can change the directory location if you want; click on the Browse button for that, as shown here:

After selecting the installation folder, you can click on Install, as shown in the following screenshot:

After this, the Couchbase will be installed on your system. You can see the progress of installation in the following screenshot:

It will take some time to install. After a few minutes, you will see the following success screen. Congratulations!!

You need to change some settings before you can use Couchbase. You can access the admin console with http://localhost:8091/index.html. The default port is 8091. You can change this if required.

The Bucket set up

Click on Setup and you will be able to configure Couchbase for your environment:

Server configuration

You can select the location of the databases and indexes with this option. Select the options shown in the preceding screenshot. Since it's the first node we are installing and there is no existing cluster, select Start a new cluster and specify 801 MB RAM, which needs to be allocated to the Couchbase cluster by each node. Then, click on Next:

Sample bucket

Select both the samples provided along with the Couchbase software. After that, click on Next. On the subsequent page, create a default bucket with the following options:

RAM: 601 MB
Replicas: 1

Default bucket

We will explain all these options of bucket settings when we discuss buckets in Chapter 3, Storing Documents in Couchbase Using Buckets. Keep clicking on the Next button until you get the following screen:

Admin credentials

Enter the password, which is root123. It will be used to connect to the Couchbase admin console. Click on Next to finish.

Your Couchbase Server is now running and ready to use!

Couchbase installation on Red Hat, CentOS, and others

You can use the rpm command to install Couchbase on Red Hat or CentOS, as follows. Replace the version with the version number of the package downloaded:

# rpm -ivhcouchbase-server-{version}.rpm

For Ubuntu and Debian, use the following:

#dpkg -i couchbase-server{version007D.deb

After the installation is complete, Couchbase starts automatically. You can perform the initial setup by going to http://localhost:8091.

Startup and shutdown

Couchbase gives you the ability to start up and shut down your cluster. Let's take a look at how to achieve this on Linux and Windows.

On Linux

Downloading the example code

You can download the example code files for all Packt books you have purchased from your account at http://www.packtpub.com. If you purchased this book elsewhere, you can visit http://www.packtpub.com/support and register to have the files e-mailed directly to you.

You can start and stop a Couchbase cluster using the following commands:

#/etc/init.d/couchbase-server start
#/etc/init.d/couchbase-server stop

It assumes that you are executing the preceding commands with the root credentials. For some OS you need to use the sudo command.

On Windows

You can start and stop a cluster using the scripts provided in the following installation folders:

C:\Program Files\Couchbase\Server\bin\service_start.bat
C:\Program Files\Couchbase\Server\bin\service_stop.bat

Understanding log and configuration files

Couchbase Server creates a number of different log files, depending on the component of the system that produced the error, the level and severity of the problem being reported. All these logs are created in a folder. Whenever there is any issue, you can go to the following paths and check the respective logs. Their specific paths are:

In Windows:
C:\Program Files\Couchbase\Server\var\lib\couchbase\logs
In Linux:
/opt/couchbase/var/lib/couchbase/logs

Some of the logs that need to be looked into when there is an issue are discussed next.

debug

You can find debug-level error messages related to the core server management subsystem. This log does not contain information included in the couchdb, xdcr, and stats logs.

info

You can observe information-level error messages related to the core server management subsystem. This log does not contain information included in the couchdb, xdcr, and stats logs.

error

Any error-level messages for all subsystems of Couchbase excluding xdcr related errors, will be logged in this file.

mapreduce_errors

Errors pertaining to JavaScript and other view-processing errors are reported in the mapreduce_errors file.

reports.log

It logs only the progress report and crash reports for the Erlang process (the language in which the cluster management is being developed), which is a lightweight process that provides built-in support for concurrency, an important requirement for distributed systems.

Mobile development with Couchbase Lite

Couchbase Lite is an embedded JSON database that can work as a standalone server, in a P2P network, or as a remote endpoint for Couchbase Server. It provides native APIs for the iOS and Android platforms. It supports replication with compatible database servers. It also supports low-latency and offline access to data.

The sync server enables Couchbase Server 2.0 and later to act as a replication endpoint for Couchbase Lite. The Sync Gateway runs an HTTP listener process that provides a passive replication endpoint and uses a Couchbase Server bucket as persistent storage for all database documents.