In this article, Alok Shrivastwa and Sunil Sarat, authors of the book OpenStack Trove Essentials, explain how OpenStack Trove truly and remarkably is a treasure or collection of valuable things, especially for open source lovers like us and, of course, it is an apt name for the Database as a Service (DBaaS) component of OpenStack. In this article, we shall see why this component shows the potential and is on its way to becoming one of the crucial components in the OpenStack world.
In this article, we will cover the following:
- DBaaS and its advantages
- An introduction to OpenStack's Trove project and its components
Database as a Service
Data is a key component in today's world, and what would applications do without data? Data is very critical, especially in the case of businesses such as the financial sector, social media, e-commerce, healthcare, and streaming media. Storing and retrieving data in a manageable way is absolutely key. Databases, as we all know, have been helping us manage data for quite some time now.
Databases form an integral part of any application. Also, the data-handling needs of different type of applications are different, which has given rise to an increase in the number of database types. As the overall complexity increases, it becomes increasingly challenging and difficult for the database administrators (DBAs) to manage them.
DBaaS is a cloud-based service-oriented approach to offering databases on demand for storing and managing data. DBaaS offers a flexible and scalable platform that is oriented towards self-service and easy management, particularly in terms of provisioning a business' environment using a database of choice in a matter of a few clicks and in minutes rather than waiting on it for days or even, in some cases, weeks.
The fundamental building block of any DBaaS is that it will be deployed over a cloud platform, be it public (AWS, Azure, and so on) or private (VMware, OpenStack, and so on). In our case, we are looking at a private cloud running OpenStack. So, to the extent necessary, you might come across references to OpenStack and its other services, on which Trove depends.
XaaS (short for Anything/Everything as a Service, of which DBaaS is one such service) is fast gaining momentum. In the cloud world, everything is offered as a service, be it infrastructure, software, or, in this case, databases. Amazon Web Services (AWS) offers various services around this: the Relational Database Service (RDS) for the RDBMS (short for relational database management system) kind of system; SimpleDB and DynamoDB for NoSQL databases; and Redshift for data warehousing needs.
The OpenStack world was also not untouched by the growing demand for DBaaS, not just by users but also by DBAs, and as a result, Trove made its debut with the OpenStack release Icehouse in April 2014 and since then is one of the most popular advanced services of OpenStack.
It supports several SQL and NoSQL databases and provides the full life cycle management of the databases.
Now, you must be wondering why we must even consider DBaaS over traditional database management strategies. Here are a few points you might want to consider that might make it worth your time.
Reduced database management costs
In any organization, most of their DBAs' time is wasted in mundane tasks such as creating databases, creating instances, and so on. They are not able to concentrate on tasks such as fine-tuning SQL queries so that applications run faster, not to mention the time taken to do it all manually (or with a bunch of scripts that need to be fired manually), so this in effect is wasting resources in terms of both developers' and DBAs' time. This can be significantly reduced using a DBaaS.
Faster provisioning and standardization
With DBaaS, databases that are provisioned by the system will be compliant with standards as there is very little human intervention involved. This is especially helpful in the case of heavily regulated industries. As an example, let's look at members of the healthcare industry. They are bound by regulations such as HIPAA (short for Health Insurance Portability and Accountability Act of 1996), which enforces certain controls on how data is to be stored and managed. Given this scenario, DBaaS makes the database provisioning process easy and compliant as they only need to qualify the process once, and then every other database coming out of the automated provisioning system is then compliant with the standards or controls set.
Since DBaaS is cloud based, which means there will be a lot of automation, administration becomes that much more automated and easier. Some important administration tasks are backup/recovery and software upgrade/downgrade management. As an example, with most databases, we should be able to push configuration modifications within minutes to all the database instances that have been spun out by the DBaaS system. This ensures that any new standards being thought of can easily be implemented.
Scaling and efficiency
Scaling (up or down) becomes immensely easy, and this reduces resource hogging, which developers used as part of their planning for a rainy day, and in most cases, it never came. In the case of DBaaS, since you don't commit resources upfront and only scale up or down as and when necessary, resource utilization will be highly efficient.
These are some of the advantages available to organizations that use DBaaS. Some of the concerns and roadblocks for organizations in adopting DBaaS, especially in a public cloud model, are as follows:
- Companies don't want to have sensitive data leave their premises.
- Database access and speed are key to application performance. Not being able to manage the underlying infrastructure inhibits some organizations from going to a DBaaS model.
In contrast to public cloud-based DBaaS, concerns regarding data security, performance, and visibility reduce significantly in the case of private DBaaS systems such as Trove. In addition, the benefits of a cloud environment are not lost either.
OpenStack Trove, which was originally called Red Dwarf, is a project that was initiated by HP, and many others contributed to it later on, including Rackspace. The project was in incubation till the Havana release of OpenStack.
It was formally introduced in the Icehouse release in April 2014, and its mission is to provide scalable and reliable cloud DBaaS provisioning functionality for relational and non-relational database engines.
As of the Liberty release, Trove is considered as a big-tent service.
Big-tent is a new approach that allows projects to enter the OpenStack code namespace. In order for a service to be a big-tent service, it only needs to follow some basic rules, which are listed here. This allows the projects to have access to the shared teams in OpenStack, such as the infrastructure teams, release management teams, and documentation teams. The project should:
- Align with the OpenStack mission
- Subject itself to the rulings of the OpenStack Technical Committee
- Support Keystone authentication
- Be completely open source and open community based
At the time of writing the article, the adoption and maturity levels are as shown here:
The previous diagram shows that the Age of the project is just 2 YRS and it has a 27% Adoption rate, meaning 27 of 100 people running OpenStack also run Trove.
The maturity index is 1 on a scale of 1 to 5. It is derived from the following five aspects:
- The presence of an installation guide
- Whether the Adoption percentage is greater or lesser than 75
- Stable branches of the project
- Whether it supports seven or more SDKs
- Corporate diversity in the team working on the project
Without further ado, let's take a look at the architecture that Trove implements in order to provide DBaaS.
The trove project uses some shared components and some dedicated project-related components as mentioned in the following subsections.
The Trove system shares two components with the other OpenStack projects, the backend database (MySQL/MariaDB), and the message bus.
The message bus
The AMQP (short for Advanced Message Queuing Protocol) message bus brokers the interactions between the task manager, API, guest agent, and conductor. This component ensures that Trove can be installed and configured as a distributed system.
MySQL or MariaDB is used by Trove to store the state of the system.
This component is responsible for providing the RESTful API with JSON and XML support. This component can be called the face of Trove to the external world since all the other components talk to Trove using this. It talks to the task manager for complex tasks, but it can also talk to the guest agent directly to perform simple tasks, such as retrieving users.
The task manager
The task manager is the engine responsible for doing the majority of the work. It is responsible for provisioning instances, managing the life cycle, and performing different operations. The task manager normally sends common commands, which are of an abstract nature; it is the responsibility of the guest agent to read them and issue database-specific commands in order to execute them.
The guest agent
The guest agent runs inside the Nova instances that are used to run the database engines. The agent listens to the messaging bus for the topic and is responsible for actually translating and executing the commands that are sent to it by the task manager component for the particular datastore.
Let's also look at the different types of guest agents that are required depending on the database engine that needs to be supported. The different guest agents (for example, the MySQL and PostgreSQL guest agents) may even have different capabilities depending on what is supported on the particular database. This way, different datastores with different capabilities can be supported, and the system is kept extensible.
The conductor component is responsible for updating the Trove backend database with the information that the guest agent sends regarding the instances. It eliminates the need for direct database access by all the guest agents for updating information. This is like the way the guest agent also listens to the topic on the messaging bus and performs its functions based on it.
The following diagram can be used to illustrate the different components of Trove and also their interaction with the dependent services:
Let's take a look at some of the terminology that Trove uses.
Datastore is the term used for the RDBMS or NoSQL database that Trove can manage; it is nothing more than an abstraction of the underlying database engine, for example, MySQL, MongoDB, Percona, Couchbase, and so on.
This is linked to the datastore and defines a set of packages to be installed or already installed on an image. As an example, let's take MySQL 5.5. The datastore version will also link to a base image (operating system) that is stored in Glance.
The configuration parameters that can be modified are also dependent on the datastore and the datastore version.
An instance is an instantiation of a datastore version. It runs on OpenStack Nova and uses Cinder for persistent storage. It has a full OS and additionally has the guest agent of Trove.
A configuration group is a bunch of options that you can set. As an example, we can create a group and associate a number of instances to one configuration group, thereby maintaining the configurations in sync.
The flavor is similar to the Nova machine flavor, but it is just a definition of memory and CPU requirements for the instance that will run and host the databases.
Normally, it's a good idea to have a high memory-to-CPU ratio as a flavor for running database instances.
This is the actual database that the users consume. Several databases can run in a single Trove instance. This is where the actual users or applications connect with their database clients.
The following diagram shows these different terminologies, as a quick summary. Users or applications connect to databases, which reside in instances. The instances run in Nova but are instantiations of the Datastore version belonging to a Datastore. Just to explain this a little further, say we have two versions of MySQL that are being serviced. We will have one datastore but two datastore versions, and any instantiation of that will be called an instance, and the actual MySQL database that will be used by the application will be called the database (shown as DB in the diagram).
A multi-datastore scenario
One of the important features of the Trove system is that it supports multiple databases to various degrees. In this subsection, we will see how Trove works with multiple Trove datastores.
In the following diagram, we have represented all the components of Trove (the API, task manager, and conductor) except the Guest Agent databases as Trove Controller. The Guest Agent code is different for every datastore that needs to be supported and the Guest Agent for that particular datastore is installed on the corresponding image of the datastore version.
The guest agents by default have to implement some of the basic actions for the datastore, namely, create, resize, and delete, and individual guest agents have extensions that enable them to support additional features just for that datastore.
The following diagram should help us understand the command proxy function of the guest agent. Please note that the commands shown are only indicative, and the actual commands will vary.
At the time of writing this article, Trove's guest agents are installable only on Linux; hence, only databases on Linux systems are supported. Feature requests (https://blueprints.launchpad.net/trove/+spec/mssql-server-db-support) were created for the ability to create a guest agent for Windows and support Microsoft SQL databases, but they have not yet been approved at the time of writing this and might be a remote possibility.
Database software distribution support
Trove supports various databases; the following table shows the databases supported by this service at the time of writing this. Automated installation is available for all the different databases, but there is some level of difference when it comes to the configuration capabilities of Trove with respect to different databases.
This has lot to do with the lack of a common configuration base among the different databases. At the time of writing this article, MySQL and MariaDB have the most configuration options available, as shown in this list:
So, as you can see, almost all the major database applications that can run on Linux are already supported on Trove.
Putting it all together
Now that we have understood the architecture and terminologies, we will take a look at the general steps that are followed:
- Horizon/Trove CLI requests a new database instance and passes the datastore name and version, along with the flavor ID and volume size as mandatory parameters. Optional parameters such as the configuration group, AZ, replica-of, and so on can also be passed.
- The Trove API requests Nova for an instance with the particular image and a Cinder volume of a specific size to be added to the instance.
- The Nova instance boots and follows these steps:
The cloud-init scripts are run(like all other Nova instances)
The configuration files (for example, trove-guestagent.conf) are copied down to the instance
The guest agent is installed
- The Trove API will also have sent the request to the task manager, which will then send the prepare call to the message bus topic.
- After booting, the guest agent listens to the message bus for any activities for it to do, and once it finds a message for itself, it processes the prepare command and performs the following functions:
- Installing the database distribution (if not already installed on the image)
- Creating the configuration file with the default configuration for the database engine (and any configuration from the configuration groups associated overriding the defaults)
- Starting the database engine and enabling auto-start
- Polling the database engine for availability (until the database engine is available or the timeout is reached)
- Reporting the status back to the Trove backend using the Trove conductor
- The Trove manager reports back to the API and the status of the machine is changed.
So, if you are wondering all the places where we can use Trove, it fits in rather nicely with the following use cases.
Dev/test databases are an absolute killer feature, and almost all companies that start using Trove will definitely use it for their dev/test environments. This provides developers with the ability to freely create and dispose of database instances at will. This ability helps them be more productive and removes any lag from when they want it to when they get it.
The capability of being able to take a backup, run a database, and restore the backup to another server is especially key when it comes to these kinds of workloads.
Web application databases
Trove is used in production for any database that supports low-risk applications, such as some web applications. With the introduction of different redundancy mechanisms, such as master-slave in MySQL, this is becoming more suited to many production environments.
Trove is moving fast in terms of the features being added in the various releases. In this section, we will take a look at the features of three releases: the current release and the past two.
The Juno release
The Juno release saw a lot of features being added to the Trove system. Here is a non-exhaustive list:
- Support for Neutron: Now we can use both nova-network and Neutron for networking purposes
- Replication: MySQL master/slave replication was added. The API also allowed us to detach a slave for it to be promoted
- Clustering: Mongo DB cluster support was added
- Configuration group improvements:
- The functionality of using a default configuration group for a datastore version was added. This allows us to build the datastore version with a base configuration of your company standards
- Basic error checking was added to configuration groups
The Kilo release
The Kilo release majorly worked on introducing a new datastore. The following is the list of major features that were introduced:
- Support for the GTID (short for global transaction identifier) replication strategy
- New datastores, namely Vertica, DB2, and CouchDB, are supported
The Liberty release
The Liberty release introduced the following features to Trove. This is a non-exhaustive list.
- Configuration groups for Redis and MongoDB
- Cluster support for Redis and MongoDB
- Percona XtraDB cluster support
- Backup and restore for a single instance of MongoDB
- User and database management for Mongo DB
- Horizon support for database clusters
- A management API for datastores and versions
- The ability to deploy Trove instances in a single admin tenant so that the Nova instances are hidden from the user
In order to see all the features introduced in the releases, please look at the release notes of the system, which can be found at these URLs:
In this article, we were introduced to the basic concepts of DBaaS and how Trove can help with this. With several changes being introduced and a score of one on five with respect to maturity, it might seem as if it is too early to adopt Trove. However, a lot of companies are giving Trove a go in their dev/test environments as well as for some web databases in production, which is why the adoption percentage is steadily on the rise.
A few companies that are using Trove today are giants such as eBay, who run their dev/test Test databases on Trove; HP Helion Cloud, Rackspace Cloud, and Tesora (which is also one of the biggest contributors to the project) have DBaaS offerings based on the Trove component.
Trove is increasingly being used in various companies, and it is helping in reducing DBAs' mundane work and improving standardization.
Resources for Article:
- OpenStack Performance, Availability [article]
- Concepts for OpenStack [article]
- Implementing OpenStack Networking and Security [article]