In this chapter, we will take an in-depth look at the topics we will be covering throughout the book. This chapter will also give you some background as to why Redshift is different from other databases you have used in the past, as well as the general types of things you will need to consider when starting up your first Redshift cluster.
This book, Getting Started with Amazon Redshift, is intended to provide a practical as well as technical overview of the product for anyone that may be intrigued as to why this technology is interesting as well as those that actually wish to take it for a test drive. Ideally, there is something here for everyone interested in this technology. The Chief Information Officer (CIO) will gain an understanding of what their technical staff are talking about, while the technical and implementation personnel will get an insight into the technology they need to understand the strengths and limitations of Redshift product. Throughout this book, I will try to relate the examples to things that are understandable and easy to replicate using your own environment. Just to be clear, this book is not a cookbook series on schema design and data warehouse implementation. I will explain some of the data warehouse specifics along the way as they are important to the process; however, this is not a crash course in dimensional modeling or data warehouse design principles.
Redshift is a brand new entry into the market, with the initial preview beta release in November of 2012 and the full version made available for purchase on February 15, 2013. As I will explain in the relevant parts of this book, there have been a few early adoption issues that I experienced along the way. That is not to say it is not a good product. So far I am impressed, very impressed actually, with what I have seen. Performance while I was testing has, been quite good, and when there was an occasional issue, the Redshift technical team's response has been stellar. The performance on a small cluster has been impressive; later, we will take a look at some runtimes and performance metrics. We will look more at the how and why of the performance that Redshift is achieving. Much of it has to do with how the data is being stored in a columnar data store and the work that has been done to reduce I/O. I know you are on the first chapter of this book and we are already talking about things such as columnar stores and I/O reduction, but don't worry; the book will progress logically, and by the time you get to the best practices at the end, you will be able to understand Redshift in a much better, more complete way. Most importantly, you will have the confidence to go and give it a try.
In the broadest terms, Amazon Redshift could be considered a traditional data warehouse platform, and in reality, although a gross oversimplification, that would not be far from the truth. In fact, Amazon Redshift is intended to be exactly that, only at a price, having scalability that is difficult to beat. You can see the video and documentation published by Amazon that lists the cost at one-tenth the cost of traditional warehousing on the Internet. There are, in my mind, clearly going to be some savings on the hardware side and on some of the human resources necessary to run both the hardware and large-scale databases locally. Don't be under the illusion that all management and maintenance tasks are taken away simply by moving data to a hosted platform; it is still your data to manage. The hardware, software patching, and disk management (all of which are no small tasks) have been taken on by Amazon. Disk management, particularly the automated recovery from disk failure, and even the ability to begin querying a cluster that is being restored (even before it is done) are all powerful and compelling things Amazon has done to reduce your workload and increase up-time.
I am sure that by now you are wondering, why Redshift? If you guessed that it is with reference to the term from astronomy and the work that Edwin Hubble did to define the relationship of the astronomical phenomenon known as redshift and the expansion of our universe, you would have guessed correctly. The ability to perform online resizes of your cluster as your data continually expands makes Redshift a very appropriate name for this technology.
As you think about your own ever-expanding universe of data, there are two basic options to choose from: High Storage Extra Large (XL) DW Node and High Storage Eight Extra Large (8XL) DW Node. As with most Amazon products, there is a menu approach to the pricing. On-Demand, as with most of their products, is the most expensive. It currently costs 85 cents per hour per node for the large nodes and $6.80 per hour for the extra-large nodes. The Reserved pricing, with some upfront costs, can get you pricing as low as 11 cents per hour for the large nodes. I will get into further specifics on cluster choices in a later section when we discuss the actual creation of the cluster. As you take a look at pricing, recognize that it is a little bit of a moving target. One can assume, based on the track record of just about every product that Amazon has rolled out, that Redshift will also follow the same model of price reductions as efficiencies of scale are realized within Amazon. For example, the DynamoDB product recently had another price drop that now makes that service available at 85 percent of the original cost. Given the track record with the other AWS offerings, I would suggest that these prices are really "worst case". With some general understanding that you will gain from this book, the selection of the node type and quantity should become clear to you as you are ready to embark on your own journey with this technology. An important point, however, is that you can see how relatively easily companies that thought an enterprise warehouse was out of their reach can afford a tremendous amount of storage and processing power at what is already a reasonable cost. The current On-Demand pricing from Amazon for Redshift is as follows:
So, with an upfront commitment, you will have a significant reduction in your hourly per-node pricing, as you can see in the following screenshot:
The three-year pricing affords you the best overall value, in that the upfront costs are not significantly more than the one year reserved node and the per hour cost per node is almost half of what the one year price is. For two XL nodes, you can recoup the upfront costs in 75 days over the on-demand pricing and then pay significantly less in the long run. I suggest, unless you truly are just testing, that you purchase the three-year reserved instance.
The basic configuration of the large Redshift (
dw.hs1.xlarge) node is as follows:
CPU: 2 Virtual Cores (Intel Xeon E5)
Memory: 15 GB
Storage: 3 HDD with 2 TB of locally attached storage
Disk I/O: Moderate
The basic configuration of the extra-large Redshift (
dw.hs1.8xlarge) node is as follows:
CPU: 16 Virtual Cores (Intel Xeon E5)
Memory: 120 GB
Storage: 24 HDD with 16 TB of locally attached storage
Network: 10 GB Ethernet
Disk I/O: Very high
hs in the naming convention is the designation Amazon has used for high-density storage.
An important point to note; if you are interested in a single-node configuration, the only option you have is the smaller of the two options. The 8XL extra-large nodes are only available in a multi-node configuration. We will look at how data is managed on the nodes and why multiple nodes are important in a later chapter. For production use, we should have at least two nodes. There are performance reasons as well as data protection reasons for this that we will look at later. The large node cluster supports up to 64 nodes for a total capacity of anything between 2 and 128 terabytes of storage. The extra-large node cluster supports from 2 to 100 nodes for a total capacity of anything between 32 terabytes and 1.6 petabytes. For the purpose of discussion, a multi-node configuration with two large instances would have 4 terabytes of storage available and therefore would also have four terabytes of associated backup space. Before we get too far ahead of ourselves, a node is a single host consisting of one of the previous configurations. When I talk about a cluster, it is a collection of one or more nodes that are running together, as seen in the following figure. Each cluster runs an Amazon Redshift database engine.
As you begin thinking about the kinds of I/O rates you will need to support your installation, you will be surprised (or at least I was) with the kind of throughput you will be able to achieve on a three-drive, 2 TB node. So, before you apply too many of your predefined beliefs, I suggest estimating your total storage needs and picking the node configuration that will best fit your overall storage needs on a reasonably small number of nodes. As I mentioned previously, the extra-large configuration will only start as multi-node so the base configuration for an extra-large configuration is really 32 TB of space. Not a small warehouse by most peoples' standards. If your overall storage needs will ultimately be in the 8 to 10 terabyte range, start with one or two large nodes (the 2 terabyte per node variety). Having more than one node will become important for parallel loading operations as well as for disk mirroring, which I will discuss in later chapters. As you get started, don't feel you need to allocate your total architecture and space requirements right off. Resizing, which we will also cover in detail, is not a difficult operation, and it even allows for resizing between the large and extra-large node configurations. Do note however that you cannot mix different node sizes in a cluster because all the nodes in a single cluster, must be of the same type. You may start with a single node if you wish; I do, however, recommend a minimum of two nodes for performance and data protection reasons. You may consider the extra-large nodes if you have very large data volumes and are adding data at a very fast pace. Otherwise, from a performance perspective, the large nodes have performed very well in all of my testing scenarios.
If you have been working on data warehouse projects for any length of time, this product will cause you to question some of your preconceived ideas of hardware configuration in general. As most data warehouse professionals know, greater speed in a data warehouse is often achieved with improved I/O. For years I have discussed and built presentations specifically on the SAN layout, spindle configuration, and other disk optimizations as ways of improving the overall query performance. The methodology that Amazon has implemented in Redshift is to eliminate a large percentage of that work and to use a relatively small number of directly attached disks. There has been an impressive improvement with these directly attached disks as they eliminate unnecessary I/O operations. With the concept of "zone mapping," there are entire blocks of data that can be skipped in the read operations, as the database knows that the zone is not needed to answer the query. The blocks are also considerably larger than most databases at 1 MB per block. As I have already mentioned, the data is stored in a column store. Think of the column store as a physical layout that will allow the reading of a single column from a table without having to read any other part of the row. Traditionally, a row would be placed on disk within a block (or multiple blocks). If you wanted to read all of the first_name fields in a given table, you would read them block by block, picking up the first_name column from each of the records as you encountered them.
Think of a vinyl record, in this example, Data Greatest Hits Vol-1 (refer to the following figure). The needle starts reading the record, and you start listening for first_name; so, you will hear first_name (remember that), then you will hear last_name and age (you choose to forget those two, as you are only interested in first_name), and then we'll get to the next record and you'll hear first_name (remember that), last_name, age (forget those), and so on.
In a column store, you would query the database in the same way, but then you would start reading block by block only those blocks containing the first_name data. The album Data Greatest Hits Vol-2 (refer to the following figure) is now configured differently, and you'll put the needle down on the section of the record for first_name and start reading first_name, first_name, first_name, and so on. There was no wasted effort in reading last_name and age at all.
Likewise, if you were reading the age column, you would start with the age data, ignoring all of the data in the first_name and last_name columns. Now apply compression (which we will cover later as well) to the blocks. A single targeted read operation of a large 1 MB block will retrieve an incredible amount of usable data. All of this is being done while going massively parallel across all available nodes. I am sure that without even having started your cluster yet, you can get a sense of why this is going to be a different experience from what you are used to.
I will cover only some of the specifics as we'll discuss these topics in other sections; however, as you begin to think about the migration of your data and processes to Redshift, there are a few things to put at the back of your mind now. As you read this book, you will need to to take into consideration the things that are unique to your environment; for example, your current schema design and the tools you use to access the data (both the input with ETL as well as the output to the user and BI tools). You will only need to make determinations as to which of them will be reusable and which of them will be required to migrate to new and different processes. This book will give you the understanding to help you make informed decisions on these unique things in your environment. On the plus side, if you are already using SQL-based tools for query access or business intelligence tools, technical migration for your end users will be easy. As far as your data warehouse itself is concerned, if your environment is like most well-controlled (or even well thought out) data warehouse implementations, there are always some things that fall into the category of "if I could do that again". Don't leave them on the table now; this is your chance to not only migrate, but to make things better in the process.
In the most general terms, there are no changes necessary for the schema that you are migrating out of and the one that you will build in Redshift to receive the data. As with all generalizations, there are a few caveats to that statement, but most of these will also depend on what database architecture you are migrating from. Some databases define a bit as a Boolean; others define it as a bit itself. In this case, things need to be defined as Boolean. You get the idea; as we delve further into the migration of the data, I will talk about some of the specifics. For now, let's just leave it at the general statement that the database structure you have today can, without large efforts, be converted into the database structures in Redshift. All the kinds of things that you are used to using (private schemas, views, users, objects owned by users, and so on) still apply in the Redshift environment. There are some things, mainly for performance reasons, that have not been implemented in Redshift. As we get further into the implementation and query chapters, I will go into greater detail about these things.
Also, before you can make use of Redshift, there will be things that you will need to think about for security as well. Redshift is run in a hosted environment, so there are a few extra steps to be taken to access the environment as well as the data. I will go through the specifics in the next chapter to get you connected. In general, there are a number of things that Amazon is doing, right from access control, firewalls, and (optionally) encryption of your data, to VPC support. Encryption is one of those options that you need to pick for your cluster when you create it. If you are familiar with Microsoft's Transparent Data Encryption (TDE), this is essentially the same thingâencryption of your data while it is at rest on the disk. Encryption is also supported in the copy process from the S3 bucket by way of the API interface. So, if you have reason to encrypt your data at rest, Redshift will support it. As you are likely to be aware, encryption does have a CPU cost for encrypting/decrypting data as it is moved to and from the disk. With Redshift, I have not seen a major penalty for using encryption, and I have personally, due to the types of data I need to store, chosen to run with encryption enabled. Amazon has done a thorough job of handling data security in general; however, I still have one bone of contention with the encryption implementation. I am not able to set and manage my own encryption key. Encryption is an option that you select, which then (with a key unknown to me) encrypts the data at rest. I am sure this has to do with the migration of data between nodes and the online resizing operations, but I would still rather manage my own keys. The final part of the security setup is the management of users. In addition to managing the database permissions, as you normally would for users that are accessing your data, there are also cluster-level controls available through Amazon's Identity and Access Management (IAM) services. These controls will allow you to specify which Amazon accounts have permissions to manage the cluster itself.
When the cluster is created, there is a single database in the cluster. Don't worry; if your environment has some other databases (staging databases, data marts, or others), these databases can be built on the same Redshift cluster if you choose to do so. Within each database, you have the ability to assign permissions to users as you would in the primary database that has been created. Additionally, there are parameter groups that you can define as global settings for all the databases you create in a cluster. So, if you have a particular date format standard, you can set it in the parameter group and it will automatically be applied to all the databases in the cluster.
So, taking a huge leap forward, you have loaded data, you are happy with the number of nodes, and you have tuned things for distribution among the nodes (another topic I will cover later); the most obvious question now to anyone should be: how do I get my data back out? This is where this solution shines over some of the other possible big-data analytical solutions. It really is simple. As the Redshift engine is built on a Postgres foundation, Postgres-compliant ODBC or JDBC drivers will get you there. Beyond the obvious simplicity in connecting with ODBC, there are also a variety of vendors, such as Tableau, Jaspersoft, MicroStrategy, and others, that are partnering with Amazon to optimize their platforms to work with Redshift specifically. There will be no shortage of quality reporting and business intelligence tools that will be available, some of which you likely already have in-house. You can continue to host these internally or on an Amazon EC2 instance. Others will be available as add-on services from Amazon. The main point here is that you will have the flexibility in this area to serve your business needs in the way you think is best. There is no single option that you are required to use with the Redshift platform.
I will also take a closer look at the management of the cluster. As with other AWS service offerings provided by Amazon, a web-based management console is also provided. Through this console, you can manage everything from snapshots to cluster resizing and performance monitoring. When I get to the discussion around the management of your cluster, we will take a closer look at each of the functions that are available from this console as well as the underlying tables that you can directly query for your customized reporting and monitoring needs. For those of you interested in management of the cluster through your own applications, there are API calls available that cover a very wide variety of cluster-management functions, including resizing, rebooting, and others, that are also available through the web-based console. If you are the scripting type, there is a command-line interface available with these management options. As a part of managing the cluster, there are also considerations that need to be given to Workload Management (WLM). Amazon has provided a process by which you can create queues for the queries to run in and processes to manage these queues. The default behavior is five concurrent queries. For your initial setup, this should be fine. We will take a more in-depth look at the WLM configuration later in the book.
As a parting thought in this overview, I would like to provide my thoughts on the future direction the industry in general is taking. I think it is far more than just hype the attention cloud computing, big data, and distributed computing are getting. Some of these are not truly new and innovative ideas in the computing world; however, the reality of all our data-driven environments is one that will require more data to make better, faster decisions at a lower cost. As each year goes by, the data in your organization undergoes its own astronomical "redshift" and rapidly expands (this happens in every other organization as well). The fact that the competitive advantage of better understanding your data through the use of business intelligence will require larger, faster computing is a reality that we will all need to understand. Big data, regardless of your definition of big, is clearly here to stay, and it will only get bigger, as will the variety of platforms, databases, and storage types. As with any decision related to how you serve your internal and external data clients, you will need to decide which platform and which storage methodology will suit their needs best. I can say with absolute confidence that there is no single answer to this problem. Redshift, although powerful, is just another tool in your toolbox, and it is not the only answer to your data storage needs. I am certain that if you have spent any amount of time reading about cloud-based storage solutions, you'll surely have come across the term polyglot. This term is almost overused at this point; however, the reality is that there are many languages (and by extension, databases and storage methodologies). You will likely not find a single database technology that will fulfill all of your storage and query needs. Understanding this will bring you much closer to embracing your own polyglot environment and using each technology for what it does best.
In this chapter, we have already covered quite a bit of ground together. From the history of the Redshift implementation to its astronomical roots, you should have a good idea of what your plan is for your initial node configuration and what your choices for encryption are when you allocate a cluster. You should be able to explain to someone why a column store is different from a standard RDBMS storage model. There are many more details for a number of topics that we have touched upon in this overview; however, at this point you should feel like you are not only ready to create your cluster, but to also have an intelligent conversation about Redshift and its capabilities. In the next chapter, we will begin to look at some of the specific things you will need to understand and configure to run your first Redshift cluster.