Getting Started with Amazon Redshift — Save 50%
Enter the exciting world of Amazon Redshift for big data, cloud computing, and scalable data warehousing with this book and ebook
This article created by Stefan Bauer author of Getting Started with Amazon Redshift, gets into some of the specifics and details you will need to get up and running with . As with most of the Amazon products you have used in the past, there are just a few preliminary things to take care of. You need to have signed up for the Redshift service on the Amazon account you will be using. Although these keys are not specific to Redshift, be sure to hang on to both your public and secret key strings from your user account. Those keys will be labeled Access Key and Secret Key. You can view the Access Key public portion from the user security credentials on the Security Credentials tab. However, if you do not capture the secret key when you create the keys, it cannot be recovered and you will need to generate a new key pair. You will need these when we start talking about loading data and configuring the command-line tools. Once you have the permissions for your account, the process to create the cluster is a wizard-driven process that you can launch from your Amazon Redshift management console.
(For more resources related to this topic, see here.)
You will find that for most things that you deal with on Redshift, the default mode is one of no access(default security group, VPC access, database access, objects in the database, and so on). Due to the fact that you need to deal with that on a consistent basis, you will find that it will not be an issue for you; it will simply be part of the process. Creating objects will require granting permissions as well as granting permissions to access cluster management. Depending on the environment that you are coming from, this may be frustrating sometimes; however, considering the fact that you are remotely hosting your data, I for one am happy with the extra steps necessary to access things. The importance of data security, as a general statement, cannot be overstated. You are responsible for your company's data as well as its image and reputation. Hardly a week goes by without news of companies that have had to make public announcements of data being improperly accessed. The fact that data has been improperly accessed has little to do with the location of the data (remote or local) if you use Amazon or some other provider, but rather it depends on the rules that have been set up to allow access to the data. Do not take your security group's configuration lightly. Only open access to the things you really need and continue to maintain strict database rules on access. Honestly, this should be something you are already doing (regardless of where your data is physically located); however, if you are not, take this as the opportunity to enforce the necessary security to safeguard your data. You will need to add your IP ranges to allow access from the machine(s) that you will be using to access your cluster. In addition, you should add your EC2 security group that contains the EC2 instances (if there are any) that you will be connecting from, as shown in the next screenshot. Later in this article, we will cover installation and configuration of the command-line interface using a connection from an EC2 instance. If you don't have an EC2 instance, don't worry, you can still add it later if you find it necessary. Don't get hung up on that, but if you already have the security group, add it now.
You will also need to have a parameter group. A parameter group applies to every database within the cluster, so whatever options you choose, think of them as global settings. If there are things that you would like to adjust in these settings, you need to create your own parameter group (you may not edit the default). The creation of the new group may be done before you create your cluster. You will see where you associate the parameter group to the cluster in the next section. If you don't need to change anything about the default values, feel free to simply use the parameter group that is already created, as shown in the following screenshot:
In this section, we will go through the steps necessary to actually create your cluster. You have already made the "hard" decisions about the kinds of nodes, your initial number of nodes, and whether you are going to use encryption or not. Really, you only have a couple of other things to decide, such as what you want to name your cluster. In addition to the cluster name, you will need to pick your master username and password. Once you have those things decided, you are (quite literally) four simple pages away from having provisioned your first cluster.
Don't forget, you can resize to a different number of nodes and even a different cluster type later.
Launch the cluster creation wizard by selecting the Launch Cluster option from the Amazon Redshift Management console:
This will bring you to the first screen, CLUSTER DETAILS, as shown in the following screenshot. Here you will name your cluster, the primary database, your username, and password. As you can see, there are clear onscreen instructions for what is required in each field.
The NODE CONFIGURATION screen, shown as follows, will allow you to pick the size of the nodes. You can also select the type of cluster (Single Node or Multi Node). For this example, I chose Single Node.
The additional configuration screen, as shown in the next screenshot, is where you will select your parameter group, encryption option, VPC if you choose, as well as the availability zone. A Virtual Private Cloud (VPC) is a networking configuration that will enable isolation of your network within the public portion of the cloud. Amazon allows you to manage your own IP ranges. A Virtual Private Network (VPN) connection to your VPC is used to essentially extend your own internal network to the resources you have allocated in the cloud. How to set up your VPC goes beyond Redshift as a topic; however, do understand that Redshift will run inside your VPC if you so choose.
Believe it or not, that really is everything. On the REVIEW screen, as shown in the next screenshot, you can now confirm your selections and actually start the cluster. Once you select the Launch Cluster button here, it will take a few minutes for your cluster to initialize. Once initialization is complete, your cluster is ready for you to use.
We will take a look at some of the options you have to manage the cluster you have just created in ways other than using the Redshift Management console; however, since we just used the console to create the cluster, we will continue on with that tool for now.
Before we go much further into the details, take a quick look around at the Redshift Management console. You will be quickly comfortable with the options you have available to manage and run your cluster. Once you have your cluster running, the following initial screen giving you the "at a glance" health status is displayed:
Along the left-hand side of the screen, as shown in the following screenshot, you can see some of the high-level management functions related to backups, security groups, and so on.
Once you have selected your cluster, there are some tabs across the top. For now, you can familiarize yourself with these, particularly the Configuration screen that you can access from the tab shown in the next screenshot. There is a wealth of information there. Most important (for now), because surely you want to get connected, is the endpoint information.
From the main AWS console, you can drag any of the AWS services you wish up into your own menu bar (see the EC2 and Redshift icons in the preceding screenshot), making it easy to get to the different console views.
Before we go too far and you jump the gun and start connecting tools and loading data, there are a few things to be aware of. I will go into greater detail on the configuration, layout, table creation, and so on as we go further along; so, let's just start with a few high-level things to keep in mind. Although you will be using PostgreSQL drivers, the core of the database is Postgres. There are certain things that have, for performance reasons, been removed. We will shortly take a closer look at the kinds of things that have not been implemented. So, as you mentally prepare the choices for the first tables you will be loading to test with, depending on what environment you are coming from, partitioning, subpartitioning, and range partitioning are the things you will leave on the table. I will explain the concept of distribution keys, which is similar to partitioning but not altogether the same. As a database professional, there are some other core features that you are used to maintaining, thinking about, and optimizing, such as indexing, clustering of data, primary keys, as well as unique constraints on columns. In the traditional sense, none of the clustering options are supported, nor are indexes. I will discuss sort keys and the considerations around what it means to select sort keys later. As far as primary key assignment is concerned, you can, and (depending on the table) maybe should, assign the primary key; however, it does nothing to enforce uniqueness on the table. It will simply be used by the optimizer to make informed decisions as to how to access the data. It tells the optimizer what you, as the user, expect to be unique. If you are not familiar with data warehouse design, you might be thinking "Oh my gosh, what were they thinking?". Those of you familiar with warehouse implementations of large tables are probably already running without primary keys on your largest tables. Load processes are designed to look up keys in dimensions, manage those keys based on the business values, and so on. I am not going to go too far off the topic on dimensional modeling here; that is not really what we are trying to learn. It should be suffcient to say that when you are loading the fact table, by the time you hit the insert statement into the fact table, you should have fully-populated dimension keys. Null values would be handled and all of the heavy lifting would be done by the load process. Logically, the overhead incurred by the database's revalidation of all of the things that you just assigned in the load is a very expensive operation when you are dealing with a 100-million row table (Redshift is about eliminating I/O). The same logic applies to the constraints at the column level. You can set the not null constraints but do nothing to actually ensure the data matches that expectation. There are a couple of maintenance commands (similar to a statistics update you are likely to be familiar with) after you manipulate large quantities of data that are more important to the optimization process than the application of constraints on the columns. I will get into the details about those commands after we get some data loaded
SQL Workbench and other query tools
Since you are able to connect to the database with native or ODBC PostgreSQL drivers,your choice of query tools is really and exactly that, your choice. It is recommended that you use the PostgreSQL 8.x JDBC and ODBC drivers. Amazon makes a recommendation for a SQL Workbench tool, which for the (free) price will certainly work, having come from environments that have more fully-featured query tools. I was a little frustrated by that product. It left me wanting for more functionalities than is provided in that product. I know it sounds counterintuitive to the discussion we just had about all the features that are not needed or are not supported; so, there are clearly going to be some things in the query tool that you simply will never use. You are after all not managing a traditional PostgreSQL database. However, the ability to have multiple connections, script objects, doc windows, to run explain plans, and to manage the results with the "pivot" type functionality is a great beneft. So, now that I have talked you out of the SQL Workbench tool and into the EMS tool, go and download that. Just to limit the confusion and to translate between tools, the screenshots, descriptions, and query examples from this point forward in this book will be using the EMS tool. Once you have the SQL tool of your choice installed, you will need some connection information from your configuration screen, as shown in the next screenshot. There is a unique endpoint name and a port number. You will also need the master user ID and password. This is your sysadmin account that we will be using to create other users, schemas, and so on.
Now that you have everything you need, select the option to create a database connection, plug in the info, and you are now connected to your cluster. If this is really your first remote database, it may be a bit early to declare total victory; however, you have now joined the ranks of the companies that talk about their cloud computing capabilities! I will go into greater detail about schemas and permissions when we discuss the management of your data.
Before we get into loading your data, we will talk about accessing your data with third-party products. There are a variety of ETL tools, and depending on which product you are currently using, you may simply be able to continue with the product you are using. Additionally, there have been some recent partnership announcements from Informatica about providing an Amazon service-based option to use their PowerCenter product. If your native expertise is in SSIS, it is possible to connect using ODBC database connections; however, you will have performance issues with large quantities of data without using some data copy options from files. There are other options, such as Pentaho, that also have a tremendous amount of promise as well.
As you start to think about your transition to this new environment, you will have a variety of decisions to make that will be unique to the current location of your source data and the in-house expertise that you have for ETL products. The good news is that most of the types of processes you currently support will translate well to Amazon Redshift. There are certain functions in SQL that are not supported; however, for the most part, the insert, update, and delete functionality, right down to creating temp tables in your queries are supported and will translate without major changes.
There are a few things to keep in mind that you might be accustomed to using that are different than you might expect or that are simply not available to you. I am not really going back on the statement that there is little change to your SQL. You will find that the majority of your queries will work with little or no modifcation. This section will highlight what I think will be most important to you as you review the kinds of reporting and analytical processes you have running in your current environment. In addition to the changes in Postgres' functionality, there are a series of Redshift-specifc system tables that augment, and in some cases, replace the functionality found in the Postgres system tables. We will look at the system tables specifcally as we discuss management and querying later in the book. Just understand that if you are familiar with Postgres' system tables, there are going to be some things that you will need to be aware of.
- Create table:This is a standard SQL statement that will allow you to build objects in the database.
- Tablespaces: These are not supported.
- Table partitioning: This is not supported; however, there is the distribution key, which is different from traditional partitioning.
- Inheritance: This is not supported.
- Unique constraint: This constraint can be created; however, it is used only to inform the optimizer when creating the query plan to access the data. We will review these constraints later as we discuss loading data.
- Exclusion constraint: This is not supported.
- Foreign Key: This is informational to the optimizer only.
- Primary Key: This is informational to the optimizer only.
- Alter table: This is a standard SQL statement that will allow you to change the structure of tables.
- ALTER COLUMN: This is not supported.
- Copy: This feature is highly optimized for load purposes and will connect to an Amazon S3 bucket or an Amazon DynamoDB database.
- Order by: This standard SQL keyword will affect the order in which the data is output by a query.
- Nulls: (order by nulls) is not supported.
- First/Last: (order by first/last) is not supported.
- VACUUM: If you are familiar with the Postgres VACUUM function to reorganize tables, this is similar, but with new options to support the Redshift functionality.
- Insert/Update/Delete: No worries, these are supported! However, the WITH syntax is not supported.
- Indexes: You will find this as one of the management (particularly space management) items that you will not miss. My only concern early on, which as of yet has not been a problem, is the ability to affect a particularly poorly performing query with the help of a well-placed index. This will remain on my "watch list" of things to look out for, but as I said, so far it has not posed any problems.
- Collations: Locale-specifc or user-defned collation sequences are not supported.
- Array and Row constructor value expressions:These are not supported.
- User-defned functions: This seems to be a limitation that at some point will need to be addressed. I don't see a technical reason why this was eliminated (but then again, I am not one of the engineers that built Redshift).
- Stored procedures: Why this was eliminated is also not clear to me. Building stored procedures, thus incorporating complicated logic into a centralized stored process, seems like it would be something that an analytical database would be able to do
- Triggers: These are not supported
- Table functions: These are not supported and are one of those items based on the column store functionality that may not prove to be necessary. Keep this item in mind as you review your queries for what will be impacted as you move to Redshift.
- Sequences:These are not supported.
- Full text search: This is not supported.
Without getting too bogged down with the specifcs, there are certain SQL related datatypes that are supported on the leader node, where that part of the query will not be passed to the data nodes. There are also datatypes that fall into the category of "unsupported features" that we are discussing here. Some of the unsupported datatypes are slightly more obscure datatypes, such as object identifer types and network address types. Those and a few others, for the sake of clarity, I am leaving off this list. For our purposes here, we will review those datatypes that are simply not available. No caveats here, so you do need to review your SQL. For these, the create table statements are not available to you in Redshift.
- Bit/Bit Varying(ok, so there is a caveat, Boolean works fne)
- Bytea(Postgres' binary datatype)
- Composite types
- Date/Time types:
- Timestamp with timezone
- Enumerated types
- Geometric types
- Numeric types:
- Serial, Bigserial, Smallserial
- Money (careful here! You will likely have something somewhere in your database defined as money)
Now that we have looked at some of the things that are different as well as the unsupported datatypes that you need to look at in your tables and SQL, there is just one remaining section of unsupported features and those are functions. Similar to the other parts of this section, this is not a complete listing. There are quite a number of these unsupported functions. Please don't be discouraged at this point. Most of these are not likely to impact much of your SQL, particularly standard end user queries. I am simply trying to paint an accurate picture of the things you need to consider.
- Access privilege inquiry functions
- Aggregate functions:Don't worry; since this could be a hot-button issue for some, I have listed all of them here. As you will see in the following list, I am sure you will find that most of the queries that you have already written do not use these functions:
- Database management functions:
- Backup/Restore (these are handled by Redshift snapshots)
- Database object location functions
- Database object size functions
- Date/Time functions:These are mostly related to the lack of timestamp support that we have already discussed:
- JSON functions(as the JSON datatype is not supported)
- XML functions(as the XML datatype is not supported)
- Mathematical functions:
- Range functions and operators
- Sequence manipulation functions
- String functions: There are really only a couple of string functions that you will likely come across with any kind of regularity. Please note that convert() and substr() are on the list of unsupported functions:
- Trigger functions (as triggers themselves are not supported)
- Window functions(depending on the types of queries you currently have, the following may be found in your SQL):
- Text search functions(as text search is not supported)
- System Catalog Functions:As I have already mentioned, we will cover the use of system tables shortly.
I have tried to give you a sense of the kinds of things that are different. Review the complete listings in the Amazon documentation before you formulate a migration plan for your environment.
There are a variety of cluster management options that are available to you in addition to the online Redshift Management console. Something many of you will be very quickly comfortable with is the command line. The command-line interface (CLI) is currently a developer preview product as a GitHub project. Just as with the other options, I am not going to try to replace the available Amazon documentation here. This is just to serve as a highlight of the steps needed to get you going and to show you some of the kinds of things you can do with the help of some basic examples. The Amazon command line utilizes Python (2.6 or greater) and will run on any operating system that supports Python. If you need assistance with Python, there are many great resources at www.python.org. To install the command-line interface, detailed instructions can be found at http://aws.amazon.com/cli. I will describe the basic steps if you are installing on an existing Amazon EC2 instance. First of all, if you are running on an Amazon EC2 instance, you already have Python installed. To get the command-line packages, run the installation with the following command from an account that has permissions to install software on the server:
Next, you will need to create a file with your Amazon credentials on the EC2 server. Make this file read-only to the user that is executing the commands, as it will contain your private and public Amazon keys. For this example, I called the file cliconfig. txt; however, you may call it anything you wish. [default] is for the profile. If you use [default], you do not need to specify the profile on the command line. This will allow you different configurations within the same file and you can then specify which profile you wish to use. Keep it simple for now and just use [default].
aws_access_key_id = <Your Access Key>
aws_secret_access_key = <Your Secret Key>
region = us-east-12
As we noted earlier when we looked at security, you will need your own credentials to be able to fill in the necessary parts here, and you will also need to pick the region that you have your cluster running in. Once you have that file, export the environmental variable necessary for the command line to understand where the configuration file is (add this to your profile as well, so the next time you connect to the host, this will be set for you already).
Once you have the command-line interface installed, your configuration file created,and the environmental variable set, the following command will confirm whether the command line has been properly installed:
To verify that you have everything working for your cluster's connectivity, run the following command:
aws redshift help
Now that you have the technical parts of the command-line interface working, the basic syntax of the command line for Redshift is as follows:
aws redshift operation
The following are optional arguments:
Note that the default output type is JSON. You can also specify text or CSV.
aws redshift describe-cluster
Again, this is not intended as a replacement for the available documentation. There are currently over 30 command line operations, each with extensive documentation. Clearly, each of these commands will have a unique set of options that are both required and optional. For our purposes here, I just want you to get a feel of the kinds of things that are possible.
As you can see in the preceding list, the create-cluster option will allow you to execute the creation of the cluster from the command line. This would produce the exact same result as having gone through the Launch Cluster button from the Redshift Management console that we looked at in the beginning of this article. The output from describe-clusters of a single-node xlarge cluster from the previous command is shown in the following screenshot:
The same output can be produced as text by adding the –output text to the command line. I am confident that if you have a Unix-scripting background,you will be up and running very quickly with the functionality you find in the command-line interface.
The PSQL command line
If you are interested in running commands other than those available in the CLI interface, you can install the standard Postgres PSQL command-line tools. The Amazon CLI tool is clearly focused on management functionality and not on the execution of queries. To connect using the psql command line, you need three values: -h (hostname), -p (port), and -U (user). You will then be prompted for the password as shown in the following command:
# psql -h <Endpoint> -p 5439 -U <user>
There are many other options to pass in files, how you wish to have the output formatted, or setting variables as described here.
The following are the connection options:
- -h, --host=HOSTNAME: This is the database server host or socket directory (default is local socket)
- -p, --port=PORT: This is the database server port (default is 5432)
- -U, --username=USERNAME: This is the database username (default is root)
- -w, --no-password: This never prompts for a password
- -W, --password: This forces a password prompt (this should happen automatically)
Output format options
The following are the output format options:
- -A, --no-align: Denotes the unaligned table output mode
- -F, --field-separator=STRING: Sets the field separator (default is |)
- -H, --html: Denotes the HTML table output mode
- -P, --pset=VAR[=ARG]: Sets the printing option VAR to ARG (see the \pset command)
- -R, --record-separator=STRING: Sets the record separator (default is newline)
- -t, --tuples-only: Prints only rows
- -T, --table-attr=TEXT: Sets the HTML table tag attributes (for example width and border)
- -x, --expanded: Turns on the expanded table's output
The following are general options:
- -c, --command=COMMAND: Runs only a single command (SQL or internal) and exits
- -d, --dbname=DBNAME: Denotes the database name to connect to (default is root)
- -f, --file=FILENAME: Executes the commands from a file and then exits
- -l, --list: Lists the available databases and then exits
- -v, --set=, --variable=NAME=VALUE: Sets the psql variable NAME to VALUE
- -X, --no-psqlrc: Prevents the startup file from being read
- -1, --single-transaction: Executes the command file as a single transaction
Along the same lines as the command-line interface, there is a rich list of over 70 API calls. Just like the command line options, the API functions have a well-defined section in the Amazon documentation. As I noted with the command line, you can see that the same ability to create a cluster exists within the API functions as well as the other cluster management tools you would expect to find.
The thing to understand at this point is one of fexibility. You have choices on how to connect to the cluster as well as what kinds of tools you wish to use to manage that cluster.
This article has started to bring together the things you will need to consider as you bring your data and processes to the Redshift environment. We have looked at quite a few things to get the cluster running, getting your query tools installed and connected, and even started to understand some of the management functions that you will be using on a daily basis. By this point, you should feel comfortable navigating through the Redshift Management console, have your own cluster running, and have a general understanding of the overall Redshift capabilities. You should also understand some of the limitations that you will need to consider as you begin thinking more closely about your own environment. You are by no means ready to run your production reporting yet; however, you really are closer than you might think.
Resources for Article:
- Introduction to Cloud Computing with Microsoft Azure [Article]
- Configuring Clusters in GlassFish [Article]
- Microsoft SQL Server 2008 High Availability: Understanding Domains, Users, and Security [Article]
eBook Price: $14.99
Book Price: $29.99
About the Author :
Stefan Bauer has worked in business intelligence and data warehousing since the late 1990s on a variety of platforms in a variety of industries. Stefan has worked with most major databases, including Oracle, Informix, SQL Server, and Amazon Redshift as well as other data storage models, such as Hadoop. Stefan provides insight into hardware architecture, database modeling, as well as developing in a variety of ETL and BI tools, including Integration Services, Informatica, Analysis Services, Reporting Services, Pentaho, and others. In addition to traditional development, Stefan enjoys teaching topics on architecture, database administration, and performance tuning. Redshift is a natural extension fit for Stefan's broad understanding of database technologies and how they relate to building enterprise-class data warehouses.