(For more resources related to this topic, see here.)
Massively Parallel Processing (MPP) databases are those which partition (and optionally replicate) data into multiple nodes. All meta-information regarding data distribution is stored in master nodes. When a query is issued, it is parsed and a suitable query plan is developed as per the meta-information and executed on relevant nodes (nodes that store related user data). HP offers one such MPP database called Vertica to solve pertinent issues of Big Data analytics.
Vertica differentiates itself from other MPP databases in many ways. The following are some of the key points:
- Column-oriented architecture: Unlike traditional databases that store data in a row-oriented format, Vertica stores its data in columnar fashion. This allows a great level of compression on data, thus freeing up a lot of disk space.
- Design tools: Vertica offers automated design tools that help in arranging your data more effectively and efficiently. The changes recommended by the tool not only ease pressure on the designer, but also help in achieving seamless performance.
- Low hardware costs: Vertica allows you to easily scale up your cluster using just commodity servers, thus reducing hardware-related costs to a certain extent.
This article will guide you through the installation and creation of a Vertica cluster. This article will also cover the installation of Vertica Management Control, which is shipped with the Vertica Enterprise edition only. It should be noted that it is possible to upgrade Vertica to a higher version but vice versa is not possible.
Before installing Vertica, you should bear in mind the following points:
- Only one database instance can be run per cluster of Vertica. So, if you have a three-node cluster, then all three nodes will be dedicated to one single database.
- Only one instance of Vertica is allowed to run per node/host.
- Each node requires at least 1 GB of RAM.
- Vertica can be deployed on Linux only and has the following requirements:
- Only the root user or the user with all privileges (sudo) can run the install_vertica script. This script is very crucial for installation and will be used at many places.
- Only ext3/ext4 filesystems are supported by Vertica.
- Verify whether rsync is installed.
- The time should be synchronized in all nodes/servers of a Vertica cluster; hence, it is good to check whether NTP daemon is running.
Understanding the preinstallation steps
Vertica has various preinstallation steps that are needed to be performed for the smooth running of Vertica. Some of the important ones are covered here.
Swap space is the space on the physical disk that is used when primary memory (RAM) is full. Although swap space is used in sync with RAM, it is not a replacement for RAM. It is suggested to have 2 GB of swap space available for Vertica. Additionally, Vertica performs well when swap-space-related files and Vertica data files are configured to store on different physical disks.
Dynamic CPU frequency scaling
Dynamic CPU frequency scaling, or CPU throttling, is where the system automatically adjusts the frequency of the microprocessor dynamically. The clear advantage of this technique is that it conserves energy and reduces the heat generated. It is believed that CPU frequency scaling reduces the number of instructions a processor can issue. Additional theories state that when frequency scaling is enabled, the CPU doesn't come to full throttle promptly. Hence, it is best that dynamic CPU frequency scaling is disabled. CPU frequency scaling can be disabled from Basic Input/Output System (BIOS). Please note that different hardware might have different settings to disable CPU frequency scaling.
Understanding disk space requirements
It is suggested to keep a buffer of 20-30 percent of disk space per node. Vertica uses buffer space to store temporary data, which is data coming from the merge out operations, hash joins, and sorts, and data arising from managing nodes in the cluster.
Steps to install Vertica
Installing Vertica is fairly simple. With the following steps, we will try to understand a two-node cluster:
- Download the Vertica installation package from http://my.vertica.com/ according to the Linux OS that you are going to use.
- Now log in as root or use the sudo command.
- After downloading the installation package, install the package using the standard command:
- For .rpm (CentOS/RedHat) packages, the command will be:
rpm -Uvh vertica-x.x.x-x.x.rpm
- For .deb (Ubuntu) packages, the command will be:
dpkg -i vertica-x.x.x-x.x.deb
Refer to the following screenshot for more details:
Running the Vertica package
- For .rpm (CentOS/RedHat) packages, the command will be:
- In the previous step, we installed the package on only one machine. Note that Vertica is installed under /opt/vertica. Now, we will setup Vertica on other nodes as well. For that, run on the same node:
/opt/vertica/sbin/install_vertica -s host_list -r rpm_package -u dba_username
Here –s is the hostname/IP of all the nodes of the cluster including the one on which Vertica is already installed. –r is the path of Vertica package and –u is the username that we wish to create for working on Vertica. This user has sudo privileges. If prompted, provide a password for the new user. If we do not specify any username, then Vertica creates dbadmin as the user, as shown in the following example:
[impetus@centos64a setups]$ sudo /opt/vertica/sbin/install_vertica -s
"/ilabs/setups/vertica-6.1.3-0.x86_64.RHEL5.rpm" -u dbadmin Vertica Analytic Database 6.1.3-0 Installation Tool Upgrading admintools meta data format.. scanning /opt/vertica/config/users Starting installation tasks... Getting system information for cluster (this may take a while).... Enter password for firstname.lastname@example.org (2 attempts left): backing up admintools.conf on 192.168.56.101 Default shell on nodes: 192.168.56.101 /bin/bash 192.168.56.102 /bin/bash Installing rpm on 1 hosts.... installing node.... 192.168.56.102 NTP service not synchronized on the hosts: ['192.168.56.101', '192.168.56.102'] Check your NTP configuration for valid NTP servers. Vertica recommends that you keep the system clock synchronized using NTP or
some other time synchronization mechanism to keep all hosts synchronized.
Time variances can cause (inconsistent) query results when
using Date/Time Functions. For instructions, see: * http://kbase.redhat.com/faq/FAQ_43_755.shtm * http://kbase.redhat.com/faq/FAQ_43_2790.shtm Info: the package 'pstack' is useful during troubleshooting.
Vertica recommends this package is installed. Checking/fixing OS parameters..... Setting vm.min_free_kbytes to 37872 ... Info! The maximum number of open file descriptors is less than 65536 Setting open filehandle limit to 65536 ... Info! The session setting of pam_limits.so is not set in /etc/pam.d/su Setting session of pam_limits.so in /etc/pam.d/su ... Detected cpufreq module loaded on 192.168.56.101 Detected cpufreq module loaded on 192.168.56.102 CPU frequency scaling is enabled. This may adversely affect the performance
of your database. Vertica recommends that cpu frequency scaling be turned off or set
to 'performance' Creating/Checking Vertica DBA group Creating/Checking Vertica DBA user Password for dbadmin: Installing/Repairing SSH keys for dbadmin Creating Vertica Data Directory... Testing N-way network test. (this may take a while) All hosts are available ... Verifying system requirements on cluster. IP configuration ... IP configuration ... Testing hosts (1 of 2).... Running Consistency Tests LANG and TZ environment variables ... Running Network Connectivity and Throughput Tests... Waiting for 1 of 2 sites... ... Test of host 192.168.56.101 (ok) ==================================== Enough RAM per CPUs (ok) -------------------------------- Test of host 192.168.56.102 (ok) ==================================== Enough RAM per CPUs (FAILED) -------------------------------- Vertica requires at least 1 GB per CPU (you have 0.71 GB/CPU) See the Vertica Installation Guide for more information. Consistency Test (ok) ========================= Info: The $TZ environment variable is not set on 192.168.56.101 Info: The $TZ environment variable is not set on 192.168.56.102 Updating spread configuration... Verifying spread configuration on whole cluster. Creating node node0001 definition for host 192.168.56.101 ... Done Creating node node0002 definition for host 192.168.56.102 ... Done Error Monitor 0 errors 4 warnings Installation completed with warnings. Installation complete. To create a database: 1. Logout and login as dbadmin.** 2. Run /opt/vertica/bin/adminTools as dbadmin 3. Select Create Database from the Configuration Menu ** The installation modified the group privileges for dbadmin. If you used sudo to install vertica as dbadmin, you will need to logout and login again before the privileges are applied.
- After we have installed Vertica on all the desired nodes, it is time to create a database. Log in as a new user (dbadmin in default scenarios) and connect to admin panel. For that we have to run following command:
- If you are connecting to admin tools for the first time, you will be prompted for a license key. If you have the license file, then enter its path; if you want to use the community edition, then just click on OK.
License key prompt
- After the previous step, you will be asked to review and accept the End-user License Agreement (EULA).
Prompt for EULA
After reviewing and accepting the EULA, you will be presented with the main menu of the admin tools of Vertica.
Admin Tools Main Menu
- Now, to create a database, navigate to Administration Tools | Configuration Menu | Create Database.
Create database option in the Configuration menu
- Now, you will be asked to enter a database name and a comment that you will like to associate with the database.
Name and Comment of the Database
- After entering the name and comment, you will be prompted to enter a password for this database.
Password for the New database
- After entering and re-entering (for confirmation) the password, you need to provide pathnames where the files related to user data and catalog data will be stored.
Catalog and Data Pathname
After providing all the necessary information related to the database, you will be asked to select hosts on which the database needs to be deployed. Once all the desired hosts are selected, Vertica will ask for one final check.
Final confirmation for a database creation
- Now, Vertica will be creating and deploying the database.
- Once the database is created, we can connect to it using the VSQL tool or perform admin tasks.
As you can see and understand this article explains briefly about, Vertica installation. One can check further by creating sample tables and performing basic CRUD operations.
For a clean installation, it is recommended to serve all the minimum requirements of Vertica. It should be noted that installation of client API(s) and Vertica Management console needs to be done separately and is not included in the basic package.
Resources for Article:
- Visualization of Big Data [Article]
- Limits of Game Data Analysis [Article]
- Learning Data Analytics with R and Hadoop [Article]