Reader small image

You're reading from  Practical Big Data Analytics

Product typeBook
Published inJan 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781783554393
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Nataraj Dasgupta
Nataraj Dasgupta
author image
Nataraj Dasgupta

Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.
Read more about Nataraj Dasgupta

Right arrow

Chapter 3. The Analytics Toolkit

There are several platforms today that are used for large-scale data analytics. At a broad level, these are divided into platforms that are used primarily for data mining, such as analysis of large datasets using NoSQL platforms, and those that are used for data science—that is, machine learning and predictive analytics. Oftentimes, the solution may have both the characteristics—a robust underlying platform for storing and managing data, and solutions that have been built on top of them that provide additional capabilities in data science.

In this chapter, we will show you how to install and configure your Analytics Toolkit, a collection of software that we'll use for the rest of the chapters:

  • Components of the Analytics Toolkit
  •  System recommendations
    • Installing on a laptop or workstation
    • Installing on the cloud
  • Installing Hadoop
    • Hadoop distributions
    • Cloudera Distribution of Hadoop (CDH)
  • Installing Spark
  • Installing R and Python

Components of the Analytics Toolkit


This book will utilize several key technologies that are used for big data mining and more generally data science. Our Analytics Toolkit consists of Hadoop and Spark, which can be installed both locally on the user's machine as well as on the cloud; and it has R and Python, both of which can be installed on the user's machine as well as on a cloud platform. Your Analytics Toolkit will consist of:

Software/platform

Used for data mining

Used for machine learning

Hadoop

X

Spark

X

X

Redis

X

MongoDB

X

Open Source R

X

X

Python (Anaconda)

X

X

Vowpal Wabbit

X

LIBSVM, LIBLINEAR

X

H2O

X

System recommendations


If you're installing Hadoop on a local machine, it is recommended that your system should have at least 4-8 GB of RAM (memory) and sufficient free disk space of at least 50 GB. Ideally, 8 GB or more memory will suffice for most applications. Below this, the performance will be lower but not prevent the user from carrying out the exercises. Please note that these numbers are estimates that are applicable for the exercises outlined in this book. A production environment will naturally have much higher requirements, which will be discussed at a later stage.

Installing analytics software, especially platforms such as Hadoop, can be quite challenging in terms of technical complexity and it is highly common for users to encounter errors that would have to be painstakingly resolved. Users spend more time attempting to resolve errors and fixing installation issues than they ideally should. This sort of additional overhead can easily be alleviated by using virtual machines ...

Installing Hadoop


There are several ways to install Hadoop. The most common ones are:

  1. Installing Hadoop from the source files from https://hadoop.apache.org
  2. Installing using open source distributions from commercial vendors such as Cloudera and Hortonworks

In this exercise, we will install the Cloudera Distribution of Apache Hadoop (CDH), an integrated platform consisting of several Hadoop and Apache-related products. Cloudera is a popular commercial Hadoop vendor that provides managed services for enterprise-scale Hadoop deployments in addition to its own release of Hadoop. In our case, we'll be installing the HDP Sandbox in a VM environment.

Installing Oracle VirtualBox

A VM environment is essentially a copy of an existing operating system that may have preinstalled software. The VM can be delivered in a single file, which allows users to replicate an entire machine by just launching a file instead of reinstalling the OS and configuring it to mimic another system. The VM operates in a self...

Installing Packt Data Science Box


We have also created a separate virtual machine for some of the exercises in the book.

Download the Packt Data Science Virtual Machine Vagrant files from https://gitlab.com/packt_public/vm.

To load the VM, first download Vagrant from https://www.vagrantup.com/downloads.html.

Download page for Vagrant

Once you have completed the download, install Vagrant by running the downloaded Vagrant installation file. Once the installation completes, you'll get a prompt to restart the machine. Restart your system and then proceed to the next step of loading the vagrant file:

Completing the Vagrant Installation

Click confirm on the final step to restart:

Restarting System

In the terminal or command prompt, go to the directory where you have downloaded the Packt Data Science Vagrant files and run the following commands (shown in Windows):

$ vagrant box add packtdatascience packtdatascience.box 

==> box: Box file was not detected as metadata. Adding it directly... 

==> box...

Installing Spark


The CDH Quickstart VM includes Spark as one of the components, and hence it will not be necessary to install Spark separately. We'll discuss more on Spark in the chapter dedicated to the subject.

Further, our tutorial on Spark will use the Databricks Community Edition which can be accessed from https://community.cloud.databricks.com/. Instructions on creating an account and executing the necessary steps have been provided in the Chapter 6Spark for Big Data Analytics.

Installing R


R is a statistical language that has become extremely popular over the last 3-5 years, especially as a platform that can be used for a wide variety of use cases, ranging from simple data mining to complex machine learning algorithms. According to an article posted in IEEE Spectrum in mid-2016, R takes the No. 5 spot among the Top 10 languages in the world.

Open source R can be downloaded from https://www.r-project.org via the CRAN site located at https://cran.r-project.org/mirrors.html.

Alternatively, you can download R from the Microsoft R Open page at https://mran.microsoft.com/rro/. This was earlier known as Revolution R Open, an enhanced version of open source R released by Revolution Analytics. After Microsoft acquired Revolution Analytics in 2015, it was rebranded under the new ownership.

Microsoft R Open includes all the functionalities of R, but also includes the following:

  • Numerous R packages installed by default as well as a set of specialized packages released by Microsoft...

Installing RStudio


 RStudio is an application released by rstudio.org that provides a powerful feature-rich graphical IDE (integrated development environment).

The following are the steps to install RStudio:

R Studio Versions

  1. Click on the link relevant for your operating system, download and install the respective file:

Downloading RStudio

  1. Note that On a macOS, you may simply move the downloaded file to the Applications folder. On Windows and Linux operating systems, double click on the downloaded file to complete the steps to install the file:

RStudio on the Mac (copy to Applications folder)

Installing Python


We proceed with the installation as follows:

  1. Similar to R, Python has gained popularity due to its versatile and diverse range of packages. Python is generally available as part of most modern Linux-based operating systems. For our exercises, we will use Anaconda from Continuum Analytics®, which enhances the base open source Python offering with many data-mining- and machine-learning-related packages that are installed natively as part of the platform. This alleviates the need for the practitioner to manually download and install packages. In that sense, it is conceptually similar in spirit to Microsoft R Open. Just as Microsoft R enhances the base open source R offering with additional functionality, Anaconda improves upon the offerings of base open source Python to provide new capabilities.
  1. Steps for installing Anaconda Python
  2. Go to https://www.continuum.io/downloads:

Python Anaconda Homepage

  1. Download the distribution that is appropriate for your system. Note that we'll be...

Summary


This chapter introduced some of the key tools used for data science. In particular, it demonstrated how to download and install the virtual machine for the Cloudera Distribution of Hadoop (CDH), Spark, R, RStudio, and Python. Although the user can download the source code of Hadoop and install it on, say, a Unix system, it is usually fraught with issues and requires a fair amount of debugging. Using a VM instead allows the user to begin using and learning Hadoop with minimal effort as it is a complete preconfigured environment.

Additionally, R and Python are the two most commonly used languages for machine learning and in general, analytics. They are available for all popular operating systems. Although they can be installed in the VM, the user is encouraged to try and install them on their local machines (laptop/workstation) if feasible as it will have relatively higher performance.

In the next chapter, we will dive deeper into the details of Hadoop and its core components and concepts...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Practical Big Data Analytics
Published in: Jan 2018Publisher: PacktISBN-13: 9781783554393
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Nataraj Dasgupta

Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.
Read more about Nataraj Dasgupta