Packt+ | Advance your knowledge in tech

You're reading from Practical Big Data Analytics

Product typeBook

Published inJan 2018

Reading LevelIntermediate

PublisherPackt

ISBN-139781783554393

Edition1st Edition

Languages

Java

Tools

Hadoop Apache Spark

Concepts

Big Data

Author (1)

Nataraj Dasgupta

Chapter 3. The Analytics Toolkit

There are several platforms today that are used for large-scale data analytics. At a broad level, these are divided into platforms that are used primarily for data mining, such as analysis of large datasets using NoSQL platforms, and those that are used for data science—that is, machine learning and predictive analytics. Oftentimes, the solution may have both the characteristics—a robust underlying platform for storing and managing data, and solutions that have been built on top of them that provide additional capabilities in data science.

In this chapter, we will show you how to install and configure your Analytics Toolkit, a collection of software that we'll use for the rest of the chapters:

Components of the Analytics Toolkit
System recommendations
- Installing on a laptop or workstation
- Installing on the cloud
Installing Hadoop
- Hadoop distributions
- Cloudera Distribution of Hadoop (CDH)
Installing Spark
Installing R and Python

Components of the Analytics Toolkit

This book will utilize several key technologies that are used for big data mining and more generally data science. Our Analytics Toolkit consists of Hadoop and Spark, which can be installed both locally on the user's machine as well as on the cloud; and it has R and Python, both of which can be installed on the user's machine as well as on a cloud platform. Your Analytics Toolkit will consist of:

Software/platform	Used for data mining	Used for machine learning
Hadoop	X
Spark	X	X
Redis	X
MongoDB	X
Open Source R	X	X
Python (Anaconda)	X	X
Vowpal Wabbit		X
LIBSVM, LIBLINEAR		X
H2O		X

System recommendations

If you're installing Hadoop on a local machine, it is recommended that your system should have at least 4-8 GB of RAM (memory) and sufficient free disk space of at least 50 GB. Ideally, 8 GB or more memory will suffice for most applications. Below this, the performance will be lower but not prevent the user from carrying out the exercises. Please note that these numbers are estimates that are applicable for the exercises outlined in this book. A production environment will naturally have much higher requirements, which will be discussed at a later stage.

Installing analytics software, especially platforms such as Hadoop, can be quite challenging in terms of technical complexity and it is highly common for users to encounter errors that would have to be painstakingly resolved. Users spend more time attempting to resolve errors and fixing installation issues than they ideally should. This sort of additional overhead can easily be alleviated by using virtual machines ...

Installing Hadoop

There are several ways to install Hadoop. The most common ones are:

Installing Hadoop from the source files from https://hadoop.apache.org
Installing using open source distributions from commercial vendors such as Cloudera and Hortonworks

In this exercise, we will install the Cloudera Distribution of Apache Hadoop (CDH), an integrated platform consisting of several Hadoop and Apache-related products. Cloudera is a popular commercial Hadoop vendor that provides managed services for enterprise-scale Hadoop deployments in addition to its own release of Hadoop. In our case, we'll be installing the HDP Sandbox in a VM environment.

Installing Oracle VirtualBox

A VM environment is essentially a copy of an existing operating system that may have preinstalled software. The VM can be delivered in a single file, which allows users to replicate an entire machine by just launching a file instead of reinstalling the OS and configuring it to mimic another system. The VM operates in a self...

Installing Packt Data Science Box

We have also created a separate virtual machine for some of the exercises in the book.

Download the Packt Data Science Virtual Machine Vagrant files from https://gitlab.com/packt_public/vm.

To load the VM, first download Vagrant from https://www.vagrantup.com/downloads.html.

Download page for Vagrant

Once you have completed the download, install Vagrant by running the downloaded Vagrant installation file. Once the installation completes, you'll get a prompt to restart the machine. Restart your system and then proceed to the next step of loading the vagrant file:

Completing the Vagrant Installation

Click confirm on the final step to restart:

Restarting System

In the terminal or command prompt, go to the directory where you have downloaded the Packt Data Science Vagrant files and run the following commands (shown in Windows):

$ vagrant box add packtdatascience packtdatascience.box 

==> box: Box file was not detected as metadata. Adding it directly... 

==> box...

Installing Spark

The CDH Quickstart VM includes Spark as one of the components, and hence it will not be necessary to install Spark separately. We'll discuss more on Spark in the chapter dedicated to the subject.

Further, our tutorial on Spark will use the Databricks Community Edition which can be accessed from https://community.cloud.databricks.com/. Instructions on creating an account and executing the necessary steps have been provided in the Chapter 6, Spark for Big Data Analytics.

Installing R

R is a statistical language that has become extremely popular over the last 3-5 years, especially as a platform that can be used for a wide variety of use cases, ranging from simple data mining to complex machine learning algorithms. According to an article posted in IEEE Spectrum in mid-2016, R takes the No. 5 spot among the Top 10 languages in the world.

Open source R can be downloaded from https://www.r-project.org via the CRAN site located at https://cran.r-project.org/mirrors.html.

Alternatively, you can download R from the Microsoft R Open page at https://mran.microsoft.com/rro/. This was earlier known as Revolution R Open, an enhanced version of open source R released by Revolution Analytics. After Microsoft acquired Revolution Analytics in 2015, it was rebranded under the new ownership.

Microsoft R Open includes all the functionalities of R, but also includes the following:

Numerous R packages installed by default as well as a set of specialized packages released by Microsoft...

Installing RStudio

RStudio is an application released by rstudio.org that provides a powerful feature-rich graphical IDE (integrated development environment).

The following are the steps to install RStudio:

Go to https://www.rstudio.com/products/rstudio/download:

R Studio Versions

Click on the link relevant for your operating system, download and install the respective file:

Downloading RStudio

Note that On a macOS, you may simply move the downloaded file to the Applications folder. On Windows and Linux operating systems, double click on the downloaded file to complete the steps to install the file:

RStudio on the Mac (copy to Applications folder)

Installing Python

We proceed with the installation as follows:

Similar to R, Python has gained popularity due to its versatile and diverse range of packages. Python is generally available as part of most modern Linux-based operating systems. For our exercises, we will use Anaconda from Continuum Analytics®, which enhances the base open source Python offering with many data-mining- and machine-learning-related packages that are installed natively as part of the platform. This alleviates the need for the practitioner to manually download and install packages. In that sense, it is conceptually similar in spirit to Microsoft R Open. Just as Microsoft R enhances the base open source R offering with additional functionality, Anaconda improves upon the offerings of base open source Python to provide new capabilities.

Steps for installing Anaconda Python
Go to https://www.continuum.io/downloads:

Python Anaconda Homepage

Download the distribution that is appropriate for your system. Note that we'll be...

Summary

This chapter introduced some of the key tools used for data science. In particular, it demonstrated how to download and install the virtual machine for the Cloudera Distribution of Hadoop (CDH), Spark, R, RStudio, and Python. Although the user can download the source code of Hadoop and install it on, say, a Unix system, it is usually fraught with issues and requires a fair amount of debugging. Using a VM instead allows the user to begin using and learning Hadoop with minimal effort as it is a complete preconfigured environment.

Additionally, R and Python are the two most commonly used languages for machine learning and in general, analytics. They are available for all popular operating systems. Although they can be installed in the VM, the user is encouraged to try and install them on their local machines (laptop/workstation) if feasible as it will have relatively higher performance.

In the next chapter, we will dive deeper into the details of Hadoop and its core components and concepts...

The rest of the chapter is locked

You have been reading a chapter from

Practical Big Data Analytics

Published in: Jan 2018Publisher: PacktISBN-13: 9781783554393

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Nataraj Dasgupta

Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.
Read more about Nataraj Dasgupta

Other recommended products

Related to this chapter

Web Application Development with R Using Shiny

Shiny is an open source R package that provides an elegant and powerful web framework for building web applications using R. This guide takes a fresh approach to developing scalable web applications. It will enable you to create responsive, interactive web applications using the complete R Shiny suite.

BookSep 2018238 pages

Apache Hadoop 3 Quick Start Guide

Apache Hadoop is a widely used distributed data platform. It enables large datasets to be efficiently processed instead of using one large computer to store and process the data. This book will get you started with the Hadoop ecosystem, and introduce you to the main technical topics such as MapReduce, YARN and HDFS.

BookOct 2018220 pages

Hands-On Big Data Modeling

Big data modeling is very challenging to handle using traditional database modeling and management systems. This book will teach you how to model big data using the latest and more efficient tools such as ERWIN, ANACONDA (Python), and WEKA to model data.

BookNov 2018306 pages

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Mastering Hadoop 3

This is a comprehensive guide to understand advanced concepts of Hadoop ecosystem. You will learn how Hadoop works internally, and build solutions to some of real world use cases. Finally, you will have a solid understanding of how components in the Hadoop ecosystem are effectively integrated to implement a fast and reliable Big Data pipeline

BookFeb 2019544 pages

Hands-on DevOps

VideoDec 20170

Data Lake for Enterprises

The term 'Data Lake' has recently emerged as a prominent term in the big data industry. Data scientists can make use of it in deriving meaningful insights which can be used by businesses to redefine or transform the way they operate. Lambda architecture is also emerging as one of the very eminent patterns in the big data landscape, as it helps to derive useful information from not only the historical data but also correlates real-time data to enable business for taking critical decisions. This book tries to bring these two important aspects into one, namely data lake and lambda architecture.

BookMay 2017596 pages

Hands-On Data Science with R

Hands-On Data Science with R explore various popular R packages to perform various data science tasks, including core statistical concepts and a wide array of use cases. This practical book covers the entire data science ecosystem for aspiring data scientists, including machine learning, NLP, and neural networks

BookNov 2018420 pages

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

BookMar 2017356 pages

Artificial Intelligence for Big Data

Create smart systems to extract intelligent insights for decision making. You will learn about widely used Artificial Intelligence techniques for carrying out solutions in a production-ready environment. You'll explore advanced topics such as clustering, symbolic and sub-symbolic information representation, and many more.

BookMay 2018384 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages