Reader small image

You're reading from  Practical Big Data Analytics

Product typeBook
Published inJan 2018
Reading LevelIntermediate
PublisherPackt
ISBN-139781783554393
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Nataraj Dasgupta
Nataraj Dasgupta
author image
Nataraj Dasgupta

Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.
Read more about Nataraj Dasgupta

Right arrow

Chapter 6. Spark for Big Data Analytics

As the use of Hadoop and related technologies in the respective ecosystem gained prominence, a few obvious and salient deficiencies of the Hadoop operational model became apparent. In particular, the ingrained reliance on the MapReduce paradigm, and other facets related to MapReduce, made a truly functional use of the Hadoop ecosystem possible only for major firms that were invested deeply in the respective technologies.

At the UC Berkeley Electrical Engineering and Computer Sciences (EECS) Annual Research Symposium of 2011, a vision for a new research group at the university was announced during a presentation by Prof. Ian Stoica (https://amplab.cs.berkeley.edu/about/). It laid out the foundation of what was to become a pivotal unit that would profoundly change the landscape of Big Data. The AMPLab, launched in February 2011, aimed to deliver a scalable and unified solution by integrating Algorithms, Machines, and People that could cater to future...

The advent of Spark


When the first release of Spark became available in 2014, Hadoop had already enjoyed several years of growth since 2009 onwards in the commercial space. Although Hadoop solved a major hurdle in analyzing large terabyte-scale datasets efficiently, using distributed computing methods that were broadly accessible, it still had shortfalls that hindered its wider acceptance.

Limitations of Hadoop

A few of the common limitations with Hadoop were as follows:

  • I/O Bound operations: Due to the reliance on local disk storage for saving and retrieving data, any operation performed in Hadoop incurred an I/O overhead. The problem became more acute in cases of larger datasets that involved thousands of blocks of data across hundreds of servers. To be fair, the ability to co-ordinate concurrent I/O operations (via HDFS) formed the foundation of distributed computing in Hadoop world. However, leveraging the capability and tuning the Hadoop cluster in an efficient manner across different...

Spark practicals


In this section, we will create an account on Databricks' Community Edition and complete a hands-on exercise that will walk the reader through the basics of actions, transformations, and RDD concepts in general.

Signing up for Databricks Community Edition

The following steps outline the process of signing up for the Databricks Community Edition:

  1. Click on the START TODAYbutton and enter your information:

  1. Confirm that you have read and agree to the terms in the popup menu (scroll down to the bottom for the Agree button):
  1. Check your email for a confirmation email from Databricks and click on the link to confirm your account:
  1. Once you click on the link to confirm your account, you'll be taken to a login screen where you can log on using the email address and password you used to sign up for the account:

  1. After logging in, click on Cluster to set up a Spark cluster, as shown in the following figure:
  1. Enter Packt_Exercise as the Cluster Name and...

Spark exercise - hands-on with Spark (Databricks)


This notebook is based on tutorials conducted by Databricks (https://databricks.com/). The tutorial will be conducted using the Databricks' Community Edition of Spark, available to sign up to at https://databricks.com/try-databricks. Databricks is a leading provider of the commercial and enterprise supported version of Spark.

In this tutorial, we will introduce a few basic commands used in Spark. Users are encouraged to try out more extensive Spark tutorials and notebooks that are available on the web for more detailed examples.

Documentation for Spark's Python API can be found at https://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.sql.

The data for this book was imported into the Databricks' Spark Platform. For more information on importing data, go to Importing Data - Databricks (https://docs.databricks.com/user-guide/importing-data.html).

# COMMAND ----------

# The SparkContext/SparkSession is the entry point for all Spark...

Summary


In this chapter, we read about some of the core features of Spark, one of the most prominent technologies in the Big Data landscape today. Spark has matured rapidly since its inception in 2014, when it was released as a Big Data solution that alleviated many of the shortcomings of Hadoop, such as I/O contention and others.

Today, Spark has several components, including dedicated ones for streaming analytics and machine learning, and is being actively developed. Databricks is the leading provider of the commercially supported version of Spark and also hosts a very convenient cloud-based Spark environment with limited resources that any user can access at no charge. This has dramatically lowered the barrier to entry as users do not need to install a complete Spark environment to learn and use the platform.

In the next chapter, we will begin our discussion on machine learning. Most of the text, until this section, has focused on the management of large scale data. Making use of the data...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Practical Big Data Analytics
Published in: Jan 2018Publisher: PacktISBN-13: 9781783554393
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Nataraj Dasgupta

Nataraj Dasgupta is the vice president of advanced analytics at RxDataScience Inc. Nataraj has been in the IT industry for more than 19 years, and has worked in the technical and analytics divisions of Philip Morris, IBM, UBS Investment Bank, and Purdue Pharma. At Purdue Pharma, Nataraj led the data science division, where he developed the company's award-winning big data and machine learning platform. Prior to Purdue, at UBS, he held the role of Associate Director, working with high-frequency and algorithmic trading technologies in the foreign exchange trading division of the bank.
Read more about Nataraj Dasgupta