Reader small image

You're reading from  Scala and Spark for Big Data Analytics

Product typeBook
Published inJul 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781785280849
Edition1st Edition
Languages
Concepts
Right arrow
Authors (2):
Md. Rezaul Karim
Md. Rezaul Karim
author image
Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

Sridhar Alla
Sridhar Alla
author image
Sridhar Alla

Sridhar?Alla?is the co-founder and CTO of Blue Whale Consulting and is expert at helping companies (big and small) define their vision for systems and capabilities that will allow them to establish a strategic execution plan to deal with the ever-growing data collected to support analytics and product teams. He has very experienced at dealing with all aspects of data collection, security, governance, and processing as part of end-to-end big data analytics and machine learning initiatives (including predictive modeling, deep learning, and ML automation). Sridhar?is a published book author and an avid presenter at numerous conferences, including Strata, Hadoop World, and Spark Summit.? He also has several patents filed with the US PTO on large-scale computing and distributed systems.? He has over 18 years' experience writing code in Scala, Java, C, C++, Python, R, and Go, and has extensive hands-on knowledge of Spark, Flink, TensorFlow, Keras, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing, and high-performance computing. Sridhar lives with his wife and daughter in New Jersey and in his spare time loves blogging and coaching organizations on next-generation advancements in technology and their alignment with business goals.
Read more about Sridhar Alla

View More author details
Right arrow

PySpark and SparkR

In this chapter, we will discuss two other popular APIs: PySpark and SparkR for writing Spark code in Python and R programming languages respectively. The first part of this chapter will cover some technical aspects while working with Spark using PySpark. Then we will move to SparkR and see how to use it with ease. The following topics will be discussed throughout this chapter:

  • Introduction to PySpark
  • Installation and getting started with PySpark
  • Interacting with DataFrame APIs
  • UDFs with PySpark
  • Data analytics using PySpark
  • Introduction to SparkR
  • Why SparkR?
  • Installation and getting started with SparkR
  • Data processing and manipulation
  • Working with RDD and DataFrame using SparkR
  • Data visualization using SparkR

Introduction to PySpark

Python is one of the most popular and general purpose programming languages with a number of exciting features for data processing and machine learning tasks. To use Spark from Python, PySpark was initially developed as a lightweight frontend of Python to Apache Spark and using Spark's distributed computation engine. In this chapter, we will discuss a few technical aspects of using Spark from Python IDE such as PyCharm.

Many data scientists use Python because it has a rich variety of numerical libraries with a statistical, machine learning, or optimization focus. However, processing large-scale datasets in Python is usually tedious as the runtime is single-threaded. As a result, data that fits in the main memory can only be processed. Considering this limitation and for getting the full flavor of Spark in Python, PySpark was initially developed as...

Installation and configuration

There are many ways of installing and configuring PySpark on Python IDEs such as PyCharm, Spider, and so on. Alternatively, you can use PySpark if you have already installed Spark and configured the SPARK_HOME. Thirdly, you can also use PySpark from the Python shell. Below we will see how to configure PySpark for running standalone jobs.

By setting SPARK_HOME

At first, download and place the Spark distribution at your preferred place, say /home/asif/Spark. Now let's set the SPARK_HOME as follows:

echo "export SPARK_HOME=/home/asif/Spark" >> ~/.bashrc

Now let's set PYTHONPATH as follows:

echo "export PYTHONPATH=$SPARK_HOME/python/" >> ~/.bashrc
echo "...

Introduction to SparkR

R is one of the most popular statistical programming languages with a number of exciting features that support statistical computing, data processing, and machine learning tasks. However, processing large-scale datasets in R is usually tedious as the runtime is single-threaded. As a result, only datasets that fit in someone's machine memory can be processed. Considering this limitation and for getting the full flavor of Spark in R, SparkR was initially developed at the AMPLab as a lightweight frontend of R to Apache Spark and using Spark's distributed computation engine.

This way it enables the R programmer to use Spark from RStudio for large-scale data analysis from the R shell. In Spark 2.1.0, SparkR provides a distributed data frame implementation that supports operations such as selection, filtering, and aggregation. This is somewhat similar...

Summary

In this chapter, we showed some examples of how to write your Spark code in Python and R. These are the most popular programming languages in the data scientist community.

We covered the motivation of using PySpark and SparkR for big data analytics with almost similar ease with Java and Scala. We discussed how to install these APIs on their popular IDEs such as PyCharm for PySpark and RStudio for SparkR. We also showed how to work with DataFrames and RDDs from these IDEs. Furthermore, we discussed how to execute Spark SQL queries from PySpark and SparkR. Then we also discussed how to perform some analytics with visualization of the dataset. Finally, we saw how to use UDFs with PySpark with examples.

Thus, we have discussed several aspects for two Spark's APIs; PySpark and SparkR. There are much more to explore. Interested readers should refer to their websites for...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Scala and Spark for Big Data Analytics
Published in: Jul 2017Publisher: PacktISBN-13: 9781785280849
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Md. Rezaul Karim

Md. Rezaul Karim is a researcher, author, and data science enthusiast with a strong computer science background, coupled with 10 years of research and development experience in machine learning, deep learning, and data mining algorithms to solve emerging bioinformatics research problems by making them explainable. He is passionate about applied machine learning, knowledge graphs, and explainable artificial intelligence (XAI). Currently, he is working as a research scientist at Fraunhofer FIT, Germany. He is also a PhD candidate at RWTH Aachen University, Germany. Before joining FIT, he worked as a researcher at the Insight Centre for Data Analytics, Ireland. Previously, he worked as a lead software engineer at Samsung Electronics, Korea.
Read more about Md. Rezaul Karim

author image
Sridhar Alla

Sridhar?Alla?is the co-founder and CTO of Blue Whale Consulting and is expert at helping companies (big and small) define their vision for systems and capabilities that will allow them to establish a strategic execution plan to deal with the ever-growing data collected to support analytics and product teams. He has very experienced at dealing with all aspects of data collection, security, governance, and processing as part of end-to-end big data analytics and machine learning initiatives (including predictive modeling, deep learning, and ML automation). Sridhar?is a published book author and an avid presenter at numerous conferences, including Strata, Hadoop World, and Spark Summit.? He also has several patents filed with the US PTO on large-scale computing and distributed systems.? He has over 18 years' experience writing code in Scala, Java, C, C++, Python, R, and Go, and has extensive hands-on knowledge of Spark, Flink, TensorFlow, Keras, Hadoop, Cassandra, HBase, MongoDB, Riak, Redis, Zeppelin, Mesos, Docker, Kafka, ElasticSearch, Solr, H2O, machine learning, text analytics, distributed computing, and high-performance computing. Sridhar lives with his wife and daughter in New Jersey and in his spare time loves blogging and coaching organizations on next-generation advancements in technology and their alignment with business goals.
Read more about Sridhar Alla