Reader small image

You're reading from  Hands-On Big Data Analytics with PySpark

Product typeBook
Published inMar 2019
Reading LevelExpert
PublisherPackt
ISBN-139781838644130
Edition1st Edition
Languages
Tools
Concepts
Right arrow
Authors (2):
Rudy Lai
Rudy Lai
author image
Rudy Lai

Colibri Digital is a technology consultancy company founded in 2015 by James Cross and Ingrid Funie. The company works to help its clients navigate the rapidly changing and complex world of emerging technologies, with deep expertise in areas such as big data, data science, machine learning, and Cloud computing. Over the past few years, they have worked with some of the World's largest and most prestigious companies, including a tier 1 investment bank, a leading management consultancy group, and one of the World's most popular soft drinks companies, helping each of them to better make sense of its data, and process it in more intelligent ways.The company lives by its motto: Data -> Intelligence -> Action. Rudy Lai is the founder of QuantCopy, a sales acceleration startup using AI to write sales emails for prospects. By taking in leads from your pipelines, QuantCopy researches them online and generates sales emails from that data. It also has a suite of email automation tools to schedule, send, and track email performancekey analytics that all feed back into how our AI generates content. Prior to founding QuantCopy, Rudy ran HighDimension.IO, a machine learning consultancy, where he experienced first-hand the frustrations of outbound sales and prospecting. As a founding partner, he helped startups and enterprises with HighDimension.IO's Machine-Learning-as-a-Service, allowing them to scale up data expertise in the blink of an eye. In the first part of his career, Rudy spent 5+ years in quantitative trading at leading investment banks such as Morgan Stanley. This valuable experience allowed him to witness the power of data, but also the pitfalls of automation using data science and machine learning. Quantitative trading was also a great platform from which you can learn about reinforcement learning and supervised learning topics in depth and in a commercial setting. Rudy holds a Computer Science degree from Imperial College London, where he was part of the Dean's List, and received awards such as the Deutsche Bank Artificial Intelligence prize.
Read more about Rudy Lai

Bartłomiej Potaczek
Bartłomiej Potaczek
author image
Bartłomiej Potaczek

Bartłomiej Potaczek is a software engineer working for Schibsted Tech Polska and programming mostly in JavaScript. He is a big fan of everything related to the react world, functional programming, and data visualization. He founded and created InitLearn, a portal that allows users to learn to program in a pair-programming fashion. He was also involved in InitLearn frontend, which is built on the React-Redux technologies. Besides programming, he enjoys football and crossfit. Currently, he is working on rewriting the frontend for tv.nu—Sweden's most complete TV guide, with over 200 channels. He has also recently worked on technologies including React, React Router, and Redux.
Read more about Bartłomiej Potaczek

View More author details
Right arrow

Leveraging the Spark GraphX API

In this chapter, we will learn how to create a graph from a data source. We will then carry out experiments with the Edge API and Vertex API. By the end of this chapter, you will know how to calculate the degree of vertex and PageRank.

In this chapter, we will cover the following topics:

  • Creating a graph from a data source
  • Using the Vertex API
  • Using the Edge API
  • Calculating the degree of vertex
  • Calculating PageRank

Creating a graph from a data source

We will be creating a loader component that will be used to load the data, revisit the graph format, and load a Spark graph from file.

Creating the loader component

The graph.g file consists of a structure of vertex to vertex. In the following graph.g file, if we align 1 to 2, this means that there is an edge between vertex ID 1 and vertex ID 2. The second line means that there's an edge from vertex ID 1 to 3, then from 2 to 3, and finally 3 to 5:

1  2
1 3
2 3
3 5

We will take the graph.g file, load it, and see how it will provide results in Spark. First, we need to get a resource to our graph.g file. We will do this using the getClass.getResource() method to get the path to it, as...

Using the Vertex API

In this section, we will construct the graph using edge. We will learn to use the Vertex API and also leverage edge transformations.

Constructing a graph using the vertex

Constructing a graph is not a trivial task; we need to supply vertices and edges between them. Let's focus on the first part. The first part consists of our users, users is an RDD of VertexId and String as follows:

package com.tomekl007.chapter_7

import org.apache.spark.SparkContext
import org.apache.spark.graphx.{Edge, Graph, VertexId}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite

class VertexAPI extends FunSuite {
val spark: SparkContext = SparkSession.builder().master("local...

Using the Edge API

In this section, we will construct the graph using the Edge API. We'll also use the vertex, but this time we'll focus on the edge transformations.

Constructing the graph using edge

As we saw in the previous sections, we have edges and vertices, which is an RDD. As this is an RDD, we can get an edge. We have a lot of methods that are available on the normal RDD. We can use the max method, min method, sum method, and all other actions. We will apply the reduce method, so the reduce method will take two edges, we will take e1, e2, and we can perform some logic on it.

The e1 edge is an edge that has an attribute, destination, and a source, as shown in the following screenshot:

Since the edge is...

Calculating the degree of the vertex

In this section, we will cover the total degree, then we'll split it into two parts—an in-degree and an out-degree—and we will understand how this works in the code.

For our first test, let's construct the graph that we already know about:

package com.tomekl007.chapter_7

import org.apache.spark.SparkContext
import org.apache.spark.graphx.{Edge, Graph, VertexId}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite
import org.scalatest.Matchers._

class CalculateDegreeTest extends FunSuite {
val spark: SparkContext = SparkSession.builder().master("local[2]").getOrCreate().sparkContext

test("should calculate degree of vertices") {
//given
val users: RDD[(VertexId, (String))] =
spark.parallelize(Array(
(1L, "a"),
(2L, "...

Calculating PageRank

In this section, we will load data about users and reload data about their followers. We will use the graph API and the structure of our data, and we will calculate PageRank to calculate the rank of users.

First, we need to load edgeListFile, as follows:

package com.tomekl007.chapter_7

import org.apache.spark.graphx.GraphLoader
import org.apache.spark.sql.SparkSession
import org.scalatest.FunSuite
import org.scalatest.Matchers._

class PageRankTest extends FunSuite {
private val sc = SparkSession.builder().master("local[2]").getOrCreate().sparkContext

test("should calculate page rank using GraphX API") {
//given
val graph = GraphLoader.edgeListFile(sc, getClass.getResource("/pagerank/followers.txt").getPath)

We have a followers.txt file; the following screenshot shows the format of the file, which is similar to the file we...

Summary

In this chapter, we delved into transformations and actions, and then we learned about Spark's immutable design. We studied how to avoid shuffle and how to reduce operational expenses. Then, we looked at how to save the data in the correct format. We also learned how to work with the Spark key/value API, and how to test Apache Spark jobs. After that, we learned how to create a graph from a data source, and then we investigated and experimented with the edge and vertex APIs. We learned how to calculate the degree of the vertex. Finally, we looked at PageRank and how we are able to calculate it using the Spark GraphicX API.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Hands-On Big Data Analytics with PySpark
Published in: Mar 2019Publisher: PacktISBN-13: 9781838644130
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (2)

author image
Rudy Lai

Colibri Digital is a technology consultancy company founded in 2015 by James Cross and Ingrid Funie. The company works to help its clients navigate the rapidly changing and complex world of emerging technologies, with deep expertise in areas such as big data, data science, machine learning, and Cloud computing. Over the past few years, they have worked with some of the World's largest and most prestigious companies, including a tier 1 investment bank, a leading management consultancy group, and one of the World's most popular soft drinks companies, helping each of them to better make sense of its data, and process it in more intelligent ways.The company lives by its motto: Data -> Intelligence -> Action. Rudy Lai is the founder of QuantCopy, a sales acceleration startup using AI to write sales emails for prospects. By taking in leads from your pipelines, QuantCopy researches them online and generates sales emails from that data. It also has a suite of email automation tools to schedule, send, and track email performancekey analytics that all feed back into how our AI generates content. Prior to founding QuantCopy, Rudy ran HighDimension.IO, a machine learning consultancy, where he experienced first-hand the frustrations of outbound sales and prospecting. As a founding partner, he helped startups and enterprises with HighDimension.IO's Machine-Learning-as-a-Service, allowing them to scale up data expertise in the blink of an eye. In the first part of his career, Rudy spent 5+ years in quantitative trading at leading investment banks such as Morgan Stanley. This valuable experience allowed him to witness the power of data, but also the pitfalls of automation using data science and machine learning. Quantitative trading was also a great platform from which you can learn about reinforcement learning and supervised learning topics in depth and in a commercial setting. Rudy holds a Computer Science degree from Imperial College London, where he was part of the Dean's List, and received awards such as the Deutsche Bank Artificial Intelligence prize.
Read more about Rudy Lai

author image
Bartłomiej Potaczek

Bartłomiej Potaczek is a software engineer working for Schibsted Tech Polska and programming mostly in JavaScript. He is a big fan of everything related to the react world, functional programming, and data visualization. He founded and created InitLearn, a portal that allows users to learn to program in a pair-programming fashion. He was also involved in InitLearn frontend, which is built on the React-Redux technologies. Besides programming, he enjoys football and crossfit. Currently, he is working on rewriting the frontend for tv.nu—Sweden's most complete TV guide, with over 200 channels. He has also recently worked on technologies including React, React Router, and Redux.
Read more about Bartłomiej Potaczek