Reader small image

You're reading from  Big Data Analytics with Java

Product typeBook
Published inJul 2017
Reading LevelIntermediate
PublisherPackt
ISBN-139781787288980
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
RAJAT MEHTA
RAJAT MEHTA
author image
RAJAT MEHTA

The author is a VP (Technical Architect) in technology in JP Morgan Chase in New York. The author is a sun certified java developer and has worked on java related technologies for more than 16 years. Current role for the past few years heavily involves the usage of bid data stack and running analytics on it. Author is also a contributor in various open source projects that are available on his GitHub repository and is also a frequent write on dev magazines.
Read more about RAJAT MEHTA

Right arrow

Data exploration


In this section, we will explore this dataset and try to perform some simple and useful analytics on top of this dataset.

First, we will create the boilerplate code for Spark configuration and the Spark session:

SparkConf conf = ...
SparkSession session = ...

Next, we will load the dataset and find the number of rows in it:

Dataset<Row> rawData = session.read().csv("data/retail/Online_Retail.csv");

This will print the number of rows in the dataset as:

Number of rows --> 541909

As you can see, this is not a very small dataset but it is not big data either. Big data can run into terabytes. We have seen the number of rows, so let's look at the first few rows now.

rawData.show();

This will print the result as:

As you can see, this dataset is a list of transactions including the country name from where the transaction was made. But if you look at the columns of the tables, Spark has given a default name to the dataset columns. In order to provide a schema and better structure...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Big Data Analytics with Java
Published in: Jul 2017Publisher: PacktISBN-13: 9781787288980

Author (1)

author image
RAJAT MEHTA

The author is a VP (Technical Architect) in technology in JP Morgan Chase in New York. The author is a sun certified java developer and has worked on java related technologies for more than 16 years. Current role for the past few years heavily involves the usage of bid data stack and running analytics on it. Author is also a contributor in various open source projects that are available on his GitHub repository and is also a frequent write on dev magazines.
Read more about RAJAT MEHTA