You're reading from Hands-On Big Data Analytics with PySpark

Product typeBook

Published inMar 2019

Reading LevelExpert

PublisherPackt

ISBN-139781838644130

Edition1st Edition

Languages

Python

Tools

PySpark

Concepts

Big Data

Authors (2):

Rudy Lai

Bartłomiej Potaczek

View More author details

Saving Data in the Correct Format

In the previous chapters, we were focusing on processing and loading data. We learned about transformations, actions, joining, shuffling, and other aspects of Spark.

In this chapter, we will learn how to save data in the correct format and also save data in plain text format using Spark's standard API. We will also leverage JSON as a data format, and learn how to use standard APIs to save JSON. Spark has a CSV format and we will leverage that format as well. We will then learn more advanced schema-based formats, where support is required to import third-party dependencies. Following that, we will use Avro with Spark and learn how to use and save the data in a columnar format known as Parquet. By the end of this chapter, we will have also learned how to retrieve data to validate whether it is stored in the proper way.

In this chapter, we will...

Saving data in plain text format

In this section, we will learn how to save data in plain text format. The following topics will be covered:

Saving data in plain text format
Loading plain text data
Testing

We will save our data in plain text format and investigate how to save it into the Spark directory. We will then load the plain text data, and then test and save it to check whether we can yield the same results code. This is our SavePlainText.scala file:

package com.tomekl007.chapter_4

import java.io.File

import com.tomekl007.UserTransaction
import org.apache.spark.sql.SparkSession
import org.apache.spark.{Partitioner, SparkContext}
import org.scalatest.{BeforeAndAfterEach, FunSuite}
import org.scalatest.Matchers._

import scala.reflect.io.Path

class SavePlainText extends FunSuite with BeforeAndAfterEach{
    val spark: SparkContext = SparkSession.builder().master("local[2]"...

Leveraging JSON as a data format

In this section, we will leverage JSON as a data format and save our data in JSON. The following topics will be covered:

Saving data in JSON format
Loading JSON data
Testing

This data is human-readable and gives us more meaning than simple plain text because it carries some schema information, such as a field name. We will then learn how to save data in JSON format and load our JSON data.

We will first create a DataFrame of UserTransaction("a", 100) and UserTransaction("b", 200), and use .toDF() to save the DataFrame API:

val rdd = spark.sparkContext
         .makeRDD(List(UserTransaction("a", 100), UserTransaction("b", 200)))
         .toDF()

We will then issue coalesce() and, this time, we will take the value as 2, and we will have two resulting files. We will then issue the write.format method and, for...

Tabular formats – CSV

In this section, we will be covering text data, but in a tabular format—CSV. The following topics will be covered:

Saving data in CSV format
Loading CSV data
Testing

Saving CSV files is even more involved than JSON and plain text because we need to specify whether we want to retain headers of our data in our CSV file.

First, we will create a DataFrame:

test("should save and load CSV with header") {
 //given
 import spark.sqlContext.implicits._
 val rdd = spark.sparkContext
 .makeRDD(List(UserTransaction("a", 100), UserTransaction("b", 200)))
 .toDF()

Then, we will use the write format CSV. We also need to specify that we don't want to include the header option in it:

//when
rdd.coalesce(1)
    .write
    .format("csv")
    .option("header", "false")
    .save(FileName)

We will then perform...

Using Avro with Spark

So far, we have looked at text-based files. We worked with plain text, JSON, and CSV. JSON and CSV are better than plain text because they carry some schema information.

In this section, we'll be looking at an advanced schema, known as Avro. The following topics will be covered:

Saving data in Avro format
Loading Avro data
Testing

Avro has a schema and data embedded within it. This is a binary format and is not human-readable. We will learn how to save data in Avro format, load it, and then test it.

First, we will create our user transaction:

 test("should save and load avro") {
 //given
 import spark.sqlContext.implicits._
 val rdd = spark.sparkContext
     .makeRDD(List(UserTransaction("a", 100), UserTransaction("b", 200)))
     .toDF()

We will then do a coalesce and write an Avro:

 //when
 rdd.coalesce(2)
     .write
    ...

Columnar formats – Parquet

In this section, we'll be looking at the second schema-based format, Parquet. The following topics will be covered:

Saving data in Parquet format
Loading Parquet data
Testing

This is a columnar format, as the data is stored column-wise and not row-wise, as we saw in the JSON, CSV, plain text, and Avro files.

This is a very interesting and important format for big data processing and for making the process faster. In this section, we will focus on adding Parquet support to Spark, saving the data into the filesystem, reloading it again, and then testing. Parquet is similar to Avro as it gives you a parquet method but this time, it is a slightly different implementation.

In the build.sbt file, for the Avro format, we need to add an external dependency, but for Parquet, we already have that dependency within Spark. So, Parquet is the way to...

Summary

In this chapter, we learned how to save data in plain text format. We noticed that schema information is lost when we do not load the data properly. We then learned how to leverage JSON as a data format and saw that JSON retains the schema, but it has a lot of overhead because the schema is for every record. We then learned about CSV and saw that Spark has embedded support for it. The disadvantage of this approach, however, is that the schema is not about the specific types of records, and tabs need to be inferred implicitly. Toward the end of this chapter, we covered Avro and Parquet, which have columnar formats that are also embedded with Spark.

In the next chapter, we'll be working with Spark's key/value API.

The rest of the chapter is locked

You have been reading a chapter from

Hands-On Big Data Analytics with PySpark

Published in: Mar 2019Publisher: PacktISBN-13: 9781838644130

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Rudy Lai

Colibri Digital is a technology consultancy company founded in 2015 by James Cross and Ingrid Funie. The company works to help its clients navigate the rapidly changing and complex world of emerging technologies, with deep expertise in areas such as big data, data science, machine learning, and Cloud computing. Over the past few years, they have worked with some of the World's largest and most prestigious companies, including a tier 1 investment bank, a leading management consultancy group, and one of the World's most popular soft drinks companies, helping each of them to better make sense of its data, and process it in more intelligent ways.The company lives by its motto: Data -> Intelligence -> Action. Rudy Lai is the founder of QuantCopy, a sales acceleration startup using AI to write sales emails for prospects. By taking in leads from your pipelines, QuantCopy researches them online and generates sales emails from that data. It also has a suite of email automation tools to schedule, send, and track email performancekey analytics that all feed back into how our AI generates content. Prior to founding QuantCopy, Rudy ran HighDimension.IO, a machine learning consultancy, where he experienced first-hand the frustrations of outbound sales and prospecting. As a founding partner, he helped startups and enterprises with HighDimension.IO's Machine-Learning-as-a-Service, allowing them to scale up data expertise in the blink of an eye. In the first part of his career, Rudy spent 5+ years in quantitative trading at leading investment banks such as Morgan Stanley. This valuable experience allowed him to witness the power of data, but also the pitfalls of automation using data science and machine learning. Quantitative trading was also a great platform from which you can learn about reinforcement learning and supervised learning topics in depth and in a commercial setting. Rudy holds a Computer Science degree from Imperial College London, where he was part of the Dean's List, and received awards such as the Deutsche Bank Artificial Intelligence prize.
Read more about Rudy Lai

Bartłomiej Potaczek

Bartłomiej Potaczek is a software engineer working for Schibsted Tech Polska and programming mostly in JavaScript. He is a big fan of everything related to the react world, functional programming, and data visualization. He founded and created InitLearn, a portal that allows users to learn to program in a pair-programming fashion. He was also involved in InitLearn frontend, which is built on the React-Redux technologies. Besides programming, he enjoys football and crossfit. Currently, he is working on rewriting the frontend for tv.nu—Sweden's most complete TV guide, with over 200 channels. He has also recently worked on technologies including React, React Router, and Redux.
Read more about Bartłomiej Potaczek

Other recommended products

Related to this chapter

Apache Spark Quick Start Guide

Apache Spark is a ?exible in-memory framework that allows processing of both batch and real-time data. Its unified engine has made it quite popular for big data use cases. This book will help you to quickly get started with Apache Spark 2.0 and write efficient big data applications for a variety of use cases.

BookJan 2019154 pages

Apache Spark 2.x Machine Learning Cookbook

Machine learning aims to extract knowledge from data, relying on fundamental concepts in computer science, statistics, probability, and optimization. This book begins with a quick overview of setting up the necessary IDEs to facilitate the execution of code examples that will be covered in various chapters. It also highlights some key issues developers face while working with machine learning algorithms on the Spark platform. We progress by uncovering the various Spark APIs and the implementation of ML algorithms with developing classification systems, recommendation engines, text analytics, clustering, and learning systems. Toward the final chapters, we’ll focus on building high-end applications and explain various unsupervised methodologies and challenges to tackle when implementing with big data ML systems.

BookSep 2017666 pages

Apache Spark 2.x for Java Developers

Apache Spark is the buzzword in the big data industry right now, especially with the increasing need for real-time streaming and data processing. While Spark is built on Scala, the Spark Java API exposes all the Spark features available in the Scala version for Java developers. This book will show you how you can implement various functionalities of the Apache Spark framework in Java, without stepping out of your comfort zone.

BookJul 2017350 pages

Frank Kane's Taming Big Data with Apache Spark and Python

Frank Kane’s Taming Big Data with Apache Spark and Python is your companion to learning Apache Spark in a hands-on manner. Frank will start you off by teaching you how to set up Spark on a single system or on a cluster, and you’ll soon move on to analyzing large data sets using Spark RDD, and developing and running effective Spark jobs quickly using Python.

BookJun 2017296 pages

PySpark Cookbook

This cookbook presents recipes on leveraging the power of Python and putting it to use in the Apache Spark ecosystem. By the end of this book, you will be able to solve any problem associated with building effective, data-intensive applications and performing machine learning and structured streaming using PySpark.

BookJun 2018330 pages

Scala and Spark for Big Data Analytics

Over the last few years, Scala has been adopted increasingly, especially in the field of data science and analytics, along with Apache Spark, which is built on Scala and is widely used in the field of analytics. With this book, you’ll learn how to leverage the power of both Scala and Spark to make sense of big data.

BookJul 2017796 pages

Learning Spark SQL

In the past year, Apache Spark has been increasingly adopted for development of distributed applications. Spark SQL APIs provides an optimized interface that helps developers build such applications quickly and easily. However, designing web-scale production applications using Spark SQL APIs can be a complex task. Understanding the design and implementation best practices for Spark SQL API based applications before you start your project will help you avoid these problems and ensure that your project is a success. Learning Spark SQL gives an insight into the engineering practices used to design and build real-world Spark-based applications. The hands-on examples will give you the required confidence to work on any future projects you encounter in Spark SQL.

BookSep 2017452 pages

Learning PySpark

This book will get you to grips with the Spark Python API. You’ll explore how Python can be used with Spark to build scalable and reliable data-intensive applications.

BookFeb 2017274 pages

Learning Apache Spark 2

Apache Spark is one of the most popular Big Data processing frameworks today, delivering speed, accuracy and real-time results – all in one solution. With this book, you will delve into the world of Apache Spark and learn about the new features introduced in Spark 2, along with the architecture and the associated concepts. A comprehensive guide to Apache Spark 2 for beginners, this book covers everything you need to know to get up and running with Big Data processing, machine learning and stream processing with Apache Spark, and allows you to easily understand each of these concepts through real-world examples.

BookMar 2017356 pages

Mastering Apache Spark 2.x

Apache Spark is an in-memory cluster-based parallel processing system that provides a wide range of functionality like graph processing, machine learning, stream processing and more. This book will familiarize you with the newest features in Apache Spark 2.x, and take you through an exciting journey of complex Big Data processing, analytics, streaming analytics as well as advanced machine learning with Apache Spark. During the course of the book, you will leverage different functionalities and modules of Apache Spark such as Spark SQL, Spark MLlib, Spark Streaming, SparkML and more, to build efficient data processing solutions. By the end of this book, you will have all the necessary knowledge to use Apache Spark effectively in your day to day tasks.

BookJul 2017354 pages

Hands-On Data Analysis with Scala

This book will help you perform effective data analysis with Scala using practical examples. You will come across different challenges and their effective solutions for a variety of data processing tasks - be it data exploration, data manipulation, or real-time data analysis using Apache Spark.

BookMay 2019298 pages

Apache Spark 2.x Cookbook

Apache Spark has become the hottest platform and sought after skill set when it comes to the fields of Big Data, Analytics and Data Science. Apache Spark 2.x comes with series of new improvements in the areas of performance, scalability, operational and production readiness for structured processing of massive datasets. This book brings in a systematic way of getting a practical hands on to using its improved programming APIs, expanded SQL functionalities and implement distributed machine learning applications with Spark ML. Through the course of chapters, you will have explored the power of Spark DataFrames/Datasets, harness MLLib for Data mining, analyze complex problems with iterative or multi-stage Spark scripts and other associated toolsets such as Spark SQL, Spark Streaming and GraphX .

BookMay 2017294 pages

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages