You're reading from Azure Data Scientist Associate Certification Guide

Product typeBook

Published inDec 2021

Reading LevelBeginner

PublisherPackt

ISBN-139781800565005

Edition1st Edition

Languages

Python

Tools

Azure Functions

Concepts

Machine Learning

Authors (2):

Andreas Botsikas

Michael Hlobil

View More author details

Using Spark in data science

At the beginning of the twenty-first century, the big data problem became a reality. Data stored in data centers was growing in volumes and velocity. In 2021, we refer to datasets as big data when they reach at least a couple of terabytes in size, while it is not uncommon to see even petabytes of data in large organizations. These datasets increase at a rapid rate, which can be from a couple of gigabytes per day to even per minute, for example, when you are storing user interactions with a website in an online store to perform clickstream analysis.

In 2009, a research project started at the University of California, Berkeley, trying to provide the parallel computing tools needed to handle big data. In 2014, the first version of Apache Spark was released from this research project. Members from that research team founded the Databricks company, one of the most significant contributors to the open source Apache Spark project.

Apache Spark provides an easy-to-use scalable solution that allows people to perform parallel processing on top of data in a distributed manner. The main idea behind the Spark architecture is that a driver node is responsible for executing your code. Your code is split into smaller parallel actions that can be performed against smaller portions of data. These smaller jobs are scheduled to be executed by the worker nodes, as seen in Figure 1.6:

Figure 1.6 – Parallel processing of big data in a Spark cluster

For example, suppose you wanted to calculate how many products your company sold during the last year. In that case, Spark could spin up 12 jobs that would produce the monthly aggregates, and then the results would be processed by another job that would sum up the totals for all months. If you were tempted to load the entire dataset into memory and perform those aggregates directly from there, let's examine how much memory you would need within that computer. Let's assume that the sales data for a single month is stored in a CSV file that is 1 GB. This file will require approximately 10 GB of memory to load. The compressed Parquet files will require even more memory. For example, a similar 1 GB parquet file may end up needing 40 GB of memory to load as a pandas.DataFrame object. As you can understand, loading all 12 files in memory simultaneously is an impossible task. You need to parallelize the processing, something Spark can do for you automatically.

Important note

The Parquet files are stored in a columnar format, which allows you to load partially any number of columns you need. In the 1 GB Parquet example, if you load only half the columns from the dataset, you will probably need only 20 GB of memory. This is one of the reasons why the Parquet format is widely used in analytical loads.

Spark is written in the Scala programming language. It offers APIs for Scala, Python, Java, R, and even C#. Still, the data science community is either working on Scala to achieve maximum computational performance and utilizing the Java library ecosystem or Python, which is widely adopted by the modern data science community. When you are writing Python code to utilize the Spark engine, you are using the PySpark tool to perform operations on top of Resilient Distributed Datasets (RDDs) or Spark.DataFrame objects introduced later in Spark Framework. To benefit from the distributed nature of Spark, you need to be handling big datasets. This means that Spark may be overkill if you deal with only hundreds of thousands of records or even a couple of millions of records.

Spark offers two machine learning libraries, the old MLlib and the new version called Spark ML. Spark ML uses the Spark.DataFrame structure, a distributed collection of data, and offers similar functionality to the DataFrame objects used in Python pandas or R. Moreover, the Koalas project provides an implementation that allows data scientists with existing knowledge of pandas.DataFrame manipulations to use their existing coding skills on top of Spark.

AzureML allows you to execute Spark jobs on top of PySpark, either using its native compute clusters or by attaching to Azure Databricks or Synapse Spark pools. Although you will not write any PySpark code in this book, in Chapter 12, Operationalizing Models with Code, you will learn how to achieve similar parallelization benefits without the need for Spark or a driver node.

No matter whether you are coding in regular Python, PySpark, R, or Scala, you are producing some code artifacts that are probably part of a larger system. In the next section, you will explore the DevOps mindset, which emphasizes the communication and collaboration of software engineers, data scientists, and system administrators to achieve faster release of valuable product features.

You have been reading a chapter from

Azure Data Scientist Associate Certification Guide

Published in: Dec 2021Publisher: PacktISBN-13: 9781800565005

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Authors (2)

Andreas Botsikas

Andreas Botsikas is an experienced advisor working in the software industry. He has worked in the finance sector, leading highly efficient DevOps teams, and architecting and building high-volume transactional systems. He then traveled the world, building AI-infused solutions with a group of engineers and data scientists. Currently, he works as a trusted advisor for customers onboarding into Azure, de-risking and accelerating their cloud journey. He is a strong engineering professional with a Doctor of Philosophy (Ph.D.) in resource optimization with artificial intelligence from the National Technical University of Athens.
Read more about Andreas Botsikas

Michael Hlobil

Michael Hlobil is an experienced architect focused on quickly understanding customers' business needs, with over 25 years of experience in IT pitfalls and successful projects, and is dedicated to creating solutions based on the Microsoft Platform. He has an MBA in Computer Science and Economics (from the Technical University and the University of Vienna) and an MSc (from the ESBA) in Systemic Coaching. He was working on advanced analytics projects in the last decade, including massive parallel systems and Machine Learning systems. He enjoys working with customers and supporting the journey to the cloud.
Read more about Michael Hlobil

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages