Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Essential PySpark for Scalable Data Analytics

You're reading from  Essential PySpark for Scalable Data Analytics

Product type Book
Published in Oct 2021
Publisher Packt
ISBN-13 9781800568877
Pages 322 pages
Edition 1st Edition
Languages
Concepts
Author (1):
Sreeram Nudurupati Sreeram Nudurupati
Profile icon Sreeram Nudurupati

Table of Contents (19) Chapters

Preface Section 1: Data Engineering
Chapter 1: Distributed Computing Primer Chapter 2: Data Ingestion Chapter 3: Data Cleansing and Integration Chapter 4: Real-Time Data Analytics Section 2: Data Science
Chapter 5: Scalable Machine Learning with PySpark Chapter 6: Feature Engineering – Extraction, Transformation, and Selection Chapter 7: Supervised Machine Learning Chapter 8: Unsupervised Machine Learning Chapter 9: Machine Learning Life Cycle Management Chapter 10: Scaling Out Single-Node Machine Learning Using PySpark Section 3: Data Analysis
Chapter 11: Data Visualization with PySpark Chapter 12: Spark SQL Primer Chapter 13: Integrating External Tools with Spark SQL Chapter 14: The Data Lakehouse Other Books You May Enjoy

What this book covers

Chapter 1, Distributed Computing Primer, introduces the distributed computing paradigm. It also talks about how distributed computing became a necessity with the ever-increasing data sizes over the last decade and ends with the in-memory data-parallel processing concept with the Map Reduce paradigm, and finally, contains introduction to the latest features in Apache Spark 3.0 engine.

Chapter 2, Data Ingestion, covers various data sources, such as databases, data lakes, message queues, and how to ingest data from these data sources. You will also learn about the uses, differences, and efficiency of various data storage formats at storing and processing data.

Chapter 3, Data Cleansing and Integration, discusses various data cleansing techniques, how to handle bad incoming data, data reliability challenges and how to cope with them, and data integration techniques to build a single integrated view of the data.

Chapter 4, Real-time Data Analytics, explains how to perform real-time data ingestion and processing, discusses the unique challenges that real-time data integration presents and how to overcome, and also the benefits it provides.

Chapter 5, Scalable Machine Learning with PySpark, briefly talks about the need to scale out machine learning and discusses various techniques available to achieve this from using natively distributed machine learning algorithms to embarrassingly parallel processing to distributed hyperparameter search. It also provides an introduction to PySpark MLlib library and an overview of its various distributed machine learning algorithms.

Chapter 6, Feature Engineering – Extraction, Transformation, and Selection, explores various techniques for converting raw data into features that are suitable to be consumed by machine learning models, including techniques for scaling, transforming features.

Chapter 7, Supervised Machine Learning, explores supervised learning techniques for machine learning classification and regression problems including linear regression, logistic regression, and gradient boosted trees.

Chapter 8, Unsupervised Machine Learning, covers unsupervised learning techniques such as clustering, collaborative filtering, and dimensionality reduction to reduce the number of features prior to applying supervised learning.

Chapter 9, Machine Learning Life Cycle Management, explains that it is not just sufficient to just build and train models, but in the real world, multiple versions of the same model are built and different versions are suitable for different applications. Thus, it is necessary to track various experiments, their hyperparameters, metrics, and also the version of the data they were trained on. It is also necessary to track and store the various models in a centrally accessible repository so models can be easily productionized and shared; and finally, mechanisms are needed to automate this repeatedly occurring process. This chapter introduces these techniques using an end-to-end open source machine learning life cycle management library called MLflow.

Chapter 10, Scaling Out Single-Node Machine Learning Using PySpark, explains that in Chapter 5, Scalable Machine Learning with PySpark, you learned how to use the power of Apache Spark's distributed computing framework to train and score machine learning models at scale. Spark's native machine learning library provides good coverage of standard tasks that data scientists typically perform; however, there is a wide variety of functionality provided by standard single-node Python libraries that were not designed to work in a distributed manner. This chapter deals with techniques for horizontally scaling out standard Python data processing and machine learning libraries such as pandas, scikit-learn, and XGBoost. This chapter covers scaling out typical data science tasks such as exploratory data analysis, model training, model inference, and finally also covers a scalable Python library named Koalas that lets you effortlessly write PySpark code using very familiar and easy-to-use pandas-like syntax.

Chapter 11, Data Visualization with PySpark, covers data visualizations, which are an important aspect of conveying meaning from data and gleaning insights into it. This chapter covers how the most popular Python visualization libraries can be used along with PySpark.

Chapter 12, Spark SQL Primer, covers SQL, which is an expressive language for ad hoc querying and data analysis. This chapter will introduce Spark SQL for data analysis and also show how to interchangeably use PySpark with data analysis.

Chapter 13, Integrating External Tools with Spark SQL, explains that once we have clean, curated, and reliable data in our performant data lake, it would be a missed opportunity to not democratize this data across the organization to citizen analysts. The most popular way of doing this is via various existing Business Intelligence (BI) tools. This chapter deals with requirements for BI tool integration.

Chapter 14, The Data Lakehouse, explains that traditional descriptive analytics tools such as BI tools are designed around data warehouses and expect data to be presented in a certain way and modern advanced analytics and data science tools are geared toward working with large amounts of data that's easily accessible in data lakes. It is also not practical or cost-effective to store redundant data in separate storage locations to be able to cater to these individual use cases. This chapter will present a new paradigm called Data Lakehouse that tries to overcome the limitations of data warehouses and data lakes and bridge the gap by combining the best elements of both.

lock icon The rest of the chapter is locked
Next Chapter arrow right
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}