You're reading from Python Data Analysis Master Python Analytics with Machine Learning, Deep Learning, GenAI, LLMs, and Data Engineering

Product type Paperback

Published in Jun 2026

Publisher Packt

ISBN-13 9781806022878

Length 766 pages

Edition 4th Edition

Languages

Python

Tools

Plotly

Concepts

Data Analysis

Authors (2):

Avinash Navlani

Cornellius Yudha Wijaya

View More author details

Table of Contents (25) Chapters

Preface

1. Part 1: Foundations for Data Analysis

2. Getting Started with Python Libraries FREE CHAPTER

3. NumPy and Pandas

4. Statistics for Data Insights

5. Linear Algebra

6. Part 2: Exploratory Data Analysis and Data Cleaning

7. Data Visualization

8. Retrieving, Processing, and Storing Data

9. Cleaning Messy Data

10. Time-Series Analysis

11. Part 3: Deep Dive into Machine Learning

12. Supervised Learning: Regression and Classification

13. Unsupervised Learning: Dimensionality Reduction, Clustering, Anomaly Detection

14. Ensemble Methods: Bagging and Boosting Methods

15. Artificial Neural Networks and Deep Learning

16. Part 4: NLP, Image Analytics, and Parallel Computing

17. Analyzing Text Data

18. Analyzing Image Data

19. LLMs and Gen AI

20. Parallel Computing Using Dask, Modin, and Ray

21. Big Data Analytics Using PySpark

22. Unlock Access to the Code Bundle and the PDF Version

Unlock this Book’s Free Benefits in 3 Easy Steps

23. Other Books You May Enjoy

Share Your Thoughts

24. Index

Working with PySpark DataFrames

PySpark DataFrames are designed for handling large datasets that may require distributed processing across multiple clusters. These data objects are optimized for scalability and can be processed in parallel mode across various cluster machines. In addition, PySpark offers many built-in functions for filtering, aggregating, joining, and transforming data and libraries for data transformation, analysis, and machine learning, allowing users to perform advanced operations efficiently on massive datasets. Because of this distributed and scalable design, PySpark is often more suitable than Pandas for handling very large volumes of data. Let’s create a simple dataframe in PySpark:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("DataFrameExample").getOrCreate()
data = [
    (1, "Avinash", 28, 50000),
    (2, "Ryan", 25, 45000),
    (3, "Alice", 30, 60000)
]
columns = ["id&quot...

The rest of the chapter is locked

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at €18.99/month. Cancel anytime

Authors (2)

Avinash Navlani

Avinash Navlani, PhD in Data Science, is a senior data scientist, researcher, and educator with 14 years of experience in data science, including 9 years in industry, 4 years in academia, and 1 year in research. He has developed machine learning models, optimization solutions, NLP systems, scalable data pipelines, and cloud-based MLOps platforms across healthcare, retail, finance, oil & gas, and manufacturing. His expertise includes Python, PySpark, Airflow, Databricks, Azure ML, MLflow, and Data Engineering. A former lecturer and speaker, he is passionate about applying analytics to solve real-world problems.

See other products by Avinash Navlani

Cornellius Yudha Wijaya

Cornellius Yudha Wijaya has over eight years of experience in data science, machine learning, and artificial intelligence. He currently works as a data scientist manager, where he leads AI initiatives, manages team members, and helps drive the development of practical data and AI solutions. Over the course of his career, he has worked across data science, AI product development, and technical education, with experience in building machine learning systems, supporting business decision-making, and making advanced analytics more usable in real-world settings. He has also written extensively on data science, Python, machine learning, and generative AI, with a strong focus on practical learning and applied problem-solving.

See other products by Cornellius Yudha Wijaya