Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Essential PySpark for Scalable Data Analytics

You're reading from  Essential PySpark for Scalable Data Analytics

Product type Book
Published in Oct 2021
Publisher Packt
ISBN-13 9781800568877
Pages 322 pages
Edition 1st Edition
Languages
Concepts
Author (1):
Sreeram Nudurupati Sreeram Nudurupati
Profile icon Sreeram Nudurupati

Table of Contents (19) Chapters

Preface Section 1: Data Engineering
Chapter 1: Distributed Computing Primer Chapter 2: Data Ingestion Chapter 3: Data Cleansing and Integration Chapter 4: Real-Time Data Analytics Section 2: Data Science
Chapter 5: Scalable Machine Learning with PySpark Chapter 6: Feature Engineering – Extraction, Transformation, and Selection Chapter 7: Supervised Machine Learning Chapter 8: Unsupervised Machine Learning Chapter 9: Machine Learning Life Cycle Management Chapter 10: Scaling Out Single-Node Machine Learning Using PySpark Section 3: Data Analysis
Chapter 11: Data Visualization with PySpark Chapter 12: Spark SQL Primer Chapter 13: Integrating External Tools with Spark SQL Chapter 14: The Data Lakehouse Other Books You May Enjoy

Feature store as a central feature repository

A large percentage of the time spent on any machine learning problem is on data cleansing and data wrangling to ensure we build our models on clean and meaningful data. Feature engineering is another critical process of the machine learning process where data scientists spend a huge chunk of their time curating machine learning features, which happens to be a complex and time-consuming process. It appears counter-intuitive to have to create features again and again for each new machine learning problem.

Typically, feature engineering takes place on already existing historic data, and new features are perfectly reusable in different machine learning problems. In fact, data scientists spend a good amount of time searching for the right features for the problem at hand. So, it would be tremendously beneficial to have a centralized repository of features that is also searchable and has metadata to identify features. This central repository...

lock icon The rest of the chapter is locked
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}