Reader small image

You're reading from  Simplifying Data Engineering and Analytics with Delta

Product typeBook
Published inJul 2022
PublisherPackt
ISBN-139781801814867
Edition1st Edition
Concepts
Right arrow
Author (1)
Anindita Mahapatra
Anindita Mahapatra
author image
Anindita Mahapatra

Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.
Read more about Anindita Mahapatra

Right arrow

Curating data in stages for analytics

The raw data has to be wrangled and transformed to be consumable and ready for analytics. Each data persona may look at different data aspects or features, and there is no reason for all of them to run repeatable cleansing functions because if they did, they could all have multiple copies of data and unnecessary processing cycles, which is both time-consuming and expensive. This is where a good data catalog and design blueprints help to maintain discipline, offer data discovery opportunities for reusable components, and prevent redundant work. We have already looked at the medallion architecture, and the bronze, silver, and gold zones are where data is forged and made usable.

RDD, DataFrames, and datasets

This is a good time to refresh concepts around RDD, DataFrames, and datasets. RDD stands for Resilient Distributed Data and is the original low-level construct of Spark. RDDs have to be optimized at each stage and cannot infer schema. DataFrames...

lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Simplifying Data Engineering and Analytics with Delta
Published in: Jul 2022Publisher: PacktISBN-13: 9781801814867

Author (1)

author image
Anindita Mahapatra

Anindita Mahapatra is a Solutions Architect at Databricks in the data and AI space helping clients across all industry verticals reap value from their data infrastructure investments. She teaches a data engineering and analytics course at Harvard University as part of their extension school program. She has extensive big data and Hadoop consulting experience from Thinkbig/Teradata prior to which she was managing development of algorithmic app discovery and promotion for both Nokia and Microsoft AppStores. She holds a Masters degree in Liberal Arts and Management from Harvard Extension School, a Masters in Computer Science from Boston University and a Bachelors in Computer Science from BITS Pilani, India.
Read more about Anindita Mahapatra