Reader small image

You're reading from  The Data Wrangling Workshop - Second Edition

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781839215001
Edition2nd Edition
Languages
Tools
Right arrow
Authors (3):
Brian Lipp
Brian Lipp
author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Shubhadeep Roychowdhury
Shubhadeep Roychowdhury
author image
Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

Dr. Tirthajyoti Sarkar
Dr. Tirthajyoti Sarkar
author image
Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar

View More author details
Right arrow

9. Applications in Business Use Cases and Conclusion of the Course

Overview

This chapter will allow you to utilize the skills you have learned throughout the course of the previous chapters. You will be able to easily handle data wrangling tasks for business use cases. Throughout the chapter, you will be testing the data wrangling skills you've acquired so far by applying them on interesting business problems. These tests, will help you shore up your data wrangling skills, thus giving you the confidence to use them to tackle interesting business problems in the real world.

Introduction

In the previous chapter, we learned about databases. It is time to combine our knowledge of data wrangling and Python with a realistic scenario. Usually, data from one source is often inadequate to perform analysis. Generally, a data wrangler has to distinguish between relevant and non-relevant data and combine data from different sources.

The primary job of a data wrangling expert is to pull data from multiple sources, format and clean it (impute the data if it is missing), and finally combine it in a coherent manner to prepare a dataset for further analysis by data scientists or machine learning engineers.

In this chapter, we will try to mimic a typical task flow by downloading and using two different datasets from reputed web portals. Each dataset contains partial data pertaining to the key question that is being asked. Let's examine this more closely.

Applying Your Knowledge to a Data Wrangling Task

Suppose you are asked the following question:

In India, did the enrollment in primary/secondary/tertiary education increase with the improvement of per capita GDP in the past 15 years? To provide an accurate and analyzed result, machine learning and data visualization techniques will be used by an expert data scientist. The actual modeling and analysis will be done by a senior data scientist, who will use machine learning and data visualization for analysis. As a data wrangling expert, your job will be to acquire and provide a clean dataset that contains educational enrollment and GDP data side by side.

Suppose you have a link for a dataset from the United Nations and you can download the dataset of education (for all the nations around the world). But this dataset has some missing values and, moreover, it does not have any Gross Domestic Product (GDP) information. Someone has also given you another separate CSV file (downloaded...

An Extension to Data Wrangling

This is the concluding chapter of this book; we want to give you a broad overview of some of the exciting technologies and frameworks that you may need to learn about beyond data wrangling to work as a full-stack data scientist. Data wrangling is an essential part of the whole data science and analytics pipeline, but it is not the whole enterprise. You have learned invaluable skills and techniques in this book, but it is always good to broaden your horizons and look beyond to see what other tools that are out there that can give you an edge in this competitive and ever-changing world.

Additional Skills Required to Become a Data Scientist

To practice as a fully qualified data scientist/analyst, you should have some basic skills in your repertoire, irrespective of the particular programming language you choose to focus on. These skills and know-how are language-agnostic and can be utilized with any framework that you have to embrace, depending on...

Summary

Data is everywhere and it is all around us. In these nine chapters, we have learned how data from different types and sources can be cleaned, corrected, and combined. Hopefully, this chapter must have tested your skills enough to shore up the concepts you've learned so far. If you want, you can revisit some of the prior chapters to practice your data wrangling skills a bit more. Using the power of Python and the knowledge of data wrangling and applying the tricks and tips that you have studied in this book, you are ready to be a data wrangler.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
The Data Wrangling Workshop - Second Edition
Published in: Jul 2020Publisher: PacktISBN-13: 9781839215001
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Authors (3)

author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

author image
Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

author image
Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar