Reader small image

You're reading from  The Data Wrangling Workshop - Second Edition

Product typeBook
Published inJul 2020
Reading LevelIntermediate
PublisherPackt
ISBN-139781839215001
Edition2nd Edition
Languages
Tools
Right arrow
Authors (3):
Brian Lipp
Brian Lipp
author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

Shubhadeep Roychowdhury
Shubhadeep Roychowdhury
author image
Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

Dr. Tirthajyoti Sarkar
Dr. Tirthajyoti Sarkar
author image
Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar

View More author details
Right arrow

6. Learning the Hidden Secrets of Data Wrangling

Activity 6.01: Handling Outliers and Missing Data

Solution:

The steps to completing this activity are as follows:

Note

The dataset to be used for this activity can be found at https://packt.live/2YajrLJ.

  1. Load the data:
    import pandas as pd
    import numpy as np
    import matplotlib.pyplot as plt
    %matplotlib inline
  2. Read the .csv file:
    df = pd.read_csv("../datasets/visit_data.csv")

    Note

    Don't forget to change the path (highlighted) based on where the CSV file is saved on your system.

  3. Print the data from the DataFrame:
    df.head()

    The output is as follows:

    Figure 6.11: The contents of the CSV file

    As we can see, there is data where some values are missing, and if we examine this, we will see some outliers.

  4. Check for duplicates by using the following command:
    print("First name is duplicated - {}"\
          .format(any(df.first_name.duplicated())))
    print("Last name is...
lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
The Data Wrangling Workshop - Second Edition
Published in: Jul 2020Publisher: PacktISBN-13: 9781839215001

Authors (3)

author image
Brian Lipp

Brian Lipp is a Technology Polyglot, Engineer, and Solution Architect with a wide skillset in many technology domains. His programming background has ranged from R, Python, and Scala, to Go and Rust development. He has worked on Big Data systems, Data Lakes, data warehouses, and backend software engineering. Brian earned a Master of Science, CSIS from Pace University in 2009. He is currently a Sr. Data Engineer working with large Tech firms to build Data Ecosystems.
Read more about Brian Lipp

author image
Shubhadeep Roychowdhury

Shubhadeep Roychowdhury holds a master's degree in computer science from West Bengal University of Technology and certifications in machine learning from Stanford. He works as a senior software engineer at a Paris-based cybersecurity startup, where he is applying state-of-the-art computer vision and data engineering algorithms and tools to develop cutting-edge products. He often writes about algorithm implementation in Python and similar topics.
Read more about Shubhadeep Roychowdhury

author image
Dr. Tirthajyoti Sarkar

Dr. Tirthajyoti Sarkar works as a senior principal engineer in the semiconductor technology domain, where he applies cutting-edge data science/machine learning techniques for design automation and predictive analytics. He writes regularly about Python programming and data science topics. He holds a Ph.D. from the University of Illinois and certifications in artificial intelligence and machine learning from Stanford and MIT.
Read more about Dr. Tirthajyoti Sarkar