FINDING DUPLICATE ROWS IN PANDAS
Listing 3.26 displays the contents of duplicates.csv and Listing 3.27 displays the contents of duplicates.py that illustrates how to find duplicate rows in a Pandas DataFrame.
LISTING 3.26: duplicates.csv
fname,lname,level,dept,state Jane,Smith,Senior,Sales,California Dave,Smith,Senior,Devel,California Jane,Jones,Year1,Mrktg,Illinois Jane,Jones,Year1,Mrktg,Illinois Jane,Stone,Senior,Mrktg,Arizona Dave,Stone,Year2,Devel,Arizona Mark,Aster,Year3,BizDev,Florida Jane,Jones,Year1,Mrktg,Illinois
LISTING 3.27: duplicates.py
import pandas as pd
df = pd.read_csv("duplicates.csv")
print("Contents of data frame:")
print(df)
print()
print("Duplicate rows:")
#df2 = df.duplicated(subset=None)
df2 = df.duplicated(subset=None, keep='first')
print(df2)
print()
print("Duplicate first names:")
df3 = df[df.duplicated(['fname'])]
print(df3)
print()
print("Duplicate first name and level:")
df3 = df...