Reader small image

You're reading from  Mastering Data Mining with Python - Find patterns hidden in your data

Product typeBook
Published inAug 2016
Reading LevelIntermediate
Publisher
ISBN-139781785889950
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Megan Squire
Megan Squire
author image
Megan Squire

Megan Squire is a professor of computing sciences at Elon University. Her primary research interest is in collecting, cleaning, and analyzing data about how free and open source software is made. She is one of the leaders of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.
Read more about Megan Squire

Right arrow

Chapter 3. Entity Matching

In my set of outdoor tools, I have a large hand axe that I have always called a mattock. But my friend from the western United States calls it a Pulaski. When he asks me to hand him the Pulaski, it always gives me a moment of pause. Sometimes, we might know a thing by more than one name, or two things might share the same name, which can lead to confusion. This happens with people all the time. Have you ever been mistaken for someone else who shares your same first and last name? Have you ever used a nickname or an alias? In a children's playground, 10 women might turn around when they hear a child call out Mom! A man who always goes by the name Bob would be immediately suspicious when an unfamiliar telephone caller asks to speak with Robert. A pharmacy technician gives John T. Smith the medicine intended for John M. Smith, leading to disastrous results.

In this chapter, we are concerned with the accurate identification of entities, or things, and the correct assignment...

What is entity matching?


Finding matching items is one of the oldest tasks in database processing, and as databases get larger and more distributed, this task becomes more and more important. Each time two datasets are merged, questions arise about how to identify duplicates, how to connect items from the first dataset to the similar items in the second data set. When we find ourselves asking Are these two things different even though they have the same name? or Are these other two things the same, even though they have different names? we can apply entity matching techniques to find out the answer.

In light of all this concern with the names for an item, it is perhaps appropriate that this task itself has many names: entity matching, entity disambiguation, object consolidation, duplicate identification, merge/purge, and record linkage, to name a few. We will use the term entity matching in this chapter to generically describe this class of activities.

Consider the following examples where...

Entity matching project


As with the application example in Chapter 2, Association Rule Mining, where we found frequently occurring sets of tags from Freecode projects, this project will also use data from the free, libre, and open source software (FLOSS) realm. Our task here is to find software projects that are being hosted on different code repositories, but actually represent the same entity. Specifically, we are interested in finding projects that were formerly hosted on the now defunct RubyForge.org site, but have subsequently migrated to its successor, the https://rubygems.org/ site. RubyForge and RubyGems are both code repositories for software written in the Ruby language, but they are slightly different in what they offer. RubyForge was a hosting site for software projects, and it included file downloads, source code control, mailing lists, discussion forums, and so on. On RubyForge, each project could be comprised of many files, including libraries, documentation, and the like...

Summary


In this chapter, we learned how to connect entities even when there is no common identifier for them. This task, called entity matching, is broadly applicable to many domains, and is one of the oldest tasks in data processing. Once we have matched entities, we are able to perform data mining on sets that were previously unconnected.

To do so, we tackled common strategies for entity matching, attribute-based, disjoint sets, and context-based. We learned several techniques for estimating whether strings are similar, including edit distances like Hamming and Levenshtein, and phonetic encodings such as Soundex, and we learned how to use blocking techniques to reduce or eliminate pairwise testing. Since it is important to evaluate the effectiveness of our entity matching methods, we learned how to calculate false positive and false negative rates. Finally, we tested our knowledge by designing an entity matching procedure for a real-world problem using data from two separate collections...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Data Mining with Python - Find patterns hidden in your data
Published in: Aug 2016Publisher: ISBN-13: 9781785889950
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Megan Squire

Megan Squire is a professor of computing sciences at Elon University. Her primary research interest is in collecting, cleaning, and analyzing data about how free and open source software is made. She is one of the leaders of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.
Read more about Megan Squire