Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Learning Data Mining with Python, - Second Edition

You're reading from  Learning Data Mining with Python, - Second Edition

Product type Book
Published in Apr 2017
Publisher Packt
ISBN-13 9781787126787
Pages 358 pages
Edition 2nd Edition
Languages
Concepts

Table of Contents (20) Chapters

Title Page
Credits
About the Author
About the Reviewer
www.PacktPub.com
Customer Feedback
Preface
Getting Started with Data Mining Classifying with scikit-learn Estimators Predicting Sports Winners with Decision Trees Recommending Movies Using Affinity Analysis Features and scikit-learn Transformers Social Media Insight using Naive Bayes Follow Recommendations Using Graph Mining Beating CAPTCHAs with Neural Networks Authorship Attribution Clustering News Articles Object Detection in Images using Deep Neural Networks Working with Big Data Next Steps...

Chapter 7. Follow Recommendations Using Graph Mining

Graphs can be used to represent a wide range of phenomena. This is particularly true for online social networks, and the Internet of Things (IoT). Graph mining is big business, with websites such as Facebook running on data analysis experiments performed on graphs.

Social media websites are built upon engagement. Users without active news feeds, or interesting friends to follow, do not engage with sites. In contrast, users with more interesting friends and followees engage more, see more ads. This leads to larger revenue streams for the website.

In this chapter, we look at how to define similarity on graphs, and how to use them within a data mining context. Again, this is based on a model of the phenomena. We look at some basic graph concepts, like sub-graphs and connected components. This leads to an investigation of cluster analysis, which we delve more deeply into in  Chapter 10

,

Clustering News Articles.

The topics covered in this chapter...

Loading the dataset


In this chapter, our task is to recommend users on online social networks based on shared connections. Our logic is that if two users have the same friends, they are highly similar and worth recommending to each other. We want our recommendations to be of high value. We can only recommend so many people before it becomes tedious, therefore we need to find recommendations that engage users.

To do this, we use the previous chapter's disambiguation model to find only users talking about Python as a programming language. In this chapter, we use the results from one data mining experiment as input into another data mining experiment. Once we have our Python programmers selected, we then use their friendships to find clusters of users that are highly similar to each other. The similarity between two users will be defined by how many friends they have in common. Our intuition will be that the more friends two people have in common, the more likely two people are to be friends...

Getting follower information from Twitter


With our initial set of users, we now need to get the friends of each of these users. A friend is a person whom the user is following. The API for this is called friends/ids, and it has both good and bad points. The good news is that it returns up to 5,000 friend IDs in a single API call. The bad news is that you can only make 15 calls every 15 minutes, which means it will take you at least 1 minute per user to get all followers—more if they have more than 5,000 friends (which happens more often than you may think).

The code is similar to the code from our previous API usage (obtaining tweets). We will package it as a function, as we will use this code in the next two sections. Our function takes a twitter user's ID value, and returns their friends. While it may be surprising to some, many Twitter users have more than 5,000 friends. Due to this we will need to use Twitter's pagination function, which lets Twitter return multiple pages of data through...

Creating a graph


At this point in our experiment, we have a list of users and their friends. This gives us a graph where some users are friends of other users (although not necessarily the other way around).

A graph is a set of nodes and edges. Nodes are usually objects of interest - in this case, they are our users. The edges in this initial graph indicate that user A is a friend of user B. We call this a directed graph, as the order of the nodes matters. Just because user A is a friend of user B, that doesn't imply that user B is a friend of user A. The example network below shows this, along with a user C who is friends of user B, and is friended in turn by user B as well:

In python, one of the best libraries for working with graphs, including creating, visualising and computing, is called NetworkX.

Note

Once again, you can use Anaconda to install NetworkX: conda install networkx

First, we create a directed graph using NetworkX. By convention, when importing NetworkX, we use the abbreviation...

Finding subgraphs


From our similarity function, we could simply rank the results for each user, returning the most similar user as a recommendation - as we did with our product recommendations. This works, and is indeed one way to perform this type of analysis.

Instead, we might want to find clusters of users that are all similar to each other. We could advise these users to start a group, create advertising targeting this segment, or even just use those clusters to do the recommendations themselves. Finding these clusters of similar users is a task called cluster analysis.

Note

Cluster analysis is a difficult task, with complications that classification tasks do not typically have. For example, evaluating classification results is relatively easy - we compare our results to the ground truth (from our training set) and see what percentage we got right. With cluster analysis, though, there isn't typically a ground truth. Evaluation usually comes down to seeing if the clusters make sense, based...

Summary


In this chapter, we looked at graphs from social networks and how to do cluster analysis on them. We also looked at saving and loading models from scikit-learn by using the classification model we created in Chapter 6

,

Social Media Insight Using Naive Bayes

.

We created a graph of friends from the social network Twitter. We then examined how similar two users were, based on their friends. Users with more friends in common were considered more similar, although we normalize this by considering the overall number of friends they have. This is a commonly used way to infer knowledge (such as age or general topic of discussion) based on similar users. We can use this logic for recommending users to others—if they follow user X and user Y is similar to user X, they will probably like user Y. This is, in many ways, similar to our transaction-led similarity of previous chapters.

The aim of this analysis was to recommend users, and our use of cluster analysis allowed us to find clusters of similar...

lock icon The rest of the chapter is locked
You have been reading a chapter from
Learning Data Mining with Python, - Second Edition
Published in: Apr 2017 Publisher: Packt ISBN-13: 9781787126787
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime}