Reader small image

You're reading from  Mastering Data Mining with Python - Find patterns hidden in your data

Product typeBook
Published inAug 2016
Reading LevelIntermediate
Publisher
ISBN-139781785889950
Edition1st Edition
Languages
Concepts
Right arrow
Author (1)
Megan Squire
Megan Squire
author image
Megan Squire

Megan Squire is a professor of computing sciences at Elon University. Her primary research interest is in collecting, cleaning, and analyzing data about how free and open source software is made. She is one of the leaders of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.
Read more about Megan Squire

Right arrow

Chapter 4. Network Analysis

Humans are very social creatures, and our ability to find connections – with each other and with other things in our lives – is one of our strongest impulses. We naturally love to connect with others, we distinguish our connections with different names or levels (friend, spouse, acquaintance, lover, enemy, BFF, frenemy, boss, employee, stranger, co-worker, neighbor), and we sometimes keep these connections for years or decades. We are fascinated by seeing people from our network appearing in other, seemingly unconnected networks. We love the notion that there might be only six (or four, or three) degrees of separation between any two people on Earth. The small world phenomenon reminds us that we are more closely connected to each other than it may appear.

So far in this book, we have experimented a lot with finding connections between things, first by finding items that are commonly associated, and then by finding entities that appear different but are really the...

What is a network?


Networks are all around us. We use the word network to refer to many different types of connected things: multiple computers hooked together, a system of cities connected by small roads and large highways, a group of people who all work in the same industry, a series of television stations that broadcast common programming. In common usage, the word network can refer to almost any set of interconnected entities.

From a data mining perspective, however, our use of the word network is more precise. We use the word network to refer to a system that can be represented by a graph made up of nodes and links. In our specialized vocabulary, nodes are the things being connected, and the links are the relationships between the nodes. The collection of all the nodes and links is called a graph. Note that we are using the word graph in its mathematical sense, not in the sense of a visualization, like a bar graph or a line graph. In graph theory vocabulary, nodes are also called vertices...

Measuring a network


Much of the analysis of a network is actually just measuring its various parts and pieces. How many nodes does it have? How are those nodes connected to each other? How many links does it have and how many ways can we traverse those edges? In this section, we will learn many of the common ways to measure a network.

Degree of a network

One way to describe a network is through its degree distribution. The degree of a node is the number of its connected edges. In an undirected graph, the degree of a node is the count of all the edges coming out of it. The degree distribution tells us how many nodes had a degree of 0, how many had a degree of 1, then 2, and so on. Figure 4 shows a histogram of the degree distribution for a simple undirected graph. Two of the nodes have a degree of three, two of the nodes have a degree of two, and one node has a degree of one:

Figure 4. Simple undirected graph and its degree distribution

Figure 5 shows some alternative shapes for degree distributions...

Representing graph data


The theoretical aspects of networks are important, but in order to be able to apply these ideas to a real-world problem, we have to first transform our data into a format that a network analysis program can understand. In this section, we will discover the common formats for representing data in a network-friendly way.

Adjacency matrix

An adjacency matrix is a convenient way to represent graph data. To construct an adjacency matrix for an undirected, unweighted graph, we can create a grid that has all the nodes listed across the top as columns, and also down the side of the grid as rows. Then we use a 1 or 0 to indicate whether there is a link between those two nodes. Consider the unweighted, undirected graph shown in Figure 12:

Figure 12. A simple unweighted, undirected graph

The adjacency matrix for the Figure 12 graph can be written like this:

  A B C D E F
A 0 1 1 1 0 0
B 1 0 1 1 0 0
C 1 1 0 0 1 1
D 1 1 0 0 0 0
E 0 0 1 0 0 0
F 0 0 1 0 0 0

A few things jump out right...

A real project


To put our new knowledge about graphs and social networks to use, we will use data about the software developers working on free, libre, and open source (FLOSS) projects developed using the Ruby programming language. As we learned in Chapter 3, Entity Matching when we tackled entity matching between projects, many Ruby programmers used the website RubyForge.org between 2003 and 2013 to create projects and collaborate. In this chapter, we are going to use this same data to learn how the social structure of this community changed over those ten years.

RubyForge developers can be placed into a social network where the developers themselves are the nodes or vertices, and the fact that they worked on a project together represents the edge or link between the nodes. We could also count how many projects they worked on together to create a weight for the link. If two developers only worked together once, the link is weaker than if the two developers worked together on dozens of projects...

Summary


In this chapter, we learned the basics of network analysis and graph theory, including how to measure a network and describe its properties. We learned why the degree, distance, and centrality of a network are important. We also investigated the various graph data formats that are used in network analysis, and considered which ones are most effective for which types of graphs. Finally, we implemented a real-world project where we build networks of software developers that had worked together in the RubyForge ecosystem. We learned various techniques for exploring the networks, including how to build smaller and more detailed networks, and how to explore component subgraphs. We discovered a few techniques for focusing on a single node and building out ego networks, and eventually we implemented those ego networks into a view of a component and its change over time.

In the next few chapters, we will turn our attention to text mining. Specifically, in Chapter 5, Sentiment Analysis in...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Data Mining with Python - Find patterns hidden in your data
Published in: Aug 2016Publisher: ISBN-13: 9781785889950
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Megan Squire

Megan Squire is a professor of computing sciences at Elon University. Her primary research interest is in collecting, cleaning, and analyzing data about how free and open source software is made. She is one of the leaders of the FLOSSmole.org, FLOSSdata.org, and FLOSSpapers.org projects.
Read more about Megan Squire