Mastering Clojure Data Analysis

By Eric Rochester
    Advance your knowledge in tech with a Packt subscription

  • Instant online access to over 7,500+ books and videos
  • Constantly updated with 100+ new titles each month
  • Breadth and depth in over 1,000+ technologies
  1. Network Analysis – The Six Degrees of Kevin Bacon

About this book

Clojure is a Lisp dialect built on top of the Java Virtual Machine. As data increasingly invades more and more parts of our lives, we continually need more tools to deal with it effectively. Data can be organized effectively using Clojure data tools.

Mastering Clojure Data Analysis teaches you how to analyze and visualize complex datasets. With this book, you'll learn how to perform data analysis using established scientific methods with the modern, powerful Clojure programming language with the help of exciting examples drawn from real-world data. This will help you get to grips with advanced topics such as network analysis, the characteristics of social networks, applying topic modeling to get a handle on unstructured textual data, and GIS analysis to apply geospatial techniques to your data analysis problems.

With this guide, you'll learn how to leverage the power and flexibility of Clojure to dig into your data and access the insights it hides.

Publication date:
May 2014
Publisher
Packt
Pages
340
ISBN
9781783284139

 

Chapter 1. Network Analysis – The Six Degrees of Kevin Bacon

With the popularity of Facebook, Twitter, LinkedIn, and other social networks, we're increasingly defined by who we know and who's in our network. These websites help us manage who we know—whether personally, professionally, or in some other way—and our interactions with those groups and individuals. In exchange, we tell these sites who we are in the network.

These companies, and many others, spend a lot of time on and pay attention to our social networks. What do they say about us, and how can we sell things to these groups?

In this chapter, we'll walk through learning about and analyzing social networks:

  • Analyzing social networks

  • Getting the data

  • Understanding graphs

  • Implementing the graphs

  • Measuring social network graphs

  • Visualizing social network graphs

 

Analyzing social networks


Although the Internet and popular games such as Six Degrees of Kevin Bacon have popularized the concept, social network analysis has been around for a long time. It has deep roots in sociology. Although the sociologist John A. Barnes may have been the first person to use the term in 1954 in the article Class and communities in a Norwegian island parish (http://garfield.library.upenn.edu/classics1987/A1987H444300001.pdf), he was building on a tradition from the 1930s, and before that, he was looking at social groups and interactions relationally. Researchers contended that the phenomenon arose from social interactions and not individuals.

Slightly more recently, starting in the 1960s, Stanley Milgram has been working on a small world experiment. He would mail a letter to a volunteer somewhere in the mid-western United States and ask him or her to get it to a target individual in Boston. If the volunteer knew the target on a first-name basis, he or she could mail it to him. Otherwise, they would need to pass it to someone they knew who might know the target. At each step, the participants were to mail a postcard to Milgram so that he could track the progress of the letter.

This experiment (and other experiments based on it) has been criticized. For one thing, the participants may decide to just throw the letter away and miss huge swathes of the network. However, the results are evocative. Milgram found that the few letters that made it to the target, did so with an average of six steps. Similar results have been born out by later, similar experiments.

Milgram himself did not use the popular phrase six degrees of separation. This was probably taken from John Guare's play and film Six Degrees of Separation (1990 and 1993). He said he got the concept from Guglielmo Marconi, who discussed it in his 1909 Nobel Prize address.

The phrase "six degrees" is synonymous with social networks in the popular imagination, and a large part of this is due to the pop culture game Six Degrees of Kevin Bacon. In this game, people would try to find a link between Kevin Bacon and some other actor by tracing the films in which they've worked together.

In this chapter, we'll take a look at this game more critically. We'll use it to explore a network of Facebook (https://www.facebook.com/) users. We'll visualize this network and look at some of its characteristics.

Specifically, we're going to look at a network that has been gathered from Facebook. We'll find data for Facebook users and their friends, and we'll use that data to construct a social network graph. We'll analyze that information to see whether the observation about the six degrees of separation applies to this network. More broadly, we'll see what we can learn about the relationships represented in the network and consider some possible directions for future research.

 

Getting the data


A couple of small datasets of the Facebook network data are available on the Internet. None of them are particularly large or complete, but they do give us a reasonable snapshot of part of Facebook's network. As the Facebook graph is a private data source, this partial view is probably the best that we can hope for.

We'll get the data from the Stanford Large Network Dataset Collection (http://snap.stanford.edu/data/). This contains a number of network datasets, from Facebook and Twitter, to road networks and citation networks. To do this, we'll download the facebook.tar.gz file from http://snap.stanford.edu/data/egonets-Facebook.html. Once it's on your computer, you can extract it. When I put it into the folder with my source code, it created a directory named facebook.

The directory contains 10 sets of files. Each group is based on one primary vertex (user), and each contains five files. For vertex 0, these files would be as follows:

  • 0.edges: This contains the vertices that the primary one links to.

  • 0.circles: This contains the groupings that the user has created for his or her friends.

  • 0.feat: This contains the features of the vertices that the user is adjacent to and ones that are listed in 0.edges.

  • 0.egofeat: This contains the primary user's features.

  • 0.featnames: This contains the names of the features described in 0.feat and 0.egofeat. For Facebook, these values have been anonymized.

For these purposes, we'll just use the *.edges files.

Now let's turn our attention to the data in the files and what they represent.

 

Understanding graphs


Graphs are the Swiss army knife of computer science data structures. Theoretically, any other data structure can be represented as a graph, although usually, it won't perform as well.

For example, binary trees can be seen as a graph in which each node has two outgoing edges at most. These edges link it to the node's children. Or, an array can be seen as a graph in which each item in the array has edges that link it to the items adjacent to it.

However, in this case, the data that we're working with is naturally represented by a graph. The people in the network are the nodes, and their relationships are the edges.

Graphs come in several flavors, but they all have some things in common. First, they are a series of nodes that are connected by edges. Edges can be unidirectional, in which case, the relationship they represent goes only one way (for example, followers on Twitter), or it goes bidirectional, in which the relationship is two-way (for example, friends on Facebook).

Graphs generally don't have any hierarchy or structure like trees or lists do. However, the data they represent may have a structure. For example, Twitter has a number of users (vertices) who have a lot of followers (inbound edges). However, most users only have a few followers. This dichotomy creates a structure to the graph, where a lot of data flows through a few vertices.

Graphs' data structures typically support a number of operations, including adding edges, removing edges, and traversing the graph. We'll implement a graph data structure later. At that point, we'll also look at these operations. This may not be the best performing graph, especially for very large datasets, but it should help make clear what graphs are all about.

 

Implementing the graphs


As the graph data structure is so central to this chapter, we'll take a look at it in more detail before we move on.

There are a number of ways to implement graphs. In this case, we'll use a variation of an adjacency list, which maps each node to a list of its neighbors. We'll store the nodes in a hash map and keep separate hash maps for each node's data. This representation is especially good for sparse graphs, because we only need to store existing links. If the graph is very dense, then representing the set of neighboring nodes as a matrix instead of a hash table will take less memory.

However, before we start looking at the code, let's check out the Leiningen 2 project.clj file. Apart from the Clojure library, this makes use of the Clojure JSON library, the me.raynes file utility library (https://github.com/Raynes/fs), and the Simple Logging Facade for Java library (http://www.slf4j.org/):

(defproject network-six "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
    :url "http://www.eclipse.org/legal/epl-v10.html"}
  :plugins [[lein-cljsbuild "0.3.2"]]
  :dependencies [[org.slf4j/slf4j-simple "1.7.5"]
    [org.clojure/clojure "1.5.1"]
    [org.clojure/data.json "0.2.2"]
    [me.raynes/fs "1.4.4"]
    [org.clojure/clojurescript "0.0-2202"]]
  :cljsbuild {:builds [{:source-paths ["src-cljs"],
    :compiler {:pretty-printer true,
      :output-to "www/js/main.js",
      :optimizations :whitespace}}]})

If you're keeping track, there are several sections related to ClojureScript (https://github.com/clojure/clojurescript) as well. We'll talk about them later in the chapter.

For the first file that we'll work in, open up src/network_six/graph.clj. Use this for the namespace declaration:

(ns network-six.graph
  (:require [clojure.set :as set]
            [clojure.core.reducers :as r]
            [clojure.data.json :as json]
            [clojure.java.io :as io]
            [clojure.set :as set]
            [network-six.util :as u]))

In this namespace, we'll create a Graph record that contains two slots. One is for the map between vertex numbers and sets of neighbors. The second is for the data maps. We'll define an empty graph that we can use anywhere, as follows:

(defrecord Graph
  [neighbors data])
(def empty-graph (Graph. {} {}))

The primary operations that we'll use for this chapter are functions that modify the graph by adding or removing edges or by merging two graphs. The add and delete operations both take an optional flag to treat the edge as bidirectional. In that case, both functions just call themselves with the ends of the edges swapped so that they operate on the edge that goes in the other direction:

(defn update-conj [s x]
  (conj (if (nil? s) #{} s) x))
(defn add
  ([g x y] (add g x y false))
  ([g x y bidirectional?]
   ((if bidirectional? #(add % y x false) identity)
      (update-in g [:neighbors x] #(update-conj % y)))))
(defn delete
  ([g x y] (delete g x y false))
  ([g x y bidirectional?]
   ((if bidirectional? #(delete % y x false) identity)
      (update-in g [:neighbors x] #(disj % y)))))
(defn merge-graphs [a b]
  (Graph. (merge-with set/union (:neighbors a) (:neighbors b))
          (merge (:data a) (:data b))))

The final low-level functions to work with graphs are two functions that are used to set or retrieve data associated with the vertices. Sometimes, it's also useful to be able to store data of the edges, but we won't use that for this implementation. However, we will associate some information with the vertices themselves later on, and when we do that, we'll use these functions.

All of these functions are overloaded. Passed in a graph, a vertex number, and a key, they set or retrieve a value on a hash map that is that vertex's value. Passed in just a graph and a vertex number, they set or retrieve the vertex's value—either the hash map or another value that is there in its place:

(defn get-value
  ([g x] ((:data g) x))
  ([g x k] ((get-value g x) k)))
(defn set-value
  ([g x v] (assoc-in g [:data x] v))
  ([g x k v] (set-value g x (assoc (get-value g x) k v))))
(defn update-value
  ([g x f] (set-value g x (f (get-value g x))))
  ([g x k f] (set-value g x k (f (get-value g x k)))))

We will also want to get the vertices and the edges for the graph. The vertices are the union of the set of all the nodes with outbound edges and the set of nodes with inbound edges. There should be some, or even a lot, of overlap between these two groups. If the graph is bidirectional, then get-edges will return each edge twice—one going from a to b and the other going from b to a:

(defn get-vertices [graph]
  (reduce set/union (set (keys (:neighbors graph)))
          (vals (:neighbors graph))))
(defn get-edges [graph]
  (let [pair-edges (fn [[v neighbors]]
                     (map #(vector v %) neighbors))]
    (mapcat pair-edges (:neighbors graph))))

We'll write some more basic utilities later, but right now, let's take a look at a function that is a slightly higher-level function, but still a fundamental operation on graphs: a breadth-first walk over the graph and a search based on that.

A breadth-first walk traverses the graph by first looking at all the neighbors of the current node. It then looks at the neighbors of those nodes. It continues broadening the search one layer at a time.

This is in opposition to a depth-first walk, which goes deep down one path until there are no outgoing edges to be tried. Then, it backs out to look down other paths.

Which walk is more efficient really depends on the nature of the individual graph and what is being searched for. However, in our case, we're using a breadth-first walk because it ensures that the shortest path between the two nodes will be found first. A depth-first search can't guarantee that.

The backbone of the breadth-first function is a First In, First Out (FIFO) queue. To keep track of the vertices in the paths that we're trying, we use a vector with the index of those vertices. The queue holds all of the active paths. We also keep a set of vertices that we've reached before. This prevents us from getting caught in loops.

We wrap everything in a lazy sequence so that the caller can control how much work is done and what happens to it.

At each step in the loop, the algorithm is pretty standard:

  1. If the queue is empty, then we've exhausted the part of the graph that's accessible from the start node. We're done, and we return null to indicate that we didn't find the node.

  2. Otherwise, we pop a path vector off the queue. The current vertex is the last one.

  3. We get the current vertex's neighbors.

  4. We remove any vertices that we've already considered.

  5. For each neighbor, we append it to the current path vector, creating that many new path vectors. For example, if the current path vector is [0, 171, 4] and the new neighbors are 7, 42 and 532, then we'll create three new vectors: [0, 171, 4, 7], [0, 171, 4, 42], and [0, 171, 4, 532].

  6. We push each of the new path vectors onto the queue.

  7. We add each of the neighbors onto the list of vertices that we've seen.

  8. We output the current path to the lazy sequence.

  9. Finally, we loop back to step one for the rest of the output sequence.

The following code is the implementation of this. Most of it takes place in bf-seq, which sets up the processing in the first clause (two parameters) and constructs the sequence in the second clause (three parameters). The other function, breadth-first, is the public interface to the function:

(defn bf-seq
  ([get-neighbors a]
   (bf-seq
     get-neighbors
     (conj clojure.lang.PersistentQueue/EMPTY [a])
     #{a}))
  ([get-neighbors q seen]
   (lazy-seq
     (when-not (empty? q)
       (let [current (first q)
             nbors (remove seen (get-neighbors (last current)))]
         (cons current
               (bf-seq get-neighbors
                       (into (pop q)
                             (map #(conj current %) nbors))
                       (into seen nbors))))))))
(defn breadth-first [graph a]
  (bf-seq (:neighbors graph) a))

Notice that what makes this a breadth-first search is that we use a FIFO queue. If we used a LIFO (Last In, First Out) queue (a Clojure list works well for this), then this would be a depth-first search. Instead of going broadly and simultaneously trying a number of paths, it would dive deep into the graph along one path and not backtrack to try a new one until it had exhausted the first path.

This is a flexible base on which one can build a number of functionalities. For example, a breadth-first search is now a two-line function:

(defn bfs [graph a b]
  (first (filter #(= (last %) b) (breadth-first graph a))))

These are just filters that find all paths that start from a and end at b and then return the first of those.

Loading the data

Now that we have the fundamental data structure that we're going to use, we can read the data files that we downloaded into a graph.

For the purposes of analyzing the network itself, we're only interested in the *.edges files. This lists the edges in the graph, one edge per line. Each edge is defined by the node numbers that it connects. As Facebook relationships are two-way, the edges represented here are bidirectional. For example, the first few lines of 0.edges are shown as follows:

236 186
122 285
24 346
271 304
176 9

We'll first define a function that reads one edge file into a Graph, and then we'll define another function that walks a directory, reads each edge file, and merges the graphs into one. I'm keeping these in a new namespace, network-six.ego. This is defined in the src/network_six/ego.clj file. It uses the following namespace declaration:

(ns network-six.ego
  (:require [clojure.java.io :as io]
            [clojure.set :as set]
            [clojure.string :as string]
            [clojure.data.json :as json]
            [clojure.core.reducers :as r]
            [network-six.graph :as g]
            [network-six.util :as u]
            [me.raynes.fs :as fs])
  (:import [java.io File]))

Now we'll define the function that reads the *.edges files from a data directory:

(defn read-edge-file [filename]
  (with-open [f (io/reader filename)]
    (->>
      f
      line-seq
      (r/map #(string/split % #"\s+"))
      (r/map #(mapv (fn [x] (Long/parseLong x)) %))
      (r/reduce #(g/add %1 (first %2) (second %2))
                g/empty-graph))))
(defn read-edge-files [ego-dir]
  (r/reduce g/merge-graphs {}
            (r/map read-edge-file
                   (fs/find-files ego-dir #".*\.edges$"))))

We can use these from read-eval-print loop (REPL) to load the data into a graph that we can work with. We can get some basic information about the data at this point, and the following how we'll go about doing that:

User=> (require '[network-six.graph :as g]
                '[network-six.ego :as ego])
user=> (def graph (ego/read-edge-files "facebook/"))
#'user/graph
user=> (count (g/get-vertices graph))
3959
user=> (count (g/get-edges graph))
168486

Now let's dive deeper into the graph and get some other metrics.

 

Measuring social network graphs


There are a variety of metrics that we can use to describe graph data structures in particular and social network graphs in general. We'll look at a few of them and think about both, what they can teach us, and how we can implement them.

Density

Recall that a network's density is the number of actual edges versus the number of possible edges. A completely dense network is one that has an edge between each vertex and every other vertex. For example, in the following figure, the graph on the upper-right section is completely dense. The graph in the lower-left section has a density factor of 0.5333.

The number of possible edges is given as N(N-1). We'll define the density formula as follows:

(defn density [graph]
  (let [n (count (get-vertices graph))
        e (count (get-edges graph))]
    (/ (* 2.0 e) (* n (dec n)))))

We can use this to get some information about the number of edges in the graph:

user=> (g/density graph)
0.021504657198130255

Looking at this, it appears that this graph is not very dense. Maybe some other metrics will help explain why.

Degrees

A vertex's degree is the number of other vertexes connected to it, and another summary statistic for social networks is the average degree. This is computed by the formula 2E/N. The Clojure to implement this is straightforward:

(defn avg-degree [graph]
  (/ (* 2.0 (count (get-edges graph)))
     (count (get-vertices graph))))

Similarly, it is easy to use it:

user=> (g/avg-degree graph)
85.11543319019954

So, the typical number of edges is around 85. Given that there are almost 4,000 vertices, it is understandable why the density is so low (0.022).

Paths

We can get a number of interesting metrics based on all of the paths between two elements. For example, we'll need those paths to get the centrality of nodes later in this chapter. The average path length is also an important metric. To calculate any of these, we'll need to compute all of the paths between any two vertices.

For weighted graphs that have a weight or cost assigned to each edge, there are a number of algorithms to find the shortest path. Dijkstra's algorithm and Johnson's algorithm are two common ones that perform well in a range of circumstances.

However, for non-weighted graphs, any of these search algorithms evolve into a breadth-first search. We just implemented this.

We can find the paths that use the breadth-first function that we walked through earlier. We simply take each vertex as a starting point and get all the paths from there. To make access easier later, we convert each path returned into a hash map as follows:

(defn find-all-paths [graph]
  (->> graph
    get-vertices
    (mapcat #(breadth-first graph %))
    (map #(hash-map :start (first %) :dest (last %) :path %))))

Unfortunately, there's an added complication; the output will probably take more memory than available. Because of this, we'll also define a couple of functions to write the paths out to a file and iterate over them again. We'll name them network-six.graph/write-paths and network-six.graph/iter-paths, and you can find them in the code download provided for this chapter on the Packt Publishing website. I saved it to the file path.json, as each line of the file is a separate JSON document.

Average path length

The first metric that we can get from the paths is the average path length. We can find this easily by walking over the paths. We'll use a slightly different definition of mean that doesn't require all the data to be kept in the memory. You can find this in the network-six.util namespace:

user=> (double
         (u/mean
           (map count (map :path (g/iter-paths "path.json")))))
6.525055748717483

This is interesting! Strictly speaking, the concept of six degrees of separation says that all paths in the network should be six or smaller However, experiments often look at the paths in terms of the average path length. In this case, the average distance between any two connected nodes in this graph is just over six. So, the six degrees of separation do appear to hold in this graph.

We can see the distribution of path lengths more clearly by looking at a histogram of them:

So, the distribution of path lengths appears to be more or less normal, centered on 6.

Network diameter

The network diameter is the longest of the shortest paths between any two nodes in the graph. This is simple to get:

user=> (reduce
         max Integer/MIN_VALUE
         (map count (map :path (g/iter-paths "path.json"))))
18

So the network diameter is approximately three times larger than the average.

Clustering coefficient

Clustering coefficient is a measure of how many densely linked clusters there are in the graph. This is one measure of the small world effect, and it's sometimes referred to as the "all my friends know each other" property. To find the clustering coefficient for one vertex, this basically cuts all of its neighbors out of the network and tries to find the density of the subgraph. In looking at the whole graph, a high clustering coefficient indicates a small world effect in the graph.

The following is how to find the clustering coefficient for a single vertex:

(defn clustering-coeff [graph n]
  (let [cluster ((:neighbors graph) n)
        edges (filter cluster (mapcat (:neighbors graph) cluster))
        e (count edges)
        k (count cluster)]
    (if (= k 1)
      0
      (/ (* 2.0 e) (* k (dec k))))))

The function to find the average clustering coefficient for the graph is straightforward, and you can find it in the code download. The following is how it looks when applied to this graph:

user=> (g/avg-cluster-coeff graph)
1.0874536731229358

So it's not overly large. Chances are, there are a few nodes that are highly connected throughout the graph and most others are less connected.

Centrality

There are several ways to measure how central a vertex is to the graph. One is closeness centrality. This is the distance of any particular vertex from all other vertices. We can easily get this information with the breadth-first function that we created earlier. Unfortunately, this only applies to complete networks, that is, to networks in which every vertex is reachable from every other vertex. This is not the case in the graph we're working with right now. There are some small pockets that are completely isolated from the rest of the network.

However, there are other measures of centrality that we can use instead. Betweenness centrality counts the number of shortest paths that a vertex is found in. Betweenness finds the vertices that act as a bridge. The original intent of this metric was to identify people who control the communication in the network.

To get this done efficiently, we can rely on the paths returned by the breadth-first function again. We'll get the paths from each vertex and call reduce over each. At every step, we'll calculate the total number of paths plus the number of times each vertex appears in a path:

(defn accum-betweenness
  [{:keys [paths betweenness reachable]} [v v-paths]]
  (let [v-paths (filter #(> (count %) 1) v-paths)]
    {:paths (+ paths (count v-paths)),
     :betweenness (merge-with +
                              betweenness
                              (frequencies (flatten v-paths))),
     :reachable (assoc reachable v (count v-paths))}))

Next, once we reach the end, we'll take the total number of paths and convert the betweenness and reachable totals for each vertex to a ratio, as follows:

(defn ->ratio [total [k c]]
  [k (double (/ c total))])
(defn finish-betweenness
  [{:keys [paths betweenness reachable] :as metrics}]
  (assoc metrics
         :betweenness (->> betweenness
                        (map #(->ratio paths %))
                        (into {}))
         :reachable (->> reachable
                      (map #(->ratio paths %))
                      (into {}))))

While these two functions do all the work, they aren't the public interface. The function metrics tie these two together in something we'd want to actually call:

(defn metrics [graph]
  (let [mzero {:paths 0, :betweenness {}, :reachable {}}]
    (->> graph
      get-vertices
      (pmap #(vector % (breadth-first graph %)))
      (reduce accum-betweenness mzero)
      finish-betweenness)))

We can now use this to find the betweenness centrality of any vertex as follows:

user=> (def m (g/metrics graph))
user=> ((:betweenness m) 0)
5.092923145895773E-4

Or, we can sort the vertices on the centrality measure to get those vertices that have the highest values. The first number in each pair of values that are returned is the node, and the second number is the betweenness centrality of that node. So, the first result says that the betweenness centrality for node 1085 is 0.254:

user=> (take 5 (reverse (sort-by second (seq (:betweenness m)))))
([1085 0.2541568423150047] [1718 0.1508391907570839] [1577 0.1228894724115601] [698 0.09236806137867479] [1505 0.08172539570689669])

This has all been interesting, but what about Kevin Bacon?

Degrees of separation

We started this chapter talking about the Six Degrees of Kevin Bacon, a pop culture phenomenon and how this captures a fundamental nature of many social networks. Let's analyze our Facebook network for this.

First, we'll create a function called degrees-between. This will take an origin vertex and a degree of separation to go out, and it will return a list of each level of separation and the vertices at that distance from the origin vertex. The degrees-between function will do this by accumulating a list of vertices at each level and a set of vertices that we've seen. At each step, it will take the last level and find all of those vertices' neighbors, without the ones we've already visited. The following is what this will look like:

(defn degrees-between [graph n from]
  (let [neighbors (:neighbors graph)]
    (loop [d [{:degree 0, :neighbors #{from}}],
           seen #{from}]
      (let [{:keys [degree neighbors]} (last d)]
        (if (= degree n)
          d
          (let [next-neighbors (->> neighbors
                             (mapcat (:neighbors graph))
                             (remove seen)
                             set)]
          (recur (conj d {:degree (inc degree)
                            :neighbors next-neighbors})
                   (into seen next-neighbors))))))))

Earlier, we included a way to associate data with a vertex, but we haven't used this yet. Let's exercise that feature to store the degrees of separation from the origin vertex in the graph. We can either call this function with the output of degrees-between or with the parameters to degrees-between:

(defn store-degrees-between
  ([graph degrees]
   (let [store (fn [g {:keys [degree neighbors]}]
                 (reduce #(set-value %1 %2 degree) g neighbors))]
     (reduce store graph degrees)))
  ([graph n from]
   (store-degrees-between graph (degrees-between graph n from))))

Finally, the full graph is a little large, especially for many visualizations. So, let's include a function that will let us zoom in on the graph identified by the degrees-between function. It will return both the original graph, with the vertex data fields populated and the subgraph of vertices within the n levels of separation from the origin vertex:

(defn degrees-between-subgraph [graph n from]
  (let [marked (store-degrees-between graph n from)
        v-set (set (map first (filter second (:data marked))))
        sub (subgraph marked v-set)]
    {:graph marked, :subgraph sub}))

With these defined, we can learn some more interesting things about the network that we're studying. Let's see how much of the network with different vertices can reach within six hops. Let's look at how we'd do this with vertex 0, and then we can see a table that presents these values for several vertices:

user=> (def v-count (count (g/get-vertices g)))
#'user/v-count
user=> (double
         (/ (count
              (g/get-vertices
                (:subgraph (g/degrees-between-subgraph g 6 0))))
            v-count))
0.8949229603435211

Now, it's interesting to see how the betweenness values for these track the amount of the graph that they can access quickly:

Vertex

Betweenness

Percent accessible

0

0.0005093

89.5000

256

0.0000001

0.0005

1354

0.0005182

75.9500

1085

0.2541568

96.1859

These are some interesting data points. What does this look like for the network as a whole?

This makes it clear that there's probably little correlation between these two variables. Most vertices have a very low betweenness, although they range between 0 and 100 in the percent of the network that they can access.

At this point, we have some interesting facts about the network, but it would be helpful to get a more intuitive overview of it, like we just did for the betweenness centrality. Visualizations can help here.

 

Visualizing the graph


At this point, it would be really useful to visualize this graph. There are a number of different ways to visualize graphs. We'll use the JavaScript library D3 (data-driven documents, http://d3js.org/) to generate several graph visualizations on subgraphs of the Facebook network data, and we'll look at the pros and cons of each. Finally, we'll use a simple pie chart to visualize how much of the graph is affected as we move outward from a node through its degrees of separation.

Setting up ClojureScript

As I just mentioned, D3 is a JavaScript library. JavaScripts are not bad, but this is a book about Clojure. There's an implementation of the Clojure compiler that takes Clojure and generates JavaScript. So, we'll use that to keep our focus on Clojure while we call JavaScript libraries and deploy them on a browser.

Before we can do that, however, we need to set up our system to use ClojureScript. The first thing we'll need to do is to add the configuration to our project.clj file for this project. This is fairly simple. We just need to declare lein-cljsbuild as a plugin for this project and then configure the ClojureScript compiler. Our project.clj file from earlier is shown as follows, with the relevant lines highlighted as follows:

(defproject network-six "0.1.0-SNAPSHOT"
  :description "FIXME: write description"
  :url "http://example.com/FIXME"
  :license {:name "Eclipse Public License"
            :url "http://www.eclipse.org/legal/epl-v10.html"}
  :plugins [[lein-cljsbuild "0.3.2"]]
  :dependencies [[org.slf4j/slf4j-simple "1.7.5"]
                 [org.clojure/clojure "1.5.1"]
                 [org.clojure/data.json "0.2.2"]
                 [me.raynes/fs "1.4.4"]
                 [org.clojure/clojurescript "0.0-2202"]]
  :cljsbuild {:builds
              [{:source-paths ["src-cljs"],
                :compiler {:pretty-printer true,
                :output-to "www/js/main.js",
                :optimizations :whitespace}}]})

The first line adds the lein-cljsbuild plugin to the project. The second block of lines tell Leiningen to watch the src-cljs directory for ClojureScript files. All of these files are then compiled into the www/js/main.js file.

We'll need an HTML file to frame the compiled JavaScript. In the code download, I've included a basic page that's modified from an HTML5 Boilerplate template (http://html5boilerplate.com/). The biggest change is that I've taken out everything that's in the div content.

Also, I added some script tags to load D3 and a D3 plugin for one of the types of graphs that we'll use later. After the tag that loads bootstrap.min.js, I added these:

<script src="http://d3js.org/d3.v3.min.js"></script>
<script src="http://d3js.org/d3.hive.v0.min.js"></script>

Finally, to load the data files asynchronously with AJAX, the www directory will need to be accessible from a web server. There are a number of different options, but if you have Python installed, the easiest option is to probably navigate to the www directory and execute the following command:

$ cd www
$ python -m SimpleHTTPServer
Serving HTTP on 0.0.0.0 port 8000 ...

Now we're ready to proceed. Let's make some charts!

A force-directed layout

One of the standard chart types to visualize graphs is a force-directed layout. These charts use a dynamic-layout algorithm to generate charts that are more clear and look nice. They're modeled on springs. Each vertex repels all the other vertices, but the edges draw the vertices closer.

To have this graph compiled to JavaScript, we start by creating a file named src-cljs/network-six/force.cljs. We'll have a standard namespace declaration at the top of the file:

(ns network-six.force)

Generally, when we use D3, we first set up part of the graph. Then, we get the data. When the data is returned, we continue setting up the graph. In D3, this generally means selecting one or more elements currently in the tree and then selecting some of their children using selectAll. The elements in this new selection may or may not exist at this point. We join the selectAll elements with the data. From this point, we use the enter method most of the time to enter the data items and the nonexistent elements that we selected earlier. If we're updating the data, assuming that the elements already exist, then the process is slightly different. However, the process that uses the enter method, which I described, is the normal workflow that uses D3.

So, we'll start with a little setup for the graph by creating the color palette. In the graph that we're creating, colors will represent the node's distance from a central node. We'll take some time to understand this, because it illustrates some of the differences between Clojure and ClojureScript, and it shows us how to call JavaScript:

(defn make-color []
  (.. js/d3
    -scale
    category10
    (domain (array 0 1 2 3 4 5 6))))

Let's take this bit by bit so that we can understand it all. I'll list a line and then point out what's interesting about it:

  (.. js/d3

There are a couple of things that we need to notice about this line. First,.. is the standard member access macro that we use for Java's interoperability with the main Clojure implementation. In this case, we're using it to construct a series of access calls against a JavaScript object. In this case, the ClojureScript that the macro expands to would be (.domain (.category10 (.-scale js/d3)) (array 0 1 2 3 4 5 6)).

In this case, that object is the main D3 object. The js/ namespace is available by default. It's just an escape hatch to the main JavaScript scope. In this case, it would be the same as accessing a property on the JavaScript window object. You can use this to access anything from JavaScript without having to declare it. I regularly use it with js/console for debugging, for example:

    -scale

This resolves into the JavaScript d3.scale call. The minus sign before scale just means that the call is a property and not a function that takes no arguments. As Clojure doesn't have properties and everything here would look like a function call, ClojureScript needs some way to know that this should not generate a function call. The dash does that as follows:

    category10

This line, combined with the preceding lines, generates JavaScript that looks like d3.scale.category10(). In this case, the call doesn't have a minus sign before it, so the ClojureScript compiler knows that it should generate a function call in this case:

    (domain (array 0 1 2 3 4 5 6))))

Finally, this makes a call to the scale's domain method with an array that sets the domain to the integers between 0 and 6, inclusive of both. These are the values for the distances that we'll look at. The JavaScript for this would be d3.scale.category10().domain([0, 1, 2, 3, 4, 5, 6]).

This function creates and returns a color object. This object is callable, and when it acts as a function that takes a value and returns a color, this will consistently return the same color whenever it's called with a given value from the domain. For example, this way, the distance 1 will also be associated with the same color in the visualization.

This gives us an introduction to the rules for interoperability in ClojureScript. Before we make the call to get the data file, we'll also create the object that takes care of managing the force-directed layout and the D3 object for the svg element. However, you can check the code download provided on the Packt Publishing website for the functions that create these objects.

Next, we need to access the data. We'll see that in a minute, though. First, we need to define some more functions to work with the data once we have it.For the first function, we need to take the force-layout object and associate the data with it.

The data for all of the visualizations has the same format. Each visualization is a JSON object with three keys. The first one, nodes, is an array of JSON objects, each representing one vertex in the graph. The main property of these objects that we're interested in is the data property. This contains the distance of the current vertex from the origin vertex. Next, the links property is a list of JSON objects that represent the edges of the graph. Each link contains the index of a source vertex and a target vertex. Third, the graph property contains the entire graph using the same data structures as we did in Clojure.

The force-directed layout object expects to work with the data from the nodes and the links properties. We set this up and start the animation with the setup-force-layout function:

(defn setup-force-layout [force-layout graph]
  (.. force-layout
    (nodes (.-nodes graph))
    (links (.-links graph))
    start))

As the animation continues, the force-layout object will assign each node and link the object with one or more coordinates. We'll need to update the circles and paths with those values.

We'll do this with a handler for a tick event that the layout object will emit:

(defn on-tick [link node]
  (fn []
    (.. link
      (attr "x1" #(.. % -source -x))
      (attr "y1" #(.. % -source -y))
      (attr "x2" #(.. % -target -x))
      (attr "y2" #(.. % -target -y)))
    (.. node
      (attr "cx" #(.-x %))
      (attr "cy" #(.-y %)))))

Also, at this stage, we create the circle and path elements that represent the vertices and edges. We won't list these functions here.

Finally, we tie everything together. First, we set up the initial objects, then we ask the server for the data, and finally, we create the HTML/SVG elements that represent the data. This is all tied together with the main function:

(defn ^:export main [json-file]
  (let [width 960, height 600
        color (make-color)
        force-layout (make-force-layout width height)
        svg (make-svg width height)]
    (.json js/d3 json-file
           (fn [err graph]
             (.. graph
               -links
               (forEach #(do (aset %1 "weight" 1.0)
                           (aset %1 "index" %2))))
             (setup-force-layout force-layout graph)
             (let [link (make-links svg graph color)
                   node (make-nodes svg graph color force-layout)]
               (.on force-layout "tick"
                    (on-tick link node)))))))

There are a couple of things that we need to notice about this function, and they're both highlighted in the preceding snippet. The first is that the function name has an :export metadata flag attached to it. This just signals that the ClojureScript compiler should make this function accessible from JavaScript outside this namespace. The second is the call to d3.json. This function takes a URL for a JSON data file and a function to handle the results. We'll see more of this function later.

Before we can use this, we need to call it from the HTML page. After the script tag that loads js/main.js, I added this script tag:

<script>
network_six.force.main('facebook-49.json');
</script>

This loads the data file for vertex number 49. This vertex had a betweenness factor of 0.0015, and it could reach four percent of the larger network within six hops. This is small enough to create a meaningful, comprehensible graphic, as seen in the following figure:

The origin vertex (49) is the blue vertex on the lower-right section, almost the farthest-right node of the graph. All the nodes at each hop away from that node will be of a different color. The origin vertex branches to three orange vertices, which link to some green ones. One of the green vertices is in the middle of the larger cluster on the right.

Some aspects of this graph are very helpful. It makes it relatively easy to trace the nodes as they get farther from the origin. This is even easier when interacting with the node in the browser, because it's easy to grab a node and pull it away from its neighbors.

However, it distorts some other information. The graph that we're working with today is not weighted. Theoretically, the links in the graph should be the same length because all the edges have the same weight. In practice, however, it's impossible to display a graph in two dimensions. Force-directed layouts help you display the graph, but the cost is that it's hard to tell exactly what the line lengths and the several clear clusters of various sizes mean on this graph.

Also, the graphs themselves cannot be compared. If we then pulled out a subgraph around a different vertex and charted it, we wouldn't be able to tell much by comparing the two.

So what other options do we have?

A hive plot

The first option is a hive plot. This is a chart type developed by Martin Krzywinski (http://egweb.bcgsc.ca/). These charts are a little different, and reading them can take some time to get used to, but they pack in more meaningful information than force-directed layout or other similar chart types do.

In hive plots, the nodes are positioned along a number of radial axes, often three. Their positions on the axis and which axis they fall on are often meaningful, although the meanings may change between different charts in different domains.

For this, we'll have vertices with a higher degree (with more edges attached to them) be positioned farther out from the center. Vertices closer in will have fewer edges and fewer neighbors. Again, the color of the lines represent the distance of that node from the central node. In this case, we won't make the selection of the axis meaningful.

To create this plot, we'll open a new file, src-cljs/network-six/hive.cljs. At the top, we'll use this namespace declaration:

(ns network-six.hive)

The axis on which a node falls on is an example of a D3 scale; its color from the force layout plot is another scale. Scales are functions that also have properties attached and are accessible via getter or setter functions. However, primarily, when they are passed a data object and a key function, they know how to assign that data object a position on the scale.

In this case, the make-angle function will be used to assign nodes to an axis:

(defn make-angle []
  (.. js/d3
    -scale
    ordinal
    (domain (.range js/d3 4))
    (rangePoints (array 0 (* 2.0 pi)))))

We'll position the nodes along each axis with the get-radius function. This is another scale that takes a vertex and positions it in a range between 40 and 400 according to the number of edges that are connected to it:

(defn get-radius [nodes]
  (.. js/d3
    -scale
    linear
    (range (array 40 400))
    (domain (array (.min js/d3 nodes #(.-count %))
                   (.max js/d3 nodes #(.-count %))))))

We use these scales, along with a scale for color, to position and style the nodes:

(defn make-circles [svg nodes color angle radius]
  (.. svg
    (selectAll ".node")
    (data nodes)
    (enter)
    (append "circle")
    (attr "stroke" #(color (.-data %)))
    (attr "transform"
          #(str "rotate(" (degrees (angle (mod (.-n %) 3))) \)))
    (attr "cx" #(radius (.-count %)))
    (attr "r" 5)
    (attr "class" #(get-classes %))))

I've highlighted the scales that we use in the preceding code snippet. The circle's stroke property comes from the color, which represents the distance of the vertex from the origin for this graph.

The angle is used to assign the circle to an axis using the circle's transform attribute. This is done more or less at random, based on the vertex's index in the data collection.

Finally, the radius scale positions the circle along the axis. This sets the circle's position on the x axis, which is then rotated using the transform attribute and the angle scale.

Again, everything is brought together in the main function. This sets up the scales, requests the data, and then creates and positions the nodes and edges:

(defn ^:export main [json-file]
  (let [width 750, height 750
        angle (make-angle), color (make-color)
        svg (make-svg width height)]
    (.json js/d3 json-file
           (fn [err data]
             (let [nodes (.-nodes data)
                   radius (get-radius nodes)]
               (make-axes svg angle radius)
               (let [df (get-degreed nodes data)]
                 (make-arcs svg nodes df color angle radius)
                 (make-circles svg nodes color angle radius)))))))

Let's see what this graph looks like:

Again, the color represents the distance of the node from the central node. The distance from the center on each axis is the degree of the node.

It's clear from the predominance of the purple-pink color and the bands that the majority of the vertices are six hops from the origin vertex. From the vertices' position on the axes, we can also see that most nodes have a moderate number of edges attached to them. One has quite a few, but most are much closer to the center.

This graph is denser. Although the force-layout graph may have been problematic, it seemed more intuitive and easier to understand, whether it was meaningful or not. Hive plots are more meaningful, but they also take a bit more work to learn to read and to decipher.

A pie chart

Our needs today are simpler than the complex graph we just created; however, we're primarily interested in how much of the network is covered within six hops from a vertex. Neither of the two graphs that we've looked at so far conveyed that well, although they have presented other information and they're commonly used with graphs. We want to know proportions, and the go-to chart for proportions is the pie chart. Maybe it's a little boring, and it's does not strictly speak of a graph visualization per se, but it's clear, and we know what we're dealing with in it.

Generating a pie chart will look very similar to creating a force-directed layout graph or a hive plot. We'll go through the same steps, overall, even though some of the details will be different.

One of the first differences is the function to create an arc. This is similar to a scale, but its output is used to create the d (path description) attribute of the pie chart's wedges:

(defn make-arc [radius]
  (.. js/d3 -svg arc
    (outerRadius (- radius 10))
    (innerRadius 0)))

The pie layout controls the overall process and design of the chart. In this case, we say that we want no sorting, and we need to use the amount property of the data objects:

(defn make-pie []
  (.. js/d3 -layout pie
    (sort nil)
    (value #(.-amount %))))

The other difference in this chart is that we'll need to preprocess the data before it's ready to be fed to the pie layout. Instead of a list of nodes and links, we'll need to give it categories and counts. To make this easier, we'll create a record type for these frequencies:

(defrecord Freq [degree amount])

Also, we'll need a function that takes the same data as the other charts, counts it by distance from the origin vertex, and creates Freq instances to contain that data:

(defn get-freqs [data]
  (->> data
    .-nodes
    (map #(.-data %))
    frequencies
    (map #(Freq. (first %) (second %)))
    into-array))

Again, we pull all these together in the main function, and we do things in the usual way. First, we set up the graph, then we retrieve the data, and finally, we put the two together to create the graph.

In this case, this should give us an idea of how much of the graph this vertex can easily touch. The graph for vertex 49 is shown as follows. We can see that it really doesn't touch much of the network at all. 3799 vertices, more than 95 percent of the network, aren't within six hops of vertex 49.

However, if we compare this with the pie chart for vertex 1085, which was the vertex with the highest betweenness factor, we see a very different picture. For that vertex, more than 95 percent of the network is reachable within 6 hops.

It's also interesting that most of the vertices are four edges away from the origin. For smaller networks, most vertices are further away. However, in this case, it's almost as if it had started running out of vertices in the network.

 

Summary


So, we discovered that this dataset does conform to a loose definition of the small world or a six-degree hypothesis. The average distance between any two nodes is about six. Also, as we're working with a sample, it's possible that working with a complete graph may fill in some links and bring the nodes closer together.

We also had an interesting time looking at some visualizations. One of the important lessons that we learned was that more complicated isn't always better. Simple, perhaps even a little boring, graphs can sometimes answer the questions we have in a better manner.

However, we've barely scratched the surface of what we can do with social graphs. We've primarily been looking at the network as a very basic, featureless graph, looking at the existence of people and their relationships without digging into the details. However, there are several directions we could go in to make our analysis more social. For one, we could look at the different types of relationships. Facebook and other social platforms allow you to specify spouses, for example, it might be interesting to look at an overlap between spouses' networks. Facebook also tracks interests and affiliations using their well-known Like feature. We could also look at how well people with similar interests find each other and form cliques.

In the end, we've managed to learn a lot about networks and how they work. Many real-world social networks share very similar characteristics, and there's a lot to be learned from sociology as well. These structures have always defined us but never more so than now. Being able to effectively analyze social networks, and the insights we can get from them, can be a useful and effective part of our toolkit.

In the next chapter, we'll look at using geographical analysis and applying that to weather data.

About the Author

  • Eric Rochester

    Eric Rochester enjoys reading, writing, and spending time with his wife and kids. When he’s not doing these things, he programs in a variety of languages and platforms, including websites and systems in Python, and libraries for linguistics and statistics in C#. Currently, he is exploring functional programming languages, including Clojure and Haskell. He works at Scholars’ Lab in the library at the University of Virginia, helping humanities professors and graduate students realize their digitally informed research agendas. He is also the author of Mastering Clojure Data Analysis, Packt Publishing.

    Browse publications by this author
Book Title
Access this book and the full library for FREE
Access now