Reader small image

You're reading from  Mastering Clojure Data Analysis

Product typeBook
Published inMay 2014
Reading LevelBeginner
Publisher
ISBN-139781783284139
Edition1st Edition
Languages
Right arrow
Author (1)
Eric Richard Rochester
Eric Richard Rochester
author image
Eric Richard Rochester

Eric Richard Rochester Studied medieval English literature and linguistics at UGA. Dissertated on lexicography. Now he programs in Haskell and writes. He's also a husband and parent.
Read more about Eric Richard Rochester

Right arrow

Chapter 4. Classifying UFO Sightings

In this chapter, we're going to look at a dataset of UFO sightings. Sometimes, data analysis begins with a specific question or problem. Sometimes, however, it's more nebulous and vague. We'll engage with this UFO sighting dataset, and along the way, we'll learn more about data exploration, data visualization, and topic modeling before we dive into Naïve Bayesian classification.

This dataset was collected by the National UFO Reporting Center (NUFORC), and is available at http://www.nuforc.org/. They have included dates, rough locations, shapes, and descriptions of the sightings. We'll download and pull in this dataset. We'll see how to extract more structured data from messy, free-form text. And from there, we'll see how to visualize, analyze, and gain insights into our data.

In the process, we'll discover when is the best time to look for UFOs. We'll also learn what their important characteristics are. And we'll learn how to tell a description of a possible...

Getting the data


For this chapter, actually acquiring the data will be relatively easy. In other chapters, this step involves screen scraping, SPARQL, or other data extraction, munging, and cleaning techniques. For this dataset, we'll just download it from Infochimps (http://www.infochimps.com/). Infochimps is a company (and their website) devoted to Big Data and doing more with data analysis. They provide a collection of datasets that are online and freely available. To download this specific dataset, browse to http://www.infochimps.com/datasets/60000-documented-ufo-sightings-with-text-descriptions-and-metada and download the data from the link there, as shown in the following screenshot:

The data is in a ZIP-compressed file. This expands the files into the chimps_16154-2010-10-20_14-33-35 directory. This contains a file that lists metadata for the dataset as well as the data itself in several different formats. For the purposes of this chapter, we'll use the tab separated values (TSV) file...

Extracting the data


Before we go further, let's look at the following Leiningen 2 (http://leiningen.org/) project.clj file that we'll use for this chapter:

(defproject ufo-data "0.1.0-SNAPSHOT"
  :plugins [[lein-cljsbuild "0.3.2"]]
  :profiles {:dev {:plugins [[com.cemerick/austin "0.1.0"]]}}
  :dependencies [[org.clojure/clojure "1.5.1"]
                 [org.clojure/data.json "0.2.2"]
                 [org.clojure/data.csv "0.1.2"]
                 [clj-time "0.5.1"]
                 [incanter "1.5.2"]
                 [cc.mallet/mallet "2.0.7"]
                 [me.raynes/fs "1.4.4"]]
  :cljsbuild
    {:builds [{:source-paths ["src-cljs"],
               :compiler {:pretty-printer true,
                          :output-to "www/js/main.js",
                          :optimizations :whitespace}}]})

The preceding code shows that over the course of this chapter, we'll parse time with the clj-time library (https://github.com/clj-time/clj-time). This provides a rich, robust date and time library...

Dealing with messy data


The first thing that we need to deal with is qualitative data from the shape and description fields.

The shape field seems like a likely place to start. Let's see how many items have good data for it:

user=> (def data (m/read-data "data/ufo_awesome.tsv"))
user=> (count (remove (comp str/blank? :shape) data))
58870
user=> (count (filter (comp str/blank? :shape) data))
2523
user=> (count data)
61393
user=> (float 2506/61137)
0.04098991

So 4 percent of the data does not have the shape field set to meaningful data. Let's see what the most popular values for that field are:

user=> (def shape-freqs
           (frequencies
             (map str/trim
                  (map :shape
                       (remove (comp str/blank? :shape) data)))))
#'user/shape-freqs
user=> (pprint (take 10 (reverse (sort-by second shape-freqs))))
(["light" 12202]
 ["triangle" 6082]
 ["circle" 5271]
 ["disk" 4825]
 ["other" 4593]
 ["unknown" 4490]
 ["sphere" 3637]
 ["fireball...

Visualizing UFO data


We'll spend a good bit of time visualizing the data, and we'll use the same system that we have in the previous chapters: a bit of HTML, a splash of CSS, and a lot of JavaScript, which we'll generate from ClojureScript.

We've already taken care of the configuration for using ClojureScript in the project.clj file that I mentioned earlier. The rest of it involves a couple of more parts:

  • The code to generate the JSON data for the graph. This will be in the src/ufo_data/analysis.clj file. We'll write this code first.

  • An HTML page that loads the JavaScript libraries that we'll use—jQuery (https://jquery.org/) and D3 (http://d3js.org/)—and creates a div container in which to put the graph itself.

  • The source code for the graph. This will include a namespace for utilities in src-cljs/ufo-data/utils.cljs and the main namespace at src-cljs/ufo-data/viz.cljs.

With these prerequisites in place, we can start creating the graph of the frequencies of the different shapes.

First, we need...

Description


While the shape field is important, the description has more information. Let's see what we can do with it.

First, let's examine a few and see what some of them look like. The following example is one that I selected randomly:

Large boomerang shaped invisible object blocked starlight while flying across sky. I have a sketch and noted the year was 1999, but did not write down the day. The sighting took place in the late evening when it was completely dark and the sky was clear and full of stars. Out of the corner of my eye, I noticed movement in the sky from the north moving to the south. When I looked closer, however, it wasn’t an object that I was seeing move, rather it was the disappearance and reappearance of stars behind an object. The object itself was black or invisible with no lights. Given the area of stars that were blocked out, I would say the object was five times larger than a jet. It was completely silent. It was shaped like a boomerang only a little more...

Topic modeling descriptions


Another way to gain a better understanding of the descriptions is to use topic modeling. We learned about this text mining and machine learning algorithm in Chapter 3, Topic Modeling – Changing Concerns in the State of the Union Addresses. In this case, we'll see if we can use it to create topics over these descriptions and to pull out the differences, trends, and patterns from this set of texts.

First, we'll create a new namespace to handle our topic modeling. We'll use the src/ufo_data/tm.clj file. The following is the namespace declaration for it:

(ns ufo-data.tm
  (:require [clojure.java.io :as io]
            [clojure.string :as str]
            [clojure.pprint :as pp])
  (:import [cc.mallet.util.*]
           [cc.mallet.types InstanceList]
           [cc.mallet.pipe
            Input2CharSequence TokenSequenceLowercase
            CharSequence2TokenSequence SerialPipes
            TokenSequenceRemoveStopwords
            TokenSequence2FeatureSequence]
   ...

Hoaxes


One of the most interesting finds in this was topic seven. This topic was focused on annotations added to the descriptions for which the witnesses wished to remain anonymous. But its most likely document was the following:

Round, lighted object over Shelby, NC, hovered then zoomed away. It was my birthday party and me and my friends were walking around the block about 21:30. I just happened to look up and I saw a circular object with white and bright blue lights all over the bottom of it. It hovered in place for about 8 seconds then shot off faster than anything I have ever seen.((NUFORC Note: Witness elects to remain totally anonymous; provides no contact information. Possible hoax?? PD))((NUFORC Note: Source of report indicates that the date of the sighting is approximate. PD))

What caught my attention was the note "Possible hoax??" Several other descriptions in this topic had similar notes, often including the word hoax.

Finding this raised an interesting possibility: could...

Summary


This has been a wandering and hopefully fun trip through the UFO sightings dataset. We've learned something about the language used in describing close encounters, and we've learned about how to use visualizations, exploratory data analysis, and Naïve Bayesian classification to learn more about the data.

But the primary impression I have of this is the feedback analysis, visualization, and exploration. The visualization led us to topic modeling, and something we discovered there led us to Bayesian classification. This is typical of data analysis, where one thing we learn informs and motivates the next stage in the analysis. Each answer can raise further questions and drive us back into the data.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Clojure Data Analysis
Published in: May 2014Publisher: ISBN-13: 9781783284139
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Eric Richard Rochester

Eric Richard Rochester Studied medieval English literature and linguistics at UGA. Dissertated on lexicography. Now he programs in Haskell and writes. He's also a husband and parent.
Read more about Eric Richard Rochester