Reader small image

You're reading from  Mastering Clojure Data Analysis

Product typeBook
Published inMay 2014
Reading LevelBeginner
Publisher
ISBN-139781783284139
Edition1st Edition
Languages
Right arrow
Author (1)
Eric Richard Rochester
Eric Richard Rochester
author image
Eric Richard Rochester

Eric Richard Rochester Studied medieval English literature and linguistics at UGA. Dissertated on lexicography. Now he programs in Haskell and writes. He's also a husband and parent.
Read more about Eric Richard Rochester

Right arrow

Dealing with messy data


The first thing that we need to deal with is qualitative data from the shape and description fields.

The shape field seems like a likely place to start. Let's see how many items have good data for it:

user=> (def data (m/read-data "data/ufo_awesome.tsv"))
user=> (count (remove (comp str/blank? :shape) data))
58870
user=> (count (filter (comp str/blank? :shape) data))
2523
user=> (count data)
61393
user=> (float 2506/61137)
0.04098991

So 4 percent of the data does not have the shape field set to meaningful data. Let's see what the most popular values for that field are:

user=> (def shape-freqs
           (frequencies
             (map str/trim
                  (map :shape
                       (remove (comp str/blank? :shape) data)))))
#'user/shape-freqs
user=> (pprint (take 10 (reverse (sort-by second shape-freqs))))
(["light" 12202]
 ["triangle" 6082]
 ["circle" 5271]
 ["disk" 4825]
 ["other" 4593]
 ["unknown" 4490]
 ["sphere" 3637]
 ["fireball...
lock icon
The rest of the page is locked
Previous PageNext Page
You have been reading a chapter from
Mastering Clojure Data Analysis
Published in: May 2014Publisher: ISBN-13: 9781783284139

Author (1)

author image
Eric Richard Rochester

Eric Richard Rochester Studied medieval English literature and linguistics at UGA. Dissertated on lexicography. Now he programs in Haskell and writes. He's also a husband and parent.
Read more about Eric Richard Rochester