Reader small image

You're reading from  Clojure for Data Science

Product typeBook
Published inSep 2015
Reading LevelIntermediate
Publisher
ISBN-139781784397180
Edition1st Edition
Languages
Right arrow
Author (1)
Henry Garner
Henry Garner
author image
Henry Garner

Henry Garner is a graduate from the University of Oxford and an experienced developer, CTO, and coach. He started his technical career at Britain's largest telecoms provider, BT, working with a traditional data warehouse infrastructure. As a part of a small team for 3 years, he built sophisticated data models to derive insight from raw data and use web applications to present the results. These applications were used internally by senior executives and operatives to track both business and systems performance. He then went on to co-found Likely, a social media analytics start-up. As the CTO, he set the technical direction, leading to the introduction of an event-based append-only data pipeline modeled after the Lambda architecture. He adopted Clojure in 2011 and led a hybrid team of programmers and data scientists, building content recommendation engines based on collaborative filtering and clustering techniques. He developed a syllabus and copresented a series of evening classes from Likely's offices for professional developers who wanted to learn Clojure. Henry now works with growing businesses, consulting in both a development and technical leadership capacity. He presents regularly at seminars and Clojure meetups in and around London.
Read more about Henry Garner

Right arrow

Binning data


To develop an intuition for what these various calculations of variance are measuring, we can employ a technique called binning. Where data is continuous, using frequencies (as we did with the election data to count the nils) is not practical since no two values may be the same. However, it's possible to get a broad sense of the structure of the data by grouping the data into discrete intervals.

The process of binning is to divide the range of values into a number of consecutive, equally-sized, smaller bins. Each value in the original series falls into exactly one bin. By counting the number of points falling into each bin, we can get a sense of the spread of the data:

The preceding illustration shows fifteen values of x split into five equally-sized bins. By counting the number of points falling into each bin we can clearly see that most points fall in the middle bin, with fewer points falling into the bins towards the edges. We can achieve the same in Clojure with the following bin function:

(defn bin [n-bins xs]
  (let [min-x    (apply min xs)
        max-x    (apply max xs)
        range-x  (- max-x min-x)
        bin-fn   (fn [x]
                   (-> x
                       (- min-x)
                       (/ range-x)
                       (* n-bins)
                       (int)
                       (min (dec n-bins))))]
    (map bin-fn xs)))

For example, we can bin range 0-14 into 5 bins like so:

(bin 5 (range 15))

;; (0 0 0 1 1 1 2 2 2 3 3 3 4 4 4)

Once we've binned the values we can then use the frequencies function once again to count the number of points in each bin. In the following code, we use the function to split the UK electorate data into five bins:

(defn ex-1-11 []
  (->> (load-data :uk-scrubbed)
       (i/$ "Electorate")
       (bin 10)
       (frequencies)))

;; {1 26, 2 450, 3 171, 4 1, 0 2}

The count of points in the extremal bins (0 and 4) is much lower than the bins in the middle—the counts seem to rise up towards the median and then down again. In the next section, we'll visualize the shape of these counts.

Previous PageNext Page
You have been reading a chapter from
Clojure for Data Science
Published in: Sep 2015Publisher: ISBN-13: 9781784397180
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Henry Garner

Henry Garner is a graduate from the University of Oxford and an experienced developer, CTO, and coach. He started his technical career at Britain's largest telecoms provider, BT, working with a traditional data warehouse infrastructure. As a part of a small team for 3 years, he built sophisticated data models to derive insight from raw data and use web applications to present the results. These applications were used internally by senior executives and operatives to track both business and systems performance. He then went on to co-found Likely, a social media analytics start-up. As the CTO, he set the technical direction, leading to the introduction of an event-based append-only data pipeline modeled after the Lambda architecture. He adopted Clojure in 2011 and led a hybrid team of programmers and data scientists, building content recommendation engines based on collaborative filtering and clustering techniques. He developed a syllabus and copresented a series of evening classes from Likely's offices for professional developers who wanted to learn Clojure. Henry now works with growing businesses, consulting in both a development and technical leadership capacity. He presents regularly at seminars and Clojure meetups in and around London.
Read more about Henry Garner