Packt+ | Advance your knowledge in tech

You're reading from Clojure for Data Science

Product typeBook

Published inSep 2015

Reading LevelIntermediate

Publisher

ISBN-139781784397180

Edition1st Edition

Languages

Clojure

Concepts

Data Analysis

Author (1)

Henry Garner

Comparative visualizations

Q-Q plots provide a great way to compare a measured, empirical distribution to a theoretical normal distribution. If we'd like to compare two or more empirical distributions with each other, we can't use Incanter's Q-Q plot charts. We have a variety of other options, though, as shown in the next two sections.

Box plots

Box plots, or box and whisker plots, are a way to visualize the descriptive statistics of median and variance visually. We can generate them using the following code:

(defn ex-1-22 []
  (-> (c/box-plot (->> (honest-baker 1000 30)
                       (take 10000))
                  :legend true
                  :y-label "Loaf weight (g)"
                  :series-label "Honest baker")
      (c/add-box-plot (->> (dishonest-baker 950 30)
                           (take 10000))
                      :series-label "Dishonest baker")
      (i/view)))

This creates the following plot:

The boxes in the center of the plot represent the interquartile range. The median is the line across the middle of the box, and the mean is the large black dot. For the honest baker, the median passes through the centre of the circle, indicating the mean and median are about the same. For the dishonest baker, the mean is offset from the median, indicating a skew.

The whiskers indicate the range of the data and outliers are represented by hollow circles. In just one chart, we're more clearly able to see the difference between the two distributions than we were on either the histograms or the Q-Q plots independently.

Cumulative distribution functions

Cumulative distribution functions, also known as CDFs, describe the probability that a value drawn from a distribution will have a value less than x. Like all probability distributions, they value between 0 and 1, with 0 representing impossibility and 1 representing certainty. For example, imagine that I'm about to throw a six-sided die. What's the probability that I'll roll less than a six?

For a fair die, the probability I'll row a five or lower is . Conversely, the probability I'll roll a one is only . Three or lower corresponds to even odds—a probability of 50 percent.

The CDF of die rolls follows the same pattern as all CDFs—for numbers at the lower end of the range, the CDF is close to zero, corresponding to a low probability of selecting numbers in this range or below. At the high end of the range, the CDF is close to one, since most values drawn from the sequence will be lower.

Note

The CDF and quantiles are closely related to each other—the CDF is the inverse of the quantile function. If the 0.5 quantile corresponds to a value of 1,000, then the CDF for 1,000 is 0.5.

Just as Incanter's s/quantile function allows us to sample values from a distribution at specific points, the s/cdf-empirical function allows us to input a value from the sequence and return a value between zero and one. It is a higher-order function—one that will accept the value (in this case, a sequence of values) and return a function. The returned function can then be called as often as necessary with different input values, returning the CDF for each of them.

Note

Higher-order functions are functions that accept or return functions.

Let's plot the CDF of both the honest and dishonest bakers side by side. We can use Incanter's c/xy-plot for visualizing the CDF by plotting the source data—the samples from our honest and dishonest bakers—against the probabilities calculated against the empirical CDF. The c/xy-plot function expects the x values and the y values to be supplied as two separate sequences of values.

To plot both distributions on the same chart, we need to be able to provide multiple series to our xy-plot. Incanter offers functions for many of its charts to add additional series. In the case of an xy-plot, we can use the function c/add-lines, which accepts the chart as the first argument, and the x series and the y series of data as the next two arguments respectively. You can also pass an optional series label. We do this in the following code so we can tell the two series apart on the finished chart:

(defn ex-1-23 []
  (let [sample-honest    (->> (honest-baker 1000 30)
                              (take 1000))
        sample-dishonest (->> (dishonest-baker 950 30)
                              (take 1000))
        ecdf-honest    (s/cdf-empirical sample-honest)
        ecdf-dishonest (s/cdf-empirical sample-dishonest)]
    (-> (c/xy-plot sample-honest (map ecdf-honest sample-honest)
                   :x-label "Loaf Weight"
                   :y-label "Probability"
                   :legend true
                   :series-label "Honest baker")
        (c/add-lines sample-dishonest
                     (map ecdf-dishonest sample-dishonest)
                     :series-label "Dishonest baker")
        (i/view))))

The preceding code generates the following chart:

Although it looks very different, this chart shows essentially the same information as the box and whisker plot. We can see that the two lines cross at approximately the median of 0.5, corresponding to 1,000g. The dishonest line is truncated at the lower tail and longer on the upper tail, corresponding to a skewed distribution.

You have been reading a chapter from

Clojure for Data Science

Published in: Sep 2015Publisher: ISBN-13: 9781784397180

A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.

undefined

Unlock this book and the full library FREE for 7 days

Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of

Start free trial

Renews at $15.99/month. Cancel anytime

Author (1)

Henry Garner

Henry Garner is a graduate from the University of Oxford and an experienced developer, CTO, and coach. He started his technical career at Britain's largest telecoms provider, BT, working with a traditional data warehouse infrastructure. As a part of a small team for 3 years, he built sophisticated data models to derive insight from raw data and use web applications to present the results. These applications were used internally by senior executives and operatives to track both business and systems performance. He then went on to co-found Likely, a social media analytics start-up. As the CTO, he set the technical direction, leading to the introduction of an event-based append-only data pipeline modeled after the Lambda architecture. He adopted Clojure in 2011 and led a hybrid team of programmers and data scientists, building content recommendation engines based on collaborative filtering and clustering techniques. He developed a syllabus and copresented a series of evening classes from Likely's offices for professional developers who wanted to learn Clojure. Henry now works with growing businesses, consulting in both a development and technical leadership capacity. He presents regularly at seminars and Clojure meetups in and around London.
Read more about Henry Garner

Personalised recommendations for you

Based on your interests and search pattern

Et al.

Ever wonder why speech recognition systems don't understand the Scottish accent, or what would happen if an astronaut only ate mac 'n' cheese, or other spurious reflections you'd have at a bar? We did, then collated those deliberations into absurd research articles with fake figures and methodologies inspired by even more fictionally absurd studies.

BookAug 2023230 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages4

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages1

Generative AI with LangChain

This book is a comprehensive introduction to LLMs and LangChain, demystifying the basic mechanics of LangChain, its functionalities, and the myriad of applications it can be integrated into.

BookDec 2023360 pages5

Mastering Tableau 2023

This book is a comprehensive resource to mastering your Tableau skills and becoming a BI expert. As you progress, you will learn how to build advanced dashboards and improve your storytelling to derive key business insight, as well as make you well-versed with advanced functionalities of Tableau in the business intelligence domain.

BookAug 2023684 pages

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages5

Building AI Applications with ChatGPT APIs

This guide covers all ChatGPT API features for effortless creation of robust AI powered apps. With its help, you’ll be able to leverage ChatGPT’s cutting-edge NLP models to take your app development skills to the next level. You’ll also work on ten exciting projects that will give you the practical know-how that you can apply to your existing applications.

BookSep 2023258 pages2

Data Engineering with AWS

Embark on a journey to master data engineering pipelines on AWS! Our book offers a hands-on experience of AWS services for ingesting, transforming, and consuming data. Whether you're an absolute beginner or someone with basic data engineering experience, this guide is an indispensable resource.

BookOct 2023636 pages5

Modern Data Architecture on AWS

Every organization wants an agile, performant, and cost-effective data platform that meets all their current and future business needs. Purpose-built AWS analytics services and their features play a big part in building such a modern data platform. This book brings to you all the design and architectural patterns that’ll help you achieve this goal.

BookAug 2023420 pages5

Practical Guide to Applied Conformal Prediction in Python

Discover the power of Conformal Prediction with the "Practical Guide to Applied Conformal Prediction in Python." Master the latest techniques to quantify uncertainty in machine learning and computer vision models, and seamlessly apply them to your industry applications.

BookDec 2023240 pages

TinyML Cookbook

With over 70 project-based recipes, the TinyML Cookbook is a practical guide that will help you to get the most out of your microcontrollers. It provides a comprehensive understanding of the theoretical foundations while giving you hands-on experience training ML models for deployment on Arduino Nano 33 BLE Sense, Raspberry Pi Pico, and SparkFun RedBoard Artemis Nano microcontrollers.

BookNov 2023664 pages