Reader small image

You're reading from  Mastering Clojure Data Analysis

Product typeBook
Published inMay 2014
Reading LevelBeginner
Publisher
ISBN-139781783284139
Edition1st Edition
Languages
Right arrow
Author (1)
Eric Richard Rochester
Eric Richard Rochester
author image
Eric Richard Rochester

Eric Richard Rochester Studied medieval English literature and linguistics at UGA. Dissertated on lexicography. Now he programs in Haskell and writes. He's also a husband and parent.
Read more about Eric Richard Rochester

Right arrow

Chapter 7. Null Hypothesis Tests – Analyzing Crime Data

Getting started with data analysis can be so easy. We just plug numbers into a function or library and retrieve the results. But sometimes, it's easy to forget that we have to pay attention to how the data and experiments are constructed and how the questions are framed. Much of the reliability of statistics comes from following good practices and developed processes for framing and executing the tests and experiments.

Of course, there's a lot to setting up statistical experiments and following best practices in gathering data and applying statistical tests. We won't be able to do more than cursorily glance at this topic. Hopefully, either it will serve as a reminder of things you already know or it will outline what you need to know and point you in the right direction to learn more.

Over the course of this chapter, we'll move back and forth between looking at the problem we're tackling and seeing what null hypothesis testing is, how...

Introducing confirmatory data analysis


Oftentimes, data analysis seems like a menu of analyses applied to problems, but lacking an overall structure. Of course, this isn't the case, but it seems that way to programmers without a strong background in statistics.

Frameworks such as confirmatory data analysis and null hypothesis testing provide the structure that may be missing. Generally, when you begin working with data, you start by generating some summary statistics that highlight some of the basic characteristics of the data. Afterwards, you probably generate some graphs that further elucidate the essential qualities of the data. This all falls into the realm of exploratory data analysis.

However, as the exploration wraps up, you'll probably start to think of some theories about the data that you'd like to test. You'll generate some hypotheses, and you'll need to test whether they're true or not. And based on those tests, you'll further refine your knowledge of the data, what's in it, and...

Understanding null hypothesis testing


One common way of structuring and processing these tests is to use null hypothesis testing. This represents a frequentist approach to statistical inference. This draws inferences based upon the frequencies or proportions in the data, paying attention to confidence intervals and error rates. Another approach is Bayesian inference, which focuses on degrees of belief, but we won't go into that in this chapter.

Frequentist inference has been very successful. Its use is assumed in many fields, such as the social sciences and biology. Its techniques are widely implemented in many libraries and software packages, and it's relatively easy to start using it. It's the approach we'll use in this chapter.

Understanding the process

To use the null hypothesis process, we should understand what we'll be doing at each step of the way. The following is the basic process that we'll work through in this chapter:

  1. Formulate an initial hypothesis.

  2. State the null (H0) and alternative...

Understanding burglary rates


Understanding crime seems like a universal problem. Earlier, societies grappled with the problem of evil in the universe from a theological perspective; today, sociologists and criminologists construct theories and study society using a variety of tools and techniques. However the problem is cast, the aim is to better understand why some people violate social norms in ways that are often violent and harmful to those around them and even themselves. By better understanding this problem, ultimately we'd like to be able to create social programs and government policies that minimize the damage and create a safer and hopefully more just society for all involved.

Of course, as data scientists and programmers engaging in data analysis, we're inclined to approach this problem as a data problem. That's what we'll do in the rest of this chapter. We'll gather some crime and economic data and look for a tie between the two. In the course of our analysis, we'll explore the...

Exploring the data


Let's explore a little and try to get a feel for the data. First, let's try to get some summary statistics for the various datasets. Afterward, we'll generate some graphs to get a more intuitive sense for what's in the data and how they're related.

Generating summary statistics

Incanter makes generating summary statistics easy. You can pass a dataset to the incanter.stats/summary function. It returns a sequence of maps. Each map represents the summary data for each column in the original dataset. This includes whether the data is numeric or not. For nominal data, it returns some sample items and their counts. For numeric data, it returns the mean, median, minimum, and maximum.

Summarizing UNODC crime data

If we load the data and filter it for the crime of "burglary", we can get the summary statistics for those fields as follows:

(s/summary
  (i/$where {:crime {:$eq "CTS 2012 Burglary"}} by-ag-lnd))

And if we pick apart the data structures that it outputs, the following are the...

Conducting the experiment


Now we're ready to frame and perform the experiment. Let's walk through the steps to do that one more time.

Formulating an initial hypothesis

In this case, our hypothesis is that there is a relationship between the per capita gross national income and the rate of burglaries. We could go further and make the hypothesis stronger by specifying that higher GNI correlates to a higher burglary rate, somewhat counter-intuitively.

Stating the null and alternative hypotheses

Given that statement of our working hypothesis, we can now formulate the null and alternative hypotheses.

  • H0: There is no relationship between the per capita gross national income and a country's burglary rate.

  • H1: There is a relationship between the per capita gross national income and the country's burglary rate.

These statements will now guide us through the rest of the process.

Identifying the statistical assumptions in the sample

There are a number of assumptions in this data that we need to be aware of...

Interpreting the results


Of course, the results don't tell us a whole lot. For one, we have to remember that just because there's a relationship, that doesn't imply causality. Moreover, because the result is so significant, we should probably be skeptical about the results and whether they're caused by some artifact in the data or the procedures.

We've already talked about the problems in the data, and some of them may be at fault. Particularly, some of the data is missing because of normalization problems, which may change the results. Another possibility is that industrialized nations keep better records, so they would appear to have more burglaries.

Summary


So, in this chapter, we learned how null hypothesis testing can help us structure our analyses. Having a well thought out and standard procedure also ensures that we are thorough in our analysis. For example, in this chapter, we were forced to confront the ugly truths about the data we were working with, and that gave us insights into the results that we achieved later.

In the next chapter, we'll actually get a chance to use these techniques again, when we look at conducting A/B testing on websites.

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Mastering Clojure Data Analysis
Published in: May 2014Publisher: ISBN-13: 9781783284139
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Eric Richard Rochester

Eric Richard Rochester Studied medieval English literature and linguistics at UGA. Dissertated on lexicography. Now he programs in Haskell and writes. He's also a husband and parent.
Read more about Eric Richard Rochester