Search icon
Arrow left icon
All Products
Best Sellers
New Releases
Books
Videos
Audiobooks
Learning Hub
Newsletters
Free Learning
Arrow right icon
Associations and Correlations
Associations and Correlations

Associations and Correlations: Unearth the powerful insights buried in your data

By Lee Baker
$20.99 $13.99
Book Jun 2019 134 pages 1st Edition
eBook
$20.99 $13.99
Print
$29.99
Subscription
$15.99 Monthly
eBook
$20.99 $13.99
Print
$29.99
Subscription
$15.99 Monthly

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Jun 28, 2019
Length 134 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781838980412
Category :
Concepts :
Table of content icon View table of contents Preview book icon Preview Book

Associations and Correlations

Data Collection and Cleaning

The first step in any data analysis project is to collect and clean your data. If you're fortunate enough to have been given a perfectly clean dataset, then congratulations – you're well on your way. For the rest of us, though, there's quite a bit of grunt work to be done before you can get to the joy of analysis (yeah, I know, I really must get a life…).

In this chapter, you'll learn about what the features of a good dataset look like and how the dataset should be formatted to make it amenable to analysis by association and correlation tests.

Most importantly, you'll learn why it's not necessarily a good idea to collect sales data on ice cream and haemorrhoid cream in the same dataset.

If you're happy with your dataset and quite sure that it doesn't need cleaning, then you can safely skip this chapter. I won't take it personally – honest!

Data Collection

The first question you should be asking before starting any project is "What is my question?" If you don't know your question, then you won't know how to get an answer. In science and statistics, this is called having a hypothesis. Typical hypotheses might be:

  • Is smoking related to lung cancer?
  • Is there an association between sales of ice cream and haemorrhoid cream?
  • Is there a correlation between coffee consumption and insomnia?

It's important to start with a question, because this will help you decide what data you should collect (and what data you shouldn't).

It's not usual that you can answer these types of question by collecting data on just those variables. It's much more likely that there will be other factors that may have an influence on the answer and all of these factors must be taken into account. If you want to answer the question is smoking related to lung cancer? then you'll typically also collect data on age, height, weight, family history, genetic factors, and environmental factors, and your dataset will start to become quite large in comparison with your hypothesis.

So, what data should you collect? Well, that depends on your hypothesis, the perceived wisdom of current thinking, and any previous research carried out, but ultimately, if you collect data sensibly, you will likely get sensible results and vice versa, so it's a good idea to take some time to think it through carefully before you start.

I'm not going to go into the finer points of data collection and cleaning here, but it's important that your dataset conforms to a few simple standards before you can start analyzing it.

By the way, if you want a copy of my book Practical Data Cleaning, you can get a free copy of it by following the instructions in the tiny little advert for it at the end of this section…

Dataset Checklist

OK, so here we go. Here are the essential features of a ready-to-go dataset for association and correlation analysis.

Your dataset is a rectangular matrix of data. If your data is spread across different spreadsheets or tables, then it's not a dataset, it's a database, and it's not ready for analysis:

  • Each column of data is a single variable corresponding to a single piece of information (such as age, height, or weight, in this case).
  • Column 1 is a list of unique consecutive numbers starting from one. This allows you to uniquely identify any given row and recover the original order of your dataset with a single sort command.
  • Row 1 contains the names of the variables. If you use rows 2, 3, 4, and so on as the variable names, you won't be able to enter your dataset into a statistics program.
  • Each row contains the details for a single sample (patient, case, test tube, and so on).
  • Each cell should contain a single piece of information. If you have entered more than one piece of information in a cell (such as date of birth and their age), then you should separate the column into two or more columns (one for date of birth, another for age).
  • Don't enter the number zero into a cell unless what has been measured, counted, or calculated results in the answer zero. Don't use the number zero as a code to signify "No Data". By now, you should have a well-formed dataset that is stored in a single Excel worksheet. Each column should be a single variable, with row 1 containing the names of the variables, and below this, each row should be a distinct sample or patient. It should look something like Figure 1.1.
Figure 1.1: A typical dataset used in association and correlation analysis
Figure 1.1: A typical dataset used in association and correlation analysis

For the rest of this book, this is how I assume your dataset is laid out, so I might use the terms variable and column interchangeably, the same going for the terms row, sample, and patient.

Data Cleaning

Your next step is cleaning the data. You may well have made some entry errors and some of your data may not be useable. You need to find such instances and correct them. The alternative is that your data may not be fit for purpose and may mislead you in your pursuit of the answers to your questions.

Even after you've corrected the obvious entry errors, there may be other types of errors in your data that are harder to find.

Check That Your Data Is Sensible

Just because your dataset is clean, it doesn't mean that it is correct – real life follows rules, and your data must follow them, too. There are limits on the heights of participants in your study, so check that all data fits within reasonable limits. Calculate the minimum, maximum, and mean values of variables to see whether all values are sensible.

Sometimes, putting together two or more pieces of data can reveal errors that can otherwise be difficult to detect. Does the difference between date of birth and date of diagnosis give you a negative number? Is your patient over 300 years old?

Figure 1.2 gives you a list of the most useful measures that will help you discover errors in your data and find out whether real-life rules have been followed.

Figure 1.2: Essential descriptive statistics
Figure 1.2: Essential descriptive statistics

Check That Your Variables Are Sensible

Once you have a perfectly clean dataset it is relatively easy to compare variables with each other to find out whether there is a relationship between them (the subject of this book). But just because you can, it doesn't mean that you should. If there is no good reason why there should be a relationship between sales of ice cream and haemorrhoid cream, then you should consider expelling one of or both of those variables from the dataset. If you've collected your own data from original sources, then you'll have considered beforehand what data is sensible to collect (you have, haven't you?), but if your dataset is a pastiche of two or more datasets, then you might find strange combinations of variables.

You should check your variables before doing any analyses and consider whether it is sensible to make these comparisons.

So, now you have collected your data, cleaned your data, and checked that your data is sensible and fit for purpose. In the next chapter, we'll go through the basics of data classification and introduce the four types of data.

Data Collection

The first question you should be asking before starting any project is "What is my question?" If you don't know your question, then you won't know how to get an answer. In science and statistics, this is called having a hypothesis. Typical hypotheses might be:

  • Is smoking related to lung cancer?
  • Is there an association between sales of ice cream and haemorrhoid cream?
  • Is there a correlation between coffee consumption and insomnia?

It's important to start with a question, because this will help you decide what data you should collect (and what data you shouldn't).

It's not usual that you can answer these types of question by collecting data on just those variables. It's much more likely that there will be other factors that may have an influence on the answer and all of these factors must be taken into account. If you want to answer the question is smoking related to lung cancer? then you'll typically also collect data on age, height, weight, family history, genetic factors, and environmental factors, and your dataset will start to become quite large in comparison with your hypothesis.

So, what data should you collect? Well, that depends on your hypothesis, the perceived wisdom of current thinking, and any previous research carried out, but ultimately, if you collect data sensibly, you will likely get sensible results and vice versa, so it's a good idea to take some time to think it through carefully before you start.

I'm not going to go into the finer points of data collection and cleaning here, but it's important that your dataset conforms to a few simple standards before you can start analyzing it.

By the way, if you want a copy of my book Practical Data Cleaning, you can get a free copy of it by following the instructions in the tiny little advert for it at the end of this section…

Dataset Checklist

OK, so here we go. Here are the essential features of a ready-to-go dataset for association and correlation analysis.

Your dataset is a rectangular matrix of data. If your data is spread across different spreadsheets or tables, then it's not a dataset, it's a database, and it's not ready for analysis:

  • Each column of data is a single variable corresponding to a single piece of information (such as age, height, or weight, in this case).
  • Column 1 is a list of unique consecutive numbers starting from one. This allows you to uniquely identify any given row and recover the original order of your dataset with a single sort command.
  • Row 1 contains the names of the variables. If you use rows 2, 3, 4, and so on as the variable names, you won't be able to enter your dataset into a statistics program.
  • Each row contains the details for a single sample (patient, case, test tube, and so on).
  • Each cell should contain a single piece of information. If you have entered more than one piece of information in a cell (such as date of birth and their age), then you should separate the column into two or more columns (one for date of birth, another for age).
  • Don't enter the number zero into a cell unless what has been measured, counted, or calculated results in the answer zero. Don't use the number zero as a code to signify "No Data". By now, you should have a well-formed dataset that is stored in a single Excel worksheet. Each column should be a single variable, with row 1 containing the names of the variables, and below this, each row should be a distinct sample or patient. It should look something like Figure 1.1.
Figure 1.1: A typical dataset used in association and correlation analysis
Figure 1.1: A typical dataset used in association and correlation analysis

For the rest of this book, this is how I assume your dataset is laid out, so I might use the terms variable and column interchangeably, the same going for the terms row, sample, and patient.

Data Cleaning

Your next step is cleaning the data. You may well have made some entry errors and some of your data may not be useable. You need to find such instances and correct them. The alternative is that your data may not be fit for purpose and may mislead you in your pursuit of the answers to your questions.

Even after you've corrected the obvious entry errors, there may be other types of errors in your data that are harder to find.

Check That Your Data Is Sensible

Just because your dataset is clean, it doesn't mean that it is correct – real life follows rules, and your data must follow them, too. There are limits on the heights of participants in your study, so check that all data fits within reasonable limits. Calculate the minimum, maximum, and mean values of variables to see whether all values are sensible.

Sometimes, putting together two or more pieces of data can reveal errors that can otherwise be difficult to detect. Does the difference between date of birth and date of diagnosis give you a negative number? Is your patient over 300 years old?

Figure 1.2 gives you a list of the most useful measures that will help you discover errors in your data and find out whether real-life rules have been followed.

Figure 1.2: Essential descriptive statistics
Figure 1.2: Essential descriptive statistics

Check That Your Variables Are Sensible

Once you have a perfectly clean dataset it is relatively easy to compare variables with each other to find out whether there is a relationship between them (the subject of this book). But just because you can, it doesn't mean that you should. If there is no good reason why there should be a relationship between sales of ice cream and haemorrhoid cream, then you should consider expelling one of or both of those variables from the dataset. If you've collected your own data from original sources, then you'll have considered beforehand what data is sensible to collect (you have, haven't you?), but if your dataset is a pastiche of two or more datasets, then you might find strange combinations of variables.

You should check your variables before doing any analyses and consider whether it is sensible to make these comparisons.

So, now you have collected your data, cleaned your data, and checked that your data is sensible and fit for purpose. In the next chapter, we'll go through the basics of data classification and introduce the four types of data.

Left arrow icon Right arrow icon

Key benefits

  • Get a comprehensive introduction to associations and correlations
  • Explore multivariate analysis, understand its limitations, and discover the assumptions on which it’s based
  • Gain insights into the various ways of preparing your data for analysis and visualization

Description

Associations and correlations are ways of describing how a pair of variables change together as a result of their connection. By knowing the various available techniques, you can easily and accurately discover and visualize the relationships in your data. This book begins by showing you how to classify your data into the four distinct types that you are likely to have in your dataset. Then, with easy-to-understand examples, you’ll learn when to use the various univariate and multivariate statistical tests. You’ll also discover what to do when your univariate and multivariate results do not match. As the book progresses, it describes why univariate and multivariate techniques should be used as a tag team, and also introduces you to the techniques of visualizing the story of your data. By the end of the book, you’ll know exactly how to select the most appropriate univariate and multivariate tests, and be able to use a single strategic framework to discover the true story of your data.

What you will learn

Identify a dataset that's fit for analysis using its basic features Understand the importance of associations and correlations Use multivariate and univariate statistical tests to confirm relationships Classify data as qualitative or quantitative and then into the four subtypes Build a visual representation of all the relationships in the dataset Automate associations and correlations with CorrelViz

What do you get with eBook?

Product feature icon Instant access to your Digital eBook purchase
Product feature icon Download this book in EPUB and PDF formats
Product feature icon Access this title in our online reader with advanced features
Product feature icon DRM FREE - Read whenever, wherever and however you want
Buy Now

Product Details


Publication date : Jun 28, 2019
Length 134 pages
Edition : 1st Edition
Language : English
ISBN-13 : 9781838980412
Category :
Concepts :

Table of Contents

9 Chapters
About the Book Chevron down icon Chevron up icon
Data Collection and Cleaning Chevron down icon Chevron up icon
Data Classification Chevron down icon Chevron up icon
Introduction to Associations and Correlations Chevron down icon Chevron up icon
Univariate Statistics Chevron down icon Chevron up icon
Multivariate Statistics Chevron down icon Chevron up icon
Visualizing Your Relationships Chevron down icon Chevron up icon
Bonus: Automating Associations and Correlations Chevron down icon Chevron up icon
Appendix Chevron down icon Chevron up icon

Customer reviews

Filter icon Filter
Top Reviews
Rating distribution
Empty star icon Empty star icon Empty star icon Empty star icon Empty star icon 0
(0 Ratings)
5 star 0%
4 star 0%
3 star 0%
2 star 0%
1 star 0%

Filter reviews by


No reviews found
Get free access to Packt library with over 7500+ books and video courses for 7 days!
Start Free Trial

FAQs

How do I buy and download an eBook? Chevron down icon Chevron up icon

Where there is an eBook version of a title available, you can buy it from the book details for that title. Add either the standalone eBook or the eBook and print book bundle to your shopping cart. Your eBook will show in your cart as a product on its own. After completing checkout and payment in the normal way, you will receive your receipt on the screen containing a link to a personalised PDF download file. This link will remain active for 30 days. You can download backup copies of the file by logging in to your account at any time.

If you already have Adobe reader installed, then clicking on the link will download and open the PDF file directly. If you don't, then save the PDF file on your machine and download the Reader to view it.

Please Note: Packt eBooks are non-returnable and non-refundable.

Packt eBook and Licensing When you buy an eBook from Packt Publishing, completing your purchase means you accept the terms of our licence agreement. Please read the full text of the agreement. In it we have tried to balance the need for the ebook to be usable for you the reader with our needs to protect the rights of us as Publishers and of our authors. In summary, the agreement says:

  • You may make copies of your eBook for your own use onto any machine
  • You may not pass copies of the eBook on to anyone else
How can I make a purchase on your website? Chevron down icon Chevron up icon

If you want to purchase a video course, eBook or Bundle (Print+eBook) please follow below steps:

  1. Register on our website using your email address and the password.
  2. Search for the title by name or ISBN using the search option.
  3. Select the title you want to purchase.
  4. Choose the format you wish to purchase the title in; if you order the Print Book, you get a free eBook copy of the same title. 
  5. Proceed with the checkout process (payment to be made using Credit Card, Debit Cart, or PayPal)
Where can I access support around an eBook? Chevron down icon Chevron up icon
  • If you experience a problem with using or installing Adobe Reader, the contact Adobe directly.
  • To view the errata for the book, see www.packtpub.com/support and view the pages for the title you have.
  • To view your account details or to download a new copy of the book go to www.packtpub.com/account
  • To contact us directly if a problem is not resolved, use www.packtpub.com/contact-us
What eBook formats do Packt support? Chevron down icon Chevron up icon

Our eBooks are currently available in a variety of formats such as PDF and ePubs. In the future, this may well change with trends and development in technology, but please note that our PDFs are not Adobe eBook Reader format, which has greater restrictions on security.

You will need to use Adobe Reader v9 or later in order to read Packt's PDF eBooks.

What are the benefits of eBooks? Chevron down icon Chevron up icon
  • You can get the information you need immediately
  • You can easily take them with you on a laptop
  • You can download them an unlimited number of times
  • You can print them out
  • They are copy-paste enabled
  • They are searchable
  • There is no password protection
  • They are lower price than print
  • They save resources and space
What is an eBook? Chevron down icon Chevron up icon

Packt eBooks are a complete electronic version of the print edition, available in PDF and ePub formats. Every piece of content down to the page numbering is the same. Because we save the costs of printing and shipping the book to you, we are able to offer eBooks at a lower cost than print editions.

When you have purchased an eBook, simply login to your account and click on the link in Your Download Area. We recommend you saving the file to your hard drive before opening it.

For optimal viewing of our eBooks, we recommend you download and install the free Adobe Reader version 9.