Packt+ | Advance your knowledge in tech

You're reading from Data Literacy With Python A Comprehensive Guide to Understanding and Analyzing Data with Python

Product type Paperback

Published in Jul 2024

Publisher Mercury_Learning

ISBN-13 9781836640097

Length 271 pages

Edition 1st Edition

Languages

Python

Tools

Matplotlib

Concepts

Data Analysis

Authors (2):

Mercury Learning and Information

Oswald Campesato

View More author details

Table of Contents (9) Chapters

Preface

1. Chapter 1: Working With Data

2. Chapter 2: Outlier and Anomaly Detection FREE CHAPTER

3. Chapter 3: Cleaning Datasets

4. Chapter 4: Introduction to Statistics

5. Chapter 5: Matplotlib and Seaborn

6. Index

Appendix A: Introduction to Python

1. Appendix B: Introduction to Pandas

CORRELATION

Correlation refers to the extent to which a pair of variables are related, which is a number between -1 and 1 inclusive. The most significant correlation values are -1, 0, and 1.

A correlation of 1 means that both variables increase and decrease in the same direction. A correlation of -1 means that both variables increase and decrease in the opposite direction. A correlation of 0 means that the variables are independent of each other.

Pandas provides the corr() method that generates a matrix containing the correlation between any pair of features in a data frame. Note that the diagonal values of this matrix are related to the variance of the features in the data frame.

A correlation matrix can be derived from a covariance matrix: each entry in the former matrix is a covariance value divided by the standard deviation of the two features in the row and column of a particular entry.

This concludes the portion of the chapter pertaining to dependencies among features in a dataset. The next section discusses different types of currencies that can appear in a dataset, along with a Python code sample for currency conversion.

What Is a Good Correlation Value?

Although there is no exact value that determines whether a correlation is weak, moderate, or strong, there are some guidelines, as shown here:

• between 0.0 and 0.2: weak

• between 0.2 and 0.5: moderate

• between 0.5 and 0.7: moderately strong

• between 0.7 and 1.0: strong

The preceding ranges are for positive correlations, and the corresponding values for negative correlations are shown here:

• between -0.2 and 0: weak

• between -0.5 and -0.2: moderate

• between -0.7 and -0.5: moderately strong

• between -0.7 and -1.0: strong

However, treat the values in the preceding lists as guidelines: some people classify values between 0.0 and 0.4 as weak correlations, and values between 0.8 and 1.0 as strong correlations. In addition, a correlation of 0.0 means that there is no correlation at all (extra weak?).

Discrimination Threshold

Logistic regression (discussed in Chapter 6) is based on the sigmoid function (which in turn involves Euler’s constant) whereby any real number is mapped to a number in the interval (0,1). Consequently, logistic regression is well-suited for classifying binary class membership: i.e., data points that belong to one of two classes. For datasets that contain two class values, let’s call them 0 and 1, logistic regression provides a probability that a data point belongs to class 1 or class 1, where the range of probability values includes all the numbers in the interval [0,1].

The discrimination threshold is the value whereby larger probabilities are associated with class 1 and smaller probabilities are associated with class 0. Some datasets have a discrimination threshold of 0.5, but in general, this value can be much closer to 0 or 1. Relevant examples include health-related datasets (healthy versus cancer), sports events (win versus lose), and even the DMV (department of motor vehicles), where the latter require 85% accuracy in order to pass the test in some US states.

The rest of the chapter is locked

Tech Concepts

Programming languages

Tech Tools

Unlimited access to the largest independent learning library in tech of over 8,000 expert-authored tech books and videos.

Innovative learning tools, including AI book assistants, code context explainers, and text-to-speech.

50+ new titles added per month and exclusive early access to books as they are being written.

You're reading from Data Literacy With Python A Comprehensive Guide to Understanding and Analyzing Data with Python

Table of Contents (9) Chapters

Authors (2)

Personalised recommendations for you

Create a Free Account To Continue Reading

Sign in to activate your 7-day free access