Reader small image

You're reading from  Causal Inference and Discovery in Python

Product typeBook
Published inMay 2023
PublisherPackt
ISBN-139781804612989
Edition1st Edition
Concepts
Right arrow
Author (1)
Aleksander Molak
Aleksander Molak
author image
Aleksander Molak

Aleksander Molak is a Machine Learning Researcher and Consultant who gained experience working with Fortune 100, Fortune 500, and Inc. 5000 companies across Europe, the USA, and Israel, designing and building large-scale machine learning systems. On a mission to democratize causality for businesses and machine learning practitioners, Aleksander is a prolific writer, creator, and international speaker. As a co-founder of Lespire, an innovative provider of AI and machine learning training for corporate teams, Aleksander is committed to empowering businesses to harness the full potential of cutting-edge technologies that allow them to stay ahead of the curve.
Read more about Aleksander Molak

Right arrow

Starting simple – observational data and linear regression

In previous chapters, we discussed the concept of association. In this section, we’ll quantify associations between variables using a regression model. We’ll see the geometrical interpretation of this model and demonstrate that regression can be performed in an arbitrary direction. For the sake of simplicity, we’ll focus our attention on linear cases. Let’s start!

Linear regression

Linear regression is a basic data-fitting algorithm that can be used to predict the expected value of a dependent (target) variable, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>Y</mml:mi></mml:math>, given values of some predictor(s), <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>X</mml:mi></mml:math>. Formally, this is written as <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow><mrow><msub><mover><mi>Y</mi><mo stretchy="true">ˆ</mo></mover><mrow><mi>X</mi><mo>=</mo><mi>x</mi></mrow></msub><mo>=</mo><mi>E</mi><mfenced open="[" close="]"><mrow><mi>Y</mi><mo>|</mo><mi>X</mi><mo>=</mo><mi>x</mi></mrow></mfenced></mrow></mrow></math>.

In the preceding formula, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:msub><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>Y</mml:mi></mml:mrow><mml:mo>^</mml:mo></mml:mover></mml:mrow><mml:mrow><mml:mi>X</mml:mi><mml:mo>=</mml:mo><mml:mi>x</mml:mi></mml:mrow></mml:msub></mml:math> is the predicted value of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>Y</mml:mi></mml:math> given that <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>X</mml:mi></mml:math> takes the value(s) <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>x</mml:mi></mml:math>. <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>E</mml:mi><mml:mo>[</mml:mo><mml:mo>.</mml:mo><mml:mo>]</mml:mo></mml:math> is the expected value operator. Note that <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>X</mml:mi></mml:math> can be multidimensional. In such cases, <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>X</mml:mi></mml:math> is usually represented as a matrix, X, with shape <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>N</mml:mi><mml:mo>×</mml:mo><mml:mi>D</mml:mi></mml:math>, where <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>N</mml:mi></mml:math> is the number of observations and <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>D</mml:mi></mml:math> is the dimensionality of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>X</mml:mi></mml:math> (the number of...

Should we always control for all available covariates?

Multiple regression provides scientists and analysts with a tool to perform statistical control – a procedure to remove unwanted influence from certain variables in the model. In this section, we’ll discuss different perspectives on statistical control and build an intuition as to why statistical control can easily lead us astray.

Let’s start with an example. When studying predictors of dyslexia, you might be interested in understanding whether parents smoking influences the risk of dyslexia in their children. In your model, you might want to control for parental education. Parental education might affect how much attention parents devote to their children’s reading and writing, and this in turn can impact children’s skills and other characteristics. At the same time, education level might decrease the probability of smoking, potentially leading to confounding. But how do we actually know whether...

Regression and structural models

Before we conclude this chapter, let’s take a look at the connection between regression and SCMs. You might already have an intuitive understanding that they are somehow related. In this section, we’ll discuss the nature of this relationship.

SCMs

In the previous chapter, we learned that SCMs are a useful tool for encoding causal models. They consist of a set of variables (exogenous and endogenous) and a set of functions defining the relationships between these variables. We saw that SCMs can be represented as graphs, with nodes representing variables and directed edges representing functions. Finally, we learned that SCMs can produce interventional and counterfactual distributions.

SCM and structural equations

In causal literature, the names structural equation model (SEM) and structural causal model (SCM) are sometimes used interchangeably (e.g., Peters et al., 2017). Others refer to SEMs as a family of specific multivariate...

Wrapping it up

That was a lot of material! Congrats on reaching the end of Chapter 3!

In this chapter, we learned about the links between regression, observational data, and causal models. We started with a review of linear regression. After that, we discussed the concept of statistical control and demonstrated how it can lead us astray. We analyzed selected recommendations regarding statistical control and reviewed them from a causal perspective. Finally, we examined the links between linear regression and SCMs.

A solid understanding of the links between observational data, regression, and statistical control will help us move freely in the world of much more complex models, which we’ll start introducing in Part 2, Causal Inference.

We’re now ready to take a more detailed look at the graphical aspect of causal models. See you in the next chapter!

References

Becker, T. E., Atinc, G., Breaugh, J. A., Carlson, K. D., Edwards, J. R., & Spector, P. E. (2016). Statistical control in correlational studies: 10 essential recommendations for organizational researchers. Journal of Organizational Behavior, 37(2), 157–167.

Bollen, K. A. & Noble, M. D. (2011). Structural equation models and the quantification of behavior. PNAS Proceedings of the National Academy of Sciences of the United States of America, 108(Suppl 3), 15639–15646.

Cinelli, C., Forney, A., & Pearl, J. (2022). A Crash Course in Good and Bad Controls. Sociological Methods & Research, 0 (0), 1-34.

Kline, R. B. (2015). Principles and Practice of Structural Equation Modeling. Guilford Press.

Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.

Pearl, J. (2012). The causal foundations of structural equation modeling. In Hoyle, R. H. (Ed.), Handbook of structural equation modeling (pp. 68–91). Guilford...

References

Becker, T. E., Atinc, G., Breaugh, J. A., Carlson, K. D., Edwards, J. R., & Spector, P. E. (2016). Statistical control in correlational studies: 10 essential recommendations for organizational researchers. Journal of Organizational Behavior, 37(2), 157–167.

Bollen, K. A. & Noble, M. D. (2011). Structural equation models and the quantification of behavior. PNAS Proceedings of the National Academy of Sciences of the United States of America, 108(Suppl 3), 15639–15646.

Cinelli, C., Forney, A., & Pearl, J. (2022). A Crash Course in Good and Bad Controls. Sociological Methods & Research, 0 (0), 1-34.

Kline, R. B. (2015). Principles and Practice of Structural Equation Modeling. Guilford Press.

Murphy, K. P. (2022). Probabilistic Machine Learning: An Introduction. MIT Press.

Pearl, J. (2012). The causal foundations of structural equation modeling. In Hoyle, R. H. (Ed.), Handbook of structural equation modeling (pp. 68–91). Guilford...

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Causal Inference and Discovery in Python
Published in: May 2023Publisher: PacktISBN-13: 9781804612989
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at €14.99/month. Cancel anytime

Author (1)

author image
Aleksander Molak

Aleksander Molak is a Machine Learning Researcher and Consultant who gained experience working with Fortune 100, Fortune 500, and Inc. 5000 companies across Europe, the USA, and Israel, designing and building large-scale machine learning systems. On a mission to democratize causality for businesses and machine learning practitioners, Aleksander is a prolific writer, creator, and international speaker. As a co-founder of Lespire, an innovative provider of AI and machine learning training for corporate teams, Aleksander is committed to empowering businesses to harness the full potential of cutting-edge technologies that allow them to stay ahead of the curve.
Read more about Aleksander Molak