Reader small image

You're reading from  Causal Inference and Discovery in Python

Product typeBook
Published inMay 2023
PublisherPackt
ISBN-139781804612989
Edition1st Edition
Concepts
Right arrow
Author (1)
Aleksander Molak
Aleksander Molak
author image
Aleksander Molak

Aleksander Molak is a Machine Learning Researcher and Consultant who gained experience working with Fortune 100, Fortune 500, and Inc. 5000 companies across Europe, the USA, and Israel, designing and building large-scale machine learning systems. On a mission to democratize causality for businesses and machine learning practitioners, Aleksander is a prolific writer, creator, and international speaker. As a co-founder of Lespire, an innovative provider of AI and machine learning training for corporate teams, Aleksander is committed to empowering businesses to harness the full potential of cutting-edge technologies that allow them to stay ahead of the curve.
Read more about Aleksander Molak

Right arrow

Step 1 – modeling the problem

In this section, we’ll discuss and practice step 1 of the four-step causal inference process: modeling the problem.

We’ll split this step into two substeps:

  1. Creating a graph representing our problem
  2. Instantiating DoWhy’s CausalModel object using this graph

Creating the graph

In Chapter 3, we introduced a graph language called GML. We’ll use GML to define our data-generating process in this section.

Figure 7.1 presents the GPS example from the previous chapter, which we’ll model next. Note that we have omitted variable-specific noise for clarity:

Figure 7.1 – The graphical model from Chapter 6

Figure 7.1 – The graphical model from Chapter 6

Note that the graph in Figure 7.1 contains an unobserved variable, U. We did not include this variable in our dataset (it’s unobserved!), but we’ll include it in our graph. This will allow DoWhy to recognize that there’s an unobserved confounder...

Step 2 – identifying the estimand(s)

This short section is all about finding estimands with DoWhy. We’ll start with a brief overview of estimands supported by the library and then jump straight into practice!

DoWhy offers three ways to find estimands:

  • Back-door
  • Front-door
  • Instrumental variable

We know all of them from the previous chapter. To see a quick practical introduction to all three methods, check out my blog post Causal Python — 3 Simple Techniques to Jump-Start Your Causal Inference Journey Today (Molak, 2022; https://bit.ly/DoWhySimpleBlog).

Let’s see how to use DoWhy in order to find a correct estimand for our model.

It turns out it is very easy! Just see for yourself:

estimand = model.identify_effect()

Yes, that’s all!

We just call the .identify_effect() method of our CausalModel object and we’re done!

Let’s print out our estimand to see what we can learn:

print(estimand)
...

Step 3 – obtaining estimates

In this section, we’ll compute causal effect estimates for our model.

Computing estimates using DoWhy is as simple as it can be. To do it, we need to call the .estimate_effect() method of our CausalModel object:

estimate = model.estimate_effect(
    identified_estimand=estimand,
    method_name='frontdoor.two_stage_regression')

We pass two arguments to the method:

  • Our identified estimand
  • The name of the method that will be used to compute the estimate

You might recall from Chapter 6 that we needed to fit two linear regression models, get their coefficients, and multiply them in order to obtain the final causal effect estimate. DoWhy makes this process much easier for us.

Let’s print out the result:

print(f'Estimate of causal effect (
    linear regression): {estimate.value}')

This gives us the following output:

Estimate...

Step 4 – where’s my validation set? Refutation tests

In this section, we’ll discuss ideas regarding causal model validation. We’ll introduce the idea behind refutation tests. Finally, we’ll implement a couple of refutation tests in practice.

How to validate causal models

One of the most popular ways to validate machine learning models is through cross-validation (CV). The basic idea behind CV is relatively simple:

  1. We split the data into <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>k</mml:mi></mml:math> folds (subsets).
  2. We train the model on <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>k</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:math> folds and validate it on the remaining fold.
  3. We repeat this process <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>k</mml:mi></mml:math> times.
  4. At every step, we train on a different set of <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math"><mml:mi>k</mml:mi><mml:mo>-</mml:mo><mml:mn>1</mml:mn></mml:math> folds and evaluate on the remaining fold (which is also different at each step).

Figure 7.3 presents a schematic visualization of a five-fold CV scheme:

Figure 7.3 – Schematic of five-fold CV

Figure 7.3 – Schematic of five-fold CV

In Figure 7.3, the blue folds denote validation sets, while the white ones denote training sets...

Full example

This section is here to help us solidify our newly acquired knowledge. We’ll run a full causal inference process once again, step by step. We’ll introduce some new exciting elements on the way and – finally – we’ll translate the whole process to the new GCM API. By the end of this section, you will have the confidence and skills to apply the four-step causal inference process to your own problems.

Figure 7.4 presents a graphical model that we’ll use in this section:

Figure 7.4 – A graphical model that we’ll use in this section

Figure 7.4 – A graphical model that we’ll use in this section

We’ll generate 1,000 observations from an SCM following the structure from Figure 7.4 and store them in a data frame:

SAMPLE_SIZE = 1000
S = np.random.random(SAMPLE_SIZE)
Q = 0.2*S + 0.67*np.random.random(SAMPLE_SIZE)
X = 0.14*Q + 0.4*np.random.random(SAMPLE_SIZE)
Y = 0.7*X + 0.11*Q + 0.32*S +
    0.24*np.random.random(SAMPLE_SIZE...

Wrapping it up

In this chapter, we discussed the Python causal ecosystem. We introduced the DoWhy and EconML libraries and practiced the four-step causal inference process using DoWhy’s CausalModel API. We learned how to automatically obtain estimands and how to use different types of estimators to compute causal effect estimates. We discussed what refutation tests are and how to use them in practice. Finally, we introduced DoWhy’s experimental GCM API and showed its great capabilities when it comes to answering various causal queries. After working through this chapter, you have the basic skills to apply causal inference to your own problems. Congratulations!

In the next chapter, we’ll summarize common assumptions for causal inference and discuss some limitations of the causal inference framework.

References

Bates, S., Hastie, T., & Tibshirani, R. (2021). Cross-validation: what does it estimate and how well does it do it?. arXiv preprint. https://doi.org/10.48550/ARXIV.2104.00673

Battocchi, K., Dillon, E., Hei, M., Lewis, G., Oka, P., Oprescu, M., & Syrgkanis, V. (2019). EconML: A Python Package for ML-Based Heterogeneous Treatment Effects Estimation. https://github.com/microsoft/EconML

Blobaum, P., Götz, P., Budhathoki, K., Mastakouri, A., & Janzing, D. (2022). DoWhy-GCM: An extension of DoWhy for causal inference in graphical causal models. arXiv.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2016). Double/Debiased Machine Learning for Treatment and Causal Parameters. arXiv preprint. https://doi.org/10.48550/ARXIV.1608.00060

Molak, A. (2022, September 27). Causal Python: 3 Simple Techniques to Jump-Start Your Causal Inference Journey Today. Towards Data Science. https://towardsdatascience.com...

Wrapping it up

In this chapter, we discussed the Python causal ecosystem. We introduced the DoWhy and EconML libraries and practiced the four-step causal inference process using DoWhy’s CausalModel API. We learned how to automatically obtain estimands and how to use different types of estimators to compute causal effect estimates. We discussed what refutation tests are and how to use them in practice. Finally, we introduced DoWhy’s experimental GCM API and showed its great capabilities when it comes to answering various causal queries. After working through this chapter, you have the basic skills to apply causal inference to your own problems. Congratulations!

In the next chapter, we’ll summarize common assumptions for causal inference and discuss some limitations of the causal inference framework.

References

Bates, S., Hastie, T., & Tibshirani, R. (2021). Cross-validation: what does it estimate and how well does it do it?. arXiv preprint. https://doi.org/10.48550/ARXIV.2104.00673

Battocchi, K., Dillon, E., Hei, M., Lewis, G., Oka, P., Oprescu, M., & Syrgkanis, V. (2019). EconML: A Python Package for ML-Based Heterogeneous Treatment Effects Estimation. https://github.com/microsoft/EconML

Blobaum, P., Götz, P., Budhathoki, K., Mastakouri, A., & Janzing, D. (2022). DoWhy-GCM: An extension of DoWhy for causal inference in graphical causal models. arXiv.

Chernozhukov, V., Chetverikov, D., Demirer, M., Duflo, E., Hansen, C., Newey, W., & Robins, J. (2016). Double/Debiased Machine Learning for Treatment and Causal Parameters. arXiv preprint. https://doi.org/10.48550/ARXIV.1608.00060

Molak, A. (2022, September 27). Causal Python: 3 Simple Techniques to Jump-Start Your Causal Inference Journey Today. Towards Data Science. https://towardsdatascience.com...

Join our book's Discord space

Join our Discord community to meet like-minded people and learn alongside more than 2000 members at: https://packt.link/infer

lock icon
The rest of the chapter is locked
You have been reading a chapter from
Causal Inference and Discovery in Python
Published in: May 2023Publisher: PacktISBN-13: 9781804612989
Register for a free Packt account to unlock a world of extra content!
A free Packt account unlocks extra newsletters, articles, discounted offers, and much more. Start advancing your knowledge today.
undefined
Unlock this book and the full library FREE for 7 days
Get unlimited access to 7000+ expert-authored eBooks and videos courses covering every tech area you can think of
Renews at $15.99/month. Cancel anytime

Author (1)

author image
Aleksander Molak

Aleksander Molak is a Machine Learning Researcher and Consultant who gained experience working with Fortune 100, Fortune 500, and Inc. 5000 companies across Europe, the USA, and Israel, designing and building large-scale machine learning systems. On a mission to democratize causality for businesses and machine learning practitioners, Aleksander is a prolific writer, creator, and international speaker. As a co-founder of Lespire, an innovative provider of AI and machine learning training for corporate teams, Aleksander is committed to empowering businesses to harness the full potential of cutting-edge technologies that allow them to stay ahead of the curve.
Read more about Aleksander Molak