10 Things to Know About Heterogeneous Treatment Effects

Summary

This guide discusses the theoretical and policy relevance of heterogeneous treatment effects, which is when effects vary by individual or group. It also discusses and demonstrates methods for estimating how effects vary and interpreting results: from testing for heterogeneity, to estimating subgroup treatment effects and their differences, to addressing the pitfalls of multiple comparisons and ad hoc searches for heterogeneity.1

1 What is treatment effect heterogeneity?

A treatment may affect individuals or groups in different ways: this is treatment effect heterogeneity. The study of treatment effect heterogeneity involves estimating how treatment effects vary across individuals or groups within an experiment. For whom are there big effects? For whom are there small effects? For whom does treatment generate beneficial or adverse effects?

2 Why is treatment effect heterogeneity important?

By investigating treatment effect heterogeneity, we may be able to learn about the conditions under which treatments are especially effective or ineffective. The results can contribute to program design decisions and enable deploying resources more effectively, for example to individuals for whom the treatment is likely to be effective.

3 Treatment effect heterogeneity: The general case

First, we might want to know whether the treatment effect is the same for all individuals and groups within a study. We can state this in terms of the variance of individual-level treatment effects: if the treatment effect is the same for all individuals, then the variance of these effects would be zero. To probe this, we can test the null hypothesis that the variance of the individual-level treatment effects is zero. But we never get to see the treatment effect for any particular individual, only the outcome for each person either in the treatment or in the control condition.2

What might treatment effect heterogeneity look like in practice? In one example, Eric Kramon, Sarah Brierley, and George Ofosu find in The Moderating Effect of Debates in Ghana that the impact of viewing parliamentary debates on candidate evaluation and vote choice vary by partisanship. In another example, Blair (2021) failed to find evidence of heterogeneous treatment effects of community policing interventions across six geographic sites in the Global South.

4 Treatment effect heterogeneity: Conditional average treatment effects (CATEs)

Second, we might want to know whether specific groups of individuals in the study are impacted differently by our treatment. A structured, theory-driven inquiry of treatment effect heterogeneity involves pre-specifying and investigating conditional average treatment effects (CATEs). A CATE is an average treatment effect specific to a subgroup of individuals, where the subgroup is defined by the attributes of the individuals, such as the average treatment effect (ATE) among women. These attributes may also be attributes of the context in which the experiment occurs (e.g., the ATE among individuals at a specific site in a multi-site field experiment).3

An important aside on estimation: It is generally a good idea to avoid conditioning on variables whose values could have been affected by the treatment itself (called post-treatment variables). This is to ensure the unbiased estimation of conditional average treatment effects. Some researchers may be interested in post-treatment effect modification, or (in the regression context) the interaction between a treatment and a post-treatment covariate. For example, how do the effects of a job search assistance program vary with participants’ levels of depression during the follow-up period? However, conditioning on a post-treatment covariate may lead to bias. See Angrist and Pischke (2009) for more on so-called “bad” controls. There is a burgeoning body of methodological research on the conditions under which CATEs involving post-treatment covariates are identified. These methods rely on model-based identification.4

5 Treatment effect heterogeneity: Differences between CATEs

Third, we might be interested in the difference between two CATEs. For example, does our treatment work differently on average for men versus women? Stated differently, we might want to know whether the difference in ATEs between women and men in the experiment (sometimes called the interaction effect between the treatment and gender) is equal to or different from zero.

An aside on causal interpretation: The variable(s) that define the subgroups that you compare may or may not have been experimentally manipulated. If it was not, then we have a treatment-by-covariate interaction that can be interpreted as a descriptive measure of association between the covariate and the treatment effect, but should not be interpreted as the causal effect of a change in the covariate value on the ATE.5 Treatment-by-treatment interactions are differences in CATEs where the personal or contextual attribute defining subgroups is experimentally manipulated. Because that other treatment is randomly assigned, treatment-by-treatment interactions may be interpreted causally. Factorial and partial factorial designs allow researchers to randomly assign individuals to different combinations of “cross-cutting” treatment conditions and to estimate treatment-by-treatment interactions.

6 Estimation

Estimating CATEs and differences between CATEs (interaction effects) is straightforward. For the CATE, estimate the ATE among individuals in the specific subgroup of interest. For differences in CATEs, take the difference between relevant estimated CATEs. CATEs and differences in CATEs (interaction effects) may also be estimated in a regression framework.

Imagine a hypothetical experiment evaluating the effect of a job training program on future earnings. Let \(Y\) be the outcome (future earnings), \(Z\) be the treatment variable (1=job training program, 0=control), and \(X\) be a pre-treatment covariate (1=scholarship receipt, 0=no scholarship).

We can write down a regression model that can aid us in estimating the CATEs (the ATE of the job training program among those who receive a scholarship and among those who do not) and the interaction effect (the difference between these two CATEs). \[\begin{aligned} Y_i &= \alpha + \beta Z_i + \gamma X_i + \delta Z_iX_i + \varepsilon_i \label{alt} \end{aligned}\]

The ATE of the job training program among individuals who do not receive a scholarship is \(\beta\). The ATE of the job training program among individuals who do receive a scholarship is \(\beta + \delta\).

The coefficient \(\delta\) is the interaction effect and is interpreted as the ATE of the job training program among individuals receiving a scholarship minus the ATE of the job training program among individuals not receiving a scholarship. This has a causal interpretation (i.e., \(\delta\) is a treatment-by-treatment interaction) when scholarships are randomly assigned and a descriptive interpretation (i.e., \(\delta\) is a treatment-by-covariate interaction) when scholarships are not randomly assigned.

7 Hypothesis testing: Differences in CATEs

We might want to test whether an estimated interaction effect is just the result of noise in the data rather than reflecting a true difference in CATEs. We can take one of two approaches. A common practice in the regression approach is to use the standard error for the interaction coefficient in the regression output. For more on how to read regression table output, see 10 Things You Need to Know About Reading a Regression Table. Alternatively, one can take a randomization inference approach. This entails generating a full schedule of potential outcomes under the null hypothesis that the true treatment effect is constant and equal to the estimated ATE. Then we would simulate random assignment a large number of times, and calculate how often the simulation produces an estimate of the interaction effect is at least as large in absolute value as the actual estimate. For more on randomization inference, see 10 Things You Need to Know About Randomization Inference.

One can also combine these two methods, conducting randomization inference in a regression framework. For two-sided tests, we can use the \(F\)-statistic as the test statistic, where the null model is \[ \begin{aligned} Y_i &= \alpha + \beta Z_i + \gamma X_i + \varepsilon_i \end{aligned} \] and the alternative model is \[ \begin{aligned} Y_i &= \alpha + \beta Z_i + \gamma X_i + \delta Z_iX_i + \varepsilon_i . \end{aligned} \]

The coefficient on the interaction term may be used as the test statistic for one-sided tests, given the appropriate model.

An aside on interpretation: It’s important to distinguish between difference in significance and difference in effects. If, for example, our evidence suggests that the ATE among female study participants is significantly different from zero, but that the ATE among male participants is not, this doesn’t necessarily mean that the two CATEs are different from each other. Instead, we should use a test of the difference between CATEs as described above.

8 Multiple comparisons

Researchers interested in heterogeneous treatment effects are likely to encounter the problem of multiple comparisons: for example, when numerous subgroup analyses are conducted, the probability that at least one result looks statistically significant may be considerably greater than the specified alpha level (typically 5 percent) even when the treatment has no effect on anyone.6

One way to mitigate the multiple comparisons problem is to reduce the number of tests conducted, for example, by analyzing a small number of pre-specified subgroups. Another approach is to adjust the \(p\)-values to account for the fact that multiple hypotheses are being tested simultaneously. For more on how to implement various methods of adjustment for multiple comparisons, see 10 Things You Need to Know About Multiple Comparisons.

9 Pre-analysis plans

Pre-specifying the investigation of heterogeneous treatment effects has many benefits. First, we can reduce the numbers of CATEs and interactions under consideration for hypothesis testing by clearly indicating which tests are of primary interest in a registered pre-analysis plan (PAP). Additional subgroup analyses can be conceptualized and specified as exploratory or descriptive analyses in the PAP. Another bonus is that if we prefer a one-sided test, we can commit to that choice in the PAP before seeing the outcome data, so that we “cannot be justly accused of cherry-picking the test after the fact” (Olken 2015). See our guide 10 Things to Know About Pre-Analysis Plans for more on pre-registration. Further, if we want to demonstrate that heterogeneous effects do not exist (and are not, for example, what is driving a null result), pre-specifying theoretically derived expectations around lack of treatment effect heterogeneity can be useful.

10 Automate exploratory search for heterogeneous effects

Machine learning methods can be used to automate the search for systematic variation in treatment effects. These automated approaches are attractive because they minimize researchers’ discretion in selecting and testing interactions. They are also useful for conducting exploratory analyses, since these types of analyses are rarely pre-specified.

Popular machine learning methods for heterogeneous treatment effects include support vector machines (R package FindIt),7 Bayesian additive regression trees (R package BayesTree),8 classification and regression trees (R package causalTree) (Athey and Imbens 2016), random forests or “causal forests” (Wager and Athey 2016), and kernel regularized least squares (R package KRLS).9

In addition to single machine learning methods, ensemble methods may be used. Ensemble methods estimate a weighted average of multiple machine learning estimates of heterogeneous effects where the weights are a function of out-of-sample prediction performance.10

Back to top

11 References

Anderson, Michael L. 2008. “Multiple Inference and Gender Differences in the Effects of Early Intervention: A Reevaluation of the Abecedarian, Perry Preschool, and Early Training Projects.” Journal of the American Statistical Association 103: 1481–95.
Angrist, J., and J. Pischke. 2009. Mostly Harmless Econometrics. Princeton University Press.
Athey, Susan, and Guido W. Imbens. 2016. “Recursive Partitioning for Heterogeneous Causal Effects.” Proceedings of the National Academy of Sciences 113: 7353–60.
Bansak, K. 2021. “Estimating Causal Moderation Effects with Randomized Treatments and Non-Randomized Moderators.” Journal of the Royal Statistical Society Series A: Statistics in Society 184 (1): 65–86.
Blair, Weinstein, G. 2021. “Community Policing Does Not Build Citizen Trust in Police or Reduce Crime in the Global South.” Science 1098.
Chipman, H. A., E. I. George, and R. E. McCulloch. 2010. “BART: Bayesian Additive Regression Trees.” Annals of Applied Statistics 20 (1): 271–40.
Cook, Richard J., and Vern T. Farewell. 1996. “Multiplicity Considerations in the Design and Analysis of Clinical Trials.” Journal of the Royal Statistical Society, Series A 159: 93–110.
Gelman, Andrew, Jennifer Hill, and Masanao Yajima. 2012. “Why We (Usually) Don’t Have to Worry about Multiple Comparisons.” Journal of Research on Educational Effectiveness 5: 189–211.
Gerber, Alan S., and Donald P. Green. 2012. Field Experiments: Design, Analysis, and Interpretation. W.W. Norton.
Green, Donald P., and Holger L. Kern. 2012. “Modeling Heterogeneous Treatment Effects in Survey Experiments with Bayesian Additive Regression Trees.” Public Opinion Quarterly 76 (3): 491–511.
Grimmer, Justin, Solomon Messing, and Sean J. Westwood. 2014. “Estimating Heterogeneous Treatment Effects and the Effects of Heterogeneous Treatments with Ensemble Methods.”
Hainmueller, Jens, and Chad Hazlett. 2013. “Kernel Regularized Least Squares: Reducing Misspecification Bias with a Flexible and Interpretable Machine Learning Approach.” Political Analysis.
Hill, Jennifer L. 2011. “Bayesian Nonparametric Modeling for Causal Inference.” Journal of Computational and Graphical Statistics 20 (1): 217–40.
Imai, Kosuke, and Marc Ratkovic. 2013. “Estimating Treatment Effect Heterogeneity in Randomized Program Evaluation.” Annals of Applied Statistics 7 (1): 443–70.
Laan, Mark J. van der, Eric Polley, and Alan Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1).
Olken, Benjamin A. 2015. “Promises and Perils of Pre-Analysis Plans.” Journal of Economic Perspectives 29 (3): 61–80.
Schulz, Kenneth F., and David A. Grimes. 2005a. “Multiplicity in Randomised Trials i: Endpoints and Treatments.” Lancet 365: 1591–95.
———. 2005b. “Multiplicity in Randomised Trials II: Subgroups and Interim Analyses.” Lancet 365: 1657–61.
Stephens, Alisa, Luke Keele, and Marshall Joffe. 2016. “Generalized Structural Mean Models for Evaluating Depression as a Post-Treatment Effect Modifier of a Jobs Training Intervention.”
Vansteelandt, S. 2010. “Estimation of Controlled Direct Effects on a Dichotomous Outcome Using Logistic Structural Direct Effect Models.” Biometrika 97: 921–34.
Vansteelandt, S., and E. Goetghebeur. 2003. “Causal Inference with Generalized Structural Mean Models.” Journal of the Royal Statistical Society, Series B 65 (817-835).
———. 2004. “Using Potential Outcomes as Predictors of Treatment Activity via Strong Structural Mean Models.” Statistica Sinica 14: 907–25.
Wager, Stefan, and Susan Athey. 2016. “Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests.”
Westfall, Peter H., Randall D. Tobias, and Russell D. Wolfinger. 2011. Multiple Comparisons and Multiple Tests Using SAS. 2nd Ed. SAS Institute.

Footnotes

  1. This guide draws heavily from Gerber and Green (2012) and from Don Green’s course notes on experimental methods at Columbia University and builds on an older version of this guide by Albert Fang.↩︎

  2. This is known as the Fundamental Problem of Causal Inference. For more background, see 10 Things You Need to Know About Causal Inference.↩︎

  3. We can also define ATEs for subgroups defined by the individuals’ treatment status (e.g., the ATE among those who were assigned to treatment, also called the ATT or average treatment effect on the treated) or individuals’ post-treatment outcomes. We do not focus on these types of CATEs because of estimation challenges described below.↩︎

  4. For further reading (at an advanced technical level), see Vansteelandt and Goetghebeur (2003); Vansteelandt and Goetghebeur (2004); Vansteelandt (2010); Stephens, Keele, and Joffe (2016).↩︎

  5. See Bansak (2021) for more on causal moderation.↩︎

  6. For more background and a range of views on the multiple comparisons problem, see, e.g.: 10 Things You Need to Know About Multiple Comparisons; Cook and Farewell (1996); Schulz and Grimes (2005a); Schulz and Grimes (2005b); Anderson (2008); Westfall, Tobias, and Wolfinger (2011); Gelman, Hill, and Yajima (2012).↩︎

  7. See, for example, Imai and Ratkovic (2013).↩︎

  8. See Chipman, George, and McCulloch (2010); Hill (2011); Green and Kern (2012).↩︎

  9. See Hainmueller and Hazlett (2013).↩︎

  10. See Laan, Polley, and Hubbard (2007); Grimmer, Messing, and Westwood (2014).↩︎