20 min read

Lord's paradox is explained by regression to the mean, not causal inference

The paradox

Lord’s paradox originally appeared in a 1967 article published in Psychological Bulletins entitled “A Paradox in the Interpretation of Group Comparisons” (Lord 1967). Here’s the original formulation.

A large university is interested in investigating the effects on the students of the diet provided in the university dining halls and any sex difference in these effects. Various types of data are gathered. In particular, the weight of each student at the time of his arrival in September and his weight the following June are recorded.

At the end of the school year, the data are independently examined by two statisticians. Both statisticians divide the students according to sex. The first statistician examines the mean weight of the girls at the beginning of the year and at the end of the year and finds these to be identical. On further investigation, he finds that the frequency distribution of weight for the girls at the end of the year is actually the same as it was at the beginning.

He finds the same to be true for the boys. Although the weight of individual boys and girls has usually changed during the course of the year, perhaps by a considerable amount, the group of girls considered as a whole has not changed in weight, nor has the group of boys. A sort of dynamic equilibrium has been maintained during the year.

The whole situation is shown by the solid lines in the diagram (Figure 1). Here the two ellipses represent separate scatter-plots for the boys and the girls. The frequency distributions of initial weight are indicated at the top of the diagram and the identical distributions of final weight are indicated on the left side. People falling on the solid 45° line through the origin are people whose initial and final weight are identical. The fact that the center of each ellipse lies on this 45° line represents the fact that there is no mean gain for either sex.

Weight at end of year (Y) versus weight at beginning of year (X)

Figure 1: Weight at end of year (Y) versus weight at beginning of year (X)

The first statistician concludes that as far as these data are concerned, there is no evidence of any interesting effect of the school diet (or of anything else) on student. In particular, there is no evidence of any differential effect on the two sexes, since neither group shows any systematic change.

The second statistician, working independently, decides to do an analysis of covariance. After some necessary preliminaries, he determines that the slope of the regression line of final weight on initial weight is essentially the same for the two sexes. This is fortunate since it makes possible a fruitful comparison of the intercepts of the regression lines. (The two regression lines are shown in the diagram as dotted lines. The figure is accurately drawn, so that these regression lines have the appropriate mathematical relationships to the ellipses and to the 45° line through the origin.) He finds that the difference between the intercepts is statistically highly significant.

The second statistician concludes, as is customary in such cases, that the boys showed significantly more gain in weight than the girls when proper allowance is made for differences in initial weight between the two sexes. When pressed to explain the meaning of this conclusion in more precise terms, he points out the following: If one selects on the basis of initial weight a subgroup of boys and a subgroup of girls having identical frequency distributions of initial weight, the relative position of the regression lines shows that the subgroup of boys is going to gain substantially more during the year than the subgroup of girls.

The college dietician is having some difficulty reconciling the conclusions of the two statisticians. The first statistician asserts that there is no evidence of any trend or change during the year for either boys or girls, and consequently, a fortiori, no evidence of a differential change between the sexes. The data clearly support the first statistician since the distribution of weight has not changed for either sex.

The second statistician insists that wherever boys and girls start with the same initial weight, it is visually (as well as statistically) obvious from the scatter-plot that the subgroup of boys gains more than the subgroup of girls.

It seems to the present writer that if the dietician had only one statistician, she would reach very different conclusions depending on whether this were the first statistician or the second. On the other hand, granted the usual linearity assumptions of the analysis of covariance, the conclusions of each statistician are visibly correct.

This paradox seems to impose a difficult interpretative task on those who wish to make similar studies of preformed groups. It seems likely that confused interpretations may arise from such studies.

What is the ‘explanation’ of the paradox? There are as many different explanations as there are explainers.

Figure 1, which is taken directly from Lord’s article, captures the entire situation. It’s a fantastically simple plot, and yet Lord derives much perplexity from it. We have two groups (boys and girls), and everyone has some variable measured before and after some treatment. That’s all that’s going on here! That so much perplexity emerges from considering such a simple study is what makes this so hilariously frustrating.

What’s going on here? There are at least two perspectives from which we can analyze the paradox and hopefully eliminate some of that perplexity. The first is the perspective of regression to the mean and the second is the perspective of causal inference. In the remainder of this blog post, we’ll consider each perspective individually and then attempt to arrive at some conclusions.

Regression to the mean

If you spend a few seconds looking at Figure 1, you’ll immediately identify the curious feature that the two regression lines, in addition to differing from each other, differ from the 45 degree line going through the origin. The latter, aka the line \(Y=X\), goes straight through the data clouds for both the boys and the girls1. Why do the regression lines, which estimate the conditional mean of the weight at the end of the year given the weight at the beginning, differ from the line straight through the data? The answer to this question is “regression to the mean,” and it lies at the heart of Lord’s paradox.2

Let’s look at a different example before returning to Lord’s case.

Regression to the mean in the NBA

The Denver Nuggets had a great season last year, winning the number 1 seed in the Western Conference and eventually the NBA championship. They brought back most of their guys this year (and didn’t make any big acquisitions). Should they expect to have a better or worse record this year than they did last year?

We can attempt to answer this question with data. In particular, we can look at all the previous times3 the Nuggets had a good season (defined as an above .500 winning percentage) and see how their record compared the following season.

library(readxl)
library(tidyverse)

# read and wrangle data
correct_year = str_c(2010:2000, "-", str_pad(11:1, 2, pad = 0))
names(correct_year) =  as.character(
  c(40483, 40087, 39692, 39295, 38899, 38504, 38108, 37712, 37316, 36923, 36526)
  )
nba_records <- read_excel("data/Historical_NBA_Performance.xlsx") %>%
  mutate(Year = coalesce(correct_year[Year], Year)) %>% # Because Excel is stupid and interpreted things as dates 
  rename(win_perc = `Winning Percentage`) %>%
  mutate(Year_n = as.integer(str_sub(Year, 1, 4)))


# show denver data
denver <- nba_records %>%
  filter(Team == "Nuggets") 

denver %>%
  arrange(Year_n) %>%
  slice_head(n = 10)
## # A tibble: 10 × 5
##    Year    Team    Record win_perc Year_n
##    <chr>   <chr>   <chr>     <dbl>  <int>
##  1 1976-77 Nuggets 50-32     0.61    1976
##  2 1977-78 Nuggets 48-34     0.585   1977
##  3 1978-79 Nuggets 47-35     0.573   1978
##  4 1979-80 Nuggets 30-52     0.366   1979
##  5 1980-81 Nuggets 37-45     0.451   1980
##  6 1981-82 Nuggets 46-36     0.561   1981
##  7 1982-83 Nuggets 45-37     0.549   1982
##  8 1983-84 Nuggets 38-44     0.463   1983
##  9 1984-85 Nuggets 52-30     0.634   1984
## 10 1985-86 Nuggets 47-35     0.573   1985
# identify seasons in which denver had a winning record and compare their record in these seasons to the seasons that followed
denver = denver %>%
  arrange(Year_n) 

idx_good <- which(denver$win_perc > .5)
idx_after <- idx_good + 1

sum(denver$win_perc[idx_good] > denver$win_perc[idx_after])
## [1] 16

In 16 out of the 21 seasons in which Denver had a winning record, their record the season after was worse. Their record the following season wasn’t necessarily bad. In fact, in 15 of the 21 cases, they achieved an above 500 record the following season. Make sense? We’re starting to see what we mean by regression towards the mean.

We can get a better estimate of the probability of having a better record next year than this year given that this year you had a winning record if we incorporate the data from every NBA team, not just Denver.

nba_records %>%
  arrange(Year_n) %>%
  group_by(Team) %>%
  mutate(pre_win_perc = lag(win_perc)) %>%
  ungroup() %>%
  filter(pre_win_perc > .5) %>%
  summarise(n_seasons = n(), estimated_prob = mean(win_perc >= pre_win_perc))
## # A tibble: 1 × 2
##   n_seasons estimated_prob
##       <int>          <dbl>
## 1       707          0.376

Our estimate, based on \(n=707\) seasons worth of data, says that there is a less than \(40\%\) chance of having as good a record next season as you did this season if you were above \(500\) this season.

A final thing we can do is to look at a particular two season slice and see which teams are regressing towards the mean of 500 and which are diverging away from the mean. For the 2014-15 and 2015-16 season, shown below, we can see that there are more teams regressing towards the mean than diverging away.

library(ggrepel)
nba_records %>%
  filter(2015 >= Year_n & Year_n >= 2014) %>%
  arrange(Year_n) %>%
  group_by(Team) %>%
  mutate(pre_win_perc = lag(win_perc)) %>%
  filter(Year_n == 2015) %>%
  mutate(category = case_when(
    (pre_win_perc < win_perc & pre_win_perc < .5) | (pre_win_perc > win_perc & pre_win_perc > .5) ~ "cat1",
    (pre_win_perc > win_perc & pre_win_perc < .5) | (pre_win_perc < win_perc & pre_win_perc > .5) ~ "cat2"
  )) %>%
  ggplot(aes(pre_win_perc, win_perc, color = category)) +
  geom_point() +
  geom_text_repel(aes(label = Team), max.overlaps = 20, size = 3, key_glyph = "blank") +
  geom_abline(aes(col = "line", slope=1, intercept = 0)) +
  scale_color_manual(
    breaks = c("cat1", "cat2", "line"), 
    values = c("red", "green", "black"),
    labels = c("Regressing towards mean", "Diverging away from mean", "2015-16 wp = 2014-15 wp")) +
  labs(x = "2014-15 win percentage", y = "2015-16 win percentage") +
  theme_minimal() +
  theme(aspect.ratio=1, legend.position = c(.2, .85), legend.title = element_blank()) +
  guides(color = guide_legend(override.aes = list(linetype = c(0, 0, 1), cex = c(1.5,1.5,0) ) ) )

This is generally the picture we get if we look at any two consecutive seasons.

Regression to the mean is baked into the bivariate normal

If random variables \((X,Y)\) are drawn from a bivariate normal distribution \[ (X,Y) \sim N((\mu_X, \mu_Y), \begin{bmatrix} \sigma^2_X & \sigma_{XY} \\ \sigma_{XY} & \sigma^2_Y \end{bmatrix}), \] then the conditional expectation of one given the other4 is

\[\begin{align} E[Y \mid X=x] &= \mu_Y + \frac{\sigma_{XY}}{\sigma_X^2}(x-\mu_X) \\ &= \mu_Y + \rho \frac{\sigma_Y}{\sigma_X}(x-\mu_X) \tag{1} \end{align}\]

Suppose now that \(X \overset{D}{=} Y\) so in particular \(\mu_X = \mu_Y\) and \(\sigma_X = \sigma_Y,\) and that \(X\) and \(Y\) are not anti-correlated, i.e. \(\rho\geq0\). Substituting into (1), we see that conditional expectations are weighted averages:

\[\begin{equation} E[Y \mid X=x] = \rho x + (1-\rho)\mu_Y \tag{2} \end{equation}\]

If \(x > \mu_X = \mu_Y\), then \(E[Y \mid X=x] \leq x\), as it is pulled down towards the mean. On the other hand, if \(x< \mu_X\), then \(E[Y \mid X=x] \geq x\), as it is pulled up towards the mean. This is the phenomenon of regression to the mean. The value of \(\rho\) determines how much regressing occurs. For large values of \(\rho\) close to 1, there is not much regressing (for \(\rho=1,\) no regressing at all). For low values of \(\rho\) close to 0, there is much regressing. I’m curious whether there are any other bivariate distributions with regression to the mean.

Simulating Lord’s data with the bivariate normal

We can simulate students’ pre- and post-year weights as a bivariate normal in R.

library(MASS)
Sigma = matrix(c(1,.6,.6,1), 2, 2)
set.seed(100)
xy_girls = mvrnorm(n=1000, c(4,4), Sigma)
xy_boys = mvrnorm(n=1000, c(6,6), Sigma)
par(pty="s")
plot(xy_girls, xlim = c(0, 8), ylim = c(0, 8), col = "blue", 
     xlab = "Pre-year weight", ylab = c("Post-year weight"))
points(xy_boys, col = "green")
lm_girls = lm(xy_girls[,2] ~ xy_girls[,1])
lm_boys = lm(xy_boys[,2] ~ xy_boys[,1])
abline(lm_girls, col = "blue", lwd = 2)
abline(lm_boys, col = "green", lwd = 2)
abline(0,1, col = "yellow", lwd = 2)
legend(.5, 7.5, legend = c("girls", "boys", "x=y"), 
       col = c("blue", "green", "yellow"), pch = 1)
Simulating Lord's data as two bivariate normal distributions

Figure 2: Simulating Lord’s data as two bivariate normal distributions

There is on average no increase or decrease in weight during the year among the boys, nor is there among the girls. And yet, for each fixed weight, the average boy who starts at that weight at the beginning of the year weighs at the end of the year 0.8 units more than the average girl starting at that weight.

For example, a boy and a girl who both weigh 5 units at the beginning of the school year will be expected to weigh 5.4 and 4.6 units, respectively, at the end of the school year. If you think about it, this is because both the boy and the girl are regressing towards their respective means. The boy starts the year below the average weight for boys and so regression to the mean says that we should expect him to gain weight. On the other hand, the girl who starts the year at the same weight as the boy, has a weight that is above the average for girls and so for her, regressing to the mean means losing weight.

Using the formula for the conditional expectation of post-year weight given pre-year weight (2), it’s not too difficult to see that no matter what the starting weight is, if a boy and a girl both start at that weight, the boy’s expected final weight will be larger than the girl’s expected final weight by exactly \((1-\rho(X,Y))(\mu_{boys} - \mu_{girls}) = 0.4*2 = 0.8\).

Causal inference

So far, we’ve seen that when you have bivariate normal data and the components have identical marginal distributions, you get regression to the mean. When you have two different bivariate distributions, and within each group, there’s regression to the mean, you’re at risk of encountering Lord’s paradox.

Are we done? Have we solved the paradox and understood it completely? From the perspective of causal inference, our understanding so far is not enough. This perspective is not satisfied with mere joint distributions, but rather seeks to identify and estimate the causal relationships between the variables in the problem.

In Figure 2, there are three explicit variables. Let’s establish some new notation for these variables, which we’ll use for the rest of the post: \(W_I\) represents the weight of a student before the school year, \(W_F\), the weight of the student after the school year, and \(S\), the student’s sex. The data are simulated such that \[ (W_I,W_F) \mid S \sim N((\mu_S, \mu_S), \begin{bmatrix} 1 & 0.6 \\ 0.6 & 1 \end{bmatrix}) \]where \(\mu_1 = 6\) (\(\mu_0 = 4\)) are the mean weight of male (female) students before and after the school year. While this model more or less specifies the joint distribution of the variables \(W_I\), \(W_F\), and \(S\), it doesn’t specify the causal relationships between them. We’ll later discuss whether it even makes sense to talk about causality in Lord’s paradox. However, if we do insist on bring causality into the picture, the causal relationships between the three variables are not up for too much debate: because there is a temporal arrangement in the determination of the three variables, namely \(S\) precedes \(W_I\), which in turn precedes \(W_F\), we must have something like

library(dagitty)
library(ggdag)
dag1 <- dagitty("dag{
               W_I -> W_F ;
               S -> W_I ;
               S -> W_F
               S [exposure]
               }") 
dag1_t = dag1 %>%
  tidy_dagitty(layout="circle", seed = 100) 

node_labels = c("S", expression(W[F]), expression(W[I]))

dag1_t %>%
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_point() +
  geom_dag_edges() +
  geom_dag_text(label = node_labels) +
  theme_dag()

Of course, in Lord’s paradox, there’s another variable in the mix, which is the weight gain during the year, aka the change score. This variable can be expressed as the difference between \(W_F\) and \(W_I\).

If we call this variable \(Y\) and add it to a DAG, then we might get something like this:

dag2 <- dagitty("dag{
               W_I -> W_F ;
               S -> W_I ;
               S -> W_F ;
               W_I -> Y ;
               W_F -> Y
               S [exposure]
               }")

dag2_t = dag2 %>%
  tidy_dagitty(layout="auto", seed = 100) %>%
  mutate(edge_labels = c ("", "", "+1", "", "-1", NA))

node_labels = c("S", expression(W[F]), expression(W[I]), "Y")


dag2_t %>%
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_point() +
  geom_dag_edges(aes(label = edge_labels), vjust = -1) +
  geom_dag_text(label = node_labels) +
  theme_dag()

The +1 and -1 on the edges into \(Y\) indicate that \(Y = W_F-W_I\).

So far, we haven’t made any parametric assumptions–just causal ones. If we further assume that each node is a linear combination of its parents plus independent noise, then we can put path coefficients on all the edges of our DAG:

dag2_t <- dag2_t %>%
  mutate(edge_labels = c("b", "a", "+1", "c", "-1", NA)) 

dag2_t %>%
  ggplot(aes(x = x, y = y, xend = xend, yend = yend)) +
  geom_dag_point() +
  geom_dag_edges(aes(label = edge_labels), vjust=-1) +
  geom_dag_text(label = node_labels) +
  theme_dag()
The DAG that appears in Judea Pearl's article about Lord's paradox (his Figure 2b)

Figure 3: The DAG that appears in Judea Pearl’s article about Lord’s paradox (his Figure 2b)

Given this DAG, we can easily see that statistician 1 is estimating b+a*c-a = b + a(c-1), whereas statistician 2 is simply estimating b. From the fact that statistician 1 concludes that there’s no difference in the diet’s effect between the genders, we conclude that \(b = a(1-c)\). Let’s see if we can fit the structural model to the data we previously generated

library(lavaan)
xy <- rbind(xy_boys, xy_girls)
colnames(xy) = c("WI", "WF")
df_xy <- as_tibble(xy) %>%
  mutate(S = if_else(row_number() > 1000, 0, 1))

pearl <-   '
    WI ~ 1 + S
    WF ~ 1 + S + WI
'
set.seed(100)
fit1b <- sem(pearl, data=df_xy)
summary(fit1b)
## lavaan 0.6-12 ended normally after 24 iterations
## 
##   Estimator                                         ML
##   Optimization method                           NLMINB
##   Number of model parameters                         7
## 
##   Number of observations                          2000
## 
## Model Test User Model:
##                                                       
##   Test statistic                                 0.000
##   Degrees of freedom                                 0
## 
## Parameter Estimates:
## 
##   Standard errors                             Standard
##   Information                                 Expected
##   Information saturated (h1) model          Structured
## 
## Regressions:
##                    Estimate  Std.Err  z-value  P(>|z|)
##   WI ~                                                
##     S                 1.972    0.046   43.007    0.000
##   WF ~                                                
##     S                 0.701    0.049   14.342    0.000
##     WI                0.644    0.017   37.505    0.000
## 
## Intercepts:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .WI                4.013    0.032  123.782    0.000
##    .WF                1.431    0.073   19.526    0.000
## 
## Variances:
##                    Estimate  Std.Err  z-value  P(>|z|)
##    .WI                1.051    0.033   31.623    0.000
##    .WF                0.620    0.020   31.623    0.000

Our parameter estimates are \(\hat{a}=1.972\), \(\hat{b}=0.701\), and \(\hat{c} = 0.644\). According to the fitted model, the girls’ weights before and after the school year are given by

\[\begin{align} W_I &= 4.01 + \epsilon_I \\ W_F &= 1.43 + 0.64W_I + \epsilon_F, \end{align}\] where \(\epsilon_I\) and \(\epsilon_F\) are independently distributed

\[\begin{align} \epsilon_I &\sim N(0, 1.05) \\ \epsilon_F &\sim N(0, 0.62) \end{align}\]

4.01 and 1.05 are close to \(4\) and \(1\), so this model gets the marginal distribution of \(W_I\) correct. Less trivially, we have

\[\begin{align} W_F &= 1.43 + 0.64W_I + \epsilon_F \\ &= 4.00 + 0.64\epsilon_I + \epsilon_F \\ &\sim N(4.00, 1.05), \end{align}\] so the mean and variance of \(W_F\) are also close to 4 and 1. Also, just as Pearl (Pearl 2016) suspected, we have \(\hat{b}=0.701 \approx 1.97*(1-0.64) = \hat{a}(1-\hat{c}).\)5 This structural causal model fits the data well.

Conclusion: Causal inference is dumb in this setting

Let me provide Lord’s explanation of his paradox, the paragraph that immediate follows the excerpt I quoted at the top from the original article.

In the writer’s opinion, the explanation is that with the data usually available for such studies, there simply is no logical or statistical procedure that can be counted on to make proper allowances for uncontrolled preexisting differences between groups. The researcher wants to know how the groups would have compared if there had been no preexisting uncontrolled differences. The usual research study of this type is attempting to answer a question that simply cannot be answered in any rigorous way on the basis of available data.

In this writer’s opinion, the stated goal of these researchers makes very little sense. In particular, let us be clear that the clause “[had there] been no preexisting uncontrolled differences” refers to a counterfactual in which boys and girls weigh the same. Whether this counterfactual even makes sense, I’ll leave up to debate. What is certainly the case is that this counterfactual cannot possibly be relevant to any dietician or university administrator seeking to make a decision about the new dining hall.

Thus, I think the main reason that Lord’s paradox is a paradox is that we’re being asked to view sex as a treatment or exposure, when that doesn’t make a whole lot of sense. I think much of the confusion is downstream of this basic confusion. A corollary of the fact that it doesn’t make a whole lot of sense to view sex as treatment is that it doesn’t make a whole lot of sense to talk about the effect of sex. This means that, in particular, the stated goal of targeting the effect of sex on the students’ weight under the new dining hall diet is a confused one at best.

Thus, I’m ultimately unsatisfied with the causal perspective that we walked through above. It’s great that we’ve managed to find a set of causal parameters such that each statistician can be said to be estimating one or another of them. But what does \(a\) or \(b\) or \(c\) even mean?? For example, how am I supposed to understand the direct effect of sex on follow up weight, aka \(b\). Does it correspond to a difference in the eating patterns of males and females during their year in college? Or to something else?

A recent paper titled “Lord’s paradox explained” (Tennant et al. 2022) doubles down on a causal explanation of Lord’s paradox and attempts to explain the meaning of the parameters \(a\), \(b\), and \(c\), using terms like “endogenous change” and “exogenous change.” I have to be honest, I didn’t understand what they were getting at at all.

On the other hand, I do think regression to the mean shows us what’s really going on.

So why did I go through the exercise of adopting a causal perspective, displaying the DAG and fitting the SEM? Well, two reasons.

1. I wanted to understand what Pearl and others were talking about.

2. In an observational study in which you have a genuine exposure variable, as well as a genuine baseline measurement, then you are very much encouraged to approach the problem causally and draw a DAG. In this DAG, you should safely assume that the arrow of causality between baseline measurement and exposure goes from the former to the latter and not vice versa: you wouldn’t be calling it a baseline measurement otherwise. In such a case, it makes sense to include the baseline measurement in your regression, because it’s a potential confounder. That is, ANCOVA (rather than change-score analysis) is the way to go in the setting where you can rigorously define a causal effect that you are trying to estimate (as opposed to trying to work your way through Lord’s paradox.)

References

Lord, Frederic M. 1967. “A Paradox in the Interpretation of Group Comparisons.” Psychological Bulletin 68 (5): 304.
Pearl, Judea. 2016. “Lords Paradox Revisited (Oh Lord! Kumbaya!).” Journal of Causal Inference 4 (2). https://doi.org/10.1515/jci-2016-0021.
Tennant, Peter WG, Georgia D Tomova, Kellyn F Arnold, and Mark S Gilthorpe. 2022. “OP09lords Paradox Explained: The 50-Year Warning on the Analyses of Change Scores*.” SSM Annual Scientific Meeting, August. https://doi.org/10.1136/jech-2022-ssmabstracts.9.

  1. Maybe more precisely, the line Y=X is the major axis of the the ellipses that represent the contour lines of the boys’ and girls’ bivariate distributions. But I think it’s better to say, “it’s the line straight through the data.”↩︎

  2. And indeed, is the very reason we call them “regression lines.” The origins of the term go back to Francis Galton, who was studying the heights of pairs of fathers and sons, and noticed regression to the mean, when he fit a linear model with OLS. The name stuck, even though most of the time when you fit a linear regression model, X and Y variables aren’t measurements of the same variable, so regression to the mean isn’t happening (at least literally).↩︎

  3. Our data from data.world actually goes up through the 2016-2017 season↩︎

  4. A derivation can be found here https://math.stackexchange.com/questions/1531865/conditional-expectation-of-a-bivariate-normal-distribution↩︎

  5. As we increase the sample size, the parameter estimates for \((a,b,c)\) would converge to \((2, 0.8, 0.6)\).↩︎