7 min read

Randomization Inference Part 1: The Wilcoxon Signed Rank Statistic

Introduction

Last semester, I made my way through the first half of Paul Rosenbaum’s textbook Design of Observational Studies (2010). Rosenbaum is a leading expert in causal inference, an area of statistical research concerned with the identification of causal effects through the analysis of observational data.

One piece of advice I’ve heard directed towards researchers in the social sciences who are interested in measuring a causal effect is to get a clear image in their head of the randomized experiment that, were it feasible, ethical, etc., would be used to test for its presence. In this hypothetical, randomization plays the role of a benchmark which can be approached using the tools of causal inference. In the fortunate circumstances in which a randomized experiment is possible, the task of conducting inference simplifies greatly, although there is still some work to be done. Rosenbaum discusses randomization inference in a matched pair experiment in Chapter 2 of his textbook, alongside an example study that he works through. In the next post or two, I would like to summarize the method he presents to solidify my own understanding.

In this post, I discuss two non-parametric ways of doing inference in a randomized matched pair study. The first näively uses the mean treated-minus-control difference among the pairs as its test statistic, whereas the second uses the more efficient Wilcoxon signed rank statistic, which has the added benefit of opening up other inferential possibilities. In a follow-up post, I hope to talk about a method for quantifying the effect of the treatment without making any simplifying assumptions about the form the effect takes. This quantity is known as the “attributable effect” of a treatment.

Hypothesis Testing with the Mean Treated-Minus-Control Difference

Let’s establish some notation to describe the set-up of our hypothetical experiment. Suppose we have $I$ pairs of individuals matched on the basis of some observed covariates. We will use $i=1,2,...,I$, to index the pairs and $j=1, 2$, to index the individual within each pair. Let $Z_{ij}$ be an indicator for whether individual $i,j$ is treated. Since we treat exactly one individual in each pair, we have $Z_{i1}+Z_{i2}=1$ for $i=1,2,...I$. Finally, let $R$ represent the response we are interested in.

With this notation, the treated-minus-control difference for matched pair $i$ is given by $Y_i=(Z_{i1}-Z_{i2})(R_{i1}-R_{i2})$. One inference goal for this study might be to test the sharp hypothesis that treatment has no effect on response in any individual. In order to formalize this hypothesis, we must imagine that each individual has two potential responses: the response they would exhibit if assigned the treatment, $r_{Tij}$, and the response they would exhibit if assigned the control, $r_{Cij}$. Then the hypothesis we are testing states precisely that $r_{Cij}=r_{Tij}$ for all $i, j$. Packaging the potential responses for all individuals into vectors, we write the null hypothesis as $\boldsymbol{r_T}=\boldsymbol{r_I}$.

If treatment assignment is random within each pair, we will be able to perform this test. In the words of Ronald Fisher, randomization provides us with a “reasoned basis for inference.” In order to see this, let’s draw out the implications of the null hypothesis. If treatment does not have any effect, then an individual’s observed response is equal to their potential response under control (this of course holds true if the individual is given the control–the point here is that because of the null hypothesis, it is also true if she is treated). That is $R_{i,j}=r_{Cij}$ for each individual $i,j$. The treated-minus-control difference for pair $i$ is thus $$Y_i=(Z_{i1}-Z_{12})(R_{i1}-R_{i2})=(Z_{i1}-Z_{12})(r_{Ci1}-r_{Ci2})$$

Now we can see from this equation that the randomness in $Y_i$ comes only from the treatment assignment $Z$. In particular, if a fair coin is used to decide assignment, we will observe $Y_i=(r_{Ci1}-r_{Ci2})$ and $Y_i=(r_{Ci2}-r_{Ci1})=-(r_{Ci1}-r_{Ci2})$ with equal probability $\frac{1}{2}$. This fact allows us to determine the exact null distribution of the mean treated-minus-control difference $\bar{Y}=\frac{1}{n}\sum_{i=1}^{n}{Y_i}$ after observing our data. For each pair $i$, if we observe $Y_i=y_i$, the null hypothesis assures us that if we happened to swap treatment with control, we would have observed a treated-minus-control difference of $-y_i$. It follows that after we observe the $Y_i$’s, we can calculate the value $\bar{Y}$ takes for each of the equally likely $2^I$ possible treatment assignment $(Z_{11}, Z_{12}, ... , Z_{I1}, Z_{I2})$. We can then compare our observed $\bar{Y}$ to this null distribution of $\bar{Y}$’s in order to asses the evidence for treatment effect, find one or two-sided p-values, and make the corresponding conclusion.

Hypothesis Testing with the Wilcoxon Signed Rank Statistic

If we followed the method outlined above, there is no reason to doubt the validity of our inferences. However, in performing inference, we would also like to be efficient at detecting effects when the responses $R_{ij}$ are not well-behaved, and in this respect, there are better test statistics than the mean treated-minus-control difference. One of the most popular such test statistics is due to Frank Wilcoxon, and it’s known as the signed rank statistic.

Like $\bar{Y}$, the Wilcoxon signed rank statistic $T$ also uses treated-minus-control differences $Y_i$, but it focuses more on their sign. To construct $T$, we rank the absolute differences $|Y_i|$ from smallest to largest. We then sum the ranks for all pairs $i$ whose treated-minus-control difference is positive. That is, let $q_i$ be the rank of $|Y_i|$. Then $T\equiv\sum_{i=1}^n{\mathop{\mathrm{sgn}}(Y_i)q_i}$, where $\mathop{\mathrm{sgn}}(x)=1$ if $x>0$ and $\mathop{\mathrm{sgn}}(x)=0$ if $x\leq0$. As we’ve seen, with no treatment effects, $Y_i=(r_{Ci1}-r_{Ci2})$ and $Y_i=(r_{Ci2}-r_{Ci1})=-(r_{Ci1}-r_{Ci2})$ with equal probability $\frac{1}{2}$. Thus, under the null and assuming non-zero treated-minus-control differences, each rank $q_i$ has a 50-50 shot of being included in the sum $T$. This makes the null distribution of $T$ particularly easy to calculate. In fact, assuming there are no ties in the absolute treated-minus-control differences, we can write the null distribution of $T$ before we even look at our data, since it is simply the result of adding up the numbers $1$ through $I$, including each number with probability $.5$. Thus, under the null hypothesis, the expected value of $T$ is given by $$\mathop{\mathrm{\mathrm{E}}}(T)=\frac{1}{2}\sum_{i=1}^I{i}=\frac{I(I+1)}{4}$$ and the variance by $$\mathop{\mathrm{\mathrm{Var}}}(T)=\sum_{i=1}^I{i^2\mathop{\mathrm{\mathrm{Var}}}[\mathop{\mathrm{sgn}}(Y_i)]}=\frac{1}{4}\sum_{i=1}^I{i^2}=\frac{I(I+1)(2I+1)}{24}$$

Just as with the $\bar{Y}$, the null distribution of $T$ is symmetric about its mean, and we can compare it to our observed $T$ to derive one-sided and two-sided p-values for the null hypothesis of no treatment effect. Furthermore, with only a slight modification, we can test the null hypothesis of a non-zero additive effect, i.e. we can test $H_0: r_{Tij} = r_{Cij} + \tau_{ij}$ for all $i,j$. In order to do this, we simply define $Y'_i=Y_i-(\tau_{i1}Z_{i1}+\tau_{i2}Z_{i2})$ and use these adjusted responses to calculate the signed-rank statistic, ranking the $|Y'_i|$s and summing up the ranks for which $Y'_i$ is positive.

The Wilcoxon signed rank test statistic thus opens up a world of opportunity for robust inference in matched pair studies. As opposed to the paired t-test, a staple of introductory statistics courses, the Wilcoxon signed rank test is non-parametric, meaning that it does not involve making any assumptions about the distribution of the data. The t-test depends on the assumption that the data (the response-minus-controldifference $Y_i$’s) is normally distributed or that the number of pairs is large enough such that the Central Limit Theorem guarantees that the distribution of $\bar{Y}$ is approximately normal. Often, these two conditions are both unrealistic–that is, the underlying distribution for our experimental data is non-normal, and our sample size is small. Personally, this is my experience of most science. We’re in luck because in just such a case, the Wilcoxon signed rank test allows us to perform valid inferences.

What’s more, by assuming that the additive effect $\tau_{ij}=\tau$ is constant across all individuals, we can invert hypothesis tests and form point estimates of a constant treatment effect (choose the $\tau$ that yields the ranked sum test statistic closest to its expected value if there were actually a $\tau$ constant effect, i.e. $I(I+1)/4$), as well as construct $1-\alpha$ confidence intervals for a constant treatment effect (include in the interval all $\tau$s which are not rejected by a $\alpha$ level test). The Wilcoxon ranked-sum statistic can also be used to detect and estimate a constant multiplicative treatment effects, i.e. $H_0: \mathbf{r_T}=\beta_0\mathbf{r_C}$ with $\beta_0\geq0$.

Unfortunately, in forming these (point and interval) estimates of the (additive or multiplicative) treatment effect, we are forced to make the possibly unrealistic assumption that the effect is constant for all individuals. In my next post, I will discuss how we can make a summary statement about the magnitude of the effect without this assumption.