Permutation Tests

Having found a pattern in data, it is natural to wonder whether that pattern is so unusual that it would be unlikely to arise by chance. This lab introduces one approach to this problem--the permutation test. But first, you need some basics.

A statistic and its sampling distribution

All data have noise. They are noisy because of measurement error, because of sampling error, and because of unpredictable factors involved in the phenomena we study. This is why we need statistical methods in the first place. Things that we calculate from data are called statistics. Familiar examples include the mean and variance of a sample. Because they are based on noisy data, statistics are noisy too. In this context, "noise" refers to random variation. Scientists deal with such randomness using probability theory, and you should be familiar with that before reading further. (See Just Enough Probability.)

Scientific hypotheses are often tested by comparing a statistic to its sampling distribution. To get this idea across, we re-visit a simple problem: that of tossing a coin. As you learned in Just Enough Probability, John Kerrich tossed a coin 10,000 times and observed 5070 heads. This number is a statistic. How could we use it to test a hypothesis?

Suppose we suspected that Kerrich's coin was slightly unfair, that the probability (p) of heads was 0.45 on each toss. To test this idea, we would need to know whether the value that Kerrich observed is unusual in similar experiments for which p really does equal 0.45. Suppose that we somehow managed to manufacture such a coin and then used it in 100,000 repetitions of Kerrich's experiment. Suppose in addition that in none of these was the number of heads as large as Kerrich's value, 5070. Then Kerrich's result would clearly rank as unusual, and this might lead us to doubt the original hypothesis that p=0.45. This is the reasoning that usually underlies tests of statistical hypotheses. The difficulty is that usually (as here) these repetitions are impossible to carry out. There is no way to manufacture a coin for which p is exactly 0.45. Even if that were possible, who would volunteer to toss that coin 100,000 X 10,000 = 1 billion times?

Nonetheless, let's pretend that we had really done 100,000 repetitions of Kerrich's experiment under conditions that guarantee that p=0.45. We could calculate the number of heads from each repetition and then summarize these values using a frequency distribution. Since the number of repetitions was so large, this frequency distribution would approximate a special sort of probability distribution called a sampling distribution.

The sampling distribution of a statistic is its probability distribution, but only in a restricted sense: it is the probability distribution implied by a particular hypothesis. Different hypotheses imply different sampling distributions, even for the same statistic. In our example, we assumed that p=0.45. A different hypothesis would imply a different sampling distribution.

In short, a sampling distribution is the probability distribution of a statistic, as implied by a particular hypothesis.

Using sampling distributions to test hypotheses

Let us postpone the question of how one gets a sampling distribution. For the moment, our focus is on using one to test a hypothesis. It is sometimes useful to distinguish between a random variable and the particular values it may take. In this section, we use upper case for the former and lower case for the latter. Thus, X represents a random variable and has a probability distribution, but x is just a number.

The figure at the right shows two ways to graph the sampling distribution of a hypothetical statistic, X. In the top panel, the vertical axis shows the probability density function, f. Areas under the curve correspond to probabilities in the sampling distribution of X.

This density function shows that values are more likely to fall in the central region than in the shaded tail. If the observed value of our statistic falls in the shaded region, then either (a) we have observed an unusual chance event, or (b) our hypothesis and its implied sampling distribution are incorrect. When this happens, we say that the hypothesis has been rejected.

This does not mean it is wrong, for correct hypotheses may be rejected too. The probability of such an error is called alpha and is equal to the shaded area of the graph. To minimize these errors, we usually choose small values of alpha, such as 0.05 or 0.01. In the figure, I have set alpha to a very large value (0.1) so that the shading is easy to see.

The bottom panel of the figure shows another way to graph the same sampling distribution. The vertical axis shows F(x), the probability of observing a value less than or equal to x. This is called the cumulative distribution function. For example, in the figure's top panel, a fraction 0.1 of the area under the curve lies to the right of x0 and a fraction 0.9 lies to the left. Consequently, the right panel shows that F(x0) = 0.9.

We explained above how f can be used to test a statistical hypothesis: if the observed value falls within the tail, then we reject. This idea can be re-expressed in terms of F because F and f contain the same information. In terms of F, we reject if F(xObs) is greater than or equal to 1 - alpha. where xObs is the observed value of the statistic, and alpha is the level of significance. This is just another way of saying the same thing, but the formulation in terms of F is often far more convenient. To test a hypothesis, we don't need to know the whole sampling distribution--just a single number, F(xObs). As we shall see, this value is easy to estimate by computer simulation.

We have just been discussing what is called a one-tailed test, because it involves a single tail of the sampling distribution. In many cases, we would be surprised if our statistic had value that was either unusually large or unusually small. In such cases, we reject the hypothesis if the test statistic falls either in the tail on the right side of the distribution or in the tail on the left. These are called two-tailed tests. In what follows, we will be concerned mainly with one-tailed tests.

Sampling distributions from computer simulations

Long ago there was only one way to figure out the sampling distribution implied by a hypothesis--it required a sharp pencil and a mathematical turn of mind. That approach doesn't always work however, even for gifted mathematicians. Fortunately there is now another way: we can estimate sampling distributions from computer simulations.

The idea is easier to illustrate than to explain, so let us start with an example. Consider the two columns of data below:

The two columns are similar, yet it is easy to see a difference between them in the strip chart to the right. On the other hand, the samples are small, so perhaps the difference is an artifact of sampling. Let us investigate this possibility.

We need first a measure of the difference between the two samples. The medians of x and y are 50.5 and 52.5. Let us take the absolute difference between these two values, D=2, as our measure of the difference between the two samples. To carry out the sort of test described above, we would need to know the sampling distribution of D under the hypothesis that x and y are drawn from the same probability distribution. In classical statistics, the sampling distribution is derived from assumptions about the distributions of x and y. This can lead to errors when such methods are applied to data that violate distributional assumptions. Here, we'll assume as little as possible about how the probability distributions of x and y. We assume only that the two samples were drawn independently and at random.

If x and y were drawn from the same distribution, then they are really just arbitrary subsets of a single larger sample. We can form arbitrary subsets of our own like this:

xy <- sample(c(x,y))
x <- xy[1:10]
y <- xy[11:20]

Here, the first step combines x and y into a single vector, scrambles this larger vector, and assigns the result to xy. From this vector of scrambled values, we assign the first 10 entries to x and the remaining 10 to y. These are not the same x and y we started with--each vector contains 10 random observations from the combined data. Yet if our hypothesis is correct--if x and y really were drawn from the same distribution--then the new x and y should have the same distribution (conditional on our particular data set) as the original x and y. This hypothesis is of interest not because we believe it. (Indeed, the strip plot above argues otherwise.) The point is that by rejecting it, we provide evidence for a real difference between x and y. When hypotheses are used in this way, they are called null hypotheses.

Under the null hypothesis, the simulated x and y have the same distributions as the observed version, and this is also true of D. To experiment with this idea, try the following steps on your own computer:

x <- c(50, 52, 56, 47, 52, 51, 45, 47, 43, 51)
y <- c(51, 66, 60, 50, 55, 50, 64, 54, 47, 50)

for(i in 1:10) {
    xy <- sample(c(x,y))
    x <- xy[1:10]
    y <- xy[11:20]
    Dsim <- abs(median(x) - median(y))
    print(Dsim)
}

Each time you run this code, you will get different output because the sample function returns its argument in a (pseudo) random order. When I ran it, I got the following output:

[1] 0.5
[1] 2.5
[1] 0
[1] 1.5
[1] 0.5
[1] 2
[1] 2
[1] 0
[1] 1
[1] 1

Each of these numbers is a value of D simulated under the null hypothesis. The observed value (D = 2) would not look out of place in this collection. Three of the simulated values are as large or larger. There is nothing here to cast doubt on the null hypothesis. In an informal way, this exercise has tested the null hypothesis.

Let now re-express this test using the formalism introduced above. To distinguish between the observed and simulated values of D, write them as Dobs and Dsim. In order to test the null hypothesis, we need F(Dobs), the probability (as implied by the null hypothesis) that D is no larger than Dobs (which equals 2).

As you may know, probabilities are estimated by relative frequencies. We can estimate F(Dobs) as the relative frequency of observations such that Dsim is less than or equal to Dobs. This is true of 7 of our 10 simulated values. Thus, we estimate F(Dobs) as 7/10 = 0.7. Our sample is small, so this may not be a very good estimate. But let us take it at face value for the moment. To test a hypothesis, we must first specify alpha, so let alpha=0.05. With this significance level, we can reject only if F(Dobs) is 0.95 or larger. Our estimate (0.7) is clearly not this large, so we cannot reject the null hypothesis.

This conclusion however is not convincing, because it is based on only 10 simulations. To get an accurate estimate of F(Dobs), we need a much larger number. Here is a program called mediantest, which removes the tedium.

mediantst <- function(x, y, nreps=100) {

   # Test for differences between two distributions.  The test
   # statistic is the absolute difference between medians.  Test is
   # based on nreps random permutations.

   # Calculate test statistic.
   d.obs <- abs(median(x) - median(y))

   nx <- length(x)
   ny <- length(y)
   tail.prob <- 0

   # Each pass through this loop randomly permutes the combined
   # sample from both distributions.  In the resulting vector, the
   # 1st nx entries are assigned to x and the 2nd ny to y.  These x
   # and y vectors are then used to calculate the test statistic.  If
   # this simulated statistic is at least as large as the observed
   # one, then we increment the tail probability.
   for(i in 1:nreps) {
       xy <- sample(c(x,y))           # permute combined list
       x <- xy[1:nx]                  # first nx are assigned to x
       y <- xy[seq(nx+1,nx+ny)]       # next ny to y
       d.sim <- abs(median(x) - median(y))
       if(d.sim >= d.obs)             # increment tail prob
           tail.prob <- tail.prob + 1
   }
   tail.prob <- tail.prob / nreps
   return(tail.prob)

}

You can download a copy of this code by clicking here. Note that the function does more than simply print out the values of Dsim. It defines a variable called tail.prob to keep track of the number of times that Dsim is greater than or equal to the observed value. Each time this condition is met, the program adds 1 to tail.prob. At the end, we divide by the nreps (the number of simulations) so that tail.prob is the fraction of simulations in which Dsim equals or exceeds the observed value.

After you download this code, you can play with it like this:


source("mediantst.r")

x <- c(50, 52, 56, 47, 52, 51, 45, 47, 43, 51)
y <- c(51, 66, 60, 50, 55, 50, 64, 54, 47, 50)

mediantst(x,y, nreps=10)
mediantst(x,y, nreps=100)

Each time you run mediantst, you get an estimate of the tail probability. If that estimate is less than 0.05, you can reject the hypothesis that x and y are equal.

The larger you set nreps, the more accurate will be your estimate of F(Dobs). Try running the program a few times at each of the following values of nreps: 100, 1000, 10000. Try it just once at 100000. This will give you a sense of how many repetitions you need.

Exercise

The exercise below is based on the crabs data set, which you will find in the MASS package. That data set describes physical measurements on two species of crab, which are indicated in the sp variable by labels 'O' (for orange) and 'B' (for blue). We will focus on variable CL, which measures the length of the carapace.

Before getting into the exercise, I want to show you a couple of tricks. The first involves using stacked box plots to get a quick look at the species/sex pairs. The trouble is that the formula used by bwplot needs a single factor to the left of the "~", but we have two factors: sp and sex. The solution is to combine these using the "interaction" function:


spsex <- with(crabs, interaction(sp,sex))

Then you can use bwplot with a formula like "spsex~CL". This generates a plot that is OK but not perfect. It would be better if the male and female boxplots for each species were adjacent. To accomplish this, turn spsex into an ordered factor, using the "levels" argument to specify the ordering:


spsex <- ordered(spsex, levels=c("O.M","O.F","B.M","B.F"))

With these adjustments, your stacked boxplots will be easy to interpret.

In this exercise, all of the discussion comes at the end. In steps 1-5, just note which species/sex pairs seem to differ.

Compare the four species/sex distributions in a stacked boxplot. (Not all possible pairs: just "OM/OF", "BM/BF", "OM/BM", and "OF/BF".) Which pairs seem to differ?
Compare the pairs using quantile-quantile plots. Once again, take note of the pairs that seem to differ.
Use mediantst to test for differences between the medians of each pair. Which pairs differ significantly at the 0.05 confidence level?
The mediantst function is not especially sensitive, because it uses only the median. Here is a more sensitive statistic. Choose some set of quantiles, say 0.1, 0.25, 0.5, 0.75, and 0.9. Calculate these quantiles for x and y, subtract those of y from those of x, and sum the absolute values of these differences. Your statistic, in other words, is the sum of absolute differences between quantiles. Modify the mediantst function so that it uses this statistic rather than the difference of medians. Use this modified function to test for differences between the species/sex pairs.
The t-test is the classical way to test for differences between the means of small samples. Here is the way to do a t-test comparing the sexes of the orange species:

with(subset(crabs, sp=="O"), t.test(CL ~ sex))

The output includes a quantity called "p-value", which is equivalent to the tail probabilities we have been calculating by simulation. The hypothesis of no difference is rejected if the p-value is less than 0.05.

Use this method to do t-tests of all sex/species pairs.
Make a table with columns labeled "OM/OF", "BM/BF", "OM/BM", and "OF/BF" and with rows labeled "box plots", "qq plots", "median test", "quantile test", and "t test". Within the table, use "1" and "0" to indicate whether that method did or did not indicate a relationship between that sex/species pair. Then use this table to discuss the sensitivity of the various methods. Do the hypothesis tests find differences that the graphical methods miss, or is it the other way around? How do the various forms of hypothesis tests compare?