The Bootstrap

Introduction

The bootstrap is a method for estimating parameters and confidence intervals. It makes few assumptions about the data and is therefore useful even when conventional statistical methods are inappropriate. In addition, it is remarkably flexible: you can use it to estimate essentially anything.

Let us start with an example. Consider the median of a sample of size N, drawn at random from the standard normal distribution. What is the variance of this statistic under random sampling? We cannot answer this question simply by plugging values into some standard formula because there is no such formula. In planning an alternative approach, it will help to consider what the method means. It refers to an imaginary experiment in which we repeat the same steps over and over. In each repetition, we sample N observations and calculate their median. We are interested in the variance of this long series of medians. To get an exact answer, the series would have to be infinite, but we can get a close approximation from a series that is merely large. Because this example involves normally-distributed data, it is easy to simulate on a computer:

N <- 200                             # sample size in each replicate
nreps <- 1000                        # number of replicates
medvec <- NULL                       # to hold list of medians
for(i in 1:nreps) {
   y <- rnorm(N)                     # random data set
   medvec <- c(medvec, median(y))    # add median to medvec
}
var(medvec)                          # variance

Each time you run this code, you'll get a slightly different answer, but these answers should all be close to 0.0078. To improve accuracy, increase nreps.

In this code, we produce a large number of data sets, each drawn from the same normal distribution. With each of these, we calculate the median. We use the variance of these estimates to estimate the variance of the median. This method works well when the number of replicates is large, because in that case two distributions become nearly the same: the (unknown) probability distribution of the median, and the empirical distribution of the medians in the simulation.

The trouble is that we often work with data whose distribution is unknown, and which seem poorly modeled by any of the standard distributions of statistics. There is thus no way to carry out the simulation just above. We can however carry out a similar calculation, which is called the bootstrap:

N <- 200                             # sample size in each replicate
x <- rnorm(N)                        # initial data set
medvec <- NULL                       # to hold list of medians
for(i in 1:nreps) {
   y <- sample(x, replace=TRUE)      # bootstrap data set
   medvec <- c(medvec, median(y))    # add median to medvec
}
var(medvec)                          # variance

Here, we start with an initial data set produced with a random number generator. In the real world, we would start with real data. After that, the procedure is just as before, except that within the loop, the line y <- rnorm(200) has been rewritten as y <- sample(x, replace=TRUE). Rather than sampling from the normal distribution, we are sampling (with replacement) from the data itself. The method works because the distribution of values within the data provides an approximation to the underlying probability distribution.

Having calculated a vector of bootstrap replicates, we can also use it to estimate a 90% confidence interval for the median:

quantile(medvec, probs=c(0.05, 0.95))

When I ran this code, the output looked like this:

        5%        95%
-0.1190897  0.1787684

These values estimate the 90% confidence interval of the median.

Because of its flexibility, the bootstrap is extremely seductive. One it tempted to use it all the time. But this flexibility comes at a price: bootstrap estimates are less accurate than those of parametric models. This is because the bootstrap entails two sources of error. First, we use the distribution of values within the data to approximate the probability distribution from which those data values were drawn. Second, we use a finite series of bootstrap replicates to approximate an infinite series of samples from that approximating distribution. In addition, there is a small bias in bootstrap confidence intervals. Most of this bias can be removed, and there is an R package that does so. The bias-corrected intervals that result are known as "BCa" intervals. I'll show how to use this software in the example below.

The bootstrap can be used to place error bounds around any curve that we fit to data. The idea is very simple. We generate a large number of bootstrap replicates of our data, and with each of these we fit a curve such as loess. To form the upper limit of a 95% confidence region, we consider each point on the horizontal axis, one at a time. At each point, we calculate the 0.975 quantile of the curves for all the bootstrap replicates. These quantiles define the upper confidence limit, and the lower limit is calculated in an analogous fashion. I have written a pair of functions (scatboot and scatboot.plot), which do this job. You can find them illustrated on the class examples page, and we'll also see them in the worked example below.

A worked example: evaluating the relationship between two hormones

In the lab project on hormones, I asked you to look for relationships between all the pairs of hormones. It was a frustrating task because the assessments were so subjective. Several pairs of hormones showed what looked like relationships, but it was hard to tell. Were these relationships real or just artifacts of sampling? This exercise will illustrate several ways to answer this question.

We'll focus on the two hormones, shown in the graph to the right. There appears to be a positive relationship here, but there are also several points that violate that relationship. In view of these outliers, we should worry that we are over-interpreting the data.

Let's examine the Spearman's correlation between these variables. We start by acquiring the data and cleaning it up:

h <- read.table("hormall.dat", header=T)
af <- subset(h, select=c(freet,a))

The gives us a data set calles af, which contains just the two variables of interest. Now we test for a significant association, using Spearman's rank-order correlation coefficient:

cor.test(af[[1]], af[[2]], method="spearman")

This test gave me the following output:

Spearman's rank correlation rho

data:  af[[1]] and af[[2]]
S = 2922.191, p-value = 0.007413
alternative hypothesis: true rho is not equal to 0
sample estimates:
      rho
0.4644078

Warning message:
In cor.test.default(af[[1]], af[[2]], method = "spearman") :
  Cannot compute exact p-values with ties

According to this result, there is a significant association here: the p-value of 0.007 is far below the conventional cutoff of 0.05. On the other hand, the output also warns us that there were problems in computing this p-value. Let's try a permutation test, using this same statistic.

You have not yet seen permutation tests applied to measures of association, so let me explain. It's easy to construct a data set with the same X and Y values, but with no association between them; just scramble the order of one of the variables but not that of the other. If there were no association between our original X and Y, then these scrambled data sets would be drawn from the same probability distribution as our original. The permutation test of this hypothesis is almost the same as the one above. We use a different test statistic--some measure of association--and we permute just one of the two variables. Apart from that, the new permutation test is just like the old one.

In the test below, our test statistic is Spearman's correlation coefficient. We could however have used any measure of association. Different statistics would yield tests of different power, but none would involve any restrictive assumptions. The only restrictive assumption is that entailed by the permutation procedure itself: that the (X,Y) pairs in our data were drawn independently and at random. Here is the code for a permutation test of association:

robs <- cor(af[[1]], af[[2]], method="spearman")   # observed r
tail.prob <- 0
nreps <- 500
for(i in 1:nreps) {
    y <- sample(af[[2]])   # randomly reorder values of 2nd variable
    rsim <- cor(af[[1]], y, method="spearman")     # simulated r
    if(abs(rsim) >= abs(robs)) {   # abs makes test 2-tailed
        tail.prob <- tail.prob + 1
    }
}
tail.prob <- tail.prob / nreps

When I ran this code, I got a tail probability of 0.004--even lower than the value from cor.test. We can be confident that the association here differs significantly from 0, but how strong is it? What values of r are consistent with the data. To find out, let's do a bootstrap confidence interval for the Spearman's statistic. Rather than write our own bootstrap code, we'll use the facilities provided by the boot package to calculate a BCa confidence region. (This corrects the small bias that is otherwise present.) Because we'll be using this package, the code must begin with

library(boot)

To use the boot package, it's necessary to write a function that encapsulates the calculation of Spearman's statistic. In this function, the first argument is the original data. In our case, this will be the list af that we defined above. The second argument is a vector containing the indices of the current bootstrap sample. It may contain the same index several times because individual observations may appear several times in the bootstrap sample. Here's our function:

getcor <- function(x, ndx) {
    return(cor(x[ndx,1], x[ndx,2], method="spearman"))
}

Next, we call the boot function to generate the bootstrap replicates. Here, the R argument sets the number of bootstrap replicates, and the stype argument tells boot that the 2nd argument of getcor must be indices rather than some other option.

result <- boot(af, getcor, R=1000, stype="i")

Once this finishes, we call another function to process bootstrap replicates to produce a BCa confidence interval:

boot.ci(result, type="bca")

BOOTSTRAP CONFIDENCE INTERVAL CALCULATIONS
Based on 1000 bootstrap replicates

CALL :
boot.ci(boot.out = result, type = "bca")

Intervals :
Level       BCa
95%   ( 0.1309,  0.7770 )
Calculations and Intervals on Original Scale

In view of this output, we can be pretty sure not only that r differs from 0 but also that it lies within the interval (0.13, 0.78). Thus, the confidence interval provides more information than we got from the hypothesis tests.

We started on this analysis because the loess curve seemed to indicate a positive relationship. We can now be confident in that inference, but there is still room for skepticism about the detailed pattern. How reliable is the loess curve itself? To find out, we can put error bounds around the curve. First the default method of panel.smoother:

xyplot(a~freet, data=h) +
layer(panel.smoother(x,y,degree=1,span=2/3))

The result is shown to the right. I can't figure out how this error bound is calculated or what it really means, so I suspect it's based on normal theory. For this reason, I don't trust it when there are outliers such as those in the present data.

Let's do it a different way, using the bootstrap. I've written a function for this purpose, which you can get through the "R Source Code" link on the main course web page. Alternatively, just use the following "source" statement:

source("http://www.anthro.utah.edu/~rogers/datanal/R/scatboot.r")

This next line does the bootstrap replicates and calculates the upper and lower confidence bound at each point on the horizontal axis.

sb <- with(af, scatboot(freet, a, confidence=0.90, degree=1,span=2/3))

Finally, we call scatboot.plot to plot the result:

scatboot.plot(sb, xlab="freet", ylab="a")

The result is shown to the right. The confidence region there is truncated at the right because there are too few points there to calculate a 90% confidence region. It is possible, however, to construct a 75% region: try it.

Exercise

This exercise will involve the msleep data set, which contains several sleep-related variables scored in a variety of species of mammal. It's provided with the ggplot2 package, but don't use the command library(ggplot2): there is a layer function in ggplot2 which may mask the one in latticeExtra. Use this command instead:

data(msleep, package="ggplot2")

That will put a copy of msleep into your workspace without loading ggplot2 into your search path. You'll also need the quanttst and scatboot functions, which you can load directly from the web server like this:

source("http://www.anthro.utah.edu/~rogers/datanal/R/quanttst.r")
source("http://www.anthro.utah.edu/~rogers/datanal/R/scatboot.r")

The quanttst program does the same calculation that you did in a previous lab project: a permutation test of the hypothesis that two samples are drawn from the same population, using as a test statistic the sum of the absolute difference between quantiles. You are welcome to use your own code if you prefer. If you use mine, the call should look something like this:

tail.prob <- quanttst(x, y, pr=c(0.1, 0.25, 0.5, 0.75, 0.9),
                      nreps=1000)

The pr argument specifies the quantiles to be used, and nreps is the number of random permutations to do.

  1. In the msleep data, the sleep_total variable measures the total time the animal spends sleeping. The vore variable divides the species into groups, with values such as "omni" (for omnivore) and "herbi" (for herbivore). Use these data to ask whether omnivores and herbivores differ in the time they spend sleeping. Attack this question using (a) a quantile-quantile plot, (b) a t-test, and (c) a permutation test using the quanttst function (or your own code) with f-values 0.1, 0.25, 0.5, 0.75, and 0.9. Based on these results, discuss how total_sleep differs between omnivores and herbivores.

  2. The rest of this exercise involves the relationship between two variables (sleep_rem and sleep_total) among omnivores. Rem sleep is the stage of sleep in which dreams happen, so we are asking whether animals that sleep a lot also dream a lot. To get a sense of these data, make a scatter plot, with sleep_rem on the X axis, and draw a loess curve through it. Comment on what you think you see. Is there a relationship here?

  3. Use the cor.test program to test for a significant association between these variables. Try all three methods: pearson, spearman, and kendall. Bear in mind that the pearson method assumes that the observations are drawn from a bivariate normal distribution. Save comments until later.

  4. Do a permutation test of the hypothesis that Pearson's r equals 0. (You don't need to worry now about the normality assumptions, because the permutation test makes no such assumption.) Save comments until later.

  5. Use the boot method of the boot package to construct a BCa confidence interval for Pearson's r. Save comments until later.

  6. Use scatboot to construct a bootstrapped confidence region for the loess curve with degree=1 and span=2/3. Make two graphs, one for confidence=0.5 and the other for confidence=0.95. Save comments until later.

  7. Write a paragraph summarizing these results. Comment both on the association between the two variables and on the consistency (or inconsistency) of the results.