legofit
Legofit is a system of programs written in C and Python,whichusescountsofnucleotidesitepatternstoinferthe
legofit Documentation

Introduction

Legofit is a computer package that uses counts of nucleotide site patterns to estimate the history of population size, subdivision, and gene flow. The package consists of the following programs

Nucleotide site patterns

These programs all operate on "nucleotide site patterns", which are summary statistics not influenced by recent population size but rich in information about ancestral populations. This section will define site patterns and explain how legofit tabulates their counts, and estimates expectations.

Consider a sample consisting of one haploid genome drawn from each of 3 populations, X, Y, and Z. Suppose that, at a given nucleotide site, the derived allele is present in the genomes from X and Y but not that from Z. If so, then this nucleotide position will be said to exhibit the "*xy* site pattern." We ignore cases in which the derived allele is present in none of the samples or in all of them. By default (without the -1 argument of legosim and legofit), we also ignore "singleton site patterns"—those in which the derived allele appears only once. In other words, we consider only polymorphic, non-singleton site patterns. For the special case of the 3-population sample just described, there are only 3 such site patterns: xy, xz, and yz.

In the general case, with samples from K populations, the number of site patterns (including singletons) is $2^K - 2$. For example, there are 14 site patterns in a sample involving $K=4$ populations. The table below shows data from a sample involving 4 populations, X, Y, N, and D.

SitePat     E[count]
      x   528808.000
      y   518558.875
      n   529267.375
      d   581255.250
    x:y   320273.625
    x:n    43345.375
    x:d    41517.750
    y:n    49486.000
    y:d    40366.375
    n:d   255980.625
  x:y:n   108149.000
  x:y:d    91739.375
  x:n:d    89563.375
  y:n:d    91388.250

The E[count] column shows numbers that can be thought of loosely as counts of site patterns in a genome-wide sample. The last line tells us that the ynd site pattern occurs at over 100,000 nucleotide sites.

These number cannot really be counts, because they aren't integers. This reflects the fact that our sample includes more than one haploid genome from each population, and a given SNP may contribute to several site patterns. The contribution to a given site pattern is the probability that a sub-sample, consisting of one haploid genome drawn at random from the larger sample of each population, would exhibit this site pattern. For example, suppose we have samples from three populations, X, Y, and N, and let $p_{iX}$, $p_{iY}$, and $p_{iN}$ represent the frequencies of the derived allele at the i*th SNP in these three samples. Then site pattern *xy occurs at SNP i with probability $z_i = p_{iX}p_{iY}(1-p_{iN})$ (Patterson et al 2010, Science, 328(5979):S129). Aggregating over SNPs, $I_{xy} = \sum_i z_i$ summarizes the information in the data about this site pattern. These are the numbers that appear in the 2nd column of the table above.

Tables such as the one above are generated by programs tabpat, sitepat, and scrmpat.

Installation and testing

The package is available at github. Before compiling, you must install two libraries: pthreads and gsl. You will need not only the libraries themselves but also several header files, such as pthread.h. I didn't need to install pthreads, because it came bundled with the Gnu C compiler. But the gsl was an extra. Under ubuntu Linux, you can install it like this:

sudo apt-get install libgsl0-dev

On the mac, using homebrew, the command is

brew install gsl

By default, the executable files will be copied into a directory named bin in your home directory. If you want them to go somewhere else, edit the first non-comment line of src/Makefile.

Then

  1. cd into the src directory.
  2. Type "make".
  3. Type "make install".

This will try to place the executables into directory "bin" in the user's home directory. Make sure this directory appears in your PATH, so that the shell can find it.

There is also a facility for installing a "global" version of the code for other users. The first step (under unix-like operating systems) is to set up a "soft link" to the directory just above the target bin directory. Then type "sudo make ginstall". You'll need administrative privileges. For example, to install the software in /usr/local/bin:

cd                     # move into home directory
ln -s /usr/local group # now "group" points to `/usr/local`

Then cd back into the legofit directory and type

sudo make ginstall     # install into /usr/local/bin

This installation will work under unix-like operating systems, such as linux and Apple's osx. I haven't tried to port this software to Windows.

The directory test contains a unit test for many of the .c files in directory src. Within this directory, type

  1. make xboot
  2. ./xboot

to test the source file boot.c. To run all unit tests, type "make". This will take awhile, as some of the unit tests are slow.

Genetic input data

Before doing data analysis with legofit, you must generate data files in one of two formats: "daf" (for derived allele frequency), or "raf" (for reference allele frequency). The first of these alternatives requires input data in which the ancestral allele has been previously called. The second ("raf") does not. Instead, the program sitepat uses an outgroup sequence to call ancestral alleles on the fly.

daf format

This was the only input data format in early versions of the legofit package. We began moving away from it when we realized that, under certain circumstances, it can introduce bias. To understand why, suppose we have data from 4 populations, A, B, C, and D. Suppose further that D has received gene flow from an unobserved population, which is distantly related to the 4 observed populations. This gene flow will introduce derived alleles into D, but at these sites A, B, and C will be fixed for the ancestral allele. If ancestral allele calls are based only on A, B, and C, then no ancestral allele calls will be available for sites at which derived alleles have been introduced into D. The gene flow into D will therefore be undetectable. To avoid such problems, it is best to call ancestral alleles during the process of tabulating site patterns, using the programs raf and sitepat. Nonetheless, we retain daf and tabpat for backwards compatibility.

The suffix "daf" stands for "derived allele frequency. See the @ref daf "daf" command for instructions on translating from "vcf" or "bcf" format into "daf".

The "daf" file is simple and looks like this:

#chr        pos aa da                  daf
   1     752566  g  a 0.835294117647058854
   1     754192  a  g 0.858823529411764652
   1     755225  t  g 0.000000000000000000
   1     755228  t  g 0.000000000000000000
   1     765437  g  a 0.000000000000000000

Columns are separated by one or more whitespace characters (tabs and spaces). The first line (beginning with "#") is an optional comment, which is used here to label the columns. The columns are as follows:

  1. Character strings that label chromosomes or scaffolds.
  2. Position of the SNP on the chromosome or scaffold, measured in base pairs. Daf format doesn't care whether nucleotide positions are numbered beginning with 0 or with 1, provided that they are consistent across files in a given analysis.
  3. Ancestral allele.
  4. Derived allele. Indels and loci with 3 or more alleles should be excluded.
  5. Frequency of the derived allele within the sample.

The lines should be sorted lexically by chromosome. Within chromosomes, they should be sorted in ascending numerical order of column 2. There should be no duplicate (chromosome, position) pairs.

raf format

The suffix "raf" stands for "reference allele frequency". See the raf command for instructions on translating from "vcf" or "bcf" format into "raf".

The "raf" file is simple and looks like this:

#chr  pos    ref  alt raf
1     752566 g    .   0.835294117647058854
1     754192 a    g   0.858823529411764652
1     755225 t    a   0.000000000000000000
1     755228 t    .   0.000000000000000000
1     765437 g    .   0.000000000000000000

Columns are separated by single tab characters. The first line (beginning with "#") is an optional comment, which is used here to label the columns. The columns are as follows:

  1. Character strings that label chromosomes or scaffolds.
  2. Position of the site on the chromosome or scaffold, measured in base pairs. Raf format doesn't care whether nucleotide positions are numbered beginning with 0 or with 1, provided that they are consistent across files in a given analysis.
  3. Reference allele.
  4. Alternate allele. Indels and loci with 3 or more alleles should be excluded.
  5. Frequency of the reference allele within the sample.

The lines should be sorted lexically by chromosome. Within chromosomes, they should be sorted in ascending numerical order of column 2. There should be no duplicate (chromosome, position) pairs.

Describing population history in an lgo file

The ".lgo" format describes the history of population size, subdivision, and gene flow. It also identifies the position within the population network of each genetic sample. It is a plain-text file, which should be constructed with a text editor.

In the paragraphs below, I will introduce a small .lgo file a few lines at a time. The first line in my file is

# Example .lgo file

This is a comment. Comments extend from the sharp character to the end of the line. The next two lines define two convenience variables, whose names are "zero" and "one"

time fixed    zero=0
twoN fixed     one=1

The first is a "time" variable, which I will use for the tips of branches, where time equals 0. I declare it "fixed", which means that it will not change. Variable names must begin with a letter. Subsequent characters may be letters, digits, or underscores. The second variable declaration begins with "twoN", which indicates that this variable refers to haploid population size–twice twice the size of the diploid population. When there is only one sample per population, the sizes of tip populations don't matter, so I set them all equal to "one". Next, three more time variables named "Txyn", "Tn", and "Txy".

time fixed    Tn=1897          # time of Neanderthal admixture
time free     Txyn=25920       # archaic-modern separation time
time free     Txy=3788         # Africa-Eurasia separation time

The comments hint at the meanings of these variables; the values are in generations. The first of these variables is "fixed" (see above). The other two are "free", which means that legofit will estimate their values. Numerical values may optionally be written in exponential notation. For example the last line above could have been written as

time free     Txy=3.788e3      # Africa-Eurasia separation time

Our measure of population size is twice the effective size of the population, and we define two such variables:

twoN free         twoNn=1e3            # archaic population size
twoN constrained twoNxy=1e4 - 1.2*Txy  # early modern population size

The first of these is a free parameter, but the second is a new category: "constrained". It defines "twoNxy" as a function of "Txy". Constraints are useful when analysis of bootstrap samples indicates a tight relationship between two or more free parameters. Constraints reduce the number of free parameters and allow more accurate estimates. In the constraint above, there are only two terms and one independent variable—"Txy". It is legal, however, to use any number of terms and independent variables. For example, we could have written

twoN constrained twoNxy=1e4 - 1.2*Txy + 0.01*Txy*Txyn # OK

It is safest to put spaces around plus and minus signs. For example, the following would fail:

twoN constrained twoNxy=1e4 -1.2*Txy  # Fails

The parser interprets "-1.2" as a number, not as an operator (-) followed by a number (1.2). Lacking an operator, the parser aborts. On the other hand, this works fine:

twoN constrained twoNxy=1e4 -Txy  # Fails

Because "-Txy" cannot be interpreted as a literal number, legofit treats it as an operator (-) followed by a variable name.

In previous versions of Legofit, only free parameters could be used in constraint formulas. In the current version, all types of parameter can be used. Independent variables must be declared in the .lgo file before they are used in a constraint equation. The parser can recognize complex mathematical expressions and knows about the standard mathematical functions. The y'th power of x can be written either as "x^y" or as "pow(x,y)". The natural log can be written either as "log" or as "ln". Parentheses are allowed, and operators have the usual precedence. For example, the following lines are equivalent:

twoN constrained x=exp(a)*pow(b,y)
twoN constrained x=e^a*b^y
twoN constrained x=exp(a + y*log(b))

To spread a constraint across several lines, break the line after a binary operator ("+", "-", "*", or "/"). For example,

twoN constrained twoNxy=1e4 -
  1.2*Txy +
  0.01*Txy*Txyn # Constraint spread across 3 lines.

We have one more variable to declare:

mixFrac free  mN=0.02           # Neanderthal admixture into y

The "mixFrac" command declares a "mixture fraction"—the fraction of a some population that derives from introgression. As above, it could have been fixed or constrained.

The next few lines of the input file declare the segments of the population network. The first of these is

segment x     t=zero   twoN=one    samples=1  # Africa

Here, "x" is the name of the segment, "zero" is the time at which it ends, and "one" is the population size. Note that "zero" and "one" are variables that we declared above. The "samples=1" phrase says that there is a genetic sample from the end of this segment. In other words, the date of the sample is "t=zero". If the segment has no samples, you can omit "samples=0". It is also possible to specify more than one sample, as in "samples=2". If you do this, the program will generate more complex site patterns, whose frequencies will depend on recent population size. In this situation, you would not want to set "twoN=one".

The next two lines are similar, and define two other terminal populations:

segment y     t=zero   twoN=one    samples=1  # Eurasia
segment n     t=Tn     twoN=twoNn  samples=1  # Neanderthal

Segment "n" does not end at time zero, but rather at the time, Tn, of Neanderthal admixture. It has one sample, whose date is also Tn. This is a bit of a stretch, because it assumes that the Neanderthal genome lived at the same time as the episode of admixture. I make this assumption for simplicity—this is only an example. There are 3 more segments to declare:

segment y2    t=Tn     twoN=one               # pre-mig eurasia
segment xy    t=Txy    twoN=twoNxy            # early modern
segment xyn   t=Txyn   twoN=twoNn             # ancestral

These segments don't have a "samples" component, because none of them have genetic samples. Segment y2 represents the Eurasian population before the episode of admixture. Note that it ends at the same time as segment n. This is necessary, because we will want to mix y2 and n below to model gene flow. Also note that the size of xyn equals twoNn—the same variable we used in setting the size of segment n. This establishes a constraint: the sizes of XYN and N will always be equal, no matter how the optimizer adjusts the value of twoNn.

The rest of the .lgo file defines relationships between segments. This involves two statements: "mix" and "derive". Consider the mix statement first:

mix    y  from y2 + mN * n      # y is a mixture of y2 and n

This says that y is a mixture of y2 and n, which must end at the same date. Specifically, a fraction mN of y comes from n and the remaining fraction comes from y2. Finally, we have 4 examples of the "derive" statement:

derive x  from xy               # x is child of xy
derive y2 from xy               # y2 is child of xy
derive xy from xyn              # xy is child of xyn
derive n  from xyn              # n is child of xyn

These statements establish ancestor-descendant relationships between segments. Note that x and y2 both derive from xy, so xy has two "children", but x and y2 each have only one parent. On the other hand, segment y has two parents, as defined in the "mix" statement above.

Segments cannot have more than two parents or more than two children. All segments should descend, eventually, from a single root.

The site patterns printed refer only to the segments that contain samples, and the sort order of site patterns is determined by the order in which segments are listed in the .lgo file. In the file discussed above, there are three segments with samples, and these are in order "x", "y", "n". For this reason, the output will contain a site pattern labeled "x:y:n" rather than, say, "x:n:y".

Using this .lgo file as input, legosim -i 10000 produces

############################################################
# legosim: generate site patterns by coalescent simulation #
############################################################

# Program was compiled: Jun  8 2017 12:41:14
# Program was run: Tue Jul 11 09:18:00 2017

# cmd: legosim -i 10000 input.lgo
# nreps                       : 10000
# input file                  : input.lgo
# not simulating mutations
# excluding singleton site patterns.
#       SitePat E[BranchLength]
            x:y   17493.5768947
            x:n       5.7110115
            y:n     467.3607849

The program reports the mean branch length in generations of three site patterns. For example, "x:y" refers to the pattern in which the derived allele is present in the samples x and y but not in n. The 2nd column gives the expected length in generations of the branch on which mutations would generate the corresponding site pattern.

In the example above, we saw parameters of three types: "twoN", "time", and "mixFrac". There is also a 4th type of variable, which is declared with the "param" command, and which is intended to represent quantities that are neither times, population sizes, nor admixture fractions. Here's an example, which also illustrates the how ranges can be specified for free parameters.

param free [ 0, 2] msum = 0.02  # mN + mD
param free [-1, 1] mdif = 0.0   # mN - mD
mixFrac constrained  mN = 0.5*(msum + mdif)
mixFrac constrained  mD = 0.5*(msum - mdif)

Here msum is the sum of two admixture fractions and mdif is the difference between them. We defined the parameters this way, because mN and mD were tightly correlated, but msum and mdif were nearly uncorrelated. This allowed the optimization algorithm to work with the uncorrelated pair of parameters rather than the correlated pair.

In the declaration of msum, "[0, 2]" specifies the range of this parameter. Ranges are used in initializing the random swarm of points used by the differential evolution algorithm. Initial points are randomly distributed within the range. A range can be specified for any free parameter, not just those defined with the param command. If no range is provided, legofit uses default ranges that differ, depending on the type of parameter. The default range is [1, 1e7] for twoN parameters, [0, 1e7] for time parameters, [0, 1] for mixFrac parameters, and [-DBL_MAX, DBL_MAX] for param parameters.

Model selection

In fitting models to data, it is important to avoid "overfitting". Complex models generally fit the data set better than simple ones do, but this improvement may be illusory. As models increase in complexity, we are eventually fitting the noise in the data rather than the signal. Consequently, the complex model may do a poor job of predicting new observations, which were not used to fit the model in the first place. Methods of model selection aim to penalize overly-complex models, so that the preferred model is the one that is best at predicting new observations.

The legofit package provides two methods of model selection:

Using the package

Predicting site pattern counts from assumptions about history

For this purpose, you want to use the program legosim. The first step is create a file in .lgo format, which describes the history of population size, subdivision, and gene flow. This format is described above. Then, you can execute legosim by typing:

legosim -i 10000 my_input_file.lgo

See the legosim documentation for details.

Simulating site patterns

The preferred approach is to do simulations using scrm, a software package written by Paul R. Staab, Sha Zhu, Dirk Metzler and Gerton Lunter. It does coalescent simulations with linkage and recombination. Our own program scrmpat tabulates site pattern frequencies from scrm output.

For less sophisticated simulations, use the -U option of legosim. This assumes free recombination between nucleotide sites. This is not ideal, because it ignores genetic linkage, but it reads the same input files as legosim, is very fast, and generates output in the form of site pattern frequencies.

Estimating parameters from genetic data

This involves several programs. The first step is to generate input files in ".daf" or ".raf" format, as described above. You will need one .daf or .raf file for each population. For .raf, you will in addition need a .raf file for an outgroup population. See the documentation of daf and raf for details.

The next task is to tabulate site pattern counts. For details, see tabpat or sitepat. Either of these programs will generate a small text file, with one row for each site pattern. If there are 4 populations in the analysis, there will be 10 site patterns.

You will also need a .lgo file, which describes the model of history you wish to explore. It also specifies which parameters will be estimated. Details are above.

Finally, legofit will estimate parameters. This program may take several hours to run, depending on the size of the analysis and the number of simulation replicates used to approximate the objective function.

To generate a bootstrap confidence interval, use the --bootreps option of tabpat or sitepat. This will generate not only the primary output (written to standard output), but also an additional output file for each bootstrap replicate. For example, --bootreps 50 would generate 51 output files: 1 for the normal output, and 1 for each bootstrap replicate.

These files should each be analyzed with a separate run of legofit. If you have access to a compute cluster, these jobs can be run in parallel.

You might be tempted to parallelize these legofit jobs on a single computer, using multiple threads of execution. Don't do it. A single legofit job makes use of all available threads, so nothing is gained by launching simultaneous legofit jobs on a single computer. Unless you have access to a compute cluster, legofit jobs should be run sequentially. As each job can take several hours, a full bootstrap may take several days.

Having run legofit on the real data and all bootstrap replicates, you can use bootci.py and flatfile.py to summarize the information in the resulting output files.

Finally, diverg.py can be used for comparing two sets of site-pattern counts or frequencies. It uses the Kullback-Leibler (KL) divergence to measure the discrepancy between the two distributions. It provides not only the overall KL divergence but also the contribution of each site pattern. Thus, it will tell you which site patterns are responsible for a poor fit between observed and expected site pattern frequencies.

Sort order of site patterns

The sort order of site patterns is determined by the order of command line arguments to tabpat and sitepat, and by the order in which segments are defined in the .lgo file. For example, if "x" precedes "y", then we get a site pattern labeled "xy" rather than "yx". These inputs should be ordered consistently in tabpat, sitepat, and .lgo, for otherwise it will be hard to compare observed site pattern frequencies with those predicted by legosim or legofit. I recommend arranging them in the order that populations labels appear in your figures.

Bias in legofit

Biases arising from constraints

Biases arise in legofit because of the constraint that a child population cannot be older than its parent. If the parent's age is fixed, then it represents an inequality constraint on the age of the child. If the child is only slightly younger than the parent, the optimizer has problems. It will propose new values for the child's age. Those that increase the child's age will be rejected if they violate the inequality constraint. Those that decrease the child's age do not encounter this constraint. Consequently, the optimizer is more likely to move away from the constraint rather than toward it. This biases estimates away from the constraint, whenever the true value is close to the constraint.

The magnitude of this bias depends on the spread of the sampling distribution of the child's age. If that sampling distribution is wide, then the DE algorithm will tend to take large steps and will encounter the boundary constraint sooner. This generates a large bias. If the sampling distribution is narrow, the bias is small.

This bias is exacerbated if the parent's age is modeled as a Gaussian variable, because then the constraint is an interval rather than a point, and the optimizer encounters it sooner. For this reason, I have not used Gaussian variables in recent work.

Bias arising from selection of variant sites

Genetic data are often restricted to sites that are polymorphic within some group of samples. For example, the 1000-Genomes Project distributes files containing genotypes that are polymorphic within modern humans. These data do not include sites at which the derived allele is present only in Neanderthals and/or Denisovans. An analysis using such data would be biased, because it would undercount the corresponding site patterns.

To avoid this bias, it is best to work with complete genomes rather than just the variant sites. Alternatively, one could call variant sites jointly using all genomes to be included in the analysis.