legofit
Legofit is a system of programs written in C and Python,whichusescountsofnucleotidesitepatternstoinferthe

Legofit is a computer package that uses counts of nucleotide site patterns to estimate the history of population size, subdivision, and gene flow. The package consists of the following programs
scrm
coalescent simulator.These programs all operate on "nucleotide site patterns", which are summary statistics not influenced by recent population size but rich in information about ancestral populations. This section will define site patterns and explain how legofit tabulates their counts, and estimates expectations.
Consider a sample consisting of one haploid genome drawn from each of 3 populations, X, Y, and Z. Suppose that, at a given nucleotide site, the derived allele is present in the genomes from X and Y but not that from Z. If so, then this nucleotide position will be said to exhibit the "*xy* site pattern." We ignore cases in which the derived allele is present in none of the samples or in all of them. By default (without the 1
argument of legosim and legofit), we also ignore "singleton site patterns"—those in which the derived allele appears only once. In other words, we consider only polymorphic, nonsingleton site patterns. For the special case of the 3population sample just described, there are only 3 such site patterns: xy, xz, and yz.
In the general case, with samples from K populations, the number of site patterns (including singletons) is . For example, there are 14 site patterns in a sample involving populations. The table below shows data from a sample involving 4 populations, X, Y, N, and D.
SitePat E[count] x 528808.000 y 518558.875 n 529267.375 d 581255.250 x:y 320273.625 x:n 43345.375 x:d 41517.750 y:n 49486.000 y:d 40366.375 n:d 255980.625 x:y:n 108149.000 x:y:d 91739.375 x:n:d 89563.375 y:n:d 91388.250
The E[count]
column shows numbers that can be thought of loosely as counts of site patterns in a genomewide sample. The last line tells us that the ynd site pattern occurs at over 100,000 nucleotide sites.
These number cannot really be counts, because they aren't integers. This reflects the fact that our sample includes more than one haploid genome from each population, and a given SNP may contribute to several site patterns. The contribution to a given site pattern is the probability that a subsample, consisting of one haploid genome drawn at random from the larger sample of each population, would exhibit this site pattern. For example, suppose we have samples from three populations, X, Y, and N, and let , , and represent the frequencies of the derived allele at the i*th SNP in these three samples. Then site pattern *xy occurs at SNP i with probability (Patterson et al 2010, Science, 328(5979):S129). Aggregating over SNPs, summarizes the information in the data about this site pattern. These are the numbers that appear in the 2nd column of the table above.
Tables such as the one above are generated by programs tabpat, sitepat, and scrmpat.
The package is available at github. Before compiling, you must install two libraries: pthreads
and gsl
. You will need not only the libraries themselves but also several header files, such as pthread.h
. I didn't need to install pthreads
, because it came bundled with the Gnu C compiler. But the gsl was an extra. Under ubuntu Linux, you can install it like this:
sudo aptget install libgsl0dev
On the mac, using homebrew, the command is
brew install gsl
By default, the executable files will be copied into a directory named bin
in your home directory. If you want them to go somewhere else, edit the first noncomment line of src/Makefile.
Then
This will try to place the executables into directory "bin" in the user's home directory. Make sure this directory appears in your PATH, so that the shell can find it.
There is also a facility for installing a "global" version of the code for other users. The first step (under unixlike operating systems) is to set up a "soft link" to the directory just above the target bin directory. Then type "sudo make ginstall". You'll need administrative privileges. For example, to install the software in /usr/local/bin
:
cd # move into home directory ln s /usr/local group # now "group" points to `/usr/local`
Then cd
back into the legofit
directory and type
sudo make ginstall # install into /usr/local/bin
This installation will work under unixlike operating systems, such as linux and Apple's osx. I haven't tried to port this software to Windows.
The directory test
contains a unit test for many of the .c files in directory src
. Within this directory, type
to test the source file boot.c
. To run all unit tests, type "make". This will take awhile, as some of the unit tests are slow.
Before doing data analysis with legofit
, you must generate data files in one of two formats: "daf" (for derived allele frequency), or "raf" (for reference allele frequency). The first of these alternatives requires input data in which the ancestral allele has been previously called. The second ("raf") does not. Instead, the program sitepat uses an outgroup sequence to call ancestral alleles on the fly.
This was the only input data format in early versions of the legofit package. We began moving away from it when we realized that, under certain circumstances, it can introduce bias. To understand why, suppose we have data from 4 populations, A, B, C, and D. Suppose further that D has received gene flow from an unobserved population, which is distantly related to the 4 observed populations. This gene flow will introduce derived alleles into D, but at these sites A, B, and C will be fixed for the ancestral allele. If ancestral allele calls are based only on A, B, and C, then no ancestral allele calls will be available for sites at which derived alleles have been introduced into D. The gene flow into D will therefore be undetectable. To avoid such problems, it is best to call ancestral alleles during the process of tabulating site patterns, using the programs raf and sitepat. Nonetheless, we retain daf and tabpat for backwards compatibility.
The suffix "daf" stands for "derived allele frequency. See the @ref daf "daf" command for instructions on translating from "vcf" or "bcf" format into "daf".
The "daf" file is simple and looks like this:
#chr pos aa da daf 1 752566 g a 0.835294117647058854 1 754192 a g 0.858823529411764652 1 755225 t g 0.000000000000000000 1 755228 t g 0.000000000000000000 1 765437 g a 0.000000000000000000
The first line (beginning with "#") is an optional comment, which is used here to label the columns. The columns are as follows:
The lines should be sorted lexically by chromosome. Within chromosomes, they should be sorted in ascending numerical order of column 2. There should be no duplicate (chromosome, position) pairs.
The suffix "raf" stands for "reference allele frequency". See the raf command for instructions on translating from "vcf" or "bcf" format into "raf".
The "raf" file is simple and looks like this:
#chr pos ref alt raf 1 752566 g . 0.835294117647058854 1 754192 a g 0.858823529411764652 1 755225 t a 0.000000000000000000 1 755228 t . 0.000000000000000000 1 765437 g . 0.000000000000000000
The columns are separated by tabs. The first line (beginning with "#") is an optional comment, which is used here to label the columns. The columns are as follows:
The lines should be sorted lexically by chromosome. Within chromosomes, they should be sorted in ascending numerical order of column 2. There should be no duplicate (chromosome, position) pairs.
The ".lgo" format describes the history of population size, subdivision, and gene flow. It also identifies the position within the population network of each genetic sample. It is a plaintext file, which should be constructed with a text editor.
In the paragraphs below, I will introduce a small .lgo file a few lines at a time. The first line in my file is
# Example .lgo file
This is a comment. Comments extend from the sharp character to the end of the line. The next two lines define two convenience variables, whose names are "zero" and "one"
time fixed zero=0 twoN fixed one=1
The first is a "time" variable, which I will use for the tips of branches, where time equals 0. I declare it "fixed", which means that it will not change. Variable names must begin with a letter. Subsequent characters may be letters, digits, or underscores. The second variable declaration begins with "twoN", which indicates that this variable refers to haploid population size–twice twice the size of the diploid population. When there is only one sample per population, the sizes of tip populations don't matter, so I set them all equal to "one". Next, three more time variables named "Txyn", "Tn", and "Txy".
time fixed Tn=1897 # time of Neanderthal admixture time free Txyn=25920 # archaicmodern separation time time free Txy=3788 # AfricaEurasia separation time
The comments hint at the meanings of these variables; the values are in generations. The first of these variables is "fixed" (see above). The other two are "free", which means that legofit
will estimate their values. Numerical values may optionally be written in exponential notation. For example the last line above could have been written as
time free Txy=3.788e3 # AfricaEurasia separation time
It is also possible to define variables as "gaussian". For example,
time gaussian x=123 sd=10
Gaussian variables are used to represent exogeneous parameters whose values are known only approximately. They are modeled as Gaussian random variables. In this case, the mean is 123 and the standard deviation is 10. Programs legosim
and legofit
use MonteCarlo integration to integrate across the uncertainty in Gaussian parameters. Although the package still supports Gaussian parameters, I don't recommend them, for reasons discussed below. The Gaussian variable "x" just defined will be ignored in what follows.
Our measure of population size is twice the effective size of the population, and we define two such variables:
twoN free twoNn=1e3 # archaic population size twoN constrained twoNxy=1e4  1.2*Txy # early modern population size
The first of these is a free parameter, but the second is a new category: "constrained". It defines "twoNxy" as a function of "Txy". Constraints are useful when analysis of bootstrap samples indicates a tight relationship between two or more free parameters. Constraints reduce the number of free parameters and allow more accurate estimates. In the constraint above, there are only two terms and one independent variable—"Txy". It is legal, however, to use any number of terms and independent variables. For example, we could have written
twoN constrained twoNxy=1e4  1.2*Txy + 0.01*Txy*Txyn # OK
All independent variables must be free parameters, and all must be defined before the constraint. The parser can recognize complex mathematical expressions and knows about the standard mathematical functions. The y'th power of x can be written either as "x^y" or as "pow(x,y)". The natural log can be written either as "log" or as "ln". Parentheses are allowed, and operators have the usual precedence. For example, the following lines are equivalent:
twoN constrained x=exp(a)*pow(b,y) twoN constrained x=e^a*b^y twoN constrained x=exp(a + y*log(b))
To spread a constraint across several lines, break the line after a "+" symbol. For example,
twoN constrained twoNxy=1e4 + 1.2*Txy + 0.01*Txy*Txyn # Constraint spread across 3 lines.
We have one more variable to declare:
mixFrac free mN=0.02 # Neanderthal admixture into y
The "mixFrac" command declares a "mixture fraction"—the fraction of a some population that derives from introgression. As above, it could have been fixed, Gaussian, or constrained.
The next few lines of the input file declare the segments of the population network. The first of these is
segment x t=zero twoN=one samples=1 # Africa
Here, "x" is the name of the segment, "zero" is the time at which it ends, and "one" is the population size. Note that "zero" and "one" are variables that we declared above. The "samples=1" phrase says that there is a genetic sample from the end of this segment. In other words, the date of the sample is "t=zero". If the segment has no samples, you can omit "samples=0". It is also possible to specify more than one sample, as in "samples=2". If you do this, the program will generate more complex site patterns, whose frequencies will depend on recent population size. In this situation, you would not want to set "twoN=one".
The next two lines are similar, and define two other terminal populations:
segment y t=zero twoN=one samples=1 # Eurasia segment n t=Tn twoN=twoNn samples=1 # Neanderthal
Segment "n" does not end at time zero, but rather at the time, Tn, of Neanderthal admixture. It has one sample, whose date is also Tn. This is a bit of a stretch, because it assumes that the Neanderthal genome lived at the same time as the episode of admixture. I make this assumption for simplicity—this is only an example. There are 3 more segments to declare:
segment y2 t=Tn twoN=one # premig eurasia segment xy t=Txy twoN=twoNxy # early modern segment xyn t=Txyn twoN=twoNn # ancestral
These segments don't have a "samples" component, because none of them have genetic samples. Segment y2 represents the Eurasian population before the episode of admixture. Note that it ends at the same time as segment n. This is necessary, because we will want to mix y2 and n below to model gene flow. Also note that the size of xyn equals twoNn—the same variable we used in setting the size of segment n. This establishes a constraint: the sizes of XYN and N will always be equal, no matter how the optimizer adjusts the value of twoNn.
The rest of the .lgo file defines relationships between segments. This involves two statements: "mix" and "derive". Consider the mix statement first:
mix y from y2 + mN * n # y is a mixture of y2 and n
This says that y is a mixture of y2 and n, which must end at the same date. Specifically, a fraction mN of y comes from n and the remaining fraction comes from y2. Finally, we have 4 examples of the "derive" statement:
derive x from xy # x is child of xy derive y2 from xy # y2 is child of xy derive xy from xyn # xy is child of xyn derive n from xyn # n is child of xyn
These statements establish ancestordescendant relationships between segments. Note that x and y2 both derive from xy, so xy has two "children", but x and y2 each have only one parent. On the other hand, segment y has two parents, as defined in the "mix" statement above.
Segments cannot have more than two parents or more than two children. All segments should descend, eventually, from a single root.
The site patterns printed refer only to the segments that contain samples, and the sort order of site patterns is determined by the order in which segments are listed in the .lgo file. In the file discussed above, there are three segments with samples, and these are in order "x", "y", "n". For this reason, the output will contain a site pattern labeled "x:y:n" rather than, say, "x:n:y".
Using this .lgo file as input, legosim i 10000
produces
############################################################ # legosim: generate site patterns by coalescent simulation # ############################################################ # Program was compiled: Jun 8 2017 12:41:14 # Program was run: Tue Jul 11 09:18:00 2017 # cmd: legosim i 10000 input.lgo # nreps : 10000 # input file : input.lgo # not simulating mutations # excluding singleton site patterns. # SitePat E[BranchLength] x:y 17493.5768947 x:n 5.7110115 y:n 467.3607849
The program reports the mean branch length in generations of three site patterns. For example, "x:y" refers to the pattern in which the derived allele is present in the samples x and y but not in n. The 2nd column gives the expected length in generations of the branch on which mutations would generate the corresponding site pattern.
In fitting models to data, it is important to avoid "overfitting". Complex models generally fit the data set better than simple ones do, but this improvement may be illusory. As models increase in complexity, we are eventually fitting the noise in the data rather than the signal. Consequently, the complex model may do a poor job of predicting new observations, which were not used to fit the model in the first place. Methods of model selection aim to penalize overlycomplex models, so that the preferred model is the one that is best at predicting new observations.
The legofit package provides two methods of model selection:
For this purpose, you want to use the program legosim. The first step is create a file in .lgo format, which describes the history of population size, subdivision, and gene flow. This format is described above. Then, you can execute legosim
by typing:
legosim i 10000 my_input_file.lgo
See the legosim documentation for details.
The preferred approach is to do simulations using scrm, a software package written by Paul R. Staab, Sha Zhu, Dirk Metzler and Gerton Lunter. It does coalescent simulations with linkage and recombination. Our own program scrmpat tabulates site pattern frequencies from scrm
output.
For less sophisticated simulations, use the U
option of legosim. This assumes free recombination between nucleotide sites. This is not ideal, because it ignores genetic linkage, but it reads the same input files as legosim
, is very fast, and generates output in the form of site pattern frequencies.
This involves several programs. The first step is to generate input files in ".daf" or ".raf" format, as described above. You will need one .daf or .raf file for each population. For .raf, you will in addition need a .raf file for an outgroup population. See the documentation of daf and raf for details.
The next task is to tabulate site pattern counts. For details, see tabpat or sitepat. Either of these programs will generate a small text file, with one row for each site pattern. If there are 4 populations in the analysis, there will be 10 site patterns.
You will also need a .lgo file, which describes the model of history you wish to explore. It also specifies which parameters will be estimated. Details are above.
Finally, legofit will estimate parameters. This program may take several hours to run, depending on the size of the analysis and the number of simulation replicates used to approximate the objective function.
To generate a bootstrap confidence interval, use the bootreps
option of tabpat or sitepat. This will generate not only the primary output (written to standard output), but also an additional output file for each bootstrap replicate. For example, bootreps 50
would generate 51 output files: 1 for the normal output, and 1 for each bootstrap replicate.
These files should each be analyzed with a separate run of legofit
. If you have access to a compute cluster, these jobs can be run in parallel.
You might be tempted to parallelize these legofit jobs on a single computer, using multiple threads of execution. Don't do it. A single legofit job makes use of all available threads, so nothing is gained by launching simultaneous legofit jobs on a single computer. Unless you have access to a compute cluster, legofit jobs should be run sequentially. As each job can take several hours, a full bootstrap may take several days.
Having run legofit
on the real data and all bootstrap replicates, you can use bootci.py and flatfile.py to summarize the information in the resulting output files.
Finally, diverg.py can be used for comparing two sets of sitepattern counts or frequencies. It uses the KullbackLeibler (KL) divergence to measure the discrepancy between the two distributions. It provides not only the overall KL divergence but also the contribution of each site pattern. Thus, it will tell you which site patterns are responsible for a poor fit between observed and expected site pattern frequencies.
The sort order of site patterns is determined by the order of command line arguments to tabpat and sitepat, and by the order in which segments are defined in the .lgo file. For example, if "x" precedes "y", then we get a site pattern labeled "xy" rather than "yx". These inputs should be ordered consistently in tabpat, sitepat, and .lgo, for otherwise it will be hard to compare observed site pattern frequencies with those predicted by legosim or legofit. I recommend arranging them in the order that populations labels appear in your figures.
legofit
Biases arise in legofit
because of the constraint that a child population cannot be older than its parent. If the parent's age is fixed, then it represents an inequality constraint on the age of the child. If the child is only slightly younger than the parent, the optimizer has problems. It will propose new values for the child's age. Those that increase the child's age will be rejected if they violate the inequality constraint. Those that decrease the child's age do not encounter this constraint. Consequently, the optimizer is more likely to move away from the constraint rather than toward it. This biases estimates away from the constraint, whenever the true value is close to the constraint.
The magnitude of this bias depends on the spread of the sampling distribution of the child's age. If that sampling distribution is wide, then the DE algorithm will tend to take large steps and will encounter the boundary constraint sooner. This generates a large bias. If the sampling distribution is narrow, the bias is small.
This bias is exacerbated if the parent's age is modeled as a Gaussian variable, because then the constraint is an interval rather than a point, and the optimizer encounters it sooner. For this reason, I have not used Gaussian variables in recent work.
Genetic data are often restricted to sites that are polymorphic within some group of samples. For example, the 1000Genomes Project distributes files containing genotypes that are polymorphic within modern humans. These data do not include sites at which the derived allele is present only in Neanderthals and/or Denisovans. An analysis using such data would be biased, because it would undercount the corresponding site patterns.
To avoid this bias, it is best to work with complete genomes rather than just the variant sites. Alternatively, one could call variant sites jointly using all genomes to be included in the analysis.