Installing and Using SMC++

All the commands below should be run from the Unix or Linux shell. On a Mac this is available through the Terminal application. You'll need various pieces of software, which you may not yet have. On a Mac, these are available through Homebrew, which you should install. On Linux, they are available through apt-get. On Windows, install the Windows Subsystem for Linux. The instructions below are oriented toward MacOS and Homebrew.

Get a data set

We'll use the wget program, so first install that using homebrew:

brew install wget

Then get the main data file from the 1000-genomes ftp site:

wget ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/release/20190312_biallelic_SNV_and_INDEL/ALL.chr22.shapeit2_integrated_snvindels_v2a_27022019.GRCh38.phased.vcf.gz

We'll also need the 1000-genomes mask file, which specifies the genomic positions that are accessible in 1000-genomes data. This file includes all chromosomes, so we'll have to restrict to chr22 later.

wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/working/20160622_genome_mask_GRCh38/StrictMask/20160622.allChr.mask.bed

Get the portion of this mask that refers to chromosome 22.

grep ^chr22 20160622.allChr.mask.bed > 20160622.chr22.mask.bed

Extract samples for the population of interest

I used zless -S to examine the .vcf.gz file and determined that the header line, which labels the columns of data, begins with "#CHROM". I also determined that the labels of samples begin with the 10th field of this header. The first sample is HG00096, and the last is NA21144. We can use those two labels to make sure we've got all the sample names. To get a list of samples in the entire data set, do this:

gzcat ALL.chr22*.vcf.gz | head -n 100 | grep "#CHROM" |
tr "\t" "\n" | tail +10 > samples.all

(On Linux, gzcat is called zcat.) Checking this file, I find that it begins with HG00096 and ends with NA21144, so it's clear that we've got all the samples.

Now I need a list of the samples that go with the particular population I'm interested in. I'm going to work with population CHB: Han Chinese from Beijing. Browse to https://www.internationalgenome.org/data/ and follow the link to data portal. Under "Filter by data collection", choose "1000 Genomes on GRCh38". Under "Filter by population", choose whatever population you want to study. Then choose "Download the list". This creates a file called igsr_samples.tsv in your Downloads directory, which I moved into my smcpp directory. In that file, every entry in column 4 says "CHB", which tells me that I did indeed get that data for Han Chinese from Beijing.

I only need the first column from this file, so I get rid of the rest like this:

cut -f 1 igsr_samples.tsv > samples.chb

Remove CHB samples that are not present in samples.all. We do this using the unix join command, and this requires that both files be sorted:

sort samples.chb > samples-sort.chb
sort samples.all > samples-sort.all
join samples-sort.chb samples-sort.all > samples.chb

We end up with a file called samples.chb, which has 106 entries rather than the original 109.

Next, I selected a subset of these 106 samples for analysis. If I were working on a server at the Center for High-performance Computing, I'd probably try to analyze the full data set. But many of us are doing this on laptops, so let's keep the project small. To do this, we can use the linux shuf (short for shuffle) command, which is called gshuf on macos. It's in the coreutils homebrew package, so first

brew install coreutils

Then

gshuf -n 25 samples.chb | sort > samples25.chb

This generates a file called samples25.chb, which contains 25 samples from the CHB population, all of which are present in the .vcf.gz file.

Use bcftools to subset the data

Install bcftools, unless you already have it:

brew install bcftools

Before making the new file, let's examine it to make sure it looks okay:

bcftools view --samples-file samples25.chb ALL.chr22*.vcf.gz | zless -S

This will pipe the first few lines of the new file into the zless command, where you can look at it. It should look like the original .vcf.gz file, but with only 25 samples. If it looks okay, then make the subsetted data file like this:

bcftools view --samples-file samples25.chb ALL.chr22*.vcf.gz \
--output-type z --output 1kgenomes-chb-chr22.vcf.gz

Here, "view" says that we want a new .vcf file, "--samples-file" says that the next argument is a file lists the samples we want to include, "--output-type z" says the output should be compressed, and the remainder specifies the name of the output file.

Smc++ requires that our data file have an index, so let's create one using the tabix command. This should already be on your computer, because you have already installed bcftools. Here's the command:

tabix 1kgenomes-chb-chr22.vcf.gz

Use bedtools to complement the mask we downloaded as a bed file

Above, we downloaded a file called 20160622.allChr.mask.bed, which is in bed format), and describes the regions of the human genome that are covered by 1000-genomes data. Below, we will need a bed file describing the regions not covered. To convert the one to the other, we'll use bedtools. If you don't have that on your computer, the first step is to download it:

brew install bedtools

We'll need to run bedtools complement, which you can find out about by typing

bedtools complement --help

Note that the -g argument specifies a file that describes the size of the chromosome. In our case, this file will contain only one line:

chr22 <chromosome size>

where the two fields are separated by a tab, and <chromosome size> is the size of chromosome 22 in genome assembly GRCh38. You can find that size here. It equals 50818468. Thus, you'll need to create a file called chr22.genome, which contains just the following line:

chr22 50818468

and where the two fields are separated by a tab.

Now we can use bedtools to construct the mask we need:

bedtools complement -L -i 20160622.chr22.mask.bed \
-g chr22.genome > tmp.bed

This creates a file called "tmp.bed", which specifies all the sites on chromosome 22 that are not present in the 1000-genomes data set. The -L option tells bedtools to restrict its output to chromosome 22.

Unfortunately, the chromosome labels in this file are not compatible with those in the .vcf file. In the .bed file, the chromosome label is "chr22". In the .vcf file it is just "22". So we need to rewrite "tmp.bed". Here's how

awk '{printf("22\t%s\t%s\n", $2, $3)}' tmp.bed > inaccessible-mask.bed

This creates "inaccessible-mask.bed", and now the chromosome labels are consistent.

smc++ will want this bed file to be compressed in bgzip format:

bgzip inaccessible-mask.bed

This generates file inaccessible-mask.bed.gz. We also need an index:

tabix inaccessible-mask.bed.gz

Install Docker and SMC++

Docker can be downloaded from from their website. Once docker is on your system, type

docker run --rm -v $PWD:/mnt terhorst/smcpp:latest

This generates an error message, which you can ignore.

Generate the smc++ input file

This file is created from the .vcf.gz file that we created above. The first step is to construct the smc++ command line. Mine looks like this:

docker run --rm -v $PWD:/mnt terhorst/smcpp:latest \
vcf2smc \
--mask inaccessible-mask.bed.gz \
--length 50818468 \
1kgenomes-chb-chr22.vcf.gz chr22.smc.gz 22 \
CHB:$(cat samples25.chb | xargs echo | tr " " ",")

Bash will treat this as a single command, because the "newline" characters that terminate lines have been "escaped" by putting a backslash ("\") at the end of all lines except the last. There should be no spaces or tabs following "\". The 1st line above has all the docker stuff. The 2nd (vcf2smc) specifies the program within smc++ that we want to execute. This one converts from vcf format to smc format. The 3rd line specifies the mask. The 4th gives the length of the chromosome. (This isn't necessary if that length is available in the ##contig line of the .vcf.gz file.) The 5th gives the input and output files and the chromosome label. The 6th line should have format

<population label>:sample1,sample2,sample3

In my case, "CHB" is the population label, and the sample names are stored in file samples25.chb. The stuff within $(...) is a shell command that generates a comma-separated list of sample names. The shell will append this to the command line just after "CHB:".

Before running the docker command at the shell command line, click on the docker icon, which looks like a whale. This will start the docker daemon, and then you can run the command above--the one that begins "docker run ..."--from the bash command line.

Estimation

Create a directory for the output:

mkdir output

Then run

docker run --rm -v $PWD:/mnt terhorst/smcpp:latest estimate \
-o output 1e-8 chr22.smc.gz

Here, 1e-8 is the mutation rate. In real research, we'd look for the latest and best estimate of this rate, and we'd justify that choice in the publication.

Plot

Run

docker run --rm -v $PWD:/mnt terhorst/smcpp:latest plot \
output/plot.png output/model.final.json

You'll find a new file in the output directory called plot.png. Mine looks like this: