ClonalOrigin install and usage
download for Doc file
Introduction
Bacteria, unlike us, can reproduce on their own. They do however have mechanisms that transfer DNA between organisms, a process more formally known as recombination. The mechanisms by which recombination takes place have been studied extensively in the laboratory but much remains to be understood concerning how, when and where recombination takes place within natural populations of bacteria and how it helps them to adapt to new environments. ClonalOrigin performs a comparative analysis of the sequences of a sample of bacterial genomes in order to reconstruct the recombination events that have taken place in their ancestry.
ClonalOrigin is described in the following paper:
Didelot X, Lawson D, Darling A, Falush D (2010) Inference of homologous recombination in bacteria using whole genome sequences. Genetics 186 (4), 1435-1449 doi:10.1534/genetics.110.120121http://www.genetics.org/cgi/content/abstract/genetics.110.120121v1
Install
Xavier Didelot edited this page on 12 Jun 2015
Step 0. Obtain and install prerequisite software
- GNU Scientific Library
- Optional: Qt4 development libraries, needed for graphical interface
Step 1. Check out the ClonalOrigin source code
git clone https://github.com/xavierdidelot/ClonalOrigin
Step 2. Configure the ClonalOrigin build
We'll assume you want to install to your home directory.
cd clonalorigin/warg
./autogen.sh
./configure --prefix=$HOME
Or if you prefer the qmake manager:
cd clonalorigin/warg
qmake
Step 3. Build and install ClonalOrigin
make
make install
Optional Step 4. Build and install the graphical interface
cd clonalorigin/gui
qmake
make
and you should get an executable file called gui (Linux), gui.exe (Windows) or gui.app (MacOS).
Using ClonalOrigin
ClonalOrigin has now been installed to the directory $HOME/bin, as a program called 'warg'. The name is shorthand for 'weak Ancestral Recombination Graph', a diminutive term for the probabilistic graphical model on which ClonalOrigin does inference. If $HOME/bin is not already part of your PATH environment variable now would be a good time to add it. Once done, the ClonalOrigin software can be run simply by typing 'warg'. Note that if the GNU Scientific Library was installed to a non-standard location, it may be able to add the path to the directory containing libgsl to the environment variable LD_LIBRARY_PATH (on Linux) or DYLD_LIBRARY_PATH (on Mac OS X).
Usage
Xavier Didelot edited this page on 11 Jun 2015
Instructions for how to download and install ClonalOrigin are available at:https://github.com/xavierdidelot/ClonalOrigin/wiki/Install
Instructions for how to use ClonalOrigin once installed are available at:https://github.com/xavierdidelot/ClonalOrigin/wiki/Usage
Estimating strength of bias in the recombination process
We define biased recombination in contrast to free recombination where all individuals in the population are equally likely to recombine. There are many factors contributing to recombination being biased rather than free. Laboratory experiments have shown that the recombination process is homology dependent whereby it tends to happen more often between individuals that are less diverged. Furthermore, the geographical and ecological structures observed in many bacterial populations implies a greater opportunity of recombination for pairs of cells that are closely related. Purifying selection may also effectively prevent recombination between distantly related bacteria. All these effects would clearly be hard to disentangle, and here we group them all under the single concept of biased recombination. The strength of this bias is an important factor to take into account in order to understand recombination in bacteria. In particular, this determines how often recombination happens within the diversity of the population under study rather than from other sources.
We have introduced a model for biased recombination which is based on the ClonalOrigin model. We use approximate Bayesian computation and whole genome data to infer the rate of bias in the recombination process in bacteria. The user guide and Matlab code can be downloaded from: http://www.stats.ox.ac.uk/~ansari/BiasedRecV1.tgz
Full details of the biased recombination model have been published in the following paper: Ansari MA, Didelot X. Inference of the Properties of the Recombination Process from Whole Bacterial Genomes. Genetics. 2014;196: 253–265. doi:10.1534/genetics.113.157172 http://www.genetics.org/content/early/2013/10/21/genetics.113.157172
Introduction
This example will demonstrate how to go from unaligned genome assemblies to aligned genomes with recombination maps ready for further analysis and summary.
Prerequisites
Throughout this example we will assume a familiarity and level of comfort with command-line software. If that doesn't sound like you, it may be worth tapping your local nerd for assistance. Buy him a large coffee, and expect the whole process to take a couple weeks of compute time.
Required data:
- Genome sequences in FastA or GenBank format, we'll assume 4 genomes for this example in files called genome1.gbk, genome2.gbk, genome3.gbk, and genome4.gbk
Required software:
The above programs need to be downloaded, uncompressed, made executable, and installed to a directory in the binary PATH for your system.
The Analysis
Genome alignment
Starting in the directory where the data resides:
progressiveMauve --output=full_alignment.xmfa genome1.gbk genome2.gbk genome3.gbk genome4.gbk
stripSubsetLCBs full_alignment.xmfa full_alignment.xmfa.bbcols core_alignment.xmfa 500
The first command constructs a multiple genome alignment of the four genomes. The second command strips out variable regions from the alignment to leave only core alignment blocks longer than 500nt.
Infer clonal genealogy
ClonalFrame -x 10000 -y 10000 -z 10 core_alignment.xmfa core_clonalframe.out.1 > cf_stdout.1 &
ClonalFrame -x 10000 -y 10000 -z 10 core_alignment.xmfa core_clonalframe.out.2 > cf_stdout.2 &
ClonalFrame -x 10000 -y 10000 -z 10 core_alignment.xmfa core_clonalframe.out.3 > cf_stdout.3 &
The above commands start three separate ClonalFrame runs in parallel on the core genome alignment in order to infer the clonal genealogy. The output from these three runs needs to be compared to ensure that each of the runs produced approximately the same tree. If not, the MCMC chain needs to run longer, and the runs should be started again with a higher value for the -y parameter.
Convert clonal genealogy to ClonalOrigin format
Check the output from the three ClonalFrame runs to ensure they have consistent topology and the consensus trees are fully resolved. This can be checked using the ClonalFrame GUI.
Assuming the ClonalFrame run produced a fully resolved consensus tree, that tree can be extracted from the ClonalFrame output with a command like the following:
getClonalTree core_clonalframe.out.1 clonaltree.nwk
Split alignment into one file per block
blocksplit.pl core_alignment.xmfa
Run ClonalOrigin on each alignment block
warg -a 1,1,0.1,1,1,1,1,1,0,0,0 -x 1000000 -y 1000000 -z 10000 clonaltree.nwk core_alignment.xmfa.N core_co.phase2.N.xml
Where this command must be run once on each of the alignment blocks, and N should be replaced with the block number. Farming these jobs out on a compute cluster is highly recommended. On clusters where jobs are managed with Sun Grid Engine (SGE) the following script can be used to run the jobs:
#!/bin/sh
#$ -cwd
#$ -S /bin/bash
#$ -t 1-157
export WORKDIR=/state/partition1/warg
mkdir -p $WORKDIR
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$HOME/lib
warg -a 1,1,0.1,1,1,1,1,1,0,0,0 -x 1000000 -y 1000000 -z 10000 clonaltree.nwk core_alignment.xmfa.$SGE_TASK_ID $WORKDIR/core_co.phase2.$SGE_TASK_ID.xml
bzip2 $WORKDIR/core_co.phase2.$SGE_TASK_ID.xml
mv $WORKDIR/core_co.phase2.$SGE_TASK_ID.xml.bz2 .
The -t 1-157 line will have to be changed so to range from 1 to the number of blocks for your dataset (e.g. change the 157). The following line where the WORKDIR variable is set will probably also need to change to reflect a path to node-local scratch space. Usually node-local storage is at /state/partition1 on a ROCKS cluster, which is assumed here. /tmp or $TMPDIR might also work. This script can can be submitted with qsub and will start one job with one task for each of the blocks.
Get estimates of population evolution parameters
In the previous step, the global evolutionary parameters theta, rho, and delta have been estimated independently on each block. This step computes the posterior median values for those parameters, weighted by the size of each block.
computeMedians.pl *.xml
Run ClonalOrigin with global parameters
warg -a 1,1,0.1,1,1,1,1,1,0,0,0 -x 1000000 -y 20000000 -z 200000 -D 123456 -T s7890 -R s66666 clonaltree.nwk core_alignment.xmfa.N core_co.phase3.N.xml
Again, this command must be run at least once for each block, and N needs to be replaced with the block number. Also, the values 123456, 7890, and 66666 must be replaced with the global parameter estimates for Delta, Theta, and Rho from the previous step.
Create a recombination map viewable in Mauve
addUnalignedIntervals core_alignment.xmfa core_alignment_mauveable.xmfa
makeMauveWargFile.pl *phase3*.bz2
Run the Mauve GUI and profit!
In the future a version of Mauve will be released that can visualize the recombination maps generated by ClonalOrigin. This feature is still under development.
What next?
You will need to invent creative ways to summarize and make inference from the recombination maps. Possibilities are endless.
Of course you will eventually write a paper, and probably publish it in a respectable open-access journal. When you do, be sure to cite the following tools that facilitated your analysis:
Darling AE, Mau B, Blatter F, Perna NT (2004) Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Research 14(7):1394-1403 http://www.genome.org/cgi/content/full/14/7/1394
Darling AE, Mau B, Perna NT (2010) progressiveMauve: Multiple Genome Alignment with Gene Gain, Loss and Rearrangement. PLoS ONE 5(6): e11147 http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0011147
Didelot X, Falush D (2007) Inference of bacterial microevolution using multilocus sequence data. Genetics 175:1251-1266 http://www.genetics.org/cgi/content/abstract/175/3/1251
Didelot X, Lawson D, Darling A, Falush D (2010) Inference of homologous recombination in bacteria using whole genome sequences. Genetics doi:10.1534/genetics.110.120121 http://www.genetics.org/cgi/content/abstract/genetics.110.1201 |