Automating Genome Assembly: from Contigs to Annotated Genomes

Supplementary Methods.

PAGIT: two worked examples

Here we give a synopsis of the work-flow used for the two examples discussed in the “Anticipated Results” section.

E. colidata and initial assembly.

The reads for Escherichia coli K-12 strain MG1655 were obtained from the Short Read Archive at the NCBI (run ERR008613 under study ERP000092). They were prepared by Illumina as a test data set for development of de novo sequence assembly algorithms and related tools. They consisted of 101 bp paired-end Illumina reads.

The NCBI was searched for “Escherichia coli” in order to find the identifier of a suitable reference genome: “U00096” was found for strain K-12 substr. MG1655, and the sequence file U00096.fna was downloaded. The EBI was then searched with “U00096” and the annotation in EMBL format was downloaded.

Before applying the PAGIT protocol, we performed a de novo assembly. Velvet version 1.1.06 compiled to support k-mers of up to 99 bp was used to generate an initial assembly. The reads came in two files (“s_6_1.fastq” and “s_6_2.fastq”) which we first “shuffled” together using the appropriate Velvet perl script. We then ran “velveth”, followed by “velvetg” using a number of different k-mers. The best results were obtained with a k-mer of 81. A synopsis of the commands we used is as follows (for more detail please see the Velvet manual):

/path/to/velvet/shuffleSequences_fastq.pl s_6_1.fastq s_6_2.fastq shuffled.fastq

/path/to/velvet/velveth K81 81 -shortPaired -fastq shuffled.fastq

/path/to/velvet/velvetg K81 -exp_cov auto

The result of this assembly was 182 scaffolds with a combined sum of 4.58 Mb, a scaffold N50 of 70.3 Kb and an average size of 25.2 Kb. The largest scaffold was 209.8 Kb.

Applying PAGIT to the E. coli example.

ABACAS was executed with the following command:

abacas.pl -b -m -r U00096.fna -q contigs.fa -p nucmer -o U96mapped

After joining the results (ordered contigs and bin), IMAGE was initially executed with the following command:

image.pl -scaffolds ABACAS.fasta -prefix s_6 -iteration 1 -all_iteration 1 -dir_prefix ite -kmer 81

The initial IMAGE run took about 4 hours to complete. The following commands were then executed one after another (they were pasted into a file, the file was made executable and then executed).For these runs, each iteration took about 20 minutes:

restartIMAGE.pl ite3 71 3 partitioned

restartIMAGE.pl ite6 61 3 partitioned

restartIMAGE.pl ite9 51 3 partitioned

restartIMAGE.pl ite12 41 3 partitioned

restartIMAGE.pl ite15 31 3 partitioned

The following command was used to execute ICORN:

icorn.start.sanger.sh scaffolds18.fa 1 6 s_6_1.fastq s_6_2.fastq 400,800 600

The following command was used to execute RATT:

$RATT_HOME/start.ratt.sh ./EMBL scaffolds18.fa.7 ratted Strain.Global

To generate Figure 6 the following commands were used to first create a BLAST comparison file, and then to load the files into ACT:

formatdb -p F -i U00096.fna

blastall -p blastn -m 8 -e 1e-40 -dU00096.fna -i Sequences/ordered_gi.48994873-o comp.blast

act EMBL/ U00096.embl comp.blast ratted.ordered_gi.48994873.gb.U00096.2..final.embl

Initially ACT displayed the reference genome “U00096.embl” on top with the newly annotated genome below, and syntenic blocks between them indicated in red. Subsequently, the original entry “U00096.embl” was disabled and the file with non-transferred annotations, “ratt.U00096.NOTTransfered.embl”, was loaded in its place (this was performed via the following ACT menus: Entries -> U00096.embl -> disable U00096.embl).

C. Trachomatis data

The data used here (EMBL files, 454 assembly and the Illumina reads) can be found on the PAGIT website: ftp://ftp.sanger.ac.uk/pub4/resources/software/pagit/PAGIT.ChlamydiaExample.tgz

We took 15% of the reads to get coverage of around 100x from the library ERS001402 (available at the NCBI in the Short Read Archive), which is a 54 bp Illumina paired-end library with around 40% of mouse contamination.

The reference genome (AM884177) and its plasmid (AM886279) were downloaded from the EBI. We saved the two EMBL files into two different directories“dir1” and “dir2”. Next we extracted the FASTA sequence for each EMBL file with:

perl $PAGIT_HOME/RATT/main.ratt.pl Embl2Fasta dir1 AM884177.fasta

perl $PAGIT_HOME/RATT/main.ratt.pl Embl2Fasta dir2 AM886279.fasta

Applying PAGIT to the C. Trachomatis example

We generated a working directory for ABACAS and linked the FASTA files into it. As we have two reference sequences, we first mapped the contigs against the nuclearreference genome, and then ordered the contigs in the bin against the plasmid genome:

ln -s ../*.fasta .

ln -s ../454LargeContigs.fna

perl $PAGIT_HOME/ABACAS/abacas.pl -r AM884177.fasta -q 454LargeContigs.fna -p nucmer -c -m -b -o Res.Nuc

perl $PAGIT_HOME/ABACAS/abacas.pl -r AM886279.fasta -q Res.Nuc.contigsInbin.fas -p nucmer -c -m -b -o Res.Plasmid

Most prokaryotic genomes consist of a single circular chromosome. Circular genomes are represented linearly in sequence files with the origin of replication at the start: this can cause problems when using them in mapping or alignment tasks due to the artificially created ends of the linear sequence. When initially assembling a circular genome it is usually the case that the origin of replication is located in the middle of a contig: hence there may be problems when aligning that contig to the reference genome.

For this use-case, based on the assumption that the origin of replication is annotated correctly in the reference genome, weused ACT to identify the origin of replication in our query genome. On examining the synteny between the origin of replication in the reference to our new sequence, we found the following: “ATGACAAGGCTTCCATTACT”. We then loaded the ABACAS output file “Res.Nuc.fasta” into an editor and searched for those bases. On finding the pattern, we inserted a new line before it with a FASTA header (for example “>manual Cut”). In this way we manually divided the sequence into two according to where we suspect the origin of replication to be.

Next we reran the first ABACAS command, changing the input to use the new split sequence of the ABACAS pseudomolecule (just 2 sequences to map). This new mapping will have gained an additional gap, but the advantage is that the gaps can be more accurately closed by IMAGE.

The seven unmapped contigs from the first ABACAS execution were mapped against the plasmid sequence, giving a plasmid pseudomolecule. We joined the two pseudo contigs (from the genome and the plasmid) and the plasmid bin contigs. Now we have a pseudo sequence for the nucleus genome and the plasmid, rather than a joined sequence. Note that we could use the plasmid bin contigs for further analysis.

Here is a synopsis of the commands used:

formatdb -p F -i AM884177.fasta

blastall -p blastn -m 8 -e 1e-40 -d AM884177.fasta -i Res.Nuc.fasta -o comp.Nuc.crunch

act AM884177.fasta comp.Nuc.crunch Res.Nuc.fasta

perl $PAGIT_HOME/ABACAS/abacas.pl -r AM884177.fasta -q Res.Nuc.fasta -p nucmer -c -m -b -o Res.NucOrigin

cat Res.NucOrigin.fasta Res.Plasmid.fasta Res.Plasmid.contigsInbin.fas > Res.abacas.fasta

IMAGE was executed with two different k-mers. A synopsis of the commands used is as follows:

perl $PAGIT_HOME/IMAGE/image.pl -scaffolds Res.abacas.fasta -prefix 3807_partial -iteration 1 -all_iteration 3 -dir_prefix ite -kmer 49 -vel_ins_len 300 -smalt_minScore 45 > output.txt

perl $PAGIT_HOME/IMAGE/restartIMAGE.pl ite3 31 12 partitioned

image_run_summary.pl ite

perl $PAGIT_HOME/IMAGE/contigs2scaffolds.pl ite9/new.fa ite9/new.read.placed 300 10 Res.image

After IMAGE, ICORN was executed for 4 iterations. A synopsis of the commands used is as follows:

icorn.start.sh Res.image.fa 1 4 3807_partial_1.fastq 3807_partial_2.fastq 80,600 300

icorn.CollectResults.sh Res.image.fa 3807

We ran RATT using the sequences output from the 4th(and final) iteration of ICORN using the following command:

start.ratt.sh embl/ Res.image.fa.4 TransferPAGIT Strain > ratt.output.TransferPAGIT.txt

For transferring onto the original 454 assembly we ran the following command:

start.ratt.sh embl/ 454LargeContigs.fna Transfer454 Strain > ratt.output.Transfer454.txt

To gather the general transfer statistics we used the following command:

grep -iB1 -A1 "Gene model" ratt.out*txt

The command used for summarizing statistics on the incorrect models was the following. It prints all the lines from the report files of the 454 transfer. The grep -v Gene_ID will delete the header line of each file. The wc -l counts the amount of lines, therefore the amount of gene models that need correction after the transfer.

cat Transfer454*.Report.txt | grep -v Gene_ID | wc -l

And similarly for the PAGIT sequence:

cat TransferPAGIT*.Report.txt | grep -v Gene_ID | wc -l

To gather statistics on the transferred models with frameshifts on the 454 contigs we used the following command. The difference to the previous command is just to count where the fifth column has more than two counts. The fifth column is the numberof stop codons in the codon regions: therefore it is likely to be a frameshift if the number of stop codons is higher than 1.

cat Transfer454*.Report.txt | awk '$5>2' | grep -v Gene_ID | wc

And similarly for the PAGIT sequence:

cat TransferPAGIT*.Report.txt | awk '$5>2' | grep -v Gene_ID | wc