Class

3

Essentials of Next Generation Sequencing 2014Page 1 of 6

Genome Assembly

Newbler 2.9

Most assembly programs are run in a similar manner to one another. We will use the Newbler assembler for this exercise. Roche provides the programat no cost to the academic community. Newbler is ideal for assembling the longer NGS reads and, in our experience, the resulting assemblies are typically far superior to those produced by typical “short read” aligners. The preferred data format is .sff (standard flowgram format - a proprietary Roche format), although it accepts data as .fasta (with quality); and, since version 2.6, .fastq files. Note also that, even though we use a command line version, a GUI version of this software is also available.

see also:

Change to the assembly directory

List the directory contents. You will see that it contains sequence reads (.sff files) from five different 454 runs

To make life simpler for ourselves later on, let’s merge the file files into onebig one using the sfffile utility in the Roche suite:

2.1 Start a new genome assembly project

To start our assembly project, we will first create a project directory using the newAssembly tool:

Usage:newAssembly [options] <project_name

Note: the newAssembly command creates a genome assembly project. For a transcriptome assembly you need to use option –cdna (newAssembly -cdnaproject_name>). Or, if we want to assemble a genome/transcriptome using a reference genome as a guide, we would use the newMapping tool

Usage: newMapping -cdnaproject_name

Let’s tell Newbler that we’d like to start a new assembly project:

This created an assembly directory named My_Genome_Assembly. Change to this directory and list its contents:You will find that it contains 2 subdirectories,assembly and sff. The assembly (or mapping) directory contains the 454AssemblyProject.xmlproject configuration file and will house the assembly results. The sff subdirectory contains symbolic links to all of the .sff files included in the project.

Let’s examine the list of parameters that govern the assembly process. This can be found in the 454AssemblyProject.xml file (see APPENDIX for interpretation of the parameter settings)

Note: There is a 454Project.xml file inside of the new project directory. It is not the same as the 454AssemblyProject.xml file inside of the assembly directory.

Now we need to tell the assembler where to find the input files.To do this, we will use the addRun tool:

Usage: addRun [options] <file_path>

Change to the project directory that was just created with the newAssembly command - the following step needs to be completed at the top part of this directory tree.

Use the addRun tool to specify the path to the Merged_reads.sff file. Remember, it isin the original assembly directory (the one you cd’d to when you started this exercise; not the one created by newAssembly)

Finally you can start the assembly from within the current directory:

Or you can cd up a level and run it from the parent assembly directory. In this case you must specify the path to the run directory:

Assembling the reads will take approximately 5 to 10 minutes.

2.2 Assembly output files

Examine the output files in the My_Genome_Project/assembly directory. These files are in the assembly directory inside of the project directory from step 1.

454AlignmentInfo.tsvcontains position-by-position summary information about the consensus sequence for the contigs generated by the assembler application, listed one nucleotide per line

.fna and .qual files:454AllContigs, 454LargeContigs, 454Scaffolds

454ReadStatus.txtcontains the status identifiers for all the reads used in the assembly computation, plus the 3’ and 5’ positions for each assembled read’s alignment within the contig results. Reads are listed one per line.

454NewblerMetrics.txtreports the key input, algorithmic and output metrics for the data analysis software. Contains input information, operation metrics, consensus results, information about the scaffolds, large contigs and all contigs.

2.3 Optional Exercise

If you are already comfortable with the command line, there is a good chance you have already completed your first assembly with plenty of time to spare. In this case, you may want to see if trimming the reads to remove poor quality sequence results in a better assembly.

Download Merged_reads.fastq from our web site (190 MB).

Run FastQC on Merged_reads.fastq.

Decide what sort of trimming/filtering you should perform on the data

Run Trimmomatic following the procedures used in Class 2. Remember to change thetrimlog, input and output file names. Save the trimmed file as Merged_reads_trimmed.fastq

Note: the input file contains single end reads. Therefore, you will need to run Trimmomatic in single-end (SE) mode, which takes a single input file and makes a single output file.

Mac OSX:

Transfer the trimmed file to the Linux server that you have been working on:

Enter your password at the prompt

Now return to your original terminal window

PC:

We will use the WinScp utility to transfer the processed sequence file from the USB drive to the Linux server you have been working on:

Download and install WinSCP from (follow the link for the “Installation package”). You can accept the defaults when installing.

Double-click on the WinSCP icon, enter the IP address of the machine on which you are working (128.163.192.150), fill in your username and password, and hit Log In. This will open up a window that shows your local (PC) files on the left-hand panel and remote (Linux) files on the right

Navigate to the location where you saved the Merged_reads_trimmed.fastq file.

Now drag the Merged_reads_trimmed.fastq file from the left-hand side (USB drive) into the assemblyfolder in the right-hand panel (Linux server)

Return to your original PuTTY terminal window

However you copied the files:

Start a new assembly with your trimmed reads. Remember to give it a new name

Compare the metrics of the two assemblies to see if trimming improves performance

APPENDIX

The Newbler configuration file starts with the header, <FourFiveFourProject> and ends with the footer, </FourFiveFourProject>.

Part 1 contains project information – project directory, type (assembly or mapping), date of creation, software version (2.9)

Part 2 contains configuration specifications – these are various parameters related to the assembly process:

xml tag / brief description / default / option in runProject
minimumReadLength / all reads shorter are ignored / 50 / -minlen #
overlapSeedStep / assembly algorithm parameter / 12 / -ss #
overlapSeedLength / assembly algorithm parameter / 16 / -sl #
overlapMinSeedCount / assembly algorithm parameter / 1 / -sc #
overlapMinMatchLength / min length to be considered a sequence overlap / 40 / -ml #
overlapMinMatchIdentity / percent of identity / 90 / -mi #
overlapMatchIdentScore / match score / 2 / -ais #
overlapMatchDiffScore / mismatch score (penalty) / -3 / -ads #
allContigThresh / shortest contig length / 100 / -a #
largeContigThresh / min length of a large contig / 500 / -l #
expectedDepth / expected depth of coverage, 0 means any / 0 / -e #
cDNAMode / if transcriptome assembly / No / -cdna
largeGenome / if large or complicated genome / No / -large
ripMode / if read can belong to one contig only / No / -rip
numCPU / number of cores used; 0 for all available / 1 / -cpu #
finishMode / gaps to be filled with repeats and short reads / No / -finish
autoTrimming / trim low quality bases / Yes / -trim/-notrim
UseSerialIO / preserve order of input files / No / -sio
SerialIOMinimizeDisk / preserve order of input while minimizing disk space / No / -sid
SerialIOMemLimit / preserve order of input limit memory use / No / -sim #

Part 3. Contains information on reference files, vector sequences, primer sequences, adaptor sequences and input files. Each reference is placed between the <File> </File> tags:

Essentials of Next Generation Sequencing 2014Page 1 of 6

Essentials of Next Generation Sequencing 2014Page 1 of 6