Module 10. Whole Genome Annotation: Misc

Background

·  Fathom (part of the snap gene predictor) analyzes annotation sets and can sort and categorize genes into different groups based on their annotation quality.

·  Grep is used to grab information from many files at once using Linux.

·  A shell script is used to run many commands automatically.

·  Command line Blast allows one to search the genome multifasta file for sequence motifs.

·  Remmina is a remote desktop client similar to PuTTY used for windows.

·  Installing software in Linux is an essential skill.

Goals

·  Assess and summarize gene quality from a whole genome annotation using Fathom

·  Use Grep and shell scripts to pull sequences out of a large multifasta file

·  Use a remote desktop client on Ubuntu to transfer files from a virtual box

·  Practice installing software in the Linux environment

V&C core competencies addressed

1) Ability to apply the process of science: Evaluation of experimental evidence, Developing problem-solving strategies

2) Ability to use quantitative reasoning: Applying statistical methods to diverse data, Managing and analyzing large data sets

3) Applying informatics tools, Managing and analyzing large data sets

GCAT-SEEK sequencing requirements

None

Computer/program requirements for data analysis

Linux OS, Fathom, Blast, GCATSEEK Linux Virtual Machine

If starting from Window OS: Putty

If starting from Mac OS: SSH

Protocols

I. Using Fathom to assess Annotation Quality

This tutorial uses the output from Maker after genome.fa was annotated. Also, running any of the listed commands without any additional input (or running them with the -h option) will display the help message including a description of the program, the usage, and all options.

1) Gather all of the GFFs into one inclusive file.

In the genome.maker.output directory

$gff3_merge -g -d genome_master_datastore_index.log

2) Convert the gff to zff (a proprietary format used by fathom)

$maker2zff -n genome.all.gff

This will create two new files genome.ann & genome.dna

The .ann file is a tab delimited file similar to a gff with sequence information. The .dna file is a fasta file that corresponds to the sequences in the .ann file.

3) First, determine the total number of genes and total number of errors there were. The results will appear on your screen. Enter them in the table below.

$fathom –validate genome.ann genome.dna

4) Next, calculate the number of Multi and Single exon genes as well as average exon and intron lengths. Enter them in the table below.

$fathom –gene-stats genome.ann genome.dna

5) Categorize splits the genes in the genome.ann and genome.dna files into 5 different categories: Unique (uni.ann uni.dna), Warning (wrn.ann uni.dna), Alternative splicing (alt.ann alt.dna), Overlapping (olp.ann olp.dna), and Errors (err.ann err.dna).

$fathom –categorize 1000 genome.ann genome.dna

Explanation of categories

Category / Explanation
Unique / Single complete gene with no problems:
Good start and stop codons, canonical intron exon boundaries (if a multi-exon gene) no other genes overlapping it or any other potential problems
Warning / Something about the gene is incorrect: missing start or stop codon, does not have a canonical intron exon boundary, or has a short intron or exon (or is a short gene if it has a single exon)
Alternative splicing / Genes that are complete, but have alternative splicing
Overlapping / Two or more genes that overlap each other
Error / Something occurred and the gene is incorrect. Usually caused by miss-ordered exons.

6) To complete the table, repeat 3) and 4) for each of the non-error categories.

Using Fathom, complete the following table using your data

Count / Percent of Total
Total
Complete Genes
Warnings
Alternative splice
Overlapping
Errors
Single Exon

II. Grep Tutorial

grep [options] “pattern” filename

Command / Function / Example Usage
Grep / Searches for lines in the file containing the search pattern / grep “GENE” annotation.gff
grep –i / Same as grep but is not case sensitive / grep -i “$think of something” $_
grep –c / Counts the number of times the pattern appears / grep -c “>” proteins.fasta
grep -An / Prints the line matching the pattern and n lines after the pattern / grep -A1 “scaffold1” genome.fasta


grep

Finds and prints every line that contains the pattern being searched for.

If you had a file with:

APPLE

pineapple

apples

And ran the command: grep “APPLE” file grep will print:

APPLE

$grep -i

The same as grep but is not case sensitive. So running the command: grep “APPLE” file grep will print:

APPLE

apples

$grep -c

Will count the number of lines the pattern appears in in a file. Useful for counting the number of sequences in a fasta file.

If you had a fasta file with sequence data:

>sequence1

GCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATG

>sequence 2

AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGGTGGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCGTGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCTCAGCCCTGCCTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGCGCTGCTG

and ran the command: grep -c “>” file grep will print:

2 (the number of lines containing > in the file which is also the number of sequences)

$grep -A n

Will print the line matching the pattern and n number of lines after it. So, running the command: grep -A1 “sequence1” file grep will print:

>sequence1

GCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATG

Some notes on grep:

If a line contains the pattern more than one time, the line will be printed or counted only once.

Grep can also make use of regular expressions.

Other options:

-n / Print the line number for lines that match the pattern
-B n / Print lines before the pattern (opposite of -An)
-H / Print the file's name before each line (when searching multiple files)
-f file / Read the pattern from a file line by line
-v / Will print lines that don't match the pattern
-x / Print the exact match. The line can only
-m n / Stop after finding n matches
egrep / Uses the full set of regular expressions
fgrep / Does not use regular expressions and only searches of literal strings. (Faster than grep)

Regular expressions:

Regular expressions are certain characters that stand for other characters. Often times multiple types of characters or multiple characters.

Some (but definitely not all) Regular Expressions:

* / Wildcard, any character
. / Any character other than new line (\n)
\t / Tab
\s / Whitespace (space, Tab, new line)
^ / Beginning of line
$ / End of line

More regular expressions can be found at:

http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1

III. Creating a grep shell script

If you have a large number of search terms to extract with grep, executing a large number of grep commands can be repetitive and overly time consuming. One way around this is by making a shell script. A shell script is a list of commands that will be executed on after the other. Shell scripts are easy to create and can run multiple commands with only one command.

Earlier in this tutorial we saw how grep -A1 will extract a sequence title and the sequence.

Extracting 5 sequences using a grep shell script:

1) Make a list of the sequences you want to extract from a large genome.

$nano

Enter the sequence names below:

______

sequence_1

sequence_2

sequence_3

sequence_4

sequence_5

______

Control X to exit

Save the file

2) Open the list in vi (vi is a command line level text editor that can handle regular expressions. Nano cannot.).

$vi [filename]

3) In vi, add the grep -A1 command to the front of each line. This is a regular expression that will find the beginning of each line and replace it with “grep –A1 ” The colon is the command line for vi. The s option is search. The characters between the first two slashes (/^/) is the search term and the characters between the last two slashes (/grep -A1 “/) is the replace term. The special character ^ matches the beginning of each line.

:%s/^/grep -A1 “/

This will turn the list into:

grep -A1 “sequence_1

grep -A1 “sequence_2

grep -A1 “sequence_3

grep -A1 “sequence_4

grep -A1 “sequence_5

4) Now, add the rest of the grep command (” genome.fa > output) to the end of each line.

:%s/$/” genome.fa > output

The > characters will append the grep output to the end of the file “output”.

This will turn the list into:

grep -A1 “sequence_1” genome.fa > output

grep -A1 “sequence_2” genome.fa > output

grep -A1 “sequence_3” genome.fa > output

grep -A1 “sequence_4” genome.fa > output

grep -A1 “sequence_5” genome.fa > output

5) Save the file and close vi. The command below “shift-zz” will save and close the file in vi.

:ZZ

6) Open the file in nano

nano filename

7) Now, change the first “>” to “>”. “>” will append information to the end of a file. A single > will create a new file if it does not exist yet and clear the contents of the file if it does already.

The list should now look like:

grep -A1 “sequence_1” genome.fa > output

grep -A1 “sequence_2” genome.fa> output

grep -A1 “sequence_3” genome.fa > output

grep -A1 “sequence_4” genome.fa > output

grep -A1 “sequence_5” genome.fa > output

8) Now add “#! /bin/bash” to the first line of the file.

The #! (known as shebang) tells the computer that the commands are to be sent to the program bash which is a common UNIX shell and the one that both the virtual machines and the HHMI cluster use by default.

9) Make the file executable by changing the file permissions. “u+x” will give users executable permissions.

$sudo chmod u+x filename

10) Run the script using “./” below will tell the computer to look in the working directory rather than the path specified in your .bashrc file for the program to execute that you just made.

$./filename

The script will now run and extract the 5 sequence headers along with their DNA from genome.fa and put them in the file “output.”

IV. Transfer files from the HHMI cluster to your Linux virtual box using the Remmina client FTP server.

Connect to the HHMI Cluster from Virtual Box using Remmina Remote Desktop

Remmina remote desktop is the default remote desktop application for the Ubuntu OS. This is the equivalent of Putty on windows.

1) Select the “Dash Home” button which is similar to the “Start Menu” in Windows in the top left of your desktop screen.

2) Type in “remm”

3) Select Remmina Remote Desktop.

4) Select the icon showing a paper with a plus sign to create a new connection.

5) Under “profile”:

Name: HHMI Cluster

Protocol: ssh – Secure Shell

Server: 192.112.102.21 (10.39.6.10 for JC internal users)

username: [enter your cluster username]

6) Connect

7) If a warning message saying the server is unknown appears, click ok.

8) Enter in your password and click ok.

Transfer Files from the cluster to Virtual box using sFTP

1) Click on the icon showing three gears in the top of the Remmina Remote Desktop window.

2) Select “Open Secure File Transfer”

3) Navigate within the GUI to the folder containing the file you want to download.

4) Select the file you want to download and click on the “Download” near the top of the Remmina Remote Desktop window.

OPTIONAL: Upload files from your virtual machine to the cluster

1) Navigate to the directory you wish to upload your file to.

2) Click on “Upload”

3) Navigate to the file you wish to upload.

4) Select the file.

5) Click ok

V. Command line Blast Tutorial

The Basic Local Alignment Search Tool or BLAST is easily the most common and widely used piece of bioinformatical software. BLAST can perform several different types of alignments with several different forms of output. BLAST can use either publicly available, previously created databases to compare the sequence of interest too, or it can be used with a custom database. This tutorial assumes you are comparing a sequence file called genome.fa to another set of reference sequences called reference.fa.

Creating a custom BLAST database

1) Make a directory for blast from your home page, and copy your genome.fa file from your Maker folder to blast folder. We will use the genome file to make a custom database that we can then use as a reference subject to search for a protein query (tBlastn).

$mkdir blast

$cp maker/genome.fa blast/

2) Move to your blast folder and create a BLAST database from a reference set of fasta files using genome.fa.

$cd blast

$makeblastdb -in [reference.fa] -dbtype [nucl or prot] -title [reference] -out [reference]

This will create three database files all starting with “reference” and ending in certain file extensions. If the initial sequences were proteins, dbtype would be set to prot and if they were nucleotides, dbtype would be set to nucl.

Use BLAST to align the sequence of interest to the custom database

1) Run tBlastn as follows:

$[t]blast[n p x] -query proteins.fa -db reference -out blastresults.txt -outfmt 0

The outfmt (outformat) option tells BLAST what format to output the data in. Pairwise (default or –outfmt 0) is a good way to visualize results. 7 (-outfmt 7) is a tabular output with comments indicating column titles. This is a good format for exporting to a spreadsheet.

BLAST also has options to control stringency of match between subject and query. To see the options call up the help documentation for the type of blast you are running. For example:

$tblastn –help

For example, to run the same query, but change the e-value to 1 x 10E-10:

$blastn -query genome.fa -db reference –evalue 1e-10 -out blastresults.txt -outfmt 0

Blast types

Blastn / Nucleotide query to nucleotide database
Blastp / Protein query to protein database
Blastx / Translated nucleotide query to protein database
Tblastn / Protein query to translated nucleotide database
Tblastx / Translated nucleotide query to translated nucleotide database

VI. For Virtual Box Users: Installing software in a Linux OS

One of the most important, but often most difficult parts of doing computer work in Linux is installing new software. Since there are a wide variety of UNIX varieties (aka distributions) and each individual machine can have a different set up, programs are often written on one distribution but may not work on another. UNIX has some support for this, but installation oftentimes depends on the user first installing other programs that are required for the installed program to work properly (aka dependencies). The user must then provide certain information on their system to the program being installed in order for it to run from any directory.