Module 10. Whole Genome Annotation: Misc
Background
· Fathom (part of the snap gene predictor) analyzes annotation sets and can sort and categorize genes into different groups based on their annotation quality.
· Grep is used to grab information from many files at once using Linux.
· A shell script is used to run many commands automatically.
· Command line Blast allows one to search the genome multifasta file for sequence motifs.
· Remmina is a remote desktop client similar to PuTTY used for windows.
· Installing software in Linux is an essential skill.
Goals
· Assess and summarize gene quality from a whole genome annotation using Fathom
· Use Grep and shell scripts to pull sequences out of a large multifasta file
· Use a remote desktop client on Ubuntu to transfer files from a virtual box
· Practice installing software in the Linux environment
V&C core competencies addressed
1) Ability to apply the process of science: Evaluation of experimental evidence, Developing problem-solving strategies
2) Ability to use quantitative reasoning: Applying statistical methods to diverse data, Managing and analyzing large data sets
3) Applying informatics tools, Managing and analyzing large data sets
GCAT-SEEK sequencing requirements
None
Computer/program requirements for data analysis
Linux OS, Fathom, Blast, GCATSEEK Linux Virtual Machine
If starting from Window OS: Putty
If starting from Mac OS: SSH
Protocols
I. Using Fathom to assess Annotation Quality
This tutorial uses the output from Maker after genome.fa was annotated. Also, running any of the listed commands without any additional input (or running them with the -h option) will display the help message including a description of the program, the usage, and all options.
1) Gather all of the GFFs into one inclusive file.
In the genome.maker.output directory
$gff3_merge -g -d genome_master_datastore_index.log
2) Convert the gff to zff (a proprietary format used by fathom)
$maker2zff -n genome.all.gff
This will create two new files genome.ann & genome.dna
The .ann file is a tab delimited file similar to a gff with sequence information. The .dna file is a fasta file that corresponds to the sequences in the .ann file.
3) First, determine the total number of genes and total number of errors there were. The results will appear on your screen. Enter them in the table below.
$fathom –validate genome.ann genome.dna
4) Next, calculate the number of Multi and Single exon genes as well as average exon and intron lengths. Enter them in the table below.
$fathom –gene-stats genome.ann genome.dna
5) Categorize splits the genes in the genome.ann and genome.dna files into 5 different categories: Unique (uni.ann uni.dna), Warning (wrn.ann uni.dna), Alternative splicing (alt.ann alt.dna), Overlapping (olp.ann olp.dna), and Errors (err.ann err.dna).
$fathom –categorize 1000 genome.ann genome.dna
Explanation of categories
Category / ExplanationUnique / Single complete gene with no problems:
Good start and stop codons, canonical intron exon boundaries (if a multi-exon gene) no other genes overlapping it or any other potential problems
Warning / Something about the gene is incorrect: missing start or stop codon, does not have a canonical intron exon boundary, or has a short intron or exon (or is a short gene if it has a single exon)
Alternative splicing / Genes that are complete, but have alternative splicing
Overlapping / Two or more genes that overlap each other
Error / Something occurred and the gene is incorrect. Usually caused by miss-ordered exons.
6) To complete the table, repeat 3) and 4) for each of the non-error categories.
Using Fathom, complete the following table using your data
Count / Percent of TotalTotal
Complete Genes
Warnings
Alternative splice
Overlapping
Errors
Single Exon
II. Grep Tutorial
grep [options] “pattern” filename
Command / Function / Example UsageGrep / Searches for lines in the file containing the search pattern / grep “GENE” annotation.gff
grep –i / Same as grep but is not case sensitive / grep -i “$think of something” $_
grep –c / Counts the number of times the pattern appears / grep -c “>” proteins.fasta
grep -An / Prints the line matching the pattern and n lines after the pattern / grep -A1 “scaffold1” genome.fasta
grep
Finds and prints every line that contains the pattern being searched for.
If you had a file with:
APPLE
pineapple
apples
And ran the command: grep “APPLE” file grep will print:
APPLE
$grep -i
The same as grep but is not case sensitive. So running the command: grep “APPLE” file grep will print:
APPLE
apples
$grep -c
Will count the number of lines the pattern appears in in a file. Useful for counting the number of sequences in a fasta file.
If you had a fasta file with sequence data:
>sequence1
GCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATG
>sequence 2
AGCCCTCCAGGACAGGCTGCATCAGAAGAGGCCATCAAGCAGGTCTGTTCCAAGGGCCTTTGCGTCAGGTGGGCTCAGGATTCCAGGGTGGCTGGACCCCAGGCCCCAGCTCTGCAGCAGGGAGGACGTGGCTGGGCTCGTGAAGCATGTGGGGGTGAGCCCAGGGGCCCCAAGGCAGGGCACCTGGCCTTCAGCCTGCTCAGCCCTGCCTGTCTCCCAGATCACTGTCCTTCTGCCATGGCCCTGTGGATGCGCCTCCTGCCCCTGCTGCGCTGCTG
and ran the command: grep -c “>” file grep will print:
2 (the number of lines containing > in the file which is also the number of sequences)
$grep -A n
Will print the line matching the pattern and n number of lines after it. So, running the command: grep -A1 “sequence1” file grep will print:
>sequence1
GCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATGCATG
Some notes on grep:
If a line contains the pattern more than one time, the line will be printed or counted only once.
Grep can also make use of regular expressions.
Other options:
-n / Print the line number for lines that match the pattern-B n / Print lines before the pattern (opposite of -An)
-H / Print the file's name before each line (when searching multiple files)
-f file / Read the pattern from a file line by line
-v / Will print lines that don't match the pattern
-x / Print the exact match. The line can only
-m n / Stop after finding n matches
egrep / Uses the full set of regular expressions
fgrep / Does not use regular expressions and only searches of literal strings. (Faster than grep)
Regular expressions:
Regular expressions are certain characters that stand for other characters. Often times multiple types of characters or multiple characters.
Some (but definitely not all) Regular Expressions:
* / Wildcard, any character. / Any character other than new line (\n)
\t / Tab
\s / Whitespace (space, Tab, new line)
^ / Beginning of line
$ / End of line
More regular expressions can be found at:
http://regexlib.com/CheatSheet.aspx?AspxAutoDetectCookieSupport=1
III. Creating a grep shell script
If you have a large number of search terms to extract with grep, executing a large number of grep commands can be repetitive and overly time consuming. One way around this is by making a shell script. A shell script is a list of commands that will be executed on after the other. Shell scripts are easy to create and can run multiple commands with only one command.
Earlier in this tutorial we saw how grep -A1 will extract a sequence title and the sequence.
Extracting 5 sequences using a grep shell script:
1) Make a list of the sequences you want to extract from a large genome.
$nano
Enter the sequence names below:
______
sequence_1
sequence_2
sequence_3
sequence_4
sequence_5
______
Control X to exit
Save the file
2) Open the list in vi (vi is a command line level text editor that can handle regular expressions. Nano cannot.).
$vi [filename]
3) In vi, add the grep -A1 command to the front of each line. This is a regular expression that will find the beginning of each line and replace it with “grep –A1 ” The colon is the command line for vi. The s option is search. The characters between the first two slashes (/^/) is the search term and the characters between the last two slashes (/grep -A1 “/) is the replace term. The special character ^ matches the beginning of each line.
:%s/^/grep -A1 “/
This will turn the list into:
grep -A1 “sequence_1
grep -A1 “sequence_2
grep -A1 “sequence_3
grep -A1 “sequence_4
grep -A1 “sequence_5
4) Now, add the rest of the grep command (” genome.fa > output) to the end of each line.
:%s/$/” genome.fa > output
The > characters will append the grep output to the end of the file “output”.
This will turn the list into:
grep -A1 “sequence_1” genome.fa > output
grep -A1 “sequence_2” genome.fa > output
grep -A1 “sequence_3” genome.fa > output
grep -A1 “sequence_4” genome.fa > output
grep -A1 “sequence_5” genome.fa > output
5) Save the file and close vi. The command below “shift-zz” will save and close the file in vi.
:ZZ
6) Open the file in nano
nano filename
7) Now, change the first “>” to “>”. “>” will append information to the end of a file. A single > will create a new file if it does not exist yet and clear the contents of the file if it does already.
The list should now look like:
grep -A1 “sequence_1” genome.fa > output
grep -A1 “sequence_2” genome.fa> output
grep -A1 “sequence_3” genome.fa > output
grep -A1 “sequence_4” genome.fa > output
grep -A1 “sequence_5” genome.fa > output
8) Now add “#! /bin/bash” to the first line of the file.
The #! (known as shebang) tells the computer that the commands are to be sent to the program bash which is a common UNIX shell and the one that both the virtual machines and the HHMI cluster use by default.
9) Make the file executable by changing the file permissions. “u+x” will give users executable permissions.
$sudo chmod u+x filename
10) Run the script using “./” below will tell the computer to look in the working directory rather than the path specified in your .bashrc file for the program to execute that you just made.
$./filename
The script will now run and extract the 5 sequence headers along with their DNA from genome.fa and put them in the file “output.”
IV. Transfer files from the HHMI cluster to your Linux virtual box using the Remmina client FTP server.
Connect to the HHMI Cluster from Virtual Box using Remmina Remote Desktop
Remmina remote desktop is the default remote desktop application for the Ubuntu OS. This is the equivalent of Putty on windows.
1) Select the “Dash Home” button which is similar to the “Start Menu” in Windows in the top left of your desktop screen.
2) Type in “remm”
3) Select Remmina Remote Desktop.
4) Select the icon showing a paper with a plus sign to create a new connection.
5) Under “profile”:
Name: HHMI Cluster
Protocol: ssh – Secure Shell
Server: 192.112.102.21 (10.39.6.10 for JC internal users)
username: [enter your cluster username]
6) Connect
7) If a warning message saying the server is unknown appears, click ok.
8) Enter in your password and click ok.
Transfer Files from the cluster to Virtual box using sFTP
1) Click on the icon showing three gears in the top of the Remmina Remote Desktop window.
2) Select “Open Secure File Transfer”
3) Navigate within the GUI to the folder containing the file you want to download.
4) Select the file you want to download and click on the “Download” near the top of the Remmina Remote Desktop window.
OPTIONAL: Upload files from your virtual machine to the cluster
1) Navigate to the directory you wish to upload your file to.
2) Click on “Upload”
3) Navigate to the file you wish to upload.
4) Select the file.
5) Click ok
V. Command line Blast Tutorial
The Basic Local Alignment Search Tool or BLAST is easily the most common and widely used piece of bioinformatical software. BLAST can perform several different types of alignments with several different forms of output. BLAST can use either publicly available, previously created databases to compare the sequence of interest too, or it can be used with a custom database. This tutorial assumes you are comparing a sequence file called genome.fa to another set of reference sequences called reference.fa.
Creating a custom BLAST database
1) Make a directory for blast from your home page, and copy your genome.fa file from your Maker folder to blast folder. We will use the genome file to make a custom database that we can then use as a reference subject to search for a protein query (tBlastn).
$mkdir blast
$cp maker/genome.fa blast/
2) Move to your blast folder and create a BLAST database from a reference set of fasta files using genome.fa.
$cd blast
$makeblastdb -in [reference.fa] -dbtype [nucl or prot] -title [reference] -out [reference]
This will create three database files all starting with “reference” and ending in certain file extensions. If the initial sequences were proteins, dbtype would be set to prot and if they were nucleotides, dbtype would be set to nucl.
Use BLAST to align the sequence of interest to the custom database
1) Run tBlastn as follows:
$[t]blast[n p x] -query proteins.fa -db reference -out blastresults.txt -outfmt 0
The outfmt (outformat) option tells BLAST what format to output the data in. Pairwise (default or –outfmt 0) is a good way to visualize results. 7 (-outfmt 7) is a tabular output with comments indicating column titles. This is a good format for exporting to a spreadsheet.
BLAST also has options to control stringency of match between subject and query. To see the options call up the help documentation for the type of blast you are running. For example:
$tblastn –help
For example, to run the same query, but change the e-value to 1 x 10E-10:
$blastn -query genome.fa -db reference –evalue 1e-10 -out blastresults.txt -outfmt 0
Blast types
Blastn / Nucleotide query to nucleotide databaseBlastp / Protein query to protein database
Blastx / Translated nucleotide query to protein database
Tblastn / Protein query to translated nucleotide database
Tblastx / Translated nucleotide query to translated nucleotide database
VI. For Virtual Box Users: Installing software in a Linux OS
One of the most important, but often most difficult parts of doing computer work in Linux is installing new software. Since there are a wide variety of UNIX varieties (aka distributions) and each individual machine can have a different set up, programs are often written on one distribution but may not work on another. UNIX has some support for this, but installation oftentimes depends on the user first installing other programs that are required for the installed program to work properly (aka dependencies). The user must then provide certain information on their system to the program being installed in order for it to run from any directory.