Running Local BLAST and Parsing Output

Computational Biology and Genomics Workshop

Todos Santos Center

April 18-22, 2016

Running Local BLAST and Parsing Output

Background. Identifying and aligning similar DNA and protein sequences is one of the most common tasks in computational biology, and BLAST (Basic Local Alignment Search Tool) still represents one of the most widely used and effective tools. While many users are familiar with running BLAST searches through the NCBI webserver, it is also possible and often preferable to download BLAST software and databases and run searches on a local computer or server. Advantages of running local BLAST include…

The ability to use customized databases including your own unpublished genome and transcriptome sequences that are not available on GenBank

The ability to submit searches for thousands of query sequences simultaneously (e.g., all contigs from a de novo transcriptome assembly)

Integration into custom scripts and pipelines

There are numerous specialized “flavors” of BLAST, but the following five programs representthe classic and most widely used methods. They differ based on whether the inputs are nucleotide or amino acid sequences and whether the alignments are based on nucleotide or (translated) amino acid sequences.

Program / Query Sequences / Database Sequences / Alignment
blastn / nucleotide / nucleotide / nucleotide
blastp / amino acid / amino acid / amino acid
blastx / nucleotide / amino acid / amino acid*
tblastn / amino acid / nucleotide / amino acid*
tblastx / nucleotide / nucleotide / amino acid*

*blastx, tblastn, and tblastx translate the input query and/or database sequences in all six possible reading frames before performing alignment.

Objectives. The goal of this exercise will be to make sequence comparisons between entire sets of proteins and entire genomes of closely related bacteria. This will involve generating databases and running searches with local BLAST, parsing the output with BioPerl scripts, and plotting data with R.

Software and Dependencies

BLAST+ Executables (installed and in your PATH)
Custom Perl scripts(dependent onBioPerl modules)
BLASTN_parse.pl
BLASTP_parse.pl
R
TextWrangler
Microsoft Excel

Protocol

Make BLAST databases. Prior to running a local BLAST search, you must first download or create a BLAST database. Familiar databases like nr can be downloaded directly from NCBI for use in local searches, but you can also create a custom BLAST database from any input file in FASTA format. In this exercise, we will make two BLAST databases. One will be made fromthe entire set of protein sequences from a particular strain of Escherichia coli, and the other will consist of a single nucleotide sequence of the entire E. coli genome.

First, open a Terminal sessionand enter the command to move into the directory with the relevant files for this exercise.

cd ~/TodosSantos/local_blast/

The directory contains FASTA files with E. coli protein sequences (Ecoli.proteins.fas) and the E. coli genome sequence (Ecoli.genome.fas). To convert these files into BLAST databases run the following commands.

makeblastdb-in Ecoli.proteins.fas-dbtypeprot

makeblastdb-in Ecoli.genome.fas-dbtypenucl

Note that you have now generated three files with additional extensions for each of the FASTA input files. These are the actual BLAST database files that will be used to run searches.

Ecoli.genome.fas

Ecoli.genome.fas.nhr

Ecoli.genome.fas.nin

Ecoli.genome.fas.nsq

Ecoli.proteins.fas

Ecoli.proteins.fas.phr

Ecoli.proteins.fas.pin

Ecoli.proteins.fas.psq

Run BLASTP search. Now use the BLASTP program to run a search for every protein sequence from a strain of Salmonella enterica(Salmonella.proteins.fas) against the E. coli protein database you just made. Enter the following command (all on one line).

blastp -query Salmonella.proteins.fas -dbEcoli.proteins.fas-evalue 1e-6 -num_threads 4 -out blastp.txt

Note that the -evalue option sets the significance threshold for reporting hits. This can be modified depending on how stringent/permissive you would like to be with your search. There are 5027 proteins in the Salmonella file, so this will run 5027 separate BLAST searches. The -num_threadsoption tells the program to use multiple cores (in this case 4) to make search go a little faster. It should take about 1 min. For more information on BLAST settings you can type: blastp -help

Once the search is complete, view the output file with the following command.

less blastp.txt

Summarize BLAST results by parsing output file with BioPerl script. The output text file from BLAST is easy to read, but when you do large numbers of searches it is far too much to view all of it. Extracting key information from BLAST output with Perl or Python scripts can be very valuable. Here, we will use a Perl script that replies on modules from BioPerl that make it easy to parse BLAST output.

./BLASTP_parse.pl blastp.txt > blastp.parsed.txt

You can view the parsed output file (blastp.parsed.txt) from the command line with less or by opening it in Microsoft Excel. Each line corresponds to a Salmonella protein and reports its top E. coli hit and the percent amino acid identity between the query and the hit. A portion of the output file…

prfA prfA 96.9529085872576

prfB prfB 95.3757225433526

prfC prfC 96.7924528301887

prfH prfH 82.7160493827161

prgH NO HIT NA

prgI NO HIT NA

prgJ NO HIT NA

prgK NO HIT NA

priA priA 90.5866302864939

priB priB 99.0476190476191

Note that we have only extracted a very small amount of information from the BLAST output (name of the query sequence, name of the top hit sequence, and percent identity of the top hit). But depending on your needs, you can extract essentially any of the information you could want from the file, including alignment lengths, positions, orientations, sequences, scores, etc.

Open the BLASTP_parse.pl script in TextWrangler. Review the Perl code and try changing line 25 from…

$value = $hsp_obj->percent_identity;

to…

$value = $hsp_obj->evalue;

If you save these changes and repeat the parsing of the BLAST output, you will find the parsed output file now reports the e-value for each hit rather than the percent amino-acid identity.

Run BLASTN search. Now try performing a nucleotide-based search to compare the entire Salmonella genome against the entire E. coli genome by entering the following command (all on one line).

blastn -task blastn -query Salmonella.genome.fas -dbEcoli.genome.fas -evalue 1e-20 -num_threads 4 -out blastn.txt

Note the use of the -task blastn flag. That may seem redundant because the program itself is BLASTN, but this program defaults to use a version of BLAST called MEGABLAST, which has very stringent default parameters (including a word size of 28). Therefore, for many or most applications it is necessary to either reduce the word size in MEGABLAST or to run classic BLASTN by using the task flag. Also note that we are using and unusually low e-value (1e-20). This will eliminate a large number of significant but short alignments from our output, which we are not interested in for this exercise.For more information on BLAST settings you can type: blastn -help

Once the search is complete, view the output file with the following command.

less blastn.txt

As above, we will use a BioPerl script to parse and summarize the BLAST output. But in this case, we are going to use a script that will report much more information about each alignment.

./BLASTN_parse.pl blastn.txt > blastn.parsed.txt

The parsed output file (blastn.parsed.txt) has numerous columns so open it in Microsoft Excel to view it. Note that the output file reports the positions of both the query and the hit for each BLAST alignment. Try using these to make a “dot-plot”.

Plot data in R. Launch R or RStudio. In the R console, enter the following command to import the parsed BLASTN output (all on one line):

blastnData = read.table("~/TodosSantos/local_blast/blastn.parsed.txt", sep = '\t', header=TRUE)

And then make an XY scatter plot showing the locations of the BLASTN hits in the two genomes with the following command. Note that the cex options simply sets the size of the points. The output should look like the image below.

plot (blastnData$Query_Start, blastnData$Hit_Start, cex = .25)