This document illustrates a number of bioinformatic computer programs and what the user interface might look like.

I keep a bookmark list of sites I have used at

NCBI Entrez.

The National Center for Bioinformatics is the main repository for DNA and protein sequence analysis in the US. ( It's main interface for looking up sequences (and many other things) is called Entrez.

A list of the databases searchable by Entrez at NCBI

A search of the Protein section for human beta globin:

Suppose one wants to find a sequence for the human beta globin protein. Naively typing in "human beta gobin" retrieves 557 proteins, many of which are obviously not human from the summary view, and many of which are obviously not complete proteins from the length. Part of the problem is that NCBI's protein database is a conglomerate of many different databases. One section, "called GenBank", contains all protein sequences exactly as their authors submitted them. This will include partial sequences, and mutant sequences. Another division is called RefSeq. RefSeq is a database curated by NCBI staff where they have attempted to filter out variants, mutants, and bad sequences, leaving just one copy to represent this protein from humans. The link in the upper right will filter out just the RefSeq entries, but still will produce 94 of them. You can also notice which entries are RefSeq entries because their names always have an underscore in them (see entry number 1). The other problem is revealed by the expanded "Search details" box on the lower left. The phrase "human beta globin" is ambiguous. Did you mean you only want proteins that are named exactly "human beta globin", or only proteins from humans that are named exactly "beta globin" or all proteins from humans containing either the word "beta" or "globin", etc. Entrez decided you meant all proteins that are either from humans or mentioned "human" somewhere in the documentation, AND which also had the phrase "beta globin" somewhere in the documentation.

If you know the query language, or use the "Advanced" tab to help you build a query, you could tighten this query up to human[organism] AND "beta globin"[Protein Name]. That retrieves 177 entries, none of which, unfortunately, are from RefSeq. You can solve the remaining problem by noticing the protein name that was assigned to the beta globin entries that were from RefSeq in the original list above. NCBI's curators decided not to call this protein "beta globin"; they decided to call it "hemoglobin subunit beta".

If the search is conducted as human[orgn] AND "hemoglobin subunit beta"[Protein Name], then only two entries are found. One is the RefSeq entry. The other is the entry from another curated database known as Uniprot (previously Swiss Prot).

The RefSeq entry may have links to related 3D structures, and many other resources, particularly for human genes. Beware that sometimes NCBI's links go to an NCBI maintained clone of a resource, rather than the actual resource. In that case, the actual resource will often have more or better information than NCBI.

For example, the structure links go to NCBI's structure system which has fairly limited information. This is one particular structure link:

They have their own structure viewer, Cn3D, which you have to download and install.

By comparison, putting the accession 3SZK at the PDB site gives a much bigger selection of information. Partially this is because NCBI considers the sequence entry to be the top page, and has linked most of the information to the sequence entry, not the structure entry. PDB considers the structure description page to be the top page, and has most of it's information linked directly to that. However, it's often true that you'll find more, or better quality, information if you look through a broader range of sites.

As a case in point, suppose we were interested in ontological information about beta-globin. From the RefSeq entry for beta globin, it's not obvious that there's any path to ontology information. There is a third party link to an ontology resource that doesn't seem to have any ontology information. You can get to ontology information through the 3rd party link to ensemble. However, NCBI Entrez does have links into some of the underlying pathway resources that GO draws upon, and some of the links are not easy to find through GO itself.

If you click the Uniprot entry, you get NCBI's clone of the Uniprot entry. In the left panel, the Uniprot -> NCBI conversion did capture linked ontology information in the form of xrefs. But the panel on the right is not updated to indicate that author-provided information exists in the entry.

If you retrieve the Uniprot accession number (P68871) and put it in the actual Uniprot site ( you'll get a much better presentation of this information.

A more generic entrance to Gene ontology is through

Use the "set filters" button to narrow down the results.

Then look for the "associations" with the gene of interest.

One can then see the list of genes for a particular function, or see a diagram if the source was one that provides diagrams, or else some other indication of how the genes were grouped into this function.

The Reactome database has diagrams of gene products collaborating on functions organized by the Uniprot curators. If the overall process referenced is too large for a single diagram, a selection of diagrams will be shown. To search for the sought-after protein in this collection of diagram, expand all the details in the section on the left, and then use the find function to identify what diagram includes the sought-after term.

Note that the NCBI may have pathways resources linked other than those found at the GO consortium, such as wikigenes, or the KEGG database.

Online Mendelian Inheritance in Man

One excellent resource searched through NCBI is Online Mendelian Inheritance in Man (OMIM). This resource gives a summary of genetic studies on each human gene particularly including phenotypes caused by defects in the gene and a summary of animal knockout experiments.

Here's an example of it's use. I was preparing a lecture to a class about collagen, which had the usual litany of facts about how procollagen is hydroxylated on lysine and proline, then secreted, processed and crosslinked. The standard story goes that in vitamin C deficiency, the hydroxylation does not occur and this results in a disease called scurvy. Since failure to hydroxylate the proline leads to failure to fold or secrete the procollagen, that would seem to explain scurvy. But what's the hydroxylation on the lysine for? The lysines are involved in subsequent crosslinking. Various textbooks show the lysines undergoing crosslinking as hydroxylated or not, and tend to be silent about how important is it for the lysines to be hydroxylated. Textbooks all state that collagen lysine hydroxylation is required for glycosylation, but tend to be silent on what that's for.

Searching OMIM reveals that humans have not one but three genes for this enzyme. Inherited genetic disorders have been ascribed to defects for two of them. An animal knockout of the third one is an embryonic lethal with basement membrane disintegration. The variety of symptoms described all seem to track to the hypothesis that without the hydroxylation the lysines can crosslink, but can not mature to an irreversible crosslink. Reading through the case studies gives a much different impression of why we need to hydroxylate collagen lysines than you'll get by looking at a reaction pathway.

Public Reaction pathway databases include KEGG, Wikipathways, and Reactome.

A relatively new BlastP interface at NCBI lists resources attached to each sequence matched.

Some resources related to DNA sequencing.

Individual sequence reads retrieved from a sequencing facility for sequence confirmation purposes should come with the chromatogram. There is a need to visualize the quality of the chromatograms to discover if apparent discrepancies are real or sequence ambiguities.

The most common format is an ab1 file.

Here is a view into an ab1 file with a free downloadable utility called Chromas:

For assembling Sanger sequence reads of a new sequence, one should have an assembler that's sensitive to quality values. Here's an example of the phredPhrap system.

Using phredPhrap at UTHSCSA is similar to a number of other high powered systems. The program has been installed on the bioinformatics computer known as alamo, and is maintained by basically one user, in this case Steve Hardies. So, to use it, you'd have to get an 'account' on alamo. [An 'account' means a username and password. At this time, there is no fee to UTHSCSA personnel associated with using the UTHSCSA computing facilities.]

Although the phredPhrap programs are installed on alamo, you'd have to edit some files in your home directory system to be able to use it. For that, you'd want to consult with a person currently using the program.

A short list of things you might need to do:

  • Install a program on your PC called Putty allowing you to log into your alamo account.
  • Edit the program path into a login file.
  • Create a set of directories and subdirectories according to program specifications for your sequencing project.
  • Copy the chromatograms from the sequence supplier to the specified program directory.
  • Obtain the cloning vector sequence and place in your system as a fasta file.
  • Create an environment variable informing the program of the location of the vector file.
  • Edit a script copied from a program directory that says how to discover paired end information from the filenames of your reads.
  • Obtain a file of repetitive sequences corresponding to the organism from which your sequence was obtained.
  • Create an environment variable informing the program of the location of that file.
  • Install a graphics interface program on your PC called vnc viewer.
  • Start a vnc server instance in your account.
  • Log in with vncviewer.
  • Run the phredPhrap script [call bases, assigns quality values, overlaps reads]
  • Run consed [an assembly viewer] to evaluate if more reads are needed, or to output the consensus of the finished sequence.

Accessing consed (the phredPhrap assembly viewer) through vnc to examine the quality value for given bases in the consensus, and the underlying chromatograms.

For next generation sequencing, if contracted to a commercial provider, they may do much of the requested bioinformatics for you. If done with equipment at UTHSCSA, you will need to do your own processing. Again, this involves using a sequence of programs installed on the UTHSCSA bioinformatics computer, and getting in contact with users that know how to use that software.

There is a user group that one can join and query by e-mail to find an existing user that can help:

Dear Colleagues:

Thank you all again very much for coming to yesterday's meeting, which

was very productive thanks to everyone's input. I have followed through

with the first suggestion, which was to set up the NGS User group, and

installed an initial mailing list for the NGS users at UTHSCSA. The address

to post to this list is . We can use this list to

pick up the discussion regarding NGS implementation on our computational

and storage resources.

I am attaching below the list of users subscribed to this

list. Please check if I accidentally left anyone out that you

think should also participate. If so, either send me their name

and e-mail address, or ask them to sign up for themselves at:

You can also manage your own subscription at this website. All messages

sent to this list are also archived on the web, which you can access with

a password for your subscription that you can retrieve from this page by

clicking on "Unsubscribe or edit Options" at the bottom of this page.

Please keep in mind that any message sent to

will be broadcast to everyone who is subscribed, even when replying.

Also: Please keep in mind that your posting may be rejected if you do not

post from the subscribed address (see below). If you have multiple e-mail

addresses from which you post, please send me the alternatives so I can

add them to the allowed lists to posts and so you will not get any bounces.

I will soon send a couple of messages to this list with valuable feedback

from Patricia Dahia and Yidong Chen to make sure everyone has a chance

to respond.

Thanks again for your participation!

Regards, -Borries

---

Borries Demeler, Ph.D.

For RNA-SEQ type experiments, the first step is to match the reads against a reference sequence. We are currently using bowtie for that purpose.

Bowtie output:

409276219chr13035061375M*00CTGGGTATGCCTCGTAGTTAAAACATTCCTGGGAACATCTTGACCATAAGATAAAGGGGACTGTGAAGACATAGC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0

4092762125chr13035074175M*00GTAGTTAAAACATTCCTGGGAACATCTTGACCATAAGATAAAGGGGACTGTGAAGACATAGCAGGGCTATATTAT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0

247957189chr1314439625575M*00CTCGGTTTGTGTTTTTTCATGAGATGAAGATGGAGCGCGGTGGCTGCCAGAGAGATTAATTCGTCAGATGAACAC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:1

2649845617chr1352331525575M=1373159320GTCAGAGTTGTATTGAGACTGGATCTCACTGTGTCGCTCTGACTGGTCTGGAATTCTCTACATAGCCCAAACTGG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:3

135203859chr14570806375M*00GTACACTGTAGCCGTCCTCAGATACTCCAGAAGAGGGCATCAGATTTCGTTACAGATGGTTGTGAGCCACCATGT IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:1

305500761chr1467800225575M=1355030840TGCACAGCTACAACTGAATCTCACGGTAGGCCCGCTTCTCCACCAGCTTCACTTTGTATTTGCGCTTGAACTTGG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:4

3055007649chr1467815325575M=1355067580CAGTGTCCTGAGTGAGCCAGACAACTCAGAGCTGAGCTGCACATCAATGTCAGGAGCCTGGTACTTGAGCCGTCC IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:3

47674541chr1467873325575M=1363195530GTTGTAGGAGGCTCCTGCAGGAATCACCTCCACTGCAGGCACCTGGGAAGGCTTGATGTGGAGGCGTTGTGGCCG IIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIIII NM:i:0

This output gives the coordinates in the reference sequence where each read matches. This output will then be reprocessed depending on the purpose of the experiment.

There are numbers of programs with advanced capabilities that use bowtie as their search engine. For example, tophat processing bowtie output to detect reads that jump across introns.

tophat output vs. a segment of human chromosome I.

chr11376357731376358250

chr11376358251376359001

chr11376359001376359440

chr11376359441376359461

chr11376359461376359492

chr11376359491376359521

chr11376359521376359562

chr11376359561376359641

chr11376359641376359682

chr11376359681376359691

chr11376359691376359702

chr11376359701376359718

chr113763597113763597231

chr113763597213763597362

chr1137635973137635974112

chr1137635974137635975135

chr1137635975137635976141

chr11376359761376359771146

chr11376359771376359781148

chr11376359781376359791165

chr11376359791376359801296

chr11376359801376359811317

chr11376359811376359821326

chr11376359821376359831341

chr11376359831376359841342

chr11376359841376359851346

chr11376359851376359861371

Tophat differs from bowtie in that it can detect when reads jump across an intron:

chr1135001650135002590JUNC000000582+135001650135002590 255,0,0 2 42,41 0,899

chr1135030133135033309JUNC0000005968+135030133135033309 255,0,0 2 59,70 0,3106

chr1135169235135170476JUNC000000601+135169235135170476 255,0,0 2 33,42 0,1199

chr1135176956135178406JUNC000000611+135176956135178406 255,0,0 2 28,47 0,1403

chr1135280461135281068JUNC000000621-135280461135281068 255,0,0 2 40,35 0,572

chr1135503085135505538JUNC00000063188-135503085135505538 255,0,0 2 71,65 0,2388

chr1135505469135506365JUNC0000006475-135505469135506365 255,0,0 2 67,71 0,825

chr1135506291135506812JUNC0000006572-135506291135506812 255,0,0 2 30,72 0,449

chr1135518312135518761JUNC0000006651-135518312135518761 255,0,0 2 66,68 0,381

You might have thought that deep sequencing just to quantitate the level of different transcripts would have consisted of comparing reads to a preconceived set of spliced cDNA sequences. But most packages instead prefer to search the entire genome, and then apportion the reads into genes based on an annotation file specifying the coordinates of named genes in the genome. This presumably aids in dealing with alternative splicing by keeping the exon boundaries in site until the last steps. This presumably also allows for discover of previously unknown transcripts, or previously unknown exons.

The file format for known gene annotation is called gtf. .gtf files for popular model organisms can be downloaded from sites like ensemble or UCSC Genome browser. Illumina has a site with numbers of gtf files. For a novel organism, you might have to find a converter to construct your gtf file from some other format. Typically the process would go GenBank->gff->gtf. The format is:

Structure is as GFF, so the fields are:
<seqname> <source> <feature> <start> <end> <score> <strand> <frame> [attributes] [comments]

Here is a simple example with 3 translated exons. Order of rows is not important.

381 Twinscan CDS 380 401 . + 0 gene_id "001"; transcript_id "001.1";

381 Twinscan CDS 501 650 . + 2 gene_id "001"; transcript_id "001.1";

381 Twinscan CDS 700 707 . + 2 gene_id "001"; transcript_id "001.1";

381 Twinscan start_codon 380 382 . + 0 gene_id "001"; transcript_id "001.1";

381 Twinscan stop_codon 708 710 . + 0 gene_id "001"; transcript_id "001.1";

From

As with most advanced bioinformatics processes, the data will be processed through a succession of computer programs, possibly derived from different sources, and chosen to progressively condense the data towards some particular goal.

Here is an example where the deep sequencing data was processed several steps and then fed to an assembly viewer called Artemis.(

From the Artemis web site.

For RNA-SEQ data, one software package to proceed towards tabular differential gene expression data is called cufflinks.

From

Sample data to work through an RNA-SEQ data processing exercise can be retrieved from the NCBI Sequence Read Archive. The output above was derived from accession SRP003951 in that archive.

PBS execution system.

Alamo is a supercomputer. It has about 100 cpu's available at any given time for carrying out computations. Eight of them are referred to as the "head node". The head node carries out the initial interactions with users, and supports the interactive graphics vnc interface and other housekeeping operations. The other cpu's are organized into "compute nodes". The object is to distribute all the intensive computing operations to the compute nodes. The job execution system is called PBS. Jobs to be executed under PBS are embedded in a submission file of the following format. Then the job is submitted with the command qsub.

#!/bin/sh

#PBS -r n

#PBS -V

#PBS -l nice=19

#PBS -j oe

#PBS -l nodes=1:ppn=8

cd /home/hardies/E25P/RNAP

hostname

blastpgp -i T7RNAP.fa -d /mnt/glusterfs/hardies/db9/nr -j3 -b0 -v 2000 -o T7RNAP.psiblast -a8

The lines beginning with # are directives to PBS. They indicate that 8 cpu's are requested, and give information about how the job is to be conducted. The commands starting with cd /home... are the actual commands that would be typed to execute this job. This is an 3 round psiblast search of the local copy of the NCBI nr database. If the computer is busy, the job will be queued until the requested cpu's are available, and then executed.

Revisiting PCR primer software.

I asked you on the midterm to think about how various changes in the PCR procedure might influence the performance of the following oligo:

Proposed primer:

GACTCAGCGCTATTGCGCATGATC

4*GC + 2*AT Tm = 74C

By inspection, hairpin:

GACTCAGCGCTATTGCGCATGATC

primer dimer

GACTCAGCGCTATTGCGCATGATC 3'

||||

3'CTAGTACGCGTTATCGCGACTCAG

Let's look at how different software packages agree or disagree on the properties of this primer.

By Oligo Version 6 (There is a version 7 now).

By default the program started with usual PCR conditions, except with primer concentrations of 1 nM. No one does PCR with 1 nM primers. The 1nM TM was 67.7C. Changing that to 0.1 uM gave 74C. Maybe someone dialed in the 1 nM primer to cause the program to automatically discount the Tm by 5C ???