Bioinformatics for Cell Biologists, 2011: Chip-Seq and RNA-Seq

Bioinformatics for cell biologists, 2011: ChIP-Seq and RNA-Seq

The programs in step 11 are slow and don't depend on previous steps, so you can start this step earlier and let it run.

While many steps use Excel, we would have used a programming language (Python in our case) to do these steps, as this is faster and allows more advanced calculations. But it takes longer to learn programming than spreadsheet programs.

1. Log into the server 130.237.142.51. You will need both a SSH client (PuTTy) for running programs, and an SCP client (WinSCP) for transferring files.

SSH will give you access to a Unix/Linux command line. Some useful commands:

cd folder (to change folder; cd .. to go up one level)

ls (shows the contents of the current folder)

mv source destination (for renaming a file)

cp source destination (copies a file)

rm filename (deletes a file; rm -r deletes a folder)

less filename (for reading a text file; q to exit, f and b to scroll)

mkdir folder (makes a new folder)

keys: ctrl+C (shuts down the running program), tab (auto-completes file name), arrow up (gives last command)

2. Under /media/quartz/danielr/exercise_data find SRX001942_1M.fastq and SRX015142_1M.fastq. These are shortened versions of files downloaded from the sequence read archive, and are named by accession ID. Go to http://www.ebi.ac.uk/ena/ and find what has been sequenced.

3. Look at one fastq file:

less /media/quartz/danielr/exercise_data/SRX001942_1M.fastq -> click q to quit

This file is in the Sanger FASTQ format. There is a description of it on http://en.wikipedia.org/wiki/FASTQ_format

4. Align with Bowtie.

(OBS! Start all runs from your own home folder!)

bowtie -m 1 -S -p 3 /media/quartz/danielr/Program/bowtie-0.12.7/indexes/hg19 input.fastq output.sam

Here, -m 1 makes bowtie only give uniquely mapping reads, -S gives SAM format output, -p 3 makes the program use 3 processors, and /.../hg19 is the indexed genome (downloaded from Bowtie's homepage, but could have been built with bowtie-build). One fastq file this size (1M reads) takes 3 minutes to align using a single processor.

5. Use samtools to convert to BAM format

samtools view -bS alignment.sam > alignment.bam

samtools sort alignment.bam alignment

samtools index alignment.bam

The sorting and indexing will be useful for visualization later.

Many peak calling programs want bed format, this conversion can be done with a program in the bedtools package: bamToBed <alignment.bam >alignment.bed

6. Call peaks with MACS

macs14 -t chipseq.bam -c control.bam -g hs

Look at NA_peaks.xls with less (it's tab-delimited plain text, despite the file ending), or copy it to your computer and import it in Excel.

Two fields are useful for sorting by: tags and FDR

The tags column is the peak height: the number of reads in the region. This is proportional to how often and strongly the transcription factor bound to the site, i. e. how many cells were captured in a state where the interaction occurred.

False discovery rate (FDR) is false positives/(true positives+false positives) at a particular cutoff. MACS calculates false discovery rate by inverting which sample uses an antibody with specificity and which is control, under the assumption that the two files have roughly the same number of reads (which is true in this case).

Sometimes it is also a good idea to filter based on fold enrichment over input, as peaks with a low enrichment (especially high peaks) is a sign of false positives.

7. Using Excel, filter the sites with a 5% FDR threshold and sort them by peak height. Copy the bam and bam index (.bam.bai) files to your computer and view the highest peaks in IGV. IGV wants positions in the format 'chromosome:start-end'.

8. Create a file in 3 column BED format with your filtered sites. This is a tab-delimited plain text file, where first column is chromosome, second is start and third is end coordinate.

See http://genome.ucsc.edu/FAQ/FAQformat.html#format1

9. Use PeakAnnotator to add putative target genes to peaks

create output directory (mkdir PeakAnnotator_out)

run PeakAnnotator to find gene with closest gene start to each peak (java -jar /media/quartz/danielr/Program/PeakAnnotator_Java_1.3/PeakAnnotator.jar -u TSS -a /media/quartz/danielr/Program/PeakAnnotator_Java_1.3/Data/Homo_sapiens.GRCh37.58.gtf -o PeakAnnotator_out -p peaksfromExcel.bed)

look at the output (less PeakAnnotator_out/NA_peaks.tss.bed)

10. Find enriched functional annotation using Genomic Regions Enrichment of Annotations Tool (GREAT)

Go to http://great.stanford.edu/public/html/index.php and enter your bed file with peak sites.

This tool compares the genes around the peaks to various gene lists to find such lists where there is above-background overlap. It performs two statistical tests, to make up for the deficiencies they have: one (the hypergeometric) of which accounts for biases in how far apart genes are (e.g. developmental genes are often in gene deserts), and another (the binomial) which only counts each gene once, thereby avoiding problems that come with small lists of peaks. The FDR is the multiple-testing corrected p-value, and fold enrichment is signal divided by expected.

11. Find gene expression values in the same cell type

The sample SRX000570 is from RNA-Seq for the same cell type.

11a. First specify a set of known junctions in 'Tophat format'. These link adjacent exons to each other, so that reads that come from two exons can be aligned. One way to generate these is with the UCSC genome browser's table browser. Here, choose output in BED format, use the refGene table under the track RefSeq Genes, group Genes and Gene Prediction Tracks. Click submit, and then specify that regions should be introns. Each line of this file contains genomic coordinates of the upstream exon end and downstream exon start.

To be able to use this file for Tophat, the file format needs to be changed (each start position needs to be subtracted by 1, and strand (+ or -) be moved to the fourth column). Since the file is too big to edit in one step in excel, we have already prepared a file in the right format for Tophat. You can find it here:

/media/quartz/danielr/exercise_data/hg19junctions_refGene.tophat.bed.

11b. Run tophat

mkdir tophatout

tophat -p 6 -o tophatout --no-novel-juncs -j /media/quartz/danielr/exercise_data/hg19junctions_refGene.tophat.bed /media/quartz/danielr/Program/bowtie-0.12.7/indexes/hg19 /media/quartz/danielr/exercise_data/SRX000570_7M.fastq

(the part before the fastq file in the genome index, same as with bowtie. --no-novel-juncs removes de novo intron discovery, which otherwise adds a lot of time)

11c. Run rpkmforgenes

rpkmforgenes is a program produced by Sandberg lab. It quantifies gene expression by calculating an RPKM-value for each gene. RPKM = total exon reads/(million mapped reads * exonlength (in kb))

rpkmforgenes -o geneexpr.txt -i tophatout/accepted_hits.bam -namecollapse -a /media/quartz/danielr/Program/rpkmforgenes/hg19_refGene.txt

12. Consider genes with >100 RPKM (reads per kilobase and million mappable reads, the gene expression metric in the output from rpkmforgenes) highly expressed. Make a list of such genes (use the sort function in Excel).

13. Look if your peaks are overrepresented around highly expressed genes

Import the output from PeakAnnotator, copy the gene list into an empty space on the sheet and use COUNTIF followed by sorting in Excel to split your peak list into peaks near highly expressed genes and peaks near other genes. Calculate the proportion of peaks close to highly expressed genes and compare to the proportion of highly expressed genes in the total gene set.

14. Find if any peaks overlap binding sites for Jun in the same cell type

Go to Gene Expression Omnibus, download the bed file with called peaks for GSM487425. In the bed file the peak location is shown with start and end of the peak. But in this analysis we only want the center of each peak, otherwise broad peaks will be more likely to show overlap than narrow peaks. Use excel to transform the files so that the start position is equal to the center of the peak and the end position is center position+1. Use windowBed in the BEDtools package to find Myc peaks and Jun peaks that are close to each other.

windowBed -a MycPeaks.bed -b JunPeaks.bed -w 500

The option -w 500 means that we will detect peaks from file b that are within 500bp upstream or downstream of peaks in file a.