In My Home Director on Cluster

Workflow for ChIP-seq & RNA-seq data analysis, GRS 20121

WORKFLOW FOR ANALYSIS OF CHIP-SEQ DATA INCLUDING ANALYSIS OF ENRICHED TRANCRIPTION FACTOR BINDING SITES

-Gavin Schnitzler 02/11/2013

1) QUALITY CONTROL ON SEQUENCING RESULTS

Open the fastqc_report.html file.

Check per base sequence quality. This should be in the green range throughout. Somewhat lower quality is expected in the first ~8bp (due to technical aspects of the sequencing method), which is why these basesare often excluded from the analysis. If other 5’ or 3’ end bases fall out of the green range you may also want to exclude these with the caveats that you’ll need at least ~25 bp to map to a mammalian genome & you will want to perform the same trim on all samples.

Check per sequence quality scores. Ideally there will be a sharp peak in the 30’s & relatively few reads below this.

Other metrics often don’t link properly to the .html report, but you can access each of these by opening the images subfolder of the fastqc results folder:

duplication_levels.png shows the number of exact duplicate reads. If most of your reads have a duplication value greater than 1 this indicates that you are sequencing multiples of the same fragment that were generated by the PCR amplification step in library production. Avoiding this is the major reason why you need to start with ~3ng of ChIP fragments and why you want to limit amplification to ~18 cycles. If there is a single curve centered around 2 or 3 your data should still be fine – but will be about as accurate as having 1/3 as many unique sequences.

The kmer content values in the .html report & kmer_profiles.png image describe certain short sequences that are over represented compared to chance predicted from GC content. Usually TTTTT & AAAAA show up & a few others, which are just normal characteristics of mammalian DNA. Don’t worry about this unless certain sequences show up 100s of times greater than background and/or the kmers in your input DNA sample differ greatly from those in your ChIP sample… in which case these kmers could be an indicator of contamination by very high copies of the same sequence – potentially a PCR fragment from some other experiment that contaminated your ChIP sample.

Per base GC contentsequence content .png files should show roughly straight lines with the exception of the first ~8 or 9 bp, which is expected (as mentioned above).

The per sequence GC content .png file should have a single peak centered over the expected %GC for the genome in question (~45% for mammals). Sharp peaks of any given GC content probably represent highly repeated sequences (resulting perhaps from contamination of your ChIP DNA with a PCR product of some specific sequence that was hanging around the lab). A second broad peak of higher GC content probably indicates bacterial DNA contamination. An example of contamination with a common soil bacterium that was probably growing in a ChIP buffer is shown here:

Note that none of these issues should completely prevent analysis of your data. For instance, non-mammalian sequences won’t map to the genome & over-amplified artifact sequences are removed by the recommended procedure of removing duplicate reads before calling peaks. These problems will, however, decrease your total number of reads and thus decrease the accuracy of peak calls. Similarly a library of too low a concentration will result in fewer than optimal numbers of sequenced clusters. What will prevent analysis is low sequence quality across all bases of reads, or high N (unknown base) calls, which will prevent any reliable mapping to your target genome. This sort of effect is rare & is likely to indicate a problem with the Illumina run itself, that you will want to contact your core about.

2) DOWNLOADING & UNPACKING ILLUMINA SEQUENCING RESULTS

#go to the web site provided by the core facility. Find the correct file ending in “fastq.gz”. Control click to get menu & select “copy link location”

#go to the Tufts computing cluster either using a shell program like TeraTerm or using “sftp ” from a Mac OSX shell.

#Navigate to an appropriate folder in your shared directory & do…

wget {pasted URL copied from website: e.g. ftp://gschnitzler:/120209-214/sequence_data_illumina/Unaligned/Sample_LiverE2_ERa_ChIP_man2.R1.fastq.gz}

#unpack using…

bsub gunzip FILE.fastq.gz

**Remember to create a backup of the QC and .fastq.gz files in someplace other than the cluster, such as an external hard drive.

3) CHIP-SEQ STEP 1: MAPPING READS TO TARGET GENOME WITH BOWTIE

#run bowtie using:

bsub -oo LiE_man2_bowtie.runinfo /cluster/shared/gschni01/bowtie-0.12.5/bowtie -n 1 -m 1 -5 8 -3 10 --best --strata mm9 FILE.fastq FILE.map

# “bsub –oo filename” submits the bowtie run as batch and records any output that would have been sent to the screen in the provided file. bsub is necessary for any job that will take more than a few seconds to run. “-n #” specifies the number of mismatches allowed in 1st 25 bp of read, “-m #” specifies the maximum number of different genomic locations a read can map to before being rejected, “-5 #” indicates the number of bp trimmed from 5’ end (8 generally advised), “-3 #” number of bp trimmed from 3’ end, “--best" & “--strata" are recommended parameters to find likely best fit in the genome.

#note, no commands to set path parameters are necessary to run bowtie, so long as the full path to the bowtie executable file is used. You can identify this path by going to that directory and typing “pwd”.

#Examine the bowtie.runinfo file & record the results, which will tell you what % of sequences mapped to the genome, what percent failed to map & what % were suppressed due to the --m parameter (e.g. mapping to more than one genomic location if --m is set to 1).

# To install bowtie go to: & follow instructions for installation. To install indexes for a genome of interest, right click on the link for the pre-assembled index you want (along right of page) & select copy link location. In UNIX go to the “indexes” folder in bowtie & do:

wget [pasted link location]

unzip [downloaded file]

4) CHIP-SEQ STEP 2: CONVERTING FROM MAP TO BED FORMAT

#MACs requires that the .map bowtie output be converted into a .bed file. Do:

awk 'OFS="\t" {print $4,$5,$5+length($6),$1,".",$3}' FILE.map > FILE.bed

# “\t” tells awk that entries within a line are separated by tabs, $4 means the 4th column, $5+length($6) means add the numeric value in the 5th column to the (text) length of the entry in column 6 & “.” means to insert a literal period in this tabbed position.

5) IDENTIFYING CHIP-SEQ PEAKS WITH MACS

#To prepare to run macs:

module load python/2.6.5

export PYTHONPATH=/cluster/shared/gschni01/lib/python2.6/site-packages:$PYTHONPATH

export PATH=/cluster/shared/gschni01/bin:$PATH

#note the paths used here depend on parameters specified when MACS was installed

#Running MACs using an input control (-c) and ChIP experimental (-t) files:

bsub -oo IPvINPUT.macsinfo macs14 --format=BED --bw=210 --tsize=33 --keep-dup=1 -B -S -c INPUT.bed -t IP.bed --name IPvINPUT

#--format=BED tells MACs that the input file is in .bed format, bw=210 tells MACs the expected size of sequenced fragments (before addition of linkers, which add an additional ~90 bp) from which value it attempts to build a model from sense and antisense sequence reads, --tsize=33 indicates that each read (after trimming in bowtie) is 33 bp long (not strictly necessary: MACS can figure this out on its own), --keep-dup=1 instructs MACS to consider only the first instance of a read starting at any given genomic base pair coordinate & pointing in the same direction – assuming that additional reads starting at the same base pair are due to amplified copies of the same ChIP fragment in the library (by default MACS estimates the number of duplicates that are likely to arise by linear amplification of all fragments from a limited starting sample, and sets the threshold to cut out replicate reads with a much higher number – likely artifacts), -B tells MACS to make a bedgraph file of read density at each base pair (which can be used to visualize the results on the UCSC browser) & -S tells MACS to make a single .bedgraph file instead of one for each chromosome, & finally --name gives the prefix name for all output files.

# Carefully scan through the MACS output file & examine the results for:

INFO @ Tue, 21 Feb 2012 18:30:07: #1total tags in treatment: 54992015

INFO @ Tue, 21 Feb 2012 18:30:07: #1tags after filtering in treatment: 38314984

#(reads left after keeping only one of each repeated sequence based on keep-dup=1)

INFO @ Tue, 21 Feb 2012 18:30:07: #1Redundant rate of treatment: 0.30

# also look at these numbers for your control file. A high redundant rate reduces your read count & your ability to detect peaks accordingly.

# MACS identifies + strand and – strand peaks, assumes that some of these come from left & right reads of binding site peak fragments & builds a model that determines the optimum separation between these peaks…

INFO @ Tue, 21 Feb 2012 18:31:19: #2 Build Peak Model...

INFO @ Tue, 21 Feb 2012 18:32:01: #2 number of paired peaks: 9916

INFO @ Tue, 21 Feb 2012 18:32:04: #2 predicted fragment length is 133 bps

# You ideally want >1000 peak pairs in the model, but MACS will run (giving a warning) so long as you have ~100. Importantly, the predicted fragment size should be within +/-10 or 20 bp of your estimate of the ChIP fragment size in your library (equal to your average final library fragment length minus the ~90 bp length of the adapters). If it is not, then the model is probably based on spurious read peaks resulting from, for instance, sonication sensitive sequences that get cleaved at high frequency.

#You can visualize the quality of the model as follows: copy the “.r” file generated by MACS (e.g. LiE_man2.33_v_Li_Input_bw210_dup1_model.r to your PC. Open R. Set the working directory to equal the folder with the .r file using:

setwd(“[file path]”)

#Then generate a viewable .pdf from the .r file using:

source(“filename.r”)

#This should show two broad peaks for plus strand & minus strand reads and a 3rd for the read density center called by MACs based on its predicted fragment length. Sharp peaks separated by less than 50 bp are peak call errors due perhaps to sonication-sensitive sequences. If you also see broad peaks flanking these sharp peaks it means that it may be possible to identify the real peaks if you adjust the MACS parameters. In particular the --mfold=X,Y parameter tells MACS to use peaks that are between X and Y fold over background in calculating the model (default 10 & 30). Sharp false peaks are often very high over background, so reducing the max --mfold can often help (e.g. --mfold=10,20). If signal for real binding events is not strong, you can also reduce the lower parameter (I’ve had good results sometimes with --mfold=7,12) – with the caveat that this is more likely to call true noise as a peak.

# Finally you get to the number of peaks called:

INFO @ Tue, 21 Feb 2012 18:38:19: #3 Finally, 55920 peaks are called!

# Then, as a control, MACS swaps treatment and control data, considering treatment as background and asking for the peaks detected in the control data.

INFO @ Tue, 21 Feb 2012 18:38:19: #3 find negative peaks by swapping treat and control

INFO @ Tue, 21 Feb 2012 18:39:08: #3 Finally, 1119 peaks are called!

#This is a measure of how many peaks result from random noise. Ideally you want “negative peaks” to be 10-fold or more lower than “positive peaks”. This is how MACS calculates empirical false discovery rate, so the lower the ratio of negative to positive peaks, the better your FDRs will be. If these values are roughly equal, this indicates that non-specific fragments in your ChIP DNA are more numerous than the fragments specifically brought down with your antibody & may suggest you need to optimize your ChIP conditions or try other antibodies.

#Keep tweaking --mfold ranges until you get the correct estimated fragment size & a high positive/negative peak ratio, if possible. Beware that the smaller the mfold range you allow the fewer qualifying peaks there will be. If the model won’t resolve, you can tell MACS to forget the model & just use your estimate of fragment size by adding these parameters to your MACS command “--bw=fragment_length --shiftsize=fragment_length/2 --nomodel" & hope that this gives a good positive/negative peaks ratio.

#To install MACS 1st go to: liulab.dfci.harvard.edu/MACS/& choose the download option & ctrl-click to copy the link location. If you need to login use default username=macs & password=chipseq

To download directly to the cluster, you will need to include login information in the wget command:

wget --http-user=macs --http-password=chipseq [paste url]

Follow the instructions for installation. To get it to work, you will probably need to change the default version of python with:

module add python/2.6.5

6) SUBSAMPLING BEFORE MACS, PLUSES and MINUSES

#I am told that the P-values and fold-enrichment values given by MACS are sensitive to the relative number of reads in the treatment vs control files, and if this number is very different (maybe > 1.5x different) MACs may not accurately report these numbers (even though the accuracy of peak calls should be mostly unaffected).

#One way to handle this is to run MACS with all of the data initially, and compare tags after filtering for treat and control. Then take a subsample of the dataset with the higher number of reads & re-run MACS. E.g. if control has 60M reads after filtering and treat has 20M, take a 33% subsample of the control & rerun MACS. The following small command-line perl script can be used to subsample reads, in this case 33% of lines in LiE_man2.bed are put into LiE_man2_33pct.bed.

perl -e 'open(F1,"FILE.bed");open(FH2,">","FILE_33pct.bed");while(<F1>){if(rand()>.666){print FH2 $_;} };close F1; close F2'

#This will not give an exactly equal number of reads, especially if % redundancy is high, but may get you close – especially if you try a few iterations. To get a truly equal number requires deduplicating the reads _outside_ of MACS and feeding MACS an equal number of these reads. Unfortunately, this is trickier than it might seem, involving converting your bed into a .sam file, running two samtools programs and then converting back to .bed.

#Be aware that subsampling is a delicate trade-off which increases the effective noise & reduces the accuracy of the subsampled dataset in order to improve MACS fold enrichment and p.value output accuracy.

7) COMPARING PEAK CALLS TO UNDERLYING READ DENSITY DATA

#If you included the -B & -S parameters in MACS it will create subfolders containing treat and control bedgraph (.bdg) files which map the read density base-pair per base-pair across the whole genome.

#Unfortunately MACS often extends predicted density a few bases past the UCSC-browser recognized chromosome ends, causing errors. To fix this you can modify and run a short perl program as follows:

vi trim_bdg_chrom_ends.pl

#cut and paste this content after adjusting the chromosome lengths to what the UCSC genome browser reports for the species and build you are using (you can change the number of base pairs per chromosome, add or subtract numbered chromosomes without problem, but you’ll need to make further adjustments to the program if you add a non-numbered chromosome other than X, Y and M):

=head1 Simple file to remove bed or bedgraph regions that excede chromosome ends in mouse mm9

=head1 Usage

Input: At command line, type:

> perl trim_bdg_chrom_ends.pl input_bed_or_begraph_file.bdg output_filename.bdg

=head1 Version Information

Gavin Schnitzler 6/29/2012

=cut

my %chromhash=(

chr1=>197195432,

chr2=>181748087,

chr3=>159599783,

chr4=>155630120,

chr5=>152537259,

chr6=>149517037,

chr7=>152524553,

chr8=>131738871,

chr9=>124076172,

chr10=>129993255,

chr11=>121843856,

chr12=>121257530,

chr13=>120284312,

chr14=>125194864,

chr15=>103494974,

chr16=>98319150,

chr17=>95272651,

chr18=>90772031,

chr19=>61342430,

chrX=>166650296,

chrY=>15902555,

chrM=>16299

);

#print "Chr1: $chromhash{chr1}\n";

open (FH1, "<", $ARGV[0]) or die ("Could not open input bed or begraph file $ARGV[0]\n");

open (FH2, ">", $ARGV[1]) or die ("Could not open output file $ARGV[1]\n");

while(<FH1>){

# chomp;

@tmp_dat = split(/\t/, $_);

if($tmp_dat[0] =~/track/){print FH2 $_; next;}

if($tmp_dat[2]>=$chromhash{$tmp_dat[0]}){next;}

print FH2 $_;

}

close FH1;

close FH2;

#finish editing in vi by

[esc]

:wq

#then run the program with:

perl trim_bdg_chrom_ends.pl filename.bdg outputfilename.bdg

#Next compress your file for uploading to the browser with:

gzip outpufilename.bdg

#copy this file to your desktop using WinSCP (in windows) or sftp within a shell window (in mac OSX)

#open the genome browser, select your genome, click add custom tracks, browse for the file name & hit upload (it will take some time, but should eventually finish). Do the same with your input control .bdg file.

#click add custom tracks again & upload the MACS result file ending with peaks.bed

#if you are comparing ChIP results from more than one condition or with more than one antibody you can upload .bdg and .bed files for each one.

#Now scan along any chromosome & examine the peaks MACS called. Are they believably above background? Make note of the coordinates of believable and non-believable peaks, then open the peaks.xls file made by MACS into Excel on your desktop. Is there a p-value or fold-enrichment threshold that separates most good peaks from most bad peaks? If so, you may want to apply this(these) threshold(s) to create your final list of peaks.

# Note that the height of the browser graphs for each track will be proportional to the number of reads after filtering that MACS used for your treat and control samples. If these differ considerably, you can adjust for this easily enough by just mentally multiplying the axis values for the sample with more reads by the ratio of [reads for the smaller sample]/[reads for the larger sample] (e.g. in the example above, you’d multiply the control sample axis values by 1/3).