Workshop hands on session: Basic RNA-Seqdata analysis with Tuxedo suite and ChIP-Seqdata analysis with MACS
Note: This is intended as a step by step guide for doing basic RNA-seq analysis with Illumina reads using the Tuxedo suite(Tophat2/Cufflinks/cummerbund). Due to the time and computation resource constraintsat the workshop,a subset of real datais used for the RNA-seqdata analysis. Therefore the results here are only for demonstration of workflow purpose.We are only using the default parameter setting. Pleases note that COPY & PASTE the commands may not work depending the computer you are using(font change in Linux). It is best to type the commands yourself.For tophat/cufflinks runs, please get on a compute node.
For details of parameter setting with Tophat/Cufflinks/cummeRbund, refer to:
Tophat2 manual:
Cufflinks manual:
Bowtie2 manual:
Samtools manual:
CummeRbundmanual:
For wget command, if you are not on a log in node, please change “hpc.oit.uci.edu”in the web links to login-1-2.
- Get reference genome: We are using Yeast as our organism here because of its small genome size. Download Yeast reference genome from UCSC genome browser download site (see class slides) to your computer.An easier alternative is to uswgetcommand on HPC for a direct download that eliminate the transfer from your local computer:
$ wget
This should downloadinto your current directory a zipped version of yeast genome sequences in FASTA format for each of the chromosomes. You can use ls to see it.If you are working with model organisms such as human or mouse, refer to class slides.
$gunzip chromFa.tar.gz
$tar -xvf chromFa.tar
Gunzip and tar should uncompress your downloaded file into 17 FASTA files for 17 chromosomes. Concatenate the17 FASTA files into one file as your reference genome using cat command.This way you do not have to index your 17 files individually. *.fa lists all files that end with .fa and > directs the output to a file called ref.fa. You can also use your own naming system. Just keep consistent with the names.
$cat *.fa > ref.fa
- Index reference genome: Use bowtie2 to index your reference genome ref.fa for alignment. Here –f indicates that your file is in FASTA format.ref is the prefix of the 6 output index files.
$ module load bowtie2/2.2.3
$ bowtie2-build -f ref.fa ref
This will take a couple of minutes. For mammalian genomes, it can take hours. After indexing, you should have six files of indexes now: ref.1-4.bt2 and ref.rev.1/2.bt2. Bowtie2 indexes the genome with a Burrows-Wheeler index to keep its memory footprint small. If you are interested in why you need to index your reference before mapping and what the .bt2 files mean, refer to
- Get annotation:Download Yeast reference genome annotation from UCSC genome browser table view (see class slides). To reduce the run time, download a sample annotation file in GTF formatwereprepared for the tutorial. This is a small subset of the available annotation so that your Tophat run will not take too long.
$ wget
- Get raw reads: Download thesample read files for condition 1 and 2, which are a small subset of the real files. It will take a few minutes depending on your internet speed. You should be able to use ls and lessto see them in your current directory. Notice it takes a couple of minutes to download just 8-9 million reads.
$ wget
$wget
- Short read alignment:Run Tophat2 with the download reads: s1.fastq for condition 1 and s2.fastq for condition 2 respectively. Here we use single read design to save run time. This should run much faster than a normal Tophat run, which could take hours to days. -T forces Tophat to only align the reads to the annotated transcriptome to save running time. –o specifies the output folder, c1 for condition1 and c2 for condition 2. –G specifies your annotation file.ref indicates the prefix of your 6 bowtie index files. You also need to load samtoolsand tophatin order to run tophat.
$ module load samtools/0.1.19
$ module load tophat/2.0.12
$ tophat2 -T -G ref.gtf -o c1 ref s1.fastq &
$tophat2 -T -G ref.gtf -o c2 ref s2.fastq &
Tophat run will take quite some time. You do not have to wait for the two Tophat runs to finish in order to moveonto the next step!
- Prepare mapping for visualization: Index and sort the 2 Tophat mappings using samtoolscommand so the mapping can be imported to Integrative Genomics Viewer (IGV) for visualization. First get the prepared BAM files instead of waiting for your Tophat runs to finish using wget. These two files should be the same as your tophat run output: c1/accepted_hits.bam and c2/accepted_hits.bam. You also need to sort and index your BAM files so IGV can quickly find the desired genomics location for visualization.
$ wget
$ wget
$ samtools sort s1.bam s1.sorted
$ samtools sort s2.bam s2.sorted
$ samtools index s1.sorted.bam
$ samtools index s2.sorted.bam
- Visualization:Visualize your indexed and sorted mapping in Integrative Genomics Viewer.
$ module load igv
$igv
Click File->load from file and choose s1.sorted.bam. Use the yeast genome SacCer2 as reference genome and go to chrIII:167,540-167,956(you can type it in the search box)and you should see your mapped reads! Load s2.sorted.bam in the same way. And both s1 and s2have one data track in your window. Mouse over and right click to change color scheme etc. Change your coordinates to wherever you like.
- Expression Quantification:Run cufflinks on the Tophat resulted BAM files to assemble transcripts and estimate abundance for each sample.–o specifies cufflinks output folder as cf1 for condition1 and cf2 for condition 2:
$ module load cufflinks
$ cufflinks-o cf1 s1.sorted.bam &
$ cufflinks-o cf2 s2.sorted.bam &
& means send the job to background. You should have two folders cf1 and cf2 for cufflinks output when you are done. Use
$ ls-ltrcf1
to see what is inside.
The file transcripts.gtfis your assembled transcripts. Genes.fpkm_tracking and isoforms.fpkm_tracking lists the FPKM values with its corresponding transcripts/genes ID and genomic location.
- Consolidate assembled transcriptome: Use cuffmerge to combine your assembled transcripts from all samples. The manifest.txt contains the path to the two assembled transcripts.gtf files. Use less command to take a look.
$ wget
$less manifest.txt
$ cuffmerge -s ref.fa manifest.txt
When the run is finished, you should have a folder generated by cuffmerge called merge_asm when you are done. There is a merged.gtf file in this folder which is your merged GTF format transcripts.
- Differential expression analysis: Use cuffdiffon the two Tophat mapping files to do differential expression analysis between condition 1 and 2. –o specifies the output directory of cuffdiff. –L gives labels to the 2 conditions as c1 and c2. –-FDR sets the false discovery rate threshold to 0.05. merged_asm/merged.gtf is the merged transcriptome annotation you generated in step 9. Note that this test data set does not have biological replicates. In reality, you usually will have a few replicates in each condition. Just list them in the manifest file.
$ cuffdiff -o cd_1_2 -L c1,c2 --FDR=0.05 merged_asm/merged.gtf s1.sorted.bam s2.sorted.bam
When cuffdiff is done, you should have a folder called cd_1_2 with your cuffdiff results.
$ ls -ltr cd_1_2
This will show you the results list. There are around 22 files in the folder and you should be mostly interested in those .diff files such as gene_exp.diff. These files are text files and can be directly imported to excel.
$ lesscd_1_2/gene_exp.diff
For details of cuffdiff output, see
------
For those of you with R basic knowledge or want more challenges, use the output results from cuffdiff you generated earlier (in the cd_1_2 folder) and follow the protocol here. A line-by-line instruction is included to show you how to generate graphs of the two samples.For details, see
You need to load Rfirst. You will also need x2go for graphics if you are using a windows laptop. Make sure you are in the directory where “cd_1_2” is. Change directory using cd.
$ module load R/2.15.2
$ R
library(cummeRbund)
cuff_data <- readCufflinks('cd_1_2')
Depending on the configuration on HPC, you might need to use the absolute path in the above command.
csDensity(genes(cuff_data))
csScatter(genes(cuff_data), 'c1', 'c2')
csVolcano(genes(cuff_data), 'c1', 'c2')
Quit R using quit().
For those of you who are interested in ChIP-Seqdata analysis and have extra time at the tutorial session, here is some sample data published by UCI researchers (SPEBP-1 ChIP-Seq Data)
Both the raw reads and processed data are accessible at this web. The paper that describes the dataset is here if you are interested:
Download the raw data from the first link using wget on HPC (this will take a few minutes)
$ wget
$ wget
Run MACS on HPC now.–g mm indicates the genome size is mouse. –t indicates the treatment file is srebp1_lane7_eland_multi.txt. –c gives the control file is IgG_lane8_eland_multi.txt. Use the perl command line to remove the “ref_” in front of the chromosome names so they are compatible with IGV genome names.
$module load enthought_python/7.3.2
$ module load macs
$ macs -t srebp1_lane7_eland_multi.txt -c IgG_lane8_eland_multi.txt -g mm
$perl -anle 's/ref_//g; print $_'NA_peaks.bed > NA_peaks1.bed
Your results are in the current folder. Use ls and less command to check them out! The BED files NA_peaks1.bed can be directly imported to IGV for visualization. Choose the mouse genome mm8 build. Check it out yourself!
For those of you with extra time and are up for a challenge, try writing a shell script that includes commands from section 1 to 10 and submit it the HPC Grid Engine using qsubcommand from the Linux session. Does your job run? This is your first RNA-seq pipeline!