IETS 2016

Fertility and genomics: comparison of gene expression in contrasting reproductive tissues of female cattle

PA McGettigan1, JA Browne1, SD Carrington2, MA Crowe2, T Fair1, N Forde1,6, BJ Loftus3, A Lohan3, P Lonergan1, K Pluta2, S Mamo1, A Murphy4, J Roche3, SW Walsh1,7, CJ Creevey5,8, B Earley5, S Keady5, DA Kenny5, D Matthews5, M McCabe5, D Morris5, A O’Loughlin5, S Waters5, MG Diskin5, ACO Evans1,4,*

1School of Agriculture and Food Science, 2School of Veterinary Medicine, 3School of Medicine and Medical Sciences, 4Conway Institute, University College Dublin, Belfield, Dublin 4, Ireland

5Animal and Grassland Research and Innovation Centre, Teagasc, Athenry, Co Galway, Ireland.

6Present Address: Division of Reproduction and Early Development, Leeds Institute of Cardiovascular and Metabolic Medicine, University of Leeds, Leeds, United Kingdom

7Present Address: Department of Chemical and Life Sciences, Waterford Institute of Technology, Waterford, Ireland

8Present Address: Institute of Biological, Environmental and Rural Sciences, Aberystwyth University, Aberystwyth, Ceredigion, United Kingdom

*Correspondence: A Evans, .

Abstract

To compare gene expression among bovine tissues we used large bovine RNAseq datasets comprising 280 samples from 10 different bovine tissues (uterine endometrium, granulosa cells, theca cells, cervix, embryos, leukocytes, liver, hypothalamus, pituitary, muscle) generating 260 Gbases of data. We used twin approaches of an information-theoretic analysis of the existing annotated transcriptome to identify the most tissue-specific genes, as well as a de-novo transcriptome annotation to evaluate general features of the transcription landscape. We detected expression of 97% of the Ensembl transcriptome with at least one read in one sample and between 28% and 66% at a level of 10 Tags Per Million (TPM)or greater in individual tissues. Over 95% of genes exhibited some level of tissue-specific gene expression. This was mostly due to different levels of expression in different tissues rather than exclusive expression in a single tissue. Less than 1% of annotated genes exhibited a highly restricted tissue-specific expression profile and approximately 2% exhibited classic housekeeping profiles. We conclude that it is combined effects of the variable expression of large numbers of genes (73 to 93% of the genome) with the specific expression of a small number of genes (less than 1% of the transcriptome) that contributes to determining the outcome of the function of individual tissues.

Keywords: RNAseq, transcription, bovine, reproduction, uterus, endometrium, embryo, cervix, ovary, follicle, theca, granulosa, hypothalamus, pituitary, leukocyte, muscle, liver

Short title: Gene expression in bovine reproductive tissues

1. Introduction

Recently, there has been a flood of development of new ‘omic’ technologies such as proteomics, transcriptomics and metabolomics that are enabling the generation of vast amounts of novel data characterizing different aspects of cellular biology at a global level. These technologies have been used in an attempt to better understand the development of various tissues with transcriptomics studies producing the most data. A major challenge of this new era is to determine the biological importance of these data in the context of cell and tissue function. In this paper, we focus on the large volumes of data that we have produced in our studies of bovine tissues in recent years, with a focus on reproductive tissues.

2. Development of reproductive tissues

While most tissues in the body are continually engaged in turnover, many reproductive tissues are actively engaged in vigorous periods of tissue (cellular) proliferation that are often followed by periods of dramatic and whole-sale tissue degradation and regression (this is especially true in female reproductive tissues). Cumulatively, the coordination of the proliferation and regression of tissues determines the success of reproduction and fertility. It is for this reason that many investigators have chosen to focus their attentions on the development of specific tissues, usually during narrow timeframes during the reproductive process, to understand their contribution to the outcome of reproduction.

In taking a step back from the mass of data available on individual tissues, it is interesting to consider some of the similarities and differences among tissues during their developmental processes. For example, some reproductive tissues are relatively static in mass (although not always function) with the hypothalamus and pituitary being clear examples. Other tissues develop over long periods of time; oocytes and follicles being the best examples, developing over months (or years depending on your point of view) with the most dramatic changes occurring in the final days before ovulation. Uterine tissues undergo changes during the reproductive cycle in preparation for pregnancy and then changes again over the months of gestation. Others tissues develop much more rapidly, with the early embryo and the corpus luteum undergoing changes in size and morphology that change them almost unrecognizably over a period of just a few days.

To better understand these changes, and the factors controlling them, many studies have focused on measuring the expression of genes and significant amounts of data have been generated during the last few years. One of the key technologies enabling this has been RNA Sequencing (RNA-seq). Since its initial development in 2006 (Bainbridge et al. 2006), RNA-seq has rapidly displaced gene expression microarrays for large-scale transcriptional profiling and is now the technology of choice in many laboratories. Some of the advantages of RNA-seq over microarrays include global profiling of transcripts including currently unannotated transcripts, identification of novel transcript isoforms as well as more accurate quantification of transcript levels. In practice, many researchers, including ourselves, have used the technology as a direct replacement for microarrays and tended to restrict their initial analyses to the known annotated transcriptome of the organism of interest (Mamo et al. 2011; Foley et al. 2012; Forde et al. 2012; O'Loughlin et al. 2012; Pluta et al. 2012; Walsh et al. 2012; Keady In preparation; Matthews In preparation).

The availability of a complete bovine genome sequence has enabled the application of the RNA-seq protocol to bovine samples (Elsik et al. 2009). We have previously published several RNA-seq-derived transcriptomic studies focusing on aspects of bovine reproduction, fertility and productivity traits under various experimental conditions (see Table 1). However, the question of the completeness of the current bovine transcriptome annotation, and characteristic gene expression differences between tissue types remain unaddressed. There are also other gaps in our knowledge of the transcriptome; for example, to what extent are genes universally or uniquely expressed in individual tissue types. There are also long-standing questions about the existence or otherwise of so-called “housekeeping genes” with consistent expression levels in all tissue types at all times (which could be used as reference genes for calibration of global expression studies).

In order to address these questions the aim of this study was to conduct a de-novo annotation of the bovine transcriptome using data from 280 bovine samples taken from 10 distinct tissue types and to compare it to the Ensembl bovine annotation (Ensemble 65) to identify novel bovine transcripts. In addition, the patterns of transcription between different tissues types were compared to identify genes with highly tissue-specific expression patterns and housekeeping genes.

3. Materials and Methods

Animal Handling

All animal procedures performed for the generation of tissue samples in this study were conducted under experimental license from the Irish Department of Health and Children in accordance with the Cruelty to Animals Act 1876 and the European Communities (Amendment of Cruelty to Animals Act 1876) regulation 2002 and 2005 with approval from individual institutional ethics committees.

Tissue sources

The sequencing data were generated from 10 different tissue types as part of 8 separate experiments, some of which have been published. The details of tissue type, cattle breed and number of samples are listed in Table 1. In brief, the samples consisted of 20 uterine endometrium samples from mixed breed beef heifers collected on Day 13 and Day 16 after estrus of which 5 samples at each time point were confirmed pregnant and 5 were non-pregnant (Forde et al. 2012). The follicular granulosa and theca cell samples were paired samples taken from dominant pre-ovulatory follicles of Holstein-Friesian cows and heifers (37 animals in total) at 3 stages of follicular development: selection, differentiation and luteinization (Walsh et al. 2012). The cervical tissues were taken from 30 mixed breed beef heifers at 6 time points during the peri-estrus period (Pluta et al. 2012). The embryo samples consisted of 28 pooled samples from mixed breed beef cattle taken at 5 different days post-fertilization (Mamo et al. 2011). The leukocytes were taken from 16 Simmental male beef calves at 4 different time points post-weaning, resulting in 55 samples in total (not all animal yielded 4 samples each) (O'Loughlin et al. 2012). The liver samples were taken from 12 early post-partum Holstein-Friesian dairy cows in either mild or severe negative energy balance (McCabe et al. 2012). The hypothalamus and pituitary samples were taken from 23 mixed breed beef animals (Matthews In preparation). The muscle samples were taken from the M. longissimus dorsi of 27 Aberdeen Angus steers undergoing nutritional restriction and compensatory growth (Keady In preparation). In some cases multiple samples were not collected from all individual animals giving rise to the actual number of samples contributing to the study as shown in Table 1.

RNA-seq library preparation

All samples were sequenced on a GAIIx sequencer (Illumina) by the Conway Institute Transcriptomics laboratory at University College Dublin, Ireland. All libraries were non-strand specific and were processed as single read libraries (with the exception of the muscle samples that were processed as paired-end libraries). The library type, read length, total numbers of reads generated for each library type and Gene Expression Omnibus (GEO) ID (Edgar et al. 2002) (where available) are listed in Table 1.

Alignment and preprocessing

FASTQ files (Cock et al. 2010) from each library were converted to the Sanger FASTQ format and were then aligned individually to the UMD3.1/BosTau6 bovine genome assembly using the software Tophat version 1.4.0 (Trapnell et al. 2009). Individual alignments, in bam format, from each library were merged together using the samtools merge command (Li et al. 2009). Finally, all combined tissue bam files were merged together into a single file. De novo transcriptome annotation was carried out on each individual tissue and on the combined dataset using cufflinks v1.1.0 (Trapnell et al. 2010). The Ensembl v65 annotation of the bovine genome was taken as the reference transcriptome (Flicek et al. 2012). Coordinates of repetitive regions were downloaded for the UMD3.1 assembly from UCSC genome browser (Kent et al. 2002). Introns were identified from the alignments with Python code utilizing the pysam library. Visualization of the alignments was carried out using the Integrated Genome Viewer (IGV) v2.0 (Robinson et al. 2011). Eval version 2.2.8 (Keibler and Brent 2003) was used to generate summary statistics for the de-novo transcriptomes. The BEDTools (Quinlan and Hall 2010) intersectbed program as well as the GenomicRanges (Aboyoun 2015) and rtracklayer (Lawrence et al. 2009) packages from R/Bioconductor were used to identify overlap of exons with genomic features such as repetitive regions or annotated Ensembl/refseq exons.

Quality Control

Density plots of the logged Reads Per Kilobase per Million (RPKM) (Mortazavi et al. 2008) levels for each gene in each sample were generated. Samples with read depth of less than 2 million reads in total, or with >80% non-unique mapping reads > 80% were excluded from further analysis. None of the 280 samples were excluded from further analysis using these QC criteria.

Initial quality control checks identified a bias in the data, exhibited by the paired-end muscle data which had a much-reduced percentage of non-unique reads compared to the single read libraries. In order to ensure the libraries were comparable with each other for the purposes of identifying differential expression, all FASTQ files were trimmed to a common length of 36 bp following the approach of Anders et al. (Anders et al. 2012). In addition, only the first read in the paired-end muscle data was used for differential gene expression analysis.

Dimensional reduction and clustering

Hierarchical clustering of the individual samples was carried out using the ColorDendrogram function from the sparcl R package. The Eisen distance metric was used as implemented in the MADE4 bioconductor package (Culhane et al. 2005). Principal Component Analysis plots were generated using the function prcomp in R.

Quantifying the diversity/dispersion of gene expression in a tissue

The Gini coefficient, a measure of the unevenness of the distribution of reads, was calculated using the Gini function in the R library reldist (Handcock et al. 1999). It is most commonly used in the social sciences as a measure of income inequality across different segments of a population. It is defined as twice the area between the 45 degree line and the Lorenz curve where, in this case, the Lorenz curve is a graph describing the cumulative share of total reads assigned to the bottom x% of the gene universe. A tissue with an exactly equal distribution of reads among all genes would have a Gini coefficient of zero and a tissue where all reads come from a single gene would have a Gini of 1. The total count of reads for each gene from all samples of each tissue was obtained and the Gini index for each tissue was calculated separately. The Gini index has been used by others to measure skewness in other aspects of transcriptomics such as PolyA length (Morozov et al. 2012).

Gene expression measures of Ensembl 65 annotated transcripts

HTseq (Anders and Huber 2010) was used to generate raw read counts for each gene in each library using the Ensembl 65 bovine annotation as the reference. These raw counts were converted into TPM (Tags Per Million) (Li et al. 2010) and RPKM metrics.

Categorical tissue specificity

Following the method of Schug et al. (Schug et al. 2005), tissue specificity of each gene was measured using the categorical tissue specificity metric (Qgt). Qgt weights genes according to the degree to which the expression of a gene is skewed towards a single tissue; it is based on the Shannon entropy of the gene Hg (Shannon and Weaver 1949).

The following calculation was used: given expression levels of a gene in N tissues, the relative expression of a gene g in a tissue t was defined as:

where wg,t is the expression level of the gene in the tissue. In this case, either TPM or RPKM can be used as the expression level as for the purposes of this calculation they are equivalent. To avoid division by zero in later calculations, a count of 1 is added to the raw counts for each gene in each sample before calculation of TPM and RPKM. The relative expression is then the RPKM of the gene in the tissue divided by the sum of RPKMs for that gene across all tissues.