Supplementary information

Tissue processing and isolation of RNA. Total RNA was extracted using a Trizol reagent kit (Invitrogen, Carlsbad, CA, Belgium) according to the manufacturer’s protocol. The RNA was purified on RNeasy columns (Qiagen, Hilden, Germany) and its integrity was assessed by denaturing agarose gel electrophoresis and by staining of the 18S and 28S RNA bands for visualization. 5µg of total RNA from each sample were used for in vitro transcription amplification (Van Gelder et al., 1990), following the protocol published by Puskás et al. (Puskas et al., 2002), slightly modified: the incubation time for the transcription step was 5 hours instead of 3. This resulted in approximately 20 µg of antisense RNA (aRNA).

RNA labeling and hybridization. Microarray analyses were performed using commercial Human 1 cDNA Microarray slides (Agilent Technologies, Palo Alto, CA, USA). First-strand cDNA was synthesized from 4µg of aRNA with random primers (hexamers) and amino-allyl-dUTP (Sigma-Aldrich, Bornem, Belgium). Fluorescent dyes (Cy3 and Cy5) coupled with N-hydroxyl-succinimidyl-ester (Amersham Biosciences, Piscataway, NJ, USA) were used to label the tumor and the non-tumor samples, respectively. All hybridizations were replicated with dyes swapped to control for labeling biases. After stopping the labeling reaction with hydroxylamine 4M (Sigma-Aldrich, Bornem, Belgium), resin column purification with Microcon YM-30 (Millipore, Brussels, Belgium) and ethanol precipitation were realized. The samples were resuspended in 12.5 µl of water. All the tumor/non-tumoral tissue pairs (n=26) were hybridized on Agilent microarray slides according to the manufacturer’s protocol.

Pre-processing of microarray data. Microarrays were scanned with a Genepix 4000B scanner. Three scans at decreasing PMT gains were performed for each slide in order to reduce saturation and salvage low intensity spots (Dudley et al., 2002; Venet., 2003). The slide images were quantified with Genepix Pro version 5.0. The resulting fluorescence intensities were processed with functions from the R package for statistics, version 2.1 (R Development Core Team., 2004) and from the Bioconductor library version 1.6 (Gentleman et al., 2004). Spots below background, or flagged as “bad” by Genepix Pro, for which spot fluorescence cannot be distinguished from the background, were ignored in all pre-processing and analysis steps. After background subtraction, scans at multiple gains were merged with a custom procedure conceptually akin to the method of ref. (Dudley et al., 2002), resulting in an extended expression range. Expression ratios (tumor/normal) were then converted into base-two logarithms. Intensity-dependent and space-dependent biases were removed with the LOESS-based procedures of the marray package for R (Yang et al., 2002). Color-flip replicates were then averaged. Genes with expression measurement missing for >25% of the samples were not included in the analysis. If the number of missing measurements was <25%, missing data were replaced by 0 (i.e., no differential expression, a conservative assumption). Data are publicly available in the Gene Expression Omnibus database (www.ncbi.nlm.nih.gov/geo), accession number GSE3950.

Detection of differentially expressed genes. Jarzab et al. (Jarzab et al., 2005) data were downloaded from http://www.genomika.pl/thyroidcancer/PTCCancerRes.html, and Huang et al. (Huang et al., 2001) data from http://thinker.med.ohio-state.edu/. The later dataset was renormalized with the RMA procedure (Irizarry et al., 2003). Expression data were averaged in each of the three dataset across transcript mapping to the same gene as defined by its NCBI Entrez gene ID. Differentially expressed genes were detected within each dataset with the Significance Analysis of Microarray (SAM) procedure (Tusher et al., 2001) as implemented in the the samr package 1.22 for R, using 1000 permutation and default parameters. Genes with q<0.05 and an average fold change >1.5 in at least 2 datasets were selected.

Cross-study data set. Genes with q<0.05 and an average fold change >1.5 in at least 2 datasets were selected. In order to compute the significance of the resulting gene list, we adapted the resampling approach of Oncomine (Rhodes et al., 2004), a widely used meta-analysis database. The idea is to assess the size of the overlap between the three data sets under the null hypothesis that they are not correlated, i.e. we estimate the size of the three data set intersection expected by chance alone. Let Ni be the number of genes with q<0.05 and a fold change, fc>1.5, in data set i. We selected Ni genes randomly for each data set i, and then computed the number of genes selected in at least two permutation sets. This process was repeated 105 times for up and for down regulated genes. None of the runs gave genes lists as large or greater than those observed when genes are selected with the criterias q<0.05 and fc>1.5, thus p<10-5 for 451 gene up- and 233 down-regulated genes.

Real-time RT-PCR. Confirmation of the reliability of the reference gene list was performed by real-time RT-PCR on 11 to 20 tumor/non-tumoral samples for 15 selected genes. Additionally, 2 regulated genes in the Jarzab dataset were included. A 2µg aliquot of total RNA was incubated with DNase I (Ambion, Huntingdon, United Kingdom) and reverse-transcribed in a 100µl reaction volume with random primers and Superscript II (Invitrogen, Carlsbad, CA, USA). Approximately 20ng of reverse-transcribed RNA (based on the RNA concentration evaluated after the DNase treatment) were used in duplicate as a template for the PCR reaction in a total volume of 25µl consisting of 12.5µl of 2× reaction buffer and 0.75µl of Diluted SYBR® Green I (RT-SN2X-03T, Eurogentec, Liege, Belgium), forward and reverse primers (300nM as final concentration except for DGKI, 150nM). PCRs were performed on a ABI Prism® 7700 Sequence Detection System (Applied Biosystems) using the cycling profile recommended by the manufacturer’s protocol. All PCR efficiencies, obtained with 4 serial dilutions points (ranging from 20ng to 200pg), were above 90%. Amplification of the correct product was confirmed by agarose gel electrophoresis followed by a melting point curve determination. The multicomponent file obtained for the melting curve was analyzed with the “dissociation Curve” software (Applied Biosystems).

References

Dudley AM, Aach J, Steffen MA and Church GM. (2002). Measuring absolute expression with microarrays with a calibrated reference sample and an extended signal intensity range. Proc Natl Acad Sci U S A 99: 7554-7559.

Gentleman RC, Carey VJ, Bates DM, Bolstad B, Dettling M, Dudoit S et al. (2004). Bioconductor: open software development for computational biology and bioinformatics. Genome Biol 5: R80-

Huang Y, Prasad M, Lemon WJ, Hampel H, Wright FA, Kornacker K et al. (2001). Gene expression in papillary thyroid carcinoma reveals highly consistent profiles. Proc Natl Acad Sci U S A 98: 15044-15049.

Irizarry RA, Bolstad BM, Collin F, Cope LM, Hobbs B and Speed TP. (2003). Summaries of Affymetrix GeneChip probe level data. Nucleic Acids Res 31: e15-

Jarzab B, Wiench M, Fujarewicz K, Simek K, Jarzab M, Oczko-Wojciechowska M et al. (2005). Gene expression profile of papillary thyroid cancer: sources of variability and diagnostic implications. Cancer Res 65: 1587-1597.

Puskas LG, Zvara A, Hackler L, Jr. and Van Hummelen P. (2002). RNA amplification results in reproducible microarray data with slight ratio bias. Biotechniques 32: 1330-4, 1336, 1338, 1340.

R Development Core Team. (2004). A Language and Environment for Statistical Computing. R Foundation for Statistical Computing.

Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D et al. (2004). Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression. Proc Natl Acad Sci U S A 101: 9309-9314.

Tusher VG, Tibshirani R and Chu G. (2001). Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A 98: 5116-5121.

Van Gelder RN, von Zastrow ME, Yool A, Dement WC, Barchas JD and Eberwine JH. (1990). Amplified RNA synthesized from limited quantities of heterogeneous cDNA. Proc Natl Acad Sci U S A 87: 1663-1667.

Venet D. (2003). MatArray: a Matlab toolbox for microarray data. Bioinformatics 19: 659-660.

Yang YH, Dudoit S, Luu P, Lin DM, Peng V, Ngai J et al. (2002). Normalization for cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Res 30: e15-