Farmer et al.

RNA extraction and microarray hybridization

EORTC 10994/BIG 00-01 study is an intergroup trial of neoadjuvant chemotherapy comparing, after randomization, a fluorouracil/epirubicin/cyclophosphamide chemotherapy regimen with a docetaxel/epirubicin regimen. Ethical approval for the trial and associated translational projects was obtained in all participating institutions, and the patients gave their informed consent for the study. All patients had one incisional or two trucut biopsies taken before starting treatment. Biopsies were embedded in OCT and frozen in isopentane/liquid nitrogen in the originating hospitals. Frozen sections were made at ISREC, examined by one pathologist (V. Becette) and excluded if the tumour cell content was below 20%. RNA was purified from 4 x 25 mm sections with an RNeasy kit (Qiagen, Hilden, Germany), digested with DNase I, extracted with phenol/chloroform and precipitated with isopropanol. The quality and yield of the RNA was assessed by Agilent Bioanalyzer (Agilent technologies, Palo Alto, USA). 100 ng of total RNA was amplified by an Eberwine T7 procedure according to the Affymetrix small sample protocol, and labelled in a T7 reaction using biotinylated nucleotides (Enzo Life Sciences, Farmingdale, USA). Labelled RNA was hybridized to Affymetrix U133A chips using the EukGE-WS2v3 fluidics protocol (Affymetrix, Santa Clara, USA). Reduction mammoplasty tissue from Lausanne University Hospital was embedded in OCT after removal of excess fat, then frozen in liquid nitrogen and sectioned as for the tumour biopsies. Ethical approval was obtained for the normal tissue study (W. Raffoul, M. Fiche and C. Brisken) and the patients gave informed consent for the use of their tissue. Histological examination confirmed the presence of lobules and ducts as well as stroma and fat in the reduction mammoplasty samples. The amount of input RNA was smaller (50 ng) and the fluidics protocol slightly different (EukGE-WS2v4_450) for the normal tissue samples. Given the differences in tissue origin and processing, the normal tissue data were included in the intrinsic gene set comparisons (fig 2) but excluded from the main analysis.

Measurement of AR repeat length

DNA was purified from the RNeasy flow-through using a DNeasy column (Qiagen, Hilden, Germany). cDNA was synthesized using Superscript II reverse transcriptase (Invitrogen, Carlsbad, USA) and random hexamer primers. PCR was performed with primers flanking the AR CAG repeat at amino acid 57 in exon 1. The PCR product was run on an ABI sequencer and analyzed with GeneScan software (ABI, Foster City, USA).

Data analysis

CEL data files were generated with Affymetrix MAS5 software and expression levels were extracted with rma in the affy R package version 1.3.27 (Gautier et al., 2004). The CEL files and rma normalized data have been deposited in the NCBI GEO database with series accession number GSE1561. Unsupervised clustering was performed with Cluster and Treeview (Eisen et al., 1998) using Cluster 3.0 for Mac (M de Hoon, Tokyo) and Java Treeview 1.0.8 (A Saldanha, Stanford). Filtering to eliminate redundancy was based on the Gene Symbol reported in the 11.10.04 Netaffx (Affymetrix) annotation file. The parameters for clustering were: filter for standard deviation >0.5, median centre and normalize 3x, uncentred correlation, centroid linkage (normalisation was omitted for sup fig 4 & 6). Cross-platform mapping was performed with the Netaffx and Cleanex (Praz et al., 2004) databases. We could map the 500 genes in the Stanford intrinsic set to 790 Affymetrix U133A probe sets, of which 269 non-redundant probesets with a standard deviation of log2 intensities after rma normalization > 0.5 were used for clustering in fig 2a.

The coefficient of determination (r2) is plotted in fig 2b, since this indicates the proportion of variation explained under a linear model, with preservation of the sign to show which group the tumour belongs to.

For the Kolmogorov-Smirnov (KS) test (Lamb et al., 2003), the ranking for similarity to AR or ESR1 expression was based on correlation. The ranking for group membership was based on Wilcoxon tests for pairwise comparisons of the three groups (apocrine: PF14, 19, 22, 23, 39, 46; luminal: PF 1, 2, 3, 4, 5, 7, 9, 10, 11, 12, 13, 15, 17, 21, 26, 27, 30, 32, 33, 35, 37, 40, 43, 45, 47, 49, 50; basal: PF 6, 16, 18, 20, 24, 25, 28, 29, 31, 34, 36, 38, 41, 42, 44, 48, using the probesets with a standard deviation >0.5 on the U133A chip after normalization in rma). A t-test was used to break ties. The AR target list includes 49 non-redundant genes induced at 24 hours in Nelson et al (2002). The ER target list includes 22 non-redundant genes induced at 4 or 8 hours in Frasor et al (2003). The mapping to Affymetrix probesets was performed with Netaffx. The p values were calculated by repeating the KS test 1000 times with randomized target lists (for similarity to AR or ER) and randomized sample labels in the Wilcoxon test (for group membership). The KS tests were performed in R.

Gene ontology (GO) analysis was performed with the Isrec Ontologizer version 0.1 (Io, www.io.isb-sib.ch, P. Anderle and T. Sengstag, ISREC). The test list contained the top 1% of probesets best defining the apocrine group in a Wilcoxon test (PF14, 19, 22, 23, 39 & 46 vs all other tumours, using all 22215 probesets on the U133A chip after normalization in rma). A t-test was used to break ties. The table lists GO terms with a corrected p value < 0.05 in the Biological Process division of GO at a depth of 4. The raw p values were calculated by Fisher’s exact test and multiplied by 465, the number of tests performed at GO depth 4, to give the corrected p values. GO depth 4 was used to avoid testing very small numbers of hits.

The Luminal-Apocrine-Basal (LAB) gene set was defined by Wilcoxon test for pairwise comparison of the following groups: apocrine: PF14, 22, 23, 39 & 46; luminal: PF 1, 2, 3, 4, 5, 7, 9, 10, 11, 12, 13, 15, 17, 21, 26, 27, 30, 32, 33, 35, 37, 40, 43, 45, 47, 49, 50; basal: PF 6, 16, 18, 20, 24, 25, 28, 29, 31, 34, 36, 38, 41, 42, 44, 48, using the probesets with a standard deviation >0.5 on the U133A chip after normalization in rma. PF19 was omitted from the list because it is an outlier in the clustering studies (sup fig 3) and only has grade 2 apocrine features histologically (table 2a). The Affymetrix data were genewise mean centred and scaled for comparison with data from other platforms. The LAB gene set is given in supplementary table sheet 2, which ranks the most discriminant genes for each pairwise comparison (eg, apocrine vs luminal) by Wilcoxon U statistic. The list includes 400 probesets for each comparison to allow for the use of more genes in those cases where few of the most highly ranked genes for a particular comparison could be identified in other data sets. After mapping, the number of probes selected for assignment of tumours to classes was 90 per pairwise comparison in each data set. To assign class membership, the decision rule was that the Spearman correlation coefficient should exceed the threshold at which a class would be assigned in <1% of tests with scrambled data. If more than one class exceeded this threshold, a class was assigned only if the correlation coefficient exceeded the next best value by the same threshold amount.

Eisen, M.B., Spellman, P.T., Brown, P.O. & Botstein, D. (1998). Proc Natl Acad Sci U S A, 95, 14863-8.

Frasor, J., Danes, J.M., Komm, B., Chang, K.C., Lyttle, C.R. & Katzenellenbogen, B.S. (2003). Endocrinology, 144, 4562-74.

Gautier, L., Cope, L., Bolstad, B.M. & Irizarry, R.A. (2004). Bioinformatics, 20, 307-15.

Lamb, J., Ramaswamy, S., Ford, H.L., Contreras, B., Martinez, R.V., Kittrell, F.S., Zahnow, C.A., Patterson, N., Golub, T.R. & Ewen, M.E. (2003). Cell, 114, 323-34.

Nelson, P.S., Clegg, N., Arnold, H., Ferguson, C., Bonham, M., White, J., Hood, L. & Lin, B. (2002). Proc Natl Acad Sci U S A, 99, 11890-5.

Praz, V., Jagannathan, V. & Bucher, P. (2004). Nucleic Acids Res, 32 Database issue, D542-7.