Supplementary Tables:

Table S1 the downloaded bombyx mori genome data

File name / File content / Number of sequences
glean_cds_on_chr.gff / Cds Gff file based on chr / 13789 (num of cds on chr)
silkworm_glean.gff *2 / Cds Gff file based on scaffold / 14623 (num of cds on scaf)
glean_cds.txt / Cds FASTA file / 14623 (num of coding cds)
glean_pro.txt / Protein FASTA file / 14623 (num of coding pro)
integretedseq.txt / Chromosome FASTA file / 28 (num of chr)
SilkEST_GenBank.seq / EST FASTA file / 184509 (num of EST)
silkworm_ReAS_TE.gff *1 / TE Gff file / 1294270(num of TE)

*1: A de novo repeat annotation strategy of ReAS identify a total of 1668 types of repeat sequences, together with 17 known silkworm TE in Gen Bank.

Overall, it is found that ~43.6% of the silkworm genome are occupied by TE, compared to 16% Anopheles gambiae, 1% in Apis mellifera and 2.7~25% in Drosophila melanogater. (S1)

*2: The approach to annotate the genes from the silkworm genome assembly.

a) A consensus non-redundant dataset with 14623 protein-coding genes was built by GLEAN with merging different gene datasets (S1).

b) Coding gene function (S1):

i) Find homologs in other species: BLAST non-redundant databases downloaded from NCBI. E-value threshold of 1E-5. 12246 (83.7%) genes found corresponding homologs.

ii) Find protein domain: all silkworm genes were queried against the InterPro database. As a result, 8522(58.2%) genes have 2509 kinds of known domains. And 5971 genes can be classified by GO terms.

iii) Identify gene families: gene families were identified among B. mori, D. melanogaster, Aedes aegypti, A.gambiae, A.mellifera, Homo sapiens, Gallus, Fugu rubripes, and Caneorhabditis elegans by using the strategy of TreeFam. Total 6669 silkworm genes are classified into 1779 gene families. ~400 families seem to be insect specific, of which 245 families are silkworm specific.

c) Expression (S1):

i) 184509 EST and full cDNA sequences (the same number of ESTs as the ones I downloaded see the above table). ~9056 genes have ESTs under the threshold of alignment length >100 and identities >80%. (In my EST analysis for my dataset, I used more strict criteria. see my description in SOM methods).

ii) Xia et al. 2007 constructed a genome-wide microarray with 22987 silkworm gene probes covering the genes predicted from the old 6x draft genome sequences. Total 10393 active transcripts were identified, with 1642 tissue-specific genes (Xia et al 2007).

Table S2 the numbers of paralog pairs before and after masking TE-occupied regions

Non-masked / Inter-chromosomal / Intra-chromosomal / Total
M1 / MM / 11 / M1 / MM / 11
TE / 1576 / 2083 / 511 / 145 / 385 / 134 / 4834
TE-free / 2304 / 9896 / 492 / 602 / 3466 / 524 / 17284
Masked / Inter-chromosomal / Intra-chromosomal / Total
M1 / MM / 11 / M1 / MM / 11
TE* / 0 / 0 / 1 / 0 / 0 / 0 / 1
TE-free / 2054 / 8427 / 414 / 515 / 2896 / 384 / 14690

The numbers of putative paralog pairs in all the movement categories dropped at least one order of magnitude for TE-associated paralogs, but not for TE-free paralogs

Table S3 Chi-square test of the difference between observed and expected movement pattern with increased criteria

a)  Alignment of retrogenes and their parental genes ³ 50% identity, 70% coverage; and at least one intron lost from the parental copies.

Expected
Direction / % / No. / Observed No. / Excess (%)
ZàA / 4.00 / 1.00 / 2 / 99.74
AàZ / 3.67 / 0.92 / 3 / 227.13
AàA / 92.33 / 23.08 / 20 / -13.35
df=2, P=0.0465, Pa=0.0397

b) Alignment of retrogenes and their parental genes ³ 40% identity, 50% coverage; and at least two introns lost from the parental copies.

Parental gene exon No. >= 3 / Expected
Direction / % / No. / Observed No. / Excess (%)
ZàA / 4.00 / 2.48 / 6 / 141.62
AàZ / 3.67 / 2.27 / 5 / 119.84
AàA / 92.33 / 57.24 / 51 / -10.91
d.f. = 2, P = 0.0115,Pa=0.0175
Direction / Expected No. / Observed No. / Excess (%) / Pa
ZàA / 2.37 / 6 / 153.17 / 0.0307
AàA / 54.63 / 51 / -6.64
AàZ / 2.14 / 5 / 133.65 / 0.0626
AàA / 53.86 / 51 / -5.31

a. These p values were estimated with Multinomial Monte carol Simulation

c) The 44 pairs of retrogenes and their parental genes identified by Treefam and our approaches independently and agreed.

Expected
Direction / % / No. / Observed No. / Excess (%)
ZàA / 4.00 / 1.76 / 5 / 183.72
AàZ / 3.67 / 1.62 / 5 / 209.78
AàA / 92.33 / 40.62 / 34 / -16.30
df=2, P=0.0008540, Pa=0.0037
Direction / Expected No. / Observed No. / Excess (%) / Pa
ZàA / 1.62 / 5 / 208.35 / 0.022
AàA / 37.38 / 34 / -9.04
AàZ / 1.49 / 5 / 235.50 / 0.0160
AàA / 37.51 / 34 / -9.36

a. These p values were estimated with Multinomial Monte carol Simulation

d) Chi-square test of the difference between observed and expected movement pattern- for the 159 gene pairs in non conservative set

Alignment of retrogenes and their parental copies ≥ 30% identity, 40% coverage, and at least one intron lost from the parental copies.

Expected
Direction / % / No. / Observed No. / Excess (%)
ZàA / 4.01 / 6.37 / 12 / 88.43
AàZ / 3.67 / 5.83 / 7 / 20.02
AàA / 92.33 / 146.80 / 140 / -4.63
df=2, P=0.0630
Direction / Expected No. / Observed No. / Excess (%) / Pb
ZàA / 6.32 / 12 / 89.88 / 0.0210
AàA / 145.68 / 140 / -3.90
AàZ / 5.62 / 7 / 24.61 / 0.5520
AàA / 141.38 / 140 / -0.978

b. These p values were estimated with Chi-square test

e) The 11 pairs of retrogenes and their parental genes that are overlapped between hanh’s group and ours.

Expected
Direction / % / No. / Observed No. / Excess (%)
ZàA / 4.00 / 0.44 / 2 / 353.95
AàZ / 3.67 / 0.40 / 1 / 147.82
AàA / 92.33 / 10.16 / 8 / -21.23
df=2, P=0.0324
, Pa=0.0831
Direction / Expected No. / Observed No. / Excess (%) / Pa
ZàA / 0.42 / 2 / 381.03 / 0.0622
AàA / 9.58 / 8 / -16.53
AàZ / 0.34 / 1 / 190.76 / 0.2952
AàA / 8.66 / 8 / -7.58

a. These p values were estimated with Multinomial Monte carol Simulation

Table S4 T test of the expression level of testis with that of other tissues for retrogenes

Retrogenes / Ovary / Head.m / Head.f / Integ.m / Integ.f / Mpg_tube.m / Mpg_tube.f / Am_sg.m / Am_sg.f
P of Paired T test (vs. Testis) / 0.0429 / 0.0277 / 0.0445 / 0.0197 / 0.0150 / 0.1532 / 0.2251 / 0.0020 / 0.0041
Retrogenes / P_sg.m / P_sg.f / Fb.m / Fb.f / Mgut.m / Mgut.f / He.m / He.f
P of Paired T test (vs. Testis) / 0.0050 / 0.0068 / 0.0060 / 0.0145 / 0.2820 / 0.2822 / 0.0250 / 0.0603

m: male, f: female, Integ: Integument, Mpg_tub: malpighian tubule, Am_sg: anterior/median silk gland, P_sg: posterior silk gland, Fb: Fat body, Mgut:Midgut, He:Hemocyte.

Table S5 Testing associations between expression and movement patterns- conservative set (Xia et al 2007 normalization approach)

Alignment of retrogenes and their parental copies ³ 40% identity, 50% coverage; and at least one intron lost from the parental copies.

Direction / Ovary biased / Non ovary biased
Z->A / 4 / 3
A->A / 10 / 44
Fisher's exact test p= 0.0425
Direction / Testis biased / Non-testis biased
A->Z / 4 / 1
A->A / 20 / 34
Fisher's exact test p=0.1476

Table S6 Testing associations between expression and movement pattern---- non-conservative set

Alignment of retrogenes and their parental copies ³ 30% identity, 40% coverage, and at least one intron lost from the parental copies.

Direction / Ovary biased / Non ovary biased
Z->A / 5 / 6
A->A / 25 / 105
Fisher's exact test p= 0.0561
Direction / Testis biased / Non testis biased
A->Z / 4 / 3
A->A / 54 / 76
Fisher's exact test p=0.4563

Table S7 Germline gene expression does not affect retrogene traffic

Type of genes / The total number of testis-ESTs
on certain chromosome / The number of genes
on certain chromosome
Z-linked genes / 262 / 654
Autosome-linked genes / 5819 / 13135
Fisher's exact test p= 0.1864
Type of genes / The total number of ovary-ESTs
on certain chromosome / The number of genes
on certain chromosome
Z-linked genes / 328 / 654
Autosome-linked genes / 13369 / 13135
Fisher's exact test p<2.2e-16