Online Supplementary Material

Higher human mutation rates where DNA and RNA polymerase can collide

Frank Grønlund Jørgensen and Mikkel H. Schierup

Department of Genetics and Ecology and Bioinformatics Research Center, C.F. Møllers Alle, bldg. 1110, Aarhus University, Denmark, Corresponding author: .

Materials and Methods

Alignment data. The 28-way multi-Z alignments human refseq coordinates (hg18) were downloaded from the UCSC genome browser18. All alignment columns with unambiguous bases in the human autosomes (hg18), chimpanzee (panTro2) and rhesus macaque (rheMac2) were kept and annotated for further analysis.

Annotations. Using human genome annotation, all alignment columns were assigned to one of the following functional categories: Coding exon on plus or minus strand, intron on plus or minus strand, UTR and intergenic. Alignment columns were also classified as repetitive or non repetitive based on human repeatmasking. Only nonrepetitive sequence was used in the analysis. The fine-scale recombination rate was retrieved from hapmap.org. Human SNP densities in a given window were determined as the number of polymorphic sites found in a filtered version of dbSNP129. The filters restricted the SNPs used here to those that were validated by either HapMap or at least two other validation types. The filtered SNPs were divided into three groups: Transition SNPs (Ts), transversion SNPs (Tv) and unknown SNPs (3 or 4 alleles present), only Ts and Tv SNPs are used in the analyses.

Putative replication origins. The chromosomal position of 1067 putative human germ line replication origins was mapped to the human genome (hg18) using the liftOver tool from the UCSC genome browser. The 1067 origins correspond to the borders of 671 N-shaped domains (some N-domains share a replication origin). These cover ~25% of the genome with similar GC content and % repetitive material (Table S1).

Statistical analysis. Data was analyzed in windows of 25000 base pairs. Summary statistics were recorded for all windows with at least 1000 aligned base pairs. Data from windows with equal distance to the nearest replication origin was summarized for all but the linear regression part of the analyses. The substitution rate on the three way phylogeny of human, chimpanzee and macaque were estimated using p-distance (using more complicated substitution models to account for multiple and parallel substitutions was tried and yielded similar results). The alignments were filtered for ancestral CpGs and the CpG sites and non-CpG sites were treated independently in the substitution rate estimations. The relationship between substitution rates or SNP-density with distance to nearest putative replication origin, GC-content and recombination rate was tested using linear regression on the individual windows. Differences between strands were tested for each distance by calculating , where L and G are some measure (e.g. dN, dS or SNP density on leading (L) and lagging (G) strand, and combining k windows by . These statistics were considered Z-distributed when deriving P-values. Reported P-values were not corrected for multiple testing.

Supplementary Table 1

Summary statistics of the non-coding parts of the 3-way alignments of human, chimpanzee and rhesus macaque for the whole genome and the part covered by putative pairs of replication origins (N-domains).

Data (MB) / %GC / Rec. rate / Human branch / Chimp branch / SNP Ts density / SNP Tv density
Non-repetitive / ~1201 / 41.6 / 1.394 / 0.0056 / 0.0058 / 0.0015 / 0.0007
Repetitive / ~ 957 / 39.8 / 1.278 / 0.0059 / 0.0065 / 0.0011 / 0.0005
Non-repetitive / ~360 / 38.5 / 1.45 / 0.0056 / 0.0058 / 0.0015 / 0.0007
Repetitive / ~255 / 40.6 / 1.32 / 0.0060 / 0.0065 / 0.0012 / 0.0006

Ts and Tv are the density of observed SNP transitions and transversions respectively.

Supplementary Table 2

Statistical tests for strand differences

ds / dn / Intron / SNP
Window / Z / P-value / Z / P-value / Z / P-value / Z / P-value
25 / 2.63 / 4.3E-03 / 1.41 / 7.9E-02 / 0.80 / 2.1E-01 / 0.60 / 2.7E-01
50 / 3.29 / 5.0E-04 / 3.06 / 1.1E-03 / 1.83 / 3.4E-02 / 4.57 / 2.4E-06
75 / 6.25 / 2.1E-10 / 2.21 / 1.3E-02 / 5.44 / 2.6E-08 / 4.59 / 2.2E-06
100 / 1.94 / 2.6E-02 / 0.16 / 4.4E-01 / 0.30 / 3.8E-01 / 4.10 / 2.1E-05
125 / 5.27 / 7.0E-08 / 2.90 / 1.9E-03 / 0.92 / 1.8E-01 / 6.54 / 3.2E-11
150 / 4.89 / 5.1E-07 / 3.36 / 3.9E-04 / 1.63 / 5.2E-02 / 4.62 / 1.9E-06
175 / 2.62 / 4.4E-03 / 4.20 / 1.3E-05 / -1.64 / 5.0E-02 / 3.38 / 3.7E-04
200 / 0.87 / 1.9E-01 / 2.09 / 1.8E-02 / -1.97 / 2.5E-02 / 4.25 / 1.1E-05
225 / 1.43 / 7.6E-02 / 0.69 / 2.4E-01 / 1.30 / 9.6E-02 / 0.17 / 4.3E-01
250 / 2.31 / 1.0E-02 / 0.26 / 4.0E-01 / 1.59 / 5.6E-02 / -0.03 / 4.9E-01
275 / 1.09 / 1.4E-01 / 4.80 / 8.0E-07 / 0.19 / 4.3E-01 / 2.56 / 5.2E-03
300 / 0.12 / 4.5E-01 / 0.00 / 5.0E-01 / 1.87 / 3.1E-02 / 1.49 / 6.8E-02
325 / 0.54 / 3.0E-01 / 0.57 / 2.8E-01 / -0.79 / 2.1E-01 / 1.17 / 1.2E-01
350 / 2.41 / 7.9E-03 / -0.32 / 3.8E-01 / 3.45 / 2.8E-04 / 2.65 / 4.0E-03
375 / -0.36 / 3.6E-01 / 0.57 / 2.9E-01 / -4.06 / 2.5E-05 / 0.32 / 3.7E-01
400 / 1.13 / 1.3E-01 / 0.53 / 3.0E-01 / -2.50 / 6.2E-03 / -3.78 / 7.9E-05
425 / 0.81 / 2.1E-01 / -2.15 / 1.6E-02 / 1.82 / 3.5E-02 / -0.46 / 3.2E-01
450 / -2.44 / 7.4E-03 / 0.17 / 4.3E-01 / -0.03 / 4.9E-01 / -1.03 / 1.5E-01
475 / 0.16 / 4.4E-01 / 1.88 / 3.0E-02 / 3.71 / 1.0E-04 / 0.74 / 2.3E-01
500 / 0.17 / 4.3E-01 / -0.64 / 2.6E-01 / 1.83 / 3.4E-02 / 1.12 / 1.3E-01

Supplementary Figure 1 – Human-Chimpanzee Divergence

X-axis is the distance to nearest replication origin measured in KB.

Figures are lettered from left to right one row at a time. Subfigure A and B shows the intergenic substitution rates for non-CpG and CpG substitutions respectively. Subfigure C shows the S -> W and W -> S substitutions in green circles and purple squares respectively. Subfigures D-I show the 12 possible types of non-CpG substitutions. D: G->T (Green) og C->A (Purple). E: C->T (Purple) og G->A (Green). F: C->G (Purple) og G->C (Green). G: A->T (Purple) og T->A (Green). H: A->G (Purple) og T->C (Green). I: A->C (Purple) og T->G (Green).

SupplementaryFigure 2A – Human branch length (non-repetitive intergenic)

X-axis is the distance to nearest replication origin measured in KB.

In all subfigures red triangles represent the lagging strand while blue squares represent the leading strand.

Figures are lettered from left to right one row at a time. Subfigure A and B shows the intergenic substitution rates for non-CpG and CpG substitutions respectively. Subfigure C shows the S -> W and W -> S substitutions in green circles and purple squares respectively. Subfigures D-I show the 12 possible types of non-CpG substitutions. D: G->T (Green) og C->A (Purple). E: C->T (Purple) og G->A (Green). F: C->G (Purple) og G->C (Green). G: A->T (Purple) og T->A (Green). H: A->G (Purple) og T->C (Green). I: A->C (Purple) og T->G (Green).

SupplementaryFigure 2B – Human branch length (non-repetitive strand specific patterns)

X-axis is the distance to nearest replication origin measured in KB.

In all subfigures red triangles represent the lagging strand while blue squares represent the leading strand.

Subfigures A-C shows substitutions in Exons, dS, dN and dN/dS respectively. For both dS and dN the rate is significantly higher on the lagging strand (Z=3.8915, P=4.98x10-5 and Z=5.1493, P=1.31x10-7). Subfigures D-E shows the non-CpG and CpG substitutions in introns respectively. Again for the non-CpG substitutions the rate is significantly higher on the lagging strand (Z=8.2317, P=1.1x10-16). For CpG substitutions the pattern is reversed (Z=-4.0539, P=2.52*10-5).

Supplementary Figure 3A – Chimpanzeebranch length (non-repetitive intergenic)

X-axis is the distance to nearest replication origin measured in KB.

Figures are lettered from left to right one row at a time. Subfigure A and B shows the intergenic substitution rates for non-CpG and CpG substitutions respectively. Subfigure C shows the S -> W and W -> S substitutions in green circles and purple squares respectively. Subfigures D-I show the 12 possible types of non-CpG substitutions. D: G->T (Green) og C->A (Purple). E: C->T (Purple) og G->A (Green). F: C->G (Purple) og G->C (Green). G: A->T (Purple) og T->A (Green). H: A->G (Purple) og T->C (Green). I: A->C (Purple) og T->G (Green).

SupplementaryFigure 3B – Chimpanzee branch length(non-repetitive strand specific patterns)

X-axis is the distance to nearest replication origin measured in KB.

In all subfigures red triangles represent the lagging strand while blue squares represent the leading strand.

Subfigures A-C shows substitutions in Exons, dS, dN and dN/dS respectively. For both dS and dN the rate is significantly higher on the lagging strand (Z=3.5685, P=0.00018 and Z=8.5794, P<2.2x10-16). Subfigures D-E shows the non-CpG and CpG substitutions in introns respectively. Again for the non-CpG substitutions the rate is significantly higher on the lagging strand (Z=3.5287, P=0.0002). For CpG substitutions the pattern is reversed (Z=-3.9894, P=3.31x10-5).

SupplementaryFigure 4A – Macaquebranch length (non-repetitive intergenic)

X-axis is the distance to nearest replication origin measured in KB.

Figures are lettered from left to right one row at a time. Subfigure A and B shows the intergenic substitution rates for non-CpG and CpG substitutions respectively. Subfigure C shows the S -> W and W -> S substitutions in green circles and purple squares respectively. Subfigures D-I show the 12 possible types of non-CpG substitutions. D: G->T (Green) og C->A (Purple). E: C->T (Purple) og G->A (Green). F: C->G (Purple) og G->C (Green). G: A->T (Purple) og T->A (Green). H: A->G (Purple) og T->C (Green). I: A->C (Purple) og T->G (Green).

SupplementaryFigure 4B – Macaque branch length (non-repetitive strand specific patterns)

X-axis is the distance to nearest replication origin measured in KB.

In all subfigures red triangles represent the lagging strand while blue squares represent the leading strand.

Subfigures A-C shows substitutions in Exons, dS, dN and dN/dS respectively. For both dS and dN the rate is significantly higher on the lagging strand (Z=8.2483, P=1.1x10-16 and Z=17.0431, P<2.2x10-16). Subfigures D-E shows the non-CpG and CpG substitutions in introns respectively. Again for the non-CpG substitutions the rate is significantly higher on the lagging strand (Z=3.9525, P<3.87x10-5). For CpG substitutions the pattern is reversed (Z=-20.631, P<2.2x10-16).

Supplementary Figure 5 – Human SNP density

X-axis is the distance to nearest replication origin measured in KB.

Top row left figure shows the density of transition SNPs in nonrepetitive intergenic regions. The top row right figure showsthe density of transversion SNPs in the same regions.

SupplementaryFigure 6 - The effect of GC content and recombination rate

A.

B.

C.

The relationship between GC-content (X-axis) and Human-Chimpanzee divergence (Figure S6A) and human SNP density (Figure S6B). Both correlations are positive and highly significant (Pearson=0.0577, P<2.2x10-16 and Pearson=0.0639, P<2.2x10-16). Figure S6C show the relationship between GC-content (X-axis) and the log transformed per base fine scale recombination rate (Y-axis). Correlation is positive and highly significant (Pearson= 0.3645, P<2.2x10-16).

Supplementary Note 1

Linear regression analysis of the effect of GC content, recombination and distance to replication origin on human-chimpanzee divergence

We contrast two linear models, one with Log(recombination rate), GC content and distance to replication origin (Model 1) and one with only recombination rate and GC content (Model 2). Model 1 explains more than twice the amount of variance (4.6%) over Model 2 (2.0%).

Model 1: Human-Chimpanzee divergence ~ Log(recombination rate) + GC + Distance

Residuals:

Min 1Q Median 3Q Max

-0.0103228 -0.0018457 -0.0002180 0.0015365 0.0935259

Coefficients

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.173e-02 2.569e-04 45.647 < 2e-16 ***

log(recombination rate) 2.740e-04 1.425e-05 19.220 < 2e-16 ***

GC3.275e-03 4.218e-04 7.765 8.54e-15 ***

Distance1.799e-09 7.466e-11 24.092 < 2e-16 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.003001 on 21497 degrees of freedom

Multiple R-Squared: 0.04567, Adjusted R-squared: 0.04554

F-statistic: 342.9 on 3 and 21497 DF, P: < 2.2e-16

Model 2: Human-Chimpanzee Divergence ~ Log(recombination rate) + GC

Residuals:

Min 1Q Median 3Q Max

-0.0102281 -0.0018976 -0.0002005 0.0015934 0.0931574

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.355e-02 2.488e-04 54.486 <2e-16 ***

Log(recombination rate) 2.754e-04 1.444e-05 19.067 <2e-16 ***

GC 4.133e-04 4.101e-04 1.008 0.314

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.003041 on 21498 degrees of freedom

Multiple R-Squared: 0.0199, Adjusted R-squared: 0.01981

F-statistic: 218.3 on 2 and 21498 DF, P: < 2.2e-16