In subfamily reconstruction, one challenge is the accurate determination of subfamilies (Han et al. 2005; Konkel et al. 2010). A diagnostic mutation is the result of a source element accumulating a non-debilitating mutation – most commonly a SNP, but deletions and insertions are also possible – and consequently transmitting the mutation to future offspring elements. Other mutations are acquired randomly post insertion and/or (less commonly) during replication or TPRT. Distinguishing diagnostic from random mutations is particularly difficult for older elements with a higher chance of accumulated substitutions. This is particularly true for CpG sites because these sites generally mutate at a faster rate (Bird 1980; Nachman and Crowell 2000; Kong et al. 2012). CpG sites within Alu elements accumulate mutations about six times faster compared to non-CpG substitutions (Xing et al. 2004). At the same time, younger subfamilies tend to have accumulated more diagnostic mutations. Thus, a minimum number of elements per subfamily, and more than one independent substitution, are commonly required for subfamily classification (Price et al. 2004; Han et al. 2007). However, these mutations occur in a sequential manner. Consequently, this approach may result in a simplified subfamily structure with the absence of some intermediate subfamilies. In addition, the youngest subfamilies may not be identified, and the average age of subfamilies is skewed toward more ancient propagation. To engage these challenges, Coseg uses a multistep approach where, in downstream analyses, single substitutions and intermediates between subfamilies are allowed under certain circumstances (Smit 2008-2015).
Bird AP. 1980. DNA methylation and the frequency of CpG in animal DNA. Nucleic Acids Res 8: 1499-1504.
Han K, Konkel MK, Xing J, Wang H, Lee J, Meyer TJ, Huang CT, Sandifer E, Hebert K, Barnes EW et al. 2007. Mobile DNA in Old World monkeys: a glimpse through the rhesus macaque genome. Science 316: 238-240.
Han K, Xing J, Wang H, Hedges DJ, Garber RK, Cordaux R, Batzer MA. 2005. Under the genomic radar: The Stealth model of Alu amplification. Genome Res 15: 655-664.
Kong A, Frigge ML, Masson G, Besenbacher S, Sulem P, Magnusson G, Gudjonsson SA, Sigurdsson A, Jonasdottir A, Jonasdottir A et al. 2012. Rate of de novo mutations and the importance of father's age to disease risk. Nature 488: 471-475.
Konkel MK, Walker JA, Batzer MA. 2010. LINEs and SINEs of primate evolution. Evol Anthropol 19: 236-249.
Nachman MW, Crowell SL. 2000. Estimate of the mutation rate per nucleotide in humans. Genetics 156: 297-304.
Price AL, Eskin E, Pevzner PA. 2004. Whole-genome analysis of Alu repeat elements reveals complex evolutionary history. Genome Res 14: 2245-2252.
Smit A, Hubley, R. . 2008-2015. RepeatModeler Open-1.0.
Xing J, Hedges DJ, Han K, Wang H, Cordaux R, Batzer MA. 2004. Alu element mutation spectra: molecular clocks and the effect of DNA methylation. J Mol Biol 344: 675-682.