Introduction to Sturgeon Systematics by Dr. Jeannette Kanefsky
You’ve now seen how morphological (physical) characters can be used to identify, classify and sometimes deduce evolutionary relationships among fish groups. However, using these kinds of characters to determine relationships among species has been difficult in sturgeon. Although morphological characters such as meristic measures (where one counts particular features of a fish, such as fin rays, scales or scutes) and body proportions are sometimes sufficient for distinguishing between the various sturgeon species (for example, see the illustrations of sturgeon and paddlefish species below), there is limited and sometimes confusing information available from morphological characters that can be used to describe the evolutionary history of this group. In other words, you can often find features that allow you to tell sturgeon species apart, but those characters may not provide enough evidence to determine evolutionary relationships among species.
Acipenser fulvescens (Lake sturgeon)
Acipenser brevirostrum (Shortnose sturgeon)
Acipenser oxyrinchus oxyrinchus (Atlantic sturgeon)
Polyodon spathula (North American paddlefish)
How are we related?
Researchers have now employed an additional type of characteristic to help them figure out how sturgeons evolved: the heredity material DNA, which is passed on from parents to offspring throughout generations. As you remember from Biology class, DNA consists of 2 strands, each composed of a sequence of 4 different nucleotides or bases (A = adenine, T = thymine, G = guanine and C = cytosine) arranged in a linear order. Genes within the DNA contain the instructions, in the form of their nucleotide sequences, to make protein or RNA products. The specific sequence of nucleotides in a gene determines the structure and the function of the gene’s product. Some genes within organisms make products that are critical for survival, perhaps because they are important for proper development or cellular function; this type of gene will be present in most if not all living organisms. Because the products of these genes serve the same or similar functions in these varied organisms, the genes will show significant similarities in nucleotide sequence among species. In addition to their similarities, however, there will also be differences in the gene sequences found in different species, created when mutations (changes) occur within the DNA and are then preserved within a species (part of the process of evolution). Determination of the sequences of these genes using DNA sequencing technologies allows them to be compared to identify their similarities and differences in different species. These comparisons are informative because the sequences from different species share a lineage and come from a common ancestor (they have been passed down from parents to offspring throughout evolutionary history), and in a way contain a record of the change that has occurred among the species. This information can be used along with different algorithms to reconstruct the evolutionary relationships among those species. This approach to the deduction of evolutionary relationships, termed molecular phylogenetics, has now been widely used and also applied to fish species, as well as paddlefish and sturgeon species.
Using the molecular phylogenetic approach, evolutionary relationships are represented in a branching phylogenetic tree, like the trees you observed on the previous pages describing the evolutionary relationships of fishes based on morphological characteristics. We’ll just take a minute to define a few key features of phylogenetic trees using our example tree shown below. The lines that make up the tree are known as the branches, representing evolutionary pathways leading to groups or species. The letters found at the tips of the terminal branches represent the different groups or species (sometimes referred to as taxa) being studied; normally you would find the names of the taxa at the ends of the branches, but we’ve used letters here for simplicity. The branching pattern or shape of the tree is called the topology, and illustrates the evolutionary relationships among the taxa in a tree. In the tree below, we see that species A and B are more closely related to each other than they are to species C, as they are clustered together into one group separate from species C and D. You can also see that the branches leading to species A and B come together to meet at the intersection marked “X”. This point represents an ancestor shared by species A and B and is also referred to as a node. Species A and B are said to belong to the same clade (contained by the red oval labeled “clade 1”), or group of species that all share a common ancestor (“X”), an ancestor that is not shared by any other species outside of the clade. Therefore, neither species C nor species D would be a member of the clade containing species A and B because they do not have “X” as an ancestor. Species A, B and C can also be said to part of a different clade (contained by the blue oval labeled “clade 2”) because they share the common ancestor “Y”. Within clade 2, species C is considered basal to species A and B because it branches off before these species. In addition, species D can be said to be found at the base of the tree, as the most distantly related species of the group.
Example of a simple phylogenetic tree
Now we’ll move on to describe (in a simplified way) a few different basic methods currently used for molecular phylogenetic reconstruction:
1) Distance-based methods calculate the evolutionary distances (basically representing the amount of nucleotide difference) among pairs of sequences from different species and use this information to reconstruct relationships in the form of a tree. Using this method, one tree is produced. It is worth noting here that there are a number of different models of sequence evolution (models of how nucleotides change over time) that may be used to calculate these evolutionary distances.
2) Maximum-likelihood methods use statistical methods to determine the likelihood (probability) of observing your DNA sequence data, for a specific model of sequence evolution, for many different sets of possible relationships (different possible trees). After comparison among trees, the set of particular relationships that is calculated to have the highest probability of producing your observed sequence data is chosen as the “best” tree. Again, a single tree is produced. This method is computationally intensive because it must build a large number of trees for comparison (the number depends on the number of taxa to be studied) and calculate the probabilities for each of these trees, and so can take a long time to complete.
3) Maximum parsimony methods are the oldest of the methods mentioned here, as they were developed originally for analyzing morphological data. Maximum parsimony is based on the idea that the simplest answer to a problem is the best answer (Occam’s razor). In a maximum parsimony analysis, different sets of possible relationships among the taxa under study (different trees) are constructed and the number of evolutionary changes (mutations) necessary to explain each of these trees is calculated. Then, the set of evolutionary relationships (tree) which requires the smallest number of changes to explain the nucleotide differences among your taxa is chosen as the best or maximum parsimony tree. It may turn out, though, that more than one tree requiring the same minimum number of evolutionary changes is found, and that no one unique tree can be inferred to be the best one. Because this method requires the construction of a large number of trees for comparison and the calculation of the number of evolutionary changes needed to explain each of these trees, maximum parsimony can also be a very computationally intensive method. Still, it is very useful, for reasons we will not go into here, for reconstructing evolutionary relationships using nucleotide sequences have low levels of divergence (low amounts of sequence change present among them).
Actually, all of these methods are too complex and computationally intensive to do by hand with real data, so computer programs have been written to carry them out. And as you can see even from these brief descriptions, these 3 methods all reconstruct species relationships in very different ways. Therefore, when the same data is analyzed using different methods, they may not always give the same answers! In practice, however, it can be useful to compare the results produced by the analysis of your sequences using the various different reconstruction methods.
In order to give you an example of how molecular phylogenetic reconstruction works, we’re going to consider an example using fabricated gene sequences for North American sturgeon and paddlefish species and the maximum parsimony method. What follows is an overview of important issues to consider and the steps necessary to carry out the analysis.
When conducting a molecular phylogenetic study, one of the first steps is to choose a gene for study. You must select a gene whose rate of evolution (sequence change) is appropriate for the group whose relationships you are trying to determine. Choose a gene that is evolving too quickly for your group of interest and there may be too much sequence variation, which can obscure species relationships. Choose a gene that is evolving too slowly for the group in question and there will not be enough sequence difference among species to determine species relationships. For the special case of the analysis the evolutionary relationships of sturgeon and paddlefish, we have another issue to consider—their polyploid ancestry. Sturgeons and paddlefish belong to a group (the Order Acipenseriformes) that has undergone multiple polyploidization events in its evolutionary history. Polyploidization occurs when the chromosome number of the genome of a species is increased. Most species, including humans, are considered diploid, meaning that they have 2 sets of chromosomes, one from each parent. But in some sturgeon species, polyploidization has increased the number of chromosomes to 4 sets (tetraploid species) or even 6 sets (hexaploid species)! This genome doubling or tripling results in more copies of genes than normal, and after a polyploidization event, changes often occur in the genome in order to remove or silence these excess gene copies and bring the gene copy number back to a diploid (2 copy) level. This process of subsequent genome reduction is referred to as diploidization. If during diploidization extra copies of a gene that we want to study are removed from the genome, this is not really a problem for molecular phylogenetic analysis—only the 2 copies will be left for us to study, each still carrying out the same function. Genes that are functional are what we call constrained; their sequences can only change so much and still produce a product that can do its intended job. But if extra copies of a gene of interest are left in the genome and are silenced, this can confound our efforts. Genes that are silenced are no longer expressed, and so are no longer functional; they are referred to as pseudogenes. Because these pseudogenes no longer need to maintain a specific sequence to retain their function (they are not constrained), they are free to change in sequence in somewhat random ways. To make it even more confusing, some copies of duplicated genes that remain in the genome and are not silenced may alternatively take on new functions. Their sequences can change over time to allow their products to better carry out that new function; that is, the gene copies are said to diverge (become different) from the original genes. The changes that accumulate in the extra gene copies as a result of both of these scenarios can confuse the determination of species relationships because they do not accurately reflect the evolutionary history of the original functional gene.
There are 2 different kinds of DNA found in eukaryotic cells: nuclear DNA and mitochondrial DNA (or mtDNA). Nuclear DNA consists of linear chromosomes and is found in the nucleus of a cell; it composes what we normally think of as the genome of an organism.
Mitochondrial DNA is a small, circular, maternally inherited (passed down from the mother only to her offspring) DNA molecule found as multiple copies in the energy producing organelle called the mitochondrion, that is itself found within the cells of the body.
The polyploidization and possible problems associated with it that were discussed above affect the nuclear DNA and the genes within it, making it technically difficult to use genes found in the nuclear DNA for phylogenetic reconstruction in sturgeons. It does not affect the mtDNA. It is for this reason that most molecular phylogenetic studies of sturgeon to date have examined genes found in the mitochondrial DNA. Aside from allowing us to avoid the potential problems introduced by polyploidization, mtDNA has another advantage. In animals, its genes are more quickly changing in sequence than nuclear genes. This turns out to be important for the study of sturgeon evolution, as it appears that this group has a relatively slow rate of molecular evolutionary change. The faster rate of change in mitochondrial genes provides us with more information (more differences) for inferring relationships.
Okay, so say we’ve identified a good candidate gene for our study and we’ve determined the nucleotide sequences of our gene of choice from the species we are studying. For simplicity, in our example we are examining 4 species: 3 sturgeon species (which belong to the family Acipenseridae) and the North American paddlefish species (which belongs to the family Polyodontidae), and our pretend sequence is 10 nucleotides long in each species. In order to identify similarities and differences among the sequences, we must align the nucleotide sequences from these 4 species. When the gene sequences from each species are the same length, this is a pretty easy job: the first nucleotide in each sequence is put together at the first site or position within the alignment, the second nucleotide in each sequence is put together at the second site or position within the alignment, and so on until you reach the last nucleotide of the gene. Below is the alignment of our fictional gene sequences. Each of the rows labeled with a species name represents a sequence isolated from one of the species. The numbers at the top the columns indicate the number of the site or position in our alignment, and below them are nucleotides found at those sites in each of our 4 species.