Building a phylogenetic tree (article)

The logic behind phylogenetic trees. How to build a tree using data about features that are present or absent in a group of organisms.

Key points:

Phylogenetic trees represent hypotheses about the evolutionary relationships among a group of organisms.
A phylogenetic tree may be built using morphological (body shape), biochemical, behavioral, or molecular features of species or other groups.
In building a tree, we organize species into nested groups based on shared derived traits (traits different from those of the group's ancestor).
The sequences of genes or proteins can be compared among species and used to build phylogenetic trees. Closely related species typically have few sequence differences, while less related species tend to have more.

Introduction

We're all related—and I don't just mean us humans, though that's most definitely true! Instead, all living things on Earth can trace their descent back to a common ancestor. Any smaller group of species can also trace its ancestry back to common ancestor, often a much more recent one.

Given that we can't go back in time and see how species evolved, how can we figure out how they are related to one another? In this article, we'll look at the basic methods and logic used to build phylogenetic trees, or trees that represent the evolutionary history and relationships of a group of organisms.

Overview of phylogenetic trees

In a phylogenetic tree, the species of interest are shown at the tips of the tree's branches. The branches themselves connect up in a way that represents the evolutionary history of the species—that is, how we think they evolved from a common ancestor through a series of divergence (splitting-in-two) events. At each branch point lies the most recent common ancestor shared by all of the species descended from that branch point. The lines of the tree represent long series of ancestors that extend from one species to the next.

For a more detailed explanation, check out the article on phylogenetic trees.

Even once you feel comfortable reading a phylogenetic tree, you may have the nagging question: How do you build one of these things? In this article, we'll take a closer look at how phylogenetic trees are constructed.

The idea behind tree construction

How do we build a phylogenetic tree? The underlying principle is Darwin’s idea of “descent with modification.” Basically, by looking at the pattern of modifications (novel traits) in present-day organisms, we can figure out—or at least, make hypotheses about—their path of descent from a common ancestor.

As an example, let's consider the phylogenetic tree below (which shows the evolutionary history of a made-up group of mouse-like species). We see three new traits arising at different points during the evolutionary history of the group: a fuzzy tail, big ears, and whiskers. Each new trait is shared by all of the species descended from the ancestor in which the trait arose (shown by the tick marks), but absent from the species that split off before the trait appeared.

There is a lot of information in the tree above! Let's go through step by step and see what the tree and the marks on it are telling us.

Let's start off with a tree with no traits marked on it. The tree shows the evolutionary relationships of a (made-up) group of mouse-like critters. The letters A, B, C, D, and E represent present-day species in the group. The base, or root, of the tree shows the most recent common ancestor of the group. This is an animal that lived long ago, whose traits we happen to know in this made-up example (though we might not in a real-life example).

Now, let's imagine that sometime after species A branched off from the lineage leading to the other species (B-E), a new trait arose on the B-E lineage: a fuzzy tail, more like the tail of a squirrel. If no other evolutionary changes happened in any of the lineages, we would expect species B, C, D, and E to have a fuzzy tail and expect species A to not have a fuzzy tail. That's because species B-E would inherit a fuzzy tail from the common ancestor in which the trait arose.

Now, let's imagine that another new trait arose. After lineage B had split off from the lineage leading to species C-E, let's imagine that the trait of big ears appeared in the C-E lineage. If no other evolutionary events affecting ear size took place in any of the groups, we would expect species C, D, and E to have big ears and species A and B to have small ears.

Finally, let's imagine that another new trait arose, this time after lineage C had split off from the lineage leading to species D and E. In a common ancestor shared by D and E only, let's imagine that the trait of whiskers arose. If no other events affecting whiskers occurred in any of the groups, we would expect species D and E to have whiskers, but all of the other species (A, B, and C) to be whisker-less.

The diagram in the main text is just a condensed representation of the information that we broke down into separate trees in this pop-up. The shaded regions nested inside of each other show which species we'd expect to have each trait, based on the point in evolutionary history where the trait arose. For instance, we'd expect species C, D, and E (blue shading) to have big ears.

Example: Building a phylogenetic tree

If we were biologists building a phylogenetic tree as part of our research, we would have to pick which set of organisms to arrange into a tree. We'd also have to choose which characteristics of those organisms to base our tree on (out of their many different physical, behavioral, and biochemical features).

If we're instead building a phylogenetic trees for a class (which is probably more likely for readers of this article), odds are that we'll be given a set of characteristics, often in the form of a table, that we need to convert into a tree. For example, this table shows presence (+) or absence (0) of various features:

Feature	Antelope	Bald eagle	Alligator	Sea bass
Lungs	+	+	+	0
Jaws	+	+	+	+
Feathers	0	+	0	0
Gizzard	0	+	+	0
Fur	+	0	0	0

Table modified from Taxonomy and phylogeny: Figure 4, by Robert Bear et al., CC BY 4.0

Next, we need to know which form of each characteristic is ancestral and which is derived. For example, is the presence of lungs an ancestral trait, or is it a derived trait? As a reminder, an ancestral trait is what we think was present in the common ancestor of the species of interest. A derived trait is a form that we think arose somewhere on a lineage descended from that ancestor.

Without the ability to look into the past (which would be handy but, alas, impossible), how do we know which traits are ancestral and which derived?

In the context of homework or a test, the question you are solving may tell you which traits are derived vs. ancestral.
If you are doing your own research, you may have knowledge that allows you identify ancestral and derived traits (e.g., based on fossils).
You may be given information about an outgroup, a species that's more distantly related to the species of interest than they are to one another.

If we are given an outgroup, the outgroup can serve as a proxy for the ancestral species. That is, we may be able to assume that its traits represent the ancestral form of each characteristic.

No, not always. Outgroups, just like the species we are trying to study, continue to evolve over time. So, it's possible for the outgroup species to (independently) acquire some of the same derived traits found among our species of interest. In this case, we might accidentally identify a derived trait as ancestral.

For instance, in our example (data repeated below for convenience), the lamprey, a jawless fish that lacks a true skeleton, is our outgroup. As shown in the table, the lamprey lacks all of the listed features: it has no lungs, jaws, feathers, gizzard, or fur. Based on this information, we will assume that absence of these features is ancestral, and that presence of each feature is a derived trait.

Feature	Antelope	Bald eagle	Alligator	Sea bass
Lungs	+	+	+	0
Jaws	+	+	+	+
Feathers	0	+	0	0
Gizzard	0	+	+	0
Fur	+	0	0	0

Table modified from Taxonomy and phylogeny: Figure 4, by Robert Bear et al., CC BY 4.0

Now, we can start building our tree by grouping organisms according to their shared derived features. A good place to start is by looking for the derived trait that is shared between the largest number of organisms. In this case, that's the presence of jaws: all the organisms except the outgroup species (lamprey) have jaws. So, we can start our tree by drawing the lamprey lineage branching off from the rest of the species, and we can place the appearance of jaws on the branch carrying the non-lamprey species.

Next, we can look for the derived trait shared by the next-largest group of organisms. This would be lungs, shared by the antelope, bald eagle, and alligator, but not by the sea bass. Based on this pattern, we can draw the lineage of the sea bass branching off, and we can place the appearance of lungs on the lineage leading to the antelope, bald eagle, and alligator.

Following the same pattern, we can now look for the derived trait shared by the next-largest number of organisms. That would be the gizzard, which is shared by the alligator and the bald eagle (and absent from the antelope). Based on this data, we can draw the antelope lineage branching off from the alligator and bald eagle lineage, and place the appearance of the gizzard on the latter.

Great question! Actually, it doesn't matter which of these two species goes on the left and which on the right—the tree is the same either way. The important information in a phylogenetic tree is the pattern of its branches, not the ordering of the species along the top. We can rotate a phylogenetic tree about any branch point (for instance, switching the positions of the bald eagle and the alligator) without changing its meaning.

The phylogenetic trees article provides more details on what we can and can't infer from a phylogenetic tree.

What about our remaining traits of fur and feathers? These traits are derived, but they are not shared, since each is found only in a single species. Derived traits that aren't shared don't help us build a tree, but we can still place them on the tree in their most likely location. For feathers, this is on the lineage leading to the bald eagle (after divergence from the alligator). For fur, this is on the antelope lineage, after its divergence from the alligator and bald eagle.

Parsimony and pitfalls in tree construction

When we were building the tree above, we used an approach called parsimony. Parsimony essentially means that we are choosing the simplest explanation that can account for our observations. In the context of making a tree, it means that we choose the tree that requires the fewest independent genetic events (appearances or disappearances of traits) to take place.

For example, we could have also explained the pattern of traits we saw using the following tree:

This series of events also provides an evolutionary explanation for the traits we see in the five species. However, it is less parsimonious because it requires more independent changes in traits to take place. Because where we've put the sea bass, we have to hypothesize that jaws independently arose two separate times (once in the sea bass lineage, and once in the lineage leading to antelopes, bald eagles, and alligators). This gives the tree a total of $6$ ‍ tick marks, or trait change events, versus $5$ ‍ in the more parsimonious tree above.

In this example, it may seem fairly obvious that there is one best tree, and counting up the tick marks may not seem very necessary. However, when researchers make phylogenies as part of their work, they often use a large number of characteristics, and the patterns of these characteristics rarely agree $100 %$ ‍ with one another. Instead, there are some conflicts, where one tree would fit better with the pattern of one trait, while another tree would fit better with the pattern of another trait. In these cases, the researcher can use parsimony to choose the one tree (hypothesis) that fits the data best.

You may be wondering: Why don't the trees all agree with one another, regardless of what characteristics they're built on? After all, the evolution of a group of species did happen in one particular way in the past. The issue is that, when we build a tree, we are reconstructing that evolutionary history from incomplete and sometimes imperfect data. For instance:

We may not always be able to distinguish features that reflect shared ancestry (hom*ologous features) from features that are similar but arose independently (analogous features arising by convergent evolution).
Imagine that the tree below shows the actual evolutionary history of a group of rodents. In this tree, whiskers arise two independent times.
If we didn't know the true history of the group and were trying to reconstruct it, we might interpret the whiskers as arising from a single event. The whisker data would then conflict with data for the other traits.
Traits can be gained and lost multiple times over the evolutionary history of a species. A species may have a derived trait, but then lose that trait (revert back to the ancestral form) over the course of evolution.
Imagine that the tree below shows the actual evolutionary history of a group of rodents. In this tree, species E undergoes a genetic change that causes it to lose its bushy tail and gain the skinny tail present in the group's ancestor.
If we didn't know the true history of the group and were trying to reconstruct it, we might assume that the species E was descended from an ancestor without a bushy tail. Under this assumption, the tail data would conflict with data for other traits.

Biologists often use many different characteristics to build phylogenetic trees because of sources of error like these. Even when all of the characteristics are carefully chosen and analyzed, there is still the potential for some of them to lead to wrong conclusions (because we don't have complete information about events that happened in the past).

Using molecular data to build trees

A tool that has revolutionized, and continues to revolutionize, phylogenetic analysis is DNA sequencing. With DNA sequencing, rather than using physical or behavioral features of organisms to build trees, we can instead compare the sequences of their orthologous (evolutionarily related) genes or proteins.

The basic principle of such a comparison is similar to what we went through above: there's an ancestral form of the DNA or protein sequence, and changes may have occurred in it over evolutionary time. However, a gene or protein doesn't just correspond to a single characteristic that exists in two states.

Instead, each nucleotide of a gene or amino acid of a protein can be viewed as a separate feature, one that can flip to multiple states (e.g., A, T, C, or G for a nucleotide) via mutation. So, a gene with $300$ ‍ nucleotides in it could represent $300$ ‍ different features existing in $4$ ‍ states! The amount of information we get from sequence comparisons—and thus, the resolution we can expect to get in a phylogenetic tree—is much higher than when we're using physical traits.

To analyze sequence data and identify the most probable phylogenetic tree, biologists typically use computer programs and statistical algorithms. In general, though, when we compare the sequences of a gene or protein between species:

A larger number of differences corresponds to less related species
A smaller number of differences corresponds to more related species

For example, suppose we compare the beta chain of hemoglobin (the oxygen-carrying protein in blood) between humans and a variety of other species. If we compare the human and gorilla versions of the protein, we'll find only $1$ ‍ amino acid difference. If we instead compare the human and dog proteins, we'll find $15$ ‍ differences. With human versus chicken, we're up to $45$ ‍ amino acid differences, and with human versus lamprey (a jawless fish), we see $127$ ‍ differences $^{1}$ ‍. These numbers reflect that, among the species considered, humans are most related to the gorilla and least related to the lamprey.

You can see Sal working through an example involving phylogenetic trees and sequence data in this AP biology free response question video.

Attribution

This article is a modified derivative of "Taxonomy and phylogeny," by Robert Bear, David Rintoul, Bruce Snyder, Martha Smith-Caldas, Christopher Herren, and Eva Horne, CC BY 4.0. Download the original article for free at http://cnx.org/contents/db89c8f8-a27c-4685-ad2a-19d11a2a7e2e@24.18.

The modified article is licensed under a CC BY-NC-SA 4.0 license.

Works cited

David Baum, "Reading a Phylogenetic Tree: The Meaning of Monophyletic Groups," Nature Education 1, no. 1 (2008): 190, http://www.nature.com/scitable/topicpage/reading-a-phylogenetic-tree-the-meaning-of-41956.

References

Baum, David. "Reading a Phylogenetic Tree: The Meaning of Monophyletic Groups." Nature Education 1, no. 1 (2008): 190. http://www.nature.com/scitable/topicpage/reading-a-phylogenetic-tree-the-meaning-of-41956.

"Building the Tree." Understanding Evolution. Accessed July 5, 2016. http://evolution.berkeley.edu/evolibrary/article/0_0_0/evo_08.

"Cladistics." Wikipedia. Last modified June 19, 2016. https://en.wikipedia.org/wiki/Cladistics.

"Cladistics and Classification." Rediscovering Biology. Accessed July 5, 2016. https://www.learner.org/courses/biology/textbook/compev/compev_3.html.

Kimball, John W. "Taxonomy." Kimball’s Biology Pages. Last modified December 16, 2013. http://www.biology-pages.info/T/Taxonomy.html.

"Lamprey." Wikipedia. Last modified July 1, 2016. https://en.wikipedia.org/wiki/Lamprey.

OpenStax College, Biology. "Determining Evolutionary Relationships." OpenStax CNX. Last modified March 23, 2016. http://cnx.org/contents/GFy_h8cu@10.8:tOc5w74I@5/Determining-Evolutionary-Relat.

OpenStax College, Biology. "Organizing Life on Earth." OpenStax CNX. Last modified March 23, 2016. http://cnx.org/contents/GFy_h8cu@10.8:ZzIv3qRH@7/Organizing-Life-on-Earth.

"Part 1 - Reading Phylogenetic Trees." Accessed July 5, 2016. http://medsocnet.ncsa.illinois.edu/MSSW/moodle/AuthTut/vpage_beta.php?tid=218&&pid=1055.

"Phylogenetics: DNA Protocol." Intro Biology: BIOL153. Accessed July 5, 2016. http://bcrc.bio.umass.edu/intro/content/phylogenetics-dna-protocol.

Purves, William K., David Sadava, Gordon H. Orians, and H. Craig Heller. "Reconstructing and Using Phylogenies." In Life: The Science of Biology, 496-509. 7th ed. Sunderland, MA: Sinauer Associates, 2003.

"Reconstructing Trees: A Simple Example." Understanding Evolution. Accessed July 5, 2016. http://evolution.berkeley.edu/evolibrary/article/phylogenetics_07.

Reece, Jane B., Lisa A. Urry, Michael L. Cain, Steven A. Wasserman, Peter V. Minorsky, and Robert B. Jackson. "Phylogeny and the Tree of Life." In Campbell Biology, 547-558. 10th ed. San Francisco: Pearson, 2011.

Thanukos, Anna and Allen Collins. "Phylogenetic Systematics, a.k.a. Evolutionary Trees." Understanding Evolution. Accessed July 5, 2016. http://evolution.berkeley.edu/evolibrary/article/phylogenetics_01.

"Understanding Phylogenies." Understanding Evolution. Accessed July 5, 2016. http://evolution.berkeley.edu/evolibrary/article/evo_05.

"What is Phylogeny?" Tree of Life Web Project. Accessed July 5, 2016. http://tolweb.org/tree/learn/concepts/whatisphylogeny.html.

Wilkin, Douglas and Niamh Gray-Wilson. "Cladistics." CK-12 Foundation. http://www.ck12.org/book/CK-12-Biology-Advanced-Concepts/section/10.44/.

Wilkin, Douglas and Niamh Gray-Wilson. "Phylogeny - Advanced." CK-12 Foundation. Last modified June 7, 2016. http://www.ck12.org/book/CK-12-Biology-Advanced-Concepts/section/10.43/.

Building a phylogenetic tree (article) | Khan Academy (2024)

Key points:

Introduction

Overview of phylogenetic trees

The idea behind tree construction

Example: Building a phylogenetic tree

Parsimony and pitfalls in tree construction

Using molecular data to build trees

Attribution

Works cited

References

References