Exploring the Universe: November 2014

Wednesday, November 19, 2014

Ch. 4: Neighbor Joining Trees

Today, I continued learning about the major methods for estimating phylogenetic trees. The Neighbor Joining Method is one of the most popular distance algorithmic method. It produces a "single, strictly bifurcating tree," which means that each internal node has exactly two branched descending from it. I downloaded another file of data for practice to construct the tree.

First I opened up the file LargeData.meg from MEGA. The window shows DNA sequence alignment.

Next, I had to determine whether the data was even suitable for estimating a Neighbor Joining Tree. In the book, it said that if the average pairwise Jukes-Cantor Distance is more than 1.0, the data isn't suitable for making NJ trees and another phylogenetic method should be used. To find the Jukes-Cantor distance, I computed the overall mean of the distances, and found the average distance to be 0.534, which is suitable for making NJ trees.

I constructed the tree by choosing Construct/Test Neighbor Joining Tree from Phylogeny menu. After using the "bootstrapping" method to test the reliability of the tree, and making sure that the tree was being constructed using the Neighbor Joining Method, I produced the following tree using the program:

There are many different ways to represent the same data in a tree. I played around with some of the various options and also got this circular tree:

Wednesday, November 12, 2014

Major Methods for Estimating Phylogenetic Trees

Today, I worked through Chapter 5 in my book, which gives a good overview of the major methods for estimating phylogenetic trees.

There are two primary approaches to tree estimation: algorithmic and tree-searching. The algorithmic approach uses an algorithm to estimate a tree from the data. The tree-searching method estimates many trees, then uses some criterion to decide which is the best tree.

The algorithmic approach has two advantages. It is fast, and it yields only a single tree from any given data set. The Neighbor Joining method is the most common algorithmic method, and I'll be learning how to use it next week.

All the other currently used approaches are tree-searching methods. They are generally slower, and some will produce several equally good trees. Methods such as Parsimony, Maximum Likelihood, and Bayesian analysis search for the tree that best meets the criteria by evaluating individual trees. Maximum Likelihood looks for the tree that, under some model of evolution, maximizes the likelihood of observing the data. Bayesian Inference is a recent variant of Maximum Likelihood. Instead of seeking the tree that maximizes the likelihood of observing the data, it seeks those trees with the greatest likelihoods given the data, and produces a set of trees with roughly equal likelihoods. Parsimony is the simplest method, and it looks for the tree or trees with the minimum number of changes.

It's almost impossible to evaluate each possible tree because of the sheer number of possibilities (even with only 10 taxa, there are more than 34 million rooted trees). Therefore, something called a "branch-addition algorithm" is used to find each of the possible trees, which I won't go into detail here.

It's important to realize that since we don't know what happened in the past, we can never be entirely sure how accurate the tree is. In addition, there is no "right" tree-- we can only hope to find the tree that most closely approximates what happened in the past.

Friday, November 7, 2014

Aligning Sequences

Once you acquire sequences (which I covered in my previous blog post), you must then align the sequences before constructing the phylogenetic tree. In MEGA, Two alignment methods are provided: ClustalW and MUSCLE. Either can be used, but in general MUSCLE is preferable. In the Alignment menu in MEGA, I chose the MUSCLE button, and then I clicked on the "Align Codons" button. There were two choices: Align DNA and Align Codons, and since my sequence was a DNA coding sequence I chose Align Codons, which ensure that the sequences are aligned by codons, a much more realistic approach than direct alignment of the DNA sequences because that avoids introducing gaps into positions that would result in frame shifts in the real sequences. Once I started the alignment process in the program, it took about two minutes. Next, I exported my file in the correct format so I would be able to use it to estimate my phylogenetic tree later.

Here is my final aligned file:

Once I completed the alignment, I noticed that gaps were introduced into the sequences. Those gaps represent historical insertions or deletions, and their purpose is to bring homologous sites into alignment in the same column. Justas a phylogenetic tree is an “estimate” of relationships among sequences, an alignment is just an estimate of the positions of historical insertions and deletions.