BIO/CSC295 2011F, Class 20: Phylogenetics (1) Overview: * Why build phylogentic trees? * Bioinformatic approaches. * Some challenges. * Web exploration (maybe). * Time for group discussions. Admin: * Don't forget: Project proposals and Paper responses due Thursday. * Does anyone lack a project group? Done. See below. * Thursday extra: Summer opportunities in CS. * No bio seminar Friday. But lots of job candidates. EC for attending job candidate talks and providing comments through SEPC. (We do care about what students say about SEPC.) * Note: We will not put details of the job candidates in these eboards. Groups: * Michael, Kevin, Ben, James * Jonah, Bill, Bonnie, Logan * Abraham, Karissa, Dilan, Bogdan * Guadalupe, Nancy, Pelle, Katherine v * Ian, Ben, Chris, David, Bozo * Jay, Radhika, Marcus, Noelle Context * It's the last chapter we're doing in the text! Yay! * And we're finishing the textbook with four weeks to go in the semester. Yay! * And Sam thinks you've learned enough programming that we don't need to do the programming project. On to the wonder of phylogenetics ... * Why do we care about phylogenetic analysis? * Medical issues with resistance * See the nice (although flawed) HIV studies that you folks did * We often find genes by looking at closely related genomes - Phylogenetic trees give you some expected information. Kellis paper is a nice example. * Gives Professor Brown a research area - coevolution of plants and insects in Hawai'i. * Insert obligatory joke about why you do research in Hawai'i. * Mechanisms of speciation * It's cool * Might tell us something about "us" * Also explore how we've affected other species, particularly as we change the environment * E.g., tall grass prairie * How do we measure evolution / how do we think about phylogenetics? * Did you do this kind of thing in Bio 252? No one remembers. * Well, you're scientists, so you should be able to figure out a technique. * Underlying question: "What makes a species?" * Consider mostly conserved genes, and consider changes within those mostly conserved genes. "Barcoding" * How do you select a gene? * Morphological differences - phenotype * Morphology - physical characteristics * Phenotype - How genes are expressed (Gene + Env -> Phenotype) * Morphology has been somewhat controversial, as has genetic. * Speciation: Two things are the same species if they can mate and produce viable offspring. * But biology is messy, and there are all sorts of things that don't match normal rules; e.g., self-reproduction * And "goofy" things like genome duplication * Economic issues of speciation: If we're protecting a species, there are those who want to show that what is in protected habitat is not different than what is in another. * Why is complicated to look at "the same gene" in multiple species * Gene duplication * Genome duplication * Need to make sure that you're actually looking at the same gene * Need something conserved, but also something that changes a bit. * Needs enough change to distinguish different species. * The choice of gene will depend on what species you are comparing * Harder when you have species that are more different * Subspecies within a population suggest a different gene * What do we do once we've identified (a) Gene of interest (b) Conservation (c) Changes * We want to infer an ancestral sequence * Accumulation of base pair differences is genetic difference * Note: Probably need enough samples that you can identify variation with a species vs. variation between species * In most cases, we think things are divergent enough that this isn't a significant issue. * Comparing sequences is not difficult. Why would it be hard to know what is going on evolutionarily? * Mutation and then reversion * Infering ancestor can be difficult * Not all changes are equal * Some are fatal * Position in a sequence is important * 3rd base changes are more easily tolerated * Changes within sequence are less likely than changees outside of sequence * Think about what approach you would take. * PAM matrix suggests that some changes are tolerated more than others. Let's look at this computationally ... important biological problem What is the useful computation that we have? What are your inputs? What are your outputs? Here is a good gene Here are the sequences in a dozen species What is the output? evolutionary distance - how long ago did they share a common ancestor? For two - how long ago did they separate? Lots of sequences? ordering of distance between all sequences Tree - Length of line represents "time" (Scaled branches) two questions - time and building a tree computation - a lot of statistics and biology knowledge Formula r = K/2T where T= time; r = subst. rate; K actual substitutions At what rate does DNA mutate? Unfortunately - none of these are obvious. Subst. rate can be measured experimentally; K is hard to get a direct number for because, if you have 2 sequences, you can count number of differences but that may not actually represent number of substitutions; How do we go from number of differences vs. substitutions 2 common formulas - Kab = -1/4lin(1-4/3Dab) Scaling of evolutionary distance Now we want to consider the tree - Start with a chart S0 S1 S2...Sx S0 0 S1 12 0 S2 2 ... Sx Looks like Pam? Distance between things so we use these to create tree. S1 and S5 close; S6 is far from everything else; Shortest super string.... Our algorith will need to take this into account Problem with slightly different input - table of distances output is a tree Algorithm is same as greedy superstring problem Find closest - merge to new find closest - merge to new Problem; How do we deal with closest and merge, once we've started merging S0 and S2 are the 2 closest (2 mil years apart) S0/S2 are merged S5 and S1 are similar (2 mil years ago) When I combine, which distance do I use? Possible Solutions: 1. Take the average - S0 to S1, S2 to S5, S2 to S1, S0 to S5 2. Always take the shortest distance (with averages - can pull one more distant indl into incorrect group) 3. Longest Compute the consensus sequence - consensus from everything else rather than a mathematical computation - Can't do the consensus with 2 Average of A and C? B Biologist - decision making Can do simulations to come up with a formula. (rather than using statistics) Lab! (Okay, Web exploration) * You can find Phylip Web interface at http://bioweb2.pasteur.fr/phylogeny/intro-en.html * For those who don't have the patience to come up with ten homologs, here are some sequences our authors suggest as interesting. + YP_026263 + YP_002174466 + ZP_02621494 + NP_000048 + XP_510594 + NP_075529 + NP_445977 + CAC00282 + ABA95537