BIO/CSC295 2009F, Class 11: Protein Alignments (2) Admin: * Math department is organizing a trip to IowaState to talk to folks about graduate school (biostats, engineering, etc.) week after fall break. Talk to Fai about getting a chance to go along! * Neverland players doing a show this weekend (no EC, but fun) * EC for Grinnell Singers doing an all-night vigil by Rachmaninoff 7:30 p.m. Friday, Sebring-Lewis * EC for Tolga's talk on Monday on complex systems (it's fun) Noon, 3821, LOTS OF FREE PIZZA! * Oh no! Family weekend is here. Go see your colleagues' posters * Renaisance Faire on Mac field this saturday. Sword fighting in the rain. * Readings for Tuesday: Chapter 5 and the HIV paper. * One of the elementary schools in town was unexpectedly cancelled. We'll have children playing in the next room. Overview: * Exploring Markham's data. * Lab: Project 4.5. Presentations (ha ha! on your projects) * What question were you exploring? * What data did you use (what sequences did you choose)? * What did you learn * In relation to your question * Other interesting things Group one: * Exploring the question of how to categorize progression of the disease * Subjects 3 and 15 seemed to be abnormal rapid progressers * Typical profile is diverse and seems to get more diverse over time * Subjects 3 and 15 had fall in diversity * Not explained in the apper * Goal: Compare rapid progressers * Built trees for 3, 10, and 15, using all the visits for each * Lots and lots of trees * Problem: Understanding the ClustalW output scores * In terms of these scores, 3 and 10 seemed more similar (much higher average scores than 15) * Doing more work: Could have compared to a non-progresser * Can you change your characteristics (from rapid progressor to non-progressor or ...) Group Two: * How does diversity change over visits * Does diversity change at a constant rate or a faster rate over visits * Data: All of the visits of one person (patient 9). * Measure of diversity: Average alignment. * Assumption: Lower scores meant more diversity. * Didn't try to fit a curve (perhaps not enough data, and not worth the effort) * Confounder: The snapshot at each visit might not include all of the strains Group Three: * Considered just one subject (subject 8) * Compared one early visit to one late visit * Looked at the number of mutations between stands and what those mutations were * Was it one strand that was off from the others, or multiple strands? * In early visit, a few sites seemed to be common (A->G mutation was common) * More mismatches and more random later on Group Four: * Worked on patient 9 * Took all the strains from 1st, 4th, and 8th visit * Different number of strains in each visit (5, 8, 3) * Hypothesis: * Diversifies as pressure is put on system * Then specialization * Problem: Paper suggests that there are more strains at 8th visit * Interesting shifts in what the mutations were * In the last visit, there were big insertions and deletions * Coding tools: * Would be useful to have a file that has lots of FASTA entries and do pair-by-pair comparisons * DON'T RELY ON SCORES ALONE! Commonalities * Understanding the ClustalW output scores * Are there good ways to use the "align pairs" algorithm to align more than just pairs of sequences * Would have liked a score for the similarity between the two sequences Questions from students * How do you conduct a study on HIV resistance? * Split population into two groups; Control and experimental * Give placebo to control, give treatment to experimental * Expose everyone to HIV [No, that's not ethical] * Wait a few years. * Count the number of people in each group that get the disease. (81 vs. 54) * Do statistical analyses. Questions from Sam: * Suppose you didn't have ClustalW: * How would you measure diversity? * How would you build it? * One strategy: * Assume everything is equal length * Align in the normal sense (just write one on top of each other) * At each position, count the diversity * Average or sum or ... * Critiques: * Need to take into account diversity in group * Need to be careful about the math - don't make the measure dependent on sequence length * Measure of diversity is affected by number of sequences * How do we prioritize lots of diversity at one position vs. less diversity at multiple positions * Doesn't account for position (e.g., in codon), although it looks like ClustalW doesn't pay attention to this either. * Note: Perhaps we should be looking at the protein level rather than the mRNA level * Question: Can we weight things according to the effects of the mutation?