BIO/CSC295 2011F, Class 10: Sequence Assembly (1) Overview: * Leftovers from Protein Alignment * Simliarity Matrices. * Lab. * Sequencing Assembly * A Biological Perspective * Shotgun Assembly: A Computational Perspective Admin: * Sam is back. Sorry for the extended absence, but citizenship requires it. He is likely to remain discombobulated for a few days. * Thursday, we will meet downstairs in the Molecular lab, Science 1104. Long pants, closed-toed shoes, hair tied back, no loose clothing Contacts okay Regular notebook okay. Leather-bound vellum lab notebook preferred * Project 2.5 returned. General comments via email. BLAST papers will take a little more time. * The lab instructions and protocol for the biology lab are now available. * OPTIONAL Programming work for chapter 5 (due Tuesday): (Formal version to follow) * Write a procedure, generate_fragments(sequence, minlen, maxlen, coverage) that generates a bunch of random fragments from the sequence with fragments between minlen and maxlen. You should generate enough fragments to get the average coverage specified. * Write a procedure, cut_fragments(fragments,pattern) that builds a new set of fragments by cutting each fragment in fragments at the portion that matches pattern. * Write a procedure filter_fragments(fragments, minlen, maxlen) that takes a list of fragments and removes those that are smaller than minlen and larger than maxlen * Write a procedure, determine_coverage(sequence, fragments), that aligns each fragment to the sequence and indicates how much each position is covered. * Present your program and some interesting sample runs of your program that show how well the two fragmentation techniques work. * There will be a paper for a week from Thursday. * CS Picnic, Friday, October 7. Sam should have the signup tickets on Thursday. * EC: From Eternity to Here: Shrinkage in American Thinking about Higher Education. Today, 4:15, JRC 101. * EC: Thursday Extra; Max Kaufmann '12 on generating parallel corpora. Thursday, 4:30, SCI 3821. * EC: Biology Seminar Friday at noon. An epidemiologist from UIowa talking about careers in epidemiology * EC: Men's Soccer vs. St. Norbert, Saturday at 2:00ish p.m. * EC: Math Presentation on Random Binary Sequences, Noon, FREE PIZZA Not all amino acids are created equal As we do alignments,we are more likely to make some subst. Like amino acids - like charge, hydrophobicity, hydrophilicity, size, etc real data : statistics (subst. matrix using real data) ALG how likely is it for one to subst. for the other PAM - Point accepted mutation - phylogeny - partial sequence - take closely related sequences - count changes for each amino acid Gives us real world data for switches Matrix has positive integers and negative integers How does an integer represent probabilities? 2 amino acids i & j f(i) and f(j) frequency for each in the data set. M(i,j) the prob that i mutates to j (j to i) If you put this together, scale inversely by frequency x/f(i)f(j) f(i)M(i,j)/f(i)f(j) the range of number is not useful, so we take the log PAM1 1% mutation rate, but we often want to compare more mutation 100% mutation PAM100 (took 100x longer) Matrix Pam1 = 1%mutation; PAM100 1% repeated 100x Fuzzy - multiply PAM1 x PAM1 100 times. Scale it! PAM250 - gives us a good range for mutations in living creatures matrix multiplication; look at the possibilities A to G; A same then to G; A to G, then G same; A to something else (lots of choices here) then G What is wrong with using PAM250? If we use this for sequence alignment - BLAST. Does starting with closely related sequences make sense? Model is too theoretical? the expansion (mult matrices may be too divorced from real data. How do we calculate M? Align - we count the number of times A to G; we divide by total number of pairs If A is abundant, we will see more examples. This doesn't take into acount that some parts of protein are more likely to change BLOSUM; Works like PAM, but they don't start with closely related sequences. Why 62 and why 250? They work well. Multiple PAM and Blosums, select which one to try depending on what you expect. That seems circular... Now we'll do LAB!!!!! and more lab!