BIO/CSC295 2009F, Class 07: Gene Alignments (2) Admin: * For next Tuesday: Work on the On Your Own project from Chapter 2. * You will not need to turn in today's Web Exploration. * Response papers due! * New groups for chapter 3 Overview: * The BLAST Paper. * Simulating the BLAST algorithm by hand. * Web Exporation. The BLAST Paper * Overall, hard to criticizea * How do you argue with math? * Hard to understand the math * That's why 75% of statistics are made up on the spot * Vague! How do they take into account insertions and deletions * Lots of references to other disciplines * Some criticisms * Compare to other algorithms! * Accuracy * They do time * Strange that they applied the same algorithm to DNA and AA sequences given that in codons, mutations in the 3rd nucleotide is less significant * Insertions and deletions are likely to be different, too * Speed was clearly a big factor. Is it still a factor now? * Vida notes that 1.5G base pairs added to the database each month * Currently 250G base pairs * But, hey, it's all a parameter. You can change them (and they probably have) * Lots of statistics, but in the end, it's "4 feels right" * Questions Sam asks about every paper * What's the problem domain that the paper is addressing? / Why are they doing this? * Similar to those that some dynamic programming algorithms address * What is their contribution? * It's faster! * And just as sensitive! search for substrings in a large database exact matches? search for similarity? close matches Why do we care? we won't get an exact match? Why do we care about sequence matches??? evolutionary relationships find similar genes - look at relationships find similar genes - can infer function!! Claim they can do it quickly!!! Paper structure history of algorithms that search & align seqs advantages; disadv. how theirs is better describe methods - how it works; how to overcome problems algorithm not method results - test it! simulated and real data Why publish in Biology? not interesting CS Is it not really interesting cs??? audience is biologists (not cs people) Fast P - vs dynamic programming assign scores Fast P is faster - no idea about scoring They want FAST and understandable SCORING three phases to algorithm Before we do that... Measurement - how do they score alignment - TLA MSP - max segment pair What is this? sustrings that score highest using a matrix best alignment Does MSP consider length? identical 1 approach, compare all length 1 strings, length 2 strigns, etc not efficient? ATAG vs AATC sub matrices - biological vs. empirical PAM PLAN vs FLAM -5 5 3 -3 = 0 DNA +5 - 4 4 matches balance 5 mismatches parameter is blast that you can change! Try all alignments and find "best" Find all the high enough MSP quickly find close to MSP; Locally Maximal Start with local max; expand non conservatively (2 or more positions) how do yo go from local maxima to MSP? Goal: find MSP 3 steps compile a list of high scoring words query string AND Threshold (to determine score) word length T-threshold and S - score (is this a good enough MSP?) T is score for variants of words in string FLAMING FLAM LAMI AMIN MING ELAM, FLAM, FLAN, etc keep those above T Scoring matrix can systematically make every variant; find value search for exact match of words in database There are lots of words!! Expand these matches to approximate Use some expansion procedure Do for every place you have a match Keep all those that are at least S How do they expand? to economize time, we stop in 1 direction until score drops too much; go in other direction WHY 20? they don't tell us Not productive to search really common words - and, the, Parameters What do they do? 3 things - length of string & database (fixed) T, S, w T : if set score at a low value, get more MSPs more approx. pairs - more garbage more coverage - too! Look at table p. 407 - S is lower = more garbage, don't miss as much S is hight = less garbage, more misses Parameters affect running time = more words take longer 1 blast search took 10s ; 1 took 5 minutes - computer time Are there defined repetitive elements? Yes - ALU 3kb, 10% genome in humans PLAY with BLAST - What does it do with deletions? NOTHING! That comes later...