BIO/CSC295 2011F, Class 06: Gene Alignments (2) Overview: * The BLAST Paper. * Exploring the BLAST algorithm. * Needleman-Wunsch - The Basics. * Using Dynamic Programming. * An NW Example. * Web Exporation. Admin: * Markham paper distributed: Due 22nd * Remember: Project 2.6 is due Thursday. * Due today: Response to Altschul et al. Email to Praitis and Rebelsky. * Some general notes on response papers. * Please use a document title something like YourLastName-Response3.doc * Please use an email title something like "BIO/CSC 295: Response Paper 3" * If you want Sam to read your email, make sure that it has something in the subject that distinguishes it from a submission. "HELP" and QUESTION are good phrases. * Many good things. Kudos. Vida wishes more of you were in 251. * Liked the ways in which you drew uopn your backgrounds. * But some of you are unable to make an argument with evidence. Or a specific enough argument. "It sucked because it had no statistics." and "They didn't do the important control" When you see a problem, identify, explain, and suggest improvements vs. "It appears that if you are looking at random sequences, you will also get some matches. Hence, it would be important for the authors to consider the likely frequency of false positives and compare it to the frequency of positives they got in this document. Had they done this, we would have much more confidence that what they identified as a match was indeed a true match. While the authors did some analysis, there wasn't this much detail." Contrast with the BLAST paper, in which they carefully analyze the frequency of false and missed matches. * Talk to Prof. P. One way you learn about science is to react to papers, which forces you to read carefully. * SandB profiles, revisited. * Today is a very computational day. * Programming as a video game. Level 1: It compiles (reads the source file without complaining). Level 2: It runs without crashing. Level 3: It runs and gives reasonable output on at least one input. Level 4: It runs and gives reasonable output on most inputs. Level 5: It runs and I've assured myself of its correctness with careful testing. Boss Level: I've written a formal proof that it's correct. * EC for Kington's talk Thursday at 11 a.m. * EC for DU and company on Robotics in CS Education, Thursday at 4:30 p.m. in 3821. * EC for Friday's Biology Seminar: Bowers, "Break on Through (to the Other Side): The Assembly of Outer Membrane Proteins in Caulobacter crescentus.". * No home sporting events this weekend? (VB at Mac, FB at Ripon, Soccer at Principia) Sam's favorite paper questions * What is the problem domain? * What are the author's primary claims? * What is the evidentiary structure of the paper? * Why are we reading this paper? What is important? * What critiques do you have of this paper? A paper addresses an audience; some problem area What problem area? Sequence comparisons are too slow. Input a query sequence - DNA or protein - and compare to database. Why does this matter? Useful for looking at evolution. Other reasons last class.... Authors claim /Central thesis - a faster sequence alignment program; heuristic but good. Good tradeoff for sensitivity and computation time. Detect weak but biologically significant similarity (Better in terms of finding importqant sequences. algorithm instead of program What is the evidence that this works and its faster? draw out what algorithm will do run data through it (real and made up data) runs fast and does a good job. Biological intuition? We need statistics Calculated real MSP vs heuristic score Run slow but accurate algorithm two approaches - stats (random data) and verify with actual organisms/data How did them prove it was fast? Fig. 2 This doesn't show us that it is faster than other algorithms How does this work? 3 parts 1. Compile a list of high scoring words input sequence database of lots of sequences What is a "word?" A word is a group of characters 4 for protein; 13 for DNA? MVPKML first word is MVPK next is VPKM next is PKML, etc How are they high scoring? 2. Scan database for matches for high scoring words MVPP is a good match ??? maybe Few high scoring words means less time... faster 3. Once we have a match - we look on either side of a good match - extend the match in each direction If its a poor match, we stop and move on Report best matches back to user How does the scoring work??? Lets go back and discuss details What is a PAM matrix? * Sample score NDCQ vs NDCP = 2 + 4 + 12 + 0 = 18 * Sample score AAAA vs AAAA = 2 + 2 + 2 + 2 = 8 high scoring words are those matched to themselves have "high scores" Good to know which PAM is used (certain types of proteins might do better with a different PAM). Step 2 Find matches to high scoring words Step 3 How do we expand p. 97 text AMANAPLANPANAMA ZZZZZZZZMANNAFLANNANANA Score of Plan? compare PLAN AND FLAN -5 + 6 + 2 + 2 = 5 expand to right? P to N 0 5 A to A 2 N to N 2 N to M -2 A to A 2 * vs X -8 When do we stop? Stop when score is more than 20 from highest score -8 -8 etc Stop Go the other direction! +2 =13 +2 = 15 2 A's = 17 2 M's are 6 = 23 A vs Z = -8 15 -8 7 -8 -1 report back best match; return score of 23 It does nothing with deletions Set a threshold value to score Why select 4 and 17? We played around with it and it worked well. Additional hacks: Hard-code things to ignore, such as repeat stretches * 10% of gene is Alu sequences Criticism: * Doesn't discuss how it deals with deletions * But we can expand two high-scoring words around the deletion, and the thing reported back will show you what's happening, including the deletion * Amazingly, humans can synthesize data, too. * For CS people: No formal big-O running time analysis * A way to formally compare the speed of algorithms * What does the curve of the algorithm look like: Is it quadratic, or cubic, or logarithmic? * Independent of the hardware you're running it on. * There is an equation somewhere in the paper aW + bN + cNW/20 (p. 408) * No careful comparison vs. other algorithms. Just "it's faster" (data not shown) * The DNA sequence stuff seems a little more ad hoc. Side notes: * The word idea is pretty new --- On to Needleman-Wunsch This is also a sequence comparison algorithm Assumes you are working with 2 sequences you have a short input sequence and a long database Goal is to find the best possible alignment This takes into account insertions and deletions How do we do the comparisons? Do every single alignment. Takes a long time.... Needleman Wunsch tries a better approach They improve upon the "try every alignment" strategy 1. Systematic strategy 2. Dynamic programming (makes it fast) Basic strategy - start at the end of both sequences 3 possibilities - 1. ITS a MATCH! but is it a good match? need some score for the alignment 2. In order to match - shift to the left (delete the last thing in the database) 3. In order to match - We should do a deletion in the search string. Both 2,3 have scores & cost associated with it. A shift is a -1 score; a match is +1. No match is 0. Three choices - How do we compare them? We look at more sequence Value of match in case 1. Score of match is 1 + value of the rest of the search string, rest of database). Value of case 2 is -1 + value of all of the search string; rest of database value of case 3 is -1 """ For 2 strings of length 5, 3 cases 4,4 for 4 we have 3,3 for 3, etc Long database 2^20 = 1 million (approx) computer 1millstep/sec 2^40 is a trillion ; this is a lot... 2^80 trillion trillion - longer than we will be alive (heat death of the universe....) We often write long algorithms O 3^n o 2^n If you do this by hand, we repeat a lot of work ACGT vs AGCA match T,A (ACG, AGC) delete top (ACG, AGCA) match delete top delete bottom Begin to see computations that repeat delete bottom Instead of starting from back and working forward, lets build a table (dynamic programming) A cell within table will tell us best match of substring Store values in the Table