BIO/CSC295 2011F, Class 12: Sequence Assembly (2) Overview: * Sequencing DNA * CS Detour: TSP and NP-Completeness * Reassembling Shotgun Sequences * Testing Assembly Algorithms Admin: * Chocolate! (Well, kinda) * Returned: BLAST responses; HIV responses (most) * Sam's comments distributed electronically * Prof. Praitis will get the rest back to you later this week * Bonus! You get inconsistent feedback from multiple profs * Reminder: You should be building a portfolio of code, responses, and reflections. * Distributed electronically: Previous mid-semester examination. We will discuss the midsem next Tuesday. * Response to Kellis et al. is now due next Tuesday. * It's long. So start early. * Thursday is homecoming in Grinnell. If you've never seen a small-town Iowa homecoming parade, it's worth going at least once. * EC for Thursday's Convocation on the Future of the Book. * EC for Thursday's CS extra on Computer Vision (4:30, 3821). * EC for Friday's Biology Seminar (noon, 2021). * EC for Healthy Iowa Walk noon on Friday. * EC for Friday's Volleball game (7pm, Darby). * EC for Saturday's Football game (1 pm). * EC for Saturday's Men's Soccer game (1:30 pm). * Please behave responsibly at 10-10 activities. * EC for orchestra 10/8 at 2pm Bioinformatics to date * Analyze lots and lots of sequence data * DNA sequences * Proteins / amino acid sequences * Making sense of the data really helped build the field * To the computer scientist, it's just data * But someone had to come up with the data * Today: How do you get DNA sequence data * Past: Converting DNA known to be gene to protein (or AA sequence) * Tuesday: Identifying genes On to DNA Sequence Data * How do you get a full chromosome of data? * In Eukaryotic organisms, DNA is in chromosomes * Human DNA: 3 billion base pairs * More than a decade ago: Move to sequence human genome * Dramatic changes in technology * Biology * CS analysis * What's the basic process for getting sequence data? * Sequencing reaction * DNA polymerase - an enzyme that replicates DNA * Normal bases - dNTP's (chemistry coming) deoxy nucleotide tri phosphates * Buffers and such * Alternative bases - ddNTP's (di-deoxy) * Primers * Template DNA * When the sequencing reaction reaches a ddNTP, it stops * Similar to PCR * Process * Open up DNA strand * Match up the primer * Add bases to the primer, matching as you go * Chemistry that Sam can't put in the EBoard * At the 3' end, you have a free hydroxyl, OH * The free hydroxyl is really important for the chemical reaction to proceed * It bonds with the phosphate on the next base * The polymerase helps it gbond * All reactions in the cell follow the same process. * Back to the process: At some point, we will add a "poison base" that does not have the 3' hydroxyl. You can't form any more bonds, so the chain terminates. * Note: This reaction is happening thousands of times. You should therefore get a termination at "every single position" along the way. * A strand ten bases long * A strand eleven bases long * A starnd twelve bases long * .... * In the original Sanger method, you do one poison base per reaction - four different sets * Then run it out on a gel. * Poison bases are radioactive, so you can see where each thing is G C A T - - - - - - You can do about 500 bases at a time, and you have to do it in four lanes. Improvement: * Run it in capilaries rather than on gels * Flourescent poison bases, all in the same tube * And spectrophetometry lets you read off of the capilary * Sam can't put the cool graph of spec. output on the EBoard, and the book represents it in green. Deal. * This allows us to do 750-1000 bases at a time * Faster b/c four at once and using capilaries * Cheaper, too Note: * Both gels and capilaries separate DNA sequences based on size * Capilaries do it better, which lets you read further * We're still capped toward the end because they are less likely to happen * For some other projects, gels remain fast and cheap How do you go from sequencing fragments to sequencing the entire chromosome? * These days, we can base our work on homologies to known systems * First Euk. genome yeast: S. Cervisiae * First multicellular: Some worm (C. Elegans) * One strategy: Directed sequencing * Cut it using restriction enzymes into lots of small (but not so small) pieces * Restriction enzymes are 6, 8, 10, bases - they don't tell you a lot * We're dealing with things between 40K and 300K base pairs * Put each into a bacterium, with a known area. And then start sequencing from the end. Once you've gotten 500 or so, create a new primer, and continue. * Problem: Large repetitive sequences in the human genome * And because of the repetitive sequences, there are some who say we have not really sequenced the human genome * And repeat sequences are highly variables; we each have different ones * Also relatively easy to align * Technique of main human genome project * Second strategy, by Venter@Celera, that savvy guy: Shotgun sequencing * Use restriction enzymes to cut up a DNA fragment * But into MUCH SMALLER PIECES, each of which you can sequence in one run * Then use the magic of computers to assemble all of these really small fragments * Also the problem of dealing with imperfect matches * This also fails miserably for repeats * What's the distinction between the two? * Different fragment sizes * Easier to assemble things in directed sequencing * Note: Human Genome project was impressive in peoples' willingness to share. * Why is shotgun supposedly faster, given that you still have to sequence everything? * You can use the same primer set for everything single piece in shotgun * In directed sequencing, you need the results of one run in order to set up the primer From the Biology part to the Computational Part How do we assemble all the little framents? Get lots of original fragments The hope is that we get the original string reassembled Input a whole bunch of little strings Output one large (or several as there are mult chromosomes) We want the large string to be correct!!!! Some indication there is a lot of overlap. We want the shortest possible string that contains all the little strings The minimum super string problem.... Even small repeats can screw you up. That Sam I am, that sam I am, I do not like that Sam I am. Do you like green eggs and ham. I do not like them sam I am. How do we align them? Can we align them in a shorter string Venter/Celera did statistics - if you overlap 8-fold, you don't really have this problem Clear to everyone that looks at this problem. The biological problem is as hard as the CS problem. We ask ourselves - How hard is this problem??? Sometimes we write algorithms and we are done We write one - lets make it better. We write one - we can't think how to make it better - we prove we can't make it better Superstring - make every permutation. Align first, align next , align etc. We can prove that one of these will give us the shortest super string. 1000 bp million strands (x8?) We can't find anything faster than this approach. Shortest super string is an NP-complete (this means REALLY REALLY HARD!!!!) 1. No fast algorithm 2. (slow N factorial or 2^N) 3. If you find a fast solution, there are fast solutions for all other similar NP problems. 300 identified useful problems with no known fast algorithms (solve one, solve another) - lots of money!!!!!!! But people think these are not solvable (but this hasn't been proven). Traveling sales critter... (man, gender) A typical sales critter needs to visit many cities In an optimal world you can get from any city to another Cost associated with this miles Our sales critter wants to drive as little as possible What is the shortest path? Nice obvious path Make every permutation of the city establish paths Pick smallest Known Slow correct algorithm N factorial 1! = 1; 2! =2' 3! = 6 4!= 24, etc 2*10^18 = 20 cites 10^6 paths/s - how many sec? 2 *10^12 Heat death of the universe!!! What do we do? Give up??? Correct but slow What do we do? Cheat? Approximate the answer! Not shortest, but close to the shortest (within 10%) Another way - we have an algorithm that works some of the time. Suppose we don't need to be optimal, but close. Fast, but reasonable. Privilege longer strings.... Take 2 longest - see how much overlap Start with some - ask for some level of overlap (min overlap/most overlap) n + N-1 + n-2 +.... to 1. n (n + 1)/2 Have a great day!!!