BIO/CSC295 2009F, Class 13: Sequence Assembly (2) Admin: * Upcoming work: * Read Chapter 6 for Tuesday * Read Kellis et al. 2003 for Thursday. A longer paper, so start early * Things that depress Sam: So few students at George Drake's history of the two Grinnells. * Bio and CS talks at noon tomorrow! * Contact BIO SEPC ASAP if you'd like to go to lunch with the speaker afterwards * Further discussion at 11-11:40 in 2024 * Class will end early today (2:25). * A topic of today's class was a subject of yesterday's NY Times. Overview: * Sequencing DNA, Continued * CS Detour: TSP and NP-Completeness * Reassembling Shotgun Sequences * Testing Assembly Algorithms Where were we on Tuesday? * We have biological techniques for sequencing 300-1000 BP * There really aren't any other techniques in common use * Although there are some being developed * There is a big prize for figuring out better techniques in the 'personal genomics' era * Some of us are confident that the technology is getting close * How do you go from the ability to sequence comparatively few BP to sequencing the whole genome? * Public Genome Project Method: Mark a starting place, sequence, mark another starting place near the end, start there, mark another starting place near the end, start there, and so on and so forth * Nice people worked on this and made their data available to everyone * Shotgun sequencing: * Blow the gene into lots of little itty bitty pieces * No known relationship of those pieces * Sequence each of the small pieces * Use computational techniques to reassemble * Note: Both techniques cut the genome into pieces * PGP uses 50K-100K bases in their sequences * Cosmids and YACs * Shotgun uses much smaller pieces * Plasmids * What are the advantages and disadvantages of each * PGP method may take longer * PGP method is more likely to give you correct long sequences * Smaller sequences mean that you'll have more problems with repeats (affecting accuracy of assembled sequence) * Both require some reassembly * Both have problems with repeats * In part because it's hard to sequence stuff with repeats * Shotgun is likely to have more problems with repeats "Assembling the strings is a lot like the traveling salescritter problem" * An important CS task: Looking at the efficiency of algorithms * Needleman Wunsch takes (approximately) some constant times N times M N is length of one sequence M is length of another sequence * Can we do better? * Try to find a better algorithm * Prove something about the best possible algorithm * Easy proof that alignment requires N + M steps * A series of complex problems were encountered * Prototypical: TSP * Collection of cities and distances between them * Find the shortest route that visits all the cities * Easy solution: Make a list of every possible ordering, find the cost of each, choose the best * Given twenty cities, how many orderings are there? 20! = 2.4 x 10^18 Suppose we haev a trillion operations per second, 10^12 So, this calculation would take 2.4 * 10^6 second * Hmmm ... not a very good solution * No significantly more efficient solution has been found (yet) * Lots of similar problems appeared * No clear good solution other than "try all possibilities and take the best" * In spite of lots and lots of work * Resolution one: Solve similar problems * Resolution two: Develop a theory of hard problems * Classification: P - "Polynomial time" - problems which have fast algorithms NP - "Nondeterministic polynomial time" - problems in which you can quickly check a proposed solution NP-complete - problems in NP provably as hard as any other problem in NP A useful problem that is NP-complete: Minimum superstring * Given a set of strings, find shortest string that contains them all * Sequence assembly is a form of the shortest/minimum sueprstring problem * What do you do if it's NP complete * You approximate! (You may not get the shortest superstring, but you get it quickly.) Given that "shortest superstring" may not be appropriate, and we're approximating it anyway, how do we have any confidence in the results? * Check it with the first method * Check by sythnesizing Empirical test of algorithm: Build fake data and see if it does the right thing * Start with a sequence * Break it up computationally, guaranteeing overlap * Run the algorithm * See if you end up with the starting sequence What is our approximate solution to maximum substring? * Start with all of the contigs * Find the two with the best alignment * Merge them * And start all over again