BIO/CSC295 2009F, Class 15: Gene Prediction (2) Admin: * At least one student who frequents this floor is suspected of having the flu. Please make sure to use the wipes. * Have a great break! * No work for this class during break (other than self review). * We'll have a review for the exam Tuesday after break. Overview: * Kellis et al. 2003. * Project 6.5. Kellis et al. 2003 * Why did Nature give them 14 pages? * Most Nature papers are like 3-4 pages (reports 2-3, articles 4-6) * They could have separated it into 3 or 4 articles * But those would have been so closely tied together that it's awkward * Why Nature? * A really important method (if we believe it can be used in other genomes) * Why do we study S. cerevisiae? /Why is this a model organism? * Not a lot of junk DNA (shotgun works better, easier to study functional elements) * These four organisms because relatively close * Lots of previous research on S. cerevisiae * Lets them examine their conclusion in respect to other stuff * The SGD * Rough estimates of the number of genes studied * First Eukaryote sequenced. * But why? * Single-cell eukaryote * An 'old' organism' * It's a very important yeast: Bakers/brewers yeast * Bread, beer * Yes: It has huge commercial import * If you study S. cerevisiae, you can get free beer! * Has been used to study a bunch of other things * Cell cycle - Dozens of important genes identified * Those genes are identical in other organisms * Regulation of cell division is important in studying cancer * The 'textbook' single-cell eukaryotic organism * Whitworth studies intron/exon decisions in it * What are their primary conclusions? * A method for finding likely genes * Side note: * The 'genes' they look at are primarily protein-coding * E.g., 'gene and regulatory motif' * The S. cerevisiae genome was already done. How did they get the rest? * Shotgun: Blast it into bits, sequence the bits, and combine them. * How did they blast it into bits? * 'Restriction enzyme coctails' * Different coctails break it up differently, so that we can get n-fold coverage * How do we sequence the bits? Fun biological techniques * How do we assemble? Shortest superstring problem * Often, we use a greedy algorithm: Match the best overlaps first * They relied on known assembly algorithms * But could also use a known sequence * The theory of evolution suggests that the different yeast sequences will be similar * And that's a fundamentally important * What did they end up after they ran the assembly algorithm? * Lots of contigs: Note the whole genome * [Detour: The divergence between the two least similar yeasts in this study is similar to the divergence between human and mice.] * Exciting: Web lab generated sequences, but the testable hypotheses are done completely 'de novo' * From descriptive to predictive * Not just 'what is' but 'what may be' ORF analysis * Large-scale alignment using matches of ORFs (see section titled 'genome alignment') * We see synteny in the matching regions * That is, genes should be in the same alignment in the same chromosome * Orientation and order * Within the same chunk they should be in the same order * Problem: If you're using synteny, are you possibly losing out on inversions? * Yes, particularly across repeat areas * However: Aligning contigs (with 50 or so genes) rather than individual genes * And they do see some inversions * See figure 1 for some details of alignment * Figure 2: * Diversification seems to happen near the telomeres rather than near the center * Telomere is the end of the chromosome, often with repeat sequences * Protects the rest of the chromosome * There *are* genes near the end of the chromosome, so we can see lost genes, repair, and weird things * Analysis of telemorase enzyme won the Nobel prize this year * How did we get the translocation? * tRNAs are similar, making the translocation a bit easier * What was your favorite discovery in this paper? * Seemed to miss only one gene out of approx. 5K * Hand-checked all the named genes to find out that seem to have been mistakenly named * Seems to be a useful technique * Conserved motifs for partners (combinatorial part) * A successful and new process * Discovery of introns, which normally seems hard to do * A 100% conserved gene - it feels like you should have one or two differences. * Is it evidence of recombination * Vida's anecdote: Example of a gene that's 100% conserved across species: No observed effect of knockout * What discoveries do you want to criticize * Discovery of regulatory elements * Too short for their other technique * So they just use some statistics based on the GAL4 sequence * Can we really apply it elsewhere? * It seems to work ... * And seems like solid theory ... * And matches what others found in some cases ... * Things need to be conserved to be regulatory * Hypothesis: Messy genomes are better to analyze, because there's much more potential mutation in the garbage, so statistics are easier * Task: * What has been done recently on the yeast MATa2