BIO/CSC295 2009F, Class 14: Gene Prediction (1) Admin: * Don't forget that we have a reading and reflection for Thursday. * Sorry that it's so long * But it's a really cool mechanism for gene prediction * And if Nature gave them this much space, it was probably worth it * Isn't it great that everyone is getting sick right before break? Overview: * What can we do with long sequences? * Gene finding * Ways of predicting useful ORFs * Obstacles to finding genes * Web exploration What's the point of the human genome project? * Okay, we have 6 billion base pairs, what can we do with it? * Find significant regions, regions that code for various things * And knowing about all the proteins and such that are coded for still won't tell us everything - we care about interactions too * Compare them to other organisms * Evolutionary stuff, phy. trees, etc. * Importance of other traits * Comparison can also be used to identify significant regions * And help identify genes * Compare multiple individuals of the same organism * Variation! What's the same and what's different * Correlate alleles with diseases (or any trait) * "Play around ..." * Side question: How identical are identical twins? * They start with one fertilized egg * So they should have identical genomes * But there's potentially some mutation during growth (somatic mutation) * Variation will be small, and in specific cell populations * Female identical twins are less identical than male identical twins * Random choice of which X chromosome is expressed in each cell Interesting bioinformatics questions: * What portions of the alphabet soup code for something interesting? * And how do you find them * And what do they code for? Some notes: * Total number of genes between 20K and 25K * Only about 1.5% of the genome codes for protein What is a gene? * A sequence of the DNA that codes for a protein * Code that gets transcribed and translated to make a protein * Are there other parts? * Regulatory sequences: * Binding sites for promotors (for the transcription machinery) * Enhancers * These tend to be on the 5' end of the gene * Some on the 3' end of the gene control message stability * Not all genes code for proteins; some code for functional RNA * Classic experiment in fruit flys (Sam can't spell the d word) * Late 1960s/early 1970s * Gene cluster in the HOX region * "7" genes - mutate them, and you get fewer or more body pieces (wings, etc.) * Only 2 coding regions * The other changes affected regulatory elements! * Some functional RNAs are also being made in that region * The coding sequence may be messy (lots of introns) * Followup question: How do you know that there are only two coding regions in a section. How do you know that an ORF is a coding region? * Do a comparison to model systems. (Whoops, not available to drosophila researchers at the time) * Put the code in a bacteria; does it make a protein? (But what can we say about this protein?) * Gene predictions based on what we know Conserved promotor, start codon, sequence, stop codon * Play with mutations * Experimental data! * Collect the cellular population of mRNA * Purify them * Make cDNAs (with reverse transcriptase) * Sequence them * Use alignment/comparison algorithms to figure out splicing and such * The joy of introns/exons: One coding sequence can actually make many different products (e.g., different combinations of exons can be used) * What regulates alternative splicing? * Whitworth showed that stress can induce alternatives * Fruitfly example: Hundreds of exons, and thousands of products expressed * Splisosomes * See picture on p. 194 * cDNA libraries are limited by how/when they are collected * Once you understand a model system very well, computational approaches (e.g., the ability to compare) become much more useful * Yeast * Worms * Fruit Flies * Mice * Humans (?) * Coming soon to a lab near you: Diagnosis through RNA expression with microarrays * Note: There are clearly sequences that are important, but that don't have a product. Do we call those 'genes'? * E.g., telomere * Potentially a moving definition * Issue: Computationally finding things that are potentially genes * Why? * At some point, we want to know what's probably important (and what's probably not) in the genome * How hard is it to find potential genes? (Limiting ourselves to protein-producing genes) * Prokaryotes * Eukaryotes * Hunt for ATGs/AUGs (which give you potential coding sequences) * Lots of candidates * Statistically: Rough analysis says that 1/64 triplets will be AUG Let's see 2,000,000,000 / 64 ~= 31 million potential starts of open reading frames * 20-30K genes * Odds of an ATG starting a gene? 1/1000 * Narrowing the list: * Look for nearby S-G sequence: AGGAGG on the 5' * Only good for Procaryotes * Look for some of the conserved sequences in Eukaryotes * Example: TATA box - Sequence of T's and A's that comes about ten base pairs upstream from the coding sequence * Might be a good pattern to look for - Doesn't typically code for anything else, may bend weirdly * Only two binding points, so it's easier to separate the strands there to start the transcription * Comptuationally find other potential promotor regions * But a difficult problem * Some statistical techniques * Probably doesn't have too much repeats (maybe) * Characteristic ratio of codons (maybe) - * CG-rich areas