BIO/CSC295 2009F, Class 16: Pause for Breath Admin: * Welcome back! * Yes, the exam will be in-class on Thursday. * EC: * Don't forget the hate crimes fora this week and next. * Thursday this and next at 8pm * Gender and science reading, whenever it happens * Noyce video (overlaps with fora) * Sustainable * One act tryouts coming up, and Justin Thomas is looking for someone to do sound design * Upcoming bioinformatics-related talks * Jun Ni from University of Iowa, November 5 * Chris Tuggle from Iowa State, November 19 * Tuggle will also visit with our class. Overview: * Basic information about the exam. * Review, phase 1: Identify key topics. * Review, phase 2: Discuss puzzling topics. * (Potentially: Begin to consider projects). What will the exam be like? (What will be on the exam?) * Written exam (no computer) * One page of notes (8.5x11, double-sided) * About six problems, designed for about ten minutes each * About everything we've talked about so far, from viruses to dinosaurs Exam topics Major Problems * Gene finding * Sequence alignments * Sequence assembly * Long repeats * Dealing with messy data Biology Background * Central dogma (DNA->RNA->Protein) * ORFs and gene structure * Exons and introns * Gene function and coding * Sanger sequencing and shotgun sequencing * Chain termination * Mutation * Horizontal and vertical gene transfer * Diff. between Euk and Prok (and viruss) * Tracing genetic disorders => single genes Important Algorithms * Gene finding * Sequence assembly and the shortest superstring problem * Sequence alignment algorithms * Needleman-Wunsch * BLAST * Substitution matrices and their construction * Heuristics Misc. * How to criticize an argument * Philosophy of interdisciplinarity Things that need further discussion * Needleman-Wunsch * TSP: Would have trouble applying those ideas * Compare and contrast various bioinformatics tools * PAM matrix construction Needleman Wunsch purpose: find the best alignment between 2 sequences; seq1 and seq2 What affects quality of an alignment? length - longer alignments are better indels - insertion deletions: fewer indels better a match is good a mismatch is bad, neutral obvious best alignment strategy: try every possible alignment, find value of each, pick best Problem: a lot of possible alignments (inefficient) 2^(n+m) NW strategy: find best alignment of subseqs of the 2 seqs; expand list/table of best alignments create a table with each element of sequence on each side; look for value of best alignments at each position; gradual grow knowledge Problem: we have partial knowledge of best match - how do we expand it? at each position - three possibilities: 1. assume s4 and t3 align 2. in best alignment, s4 delete 3. in best alignment, t4 insert find the value for each; largest value is the best alignment value of one is the value of the best alignment for s1-s3 and t1 to t2 plus the value of subst. s4 and t3 (if same - large number; if mismatch - small number) In second/third example: deletion (insertion)cost + value of prev. best alignment pick best, continue to fill in squares in table repeat in the next position in the table class exercise: how do we fix the algorithm to deal with small and big chunks? for seq AAAAAAA and AAABBBBAAAA Align: AAA----AAAA AAABBBBAAAA problem: the deletion is likely only a "single" mutation; need to change algorithm so we only charge 1 time for a deletion in 1 spot. Rethink algorithm: instead of looking at a specific row, look at prev. positions inefficient to go backwards everytime; create limits PAM What does a substition matrix do? matrix with amino acids or nucleotides on each axis; gives scores for substituting 1 amino acid for another score represents the probability of different substitutions why do we care? so we can align sequences by giving a value to the alignment Construct: characteristics: naive methods - hydrophobicity; charge; size factors Use real data to determine how often subst occur: compare real sequences; assume they represent mutations of same original seq count every time you see particular substitutions: freq varies for each a.a. now you have freq of occurrences for each subst; stats!!!: probability value : whole row = a can change; each position freq a given change has occurred PAM: also includes probability aa appears at all prob. less than 1 not good to use in a table - compute to values -20 to 20; multiple and take log PAM 1 PAM 2 PAM 3 repr. 3 cycles of evolution PAM 4 repr 4 cycles of evolution BLAST- tool that does sequence alignment Key idea: local alignment look for small alignments - (Table) Expand these alignments to get larger ones Utilizes a threshold score for scoring alignments size How do you find good local alignments? pick a length of seq: look for alignments of that seq; Create a table of possible alignments; Based on threshold score (length of seq can be changed by user) Remove common/repeat elements How do you score a match? score from subst. matrix! Next? extend! incorp and assess value of each additional amino acid - sometimes extending decreases score; go out to a limit - extension not worth it; Fun with bioinformatics tools * Relationship between ClustalW and BLAST * BLAST is most useful for searching for similar sequences * ClustalW is most useful for aligning similar sequences you've identified * NCBI * Provides a wide variety of tools for doing analysis * It's okay if you forget the specific names of things Open Reading Frames and finding genes * How broad do the tools get? * Start with ORFs * Incorporate other sequences that are indicative of genes * Preceded by promoter region * Common form of introns for various species * GC-rich sequences often contain genes * AT-rich sequences are often upstream of genes * Biologically: Collect mRNAs and match sequences * Yeast paper method: ORFs that are conserved in evolution are more likely to code.