BIO/CSC295 2009F, Class 10: Protein Alignments (1) Admin: * Due Thursday: Analysis of HIV env sequences * Due Tuesday: Response to HIV env paper. Are there questions? * Sam will be unavailable for office hours on Wednesday. * Friday noon CS talk on history of programming languages. Pizza! Any takers? * Visit on Nov. 19th from Bioinformatics at Iowa State. Use of class time? Overview: * Overview of chapter 4. * From aligning DNA to aligning proteins * Building PAM Matrices. * Lab. Chapter 4 * Or "a few weeks of biochemistry in ten minutes" * We assume that you've already looked at the chapter * Chapter 3 was really about aligning DNA * But not all bases are created equal * Side note: If you can come up with your own examples, we'd appreciate it. * Central dogma review * DNA goes to mRNA goes to protein * Structure of DNA corresponds well to proein * In viruses and bacteria there's a direct correlation * In Eukaryotes (which Sam can't spell) the DNA has a regulatory sequence, exons, and introns. * The exons are "fused" to build the mRNA * So there's still a correlation, but it's a bit different * Important issue (that we've been ignoring): We should be looking at alignment somewhat differently in different categories. * DNA is simple, proteins are more complex * What's a protein? * A chain of amino acids * That folds into an interesting shape that lets it doing things * "Clumped together" * "Globular" * Folding is affected by * Size of amino acids * Hydrophobicity/Hydrophilicity * Charge * Some funky things that happen in a few cases * Chaperones * Cleavage * ... * Two classes of folding are highly predictable * alpha helix * beta sheet * Folding of alpha helices is essentially instantaneous * Structure levels: * Primary: Sequence * Secondary: Basic folds, such as alpha helix and beta sheet * Tertiary: * Quartenary: Complex of proteins * These secondary structures fold into a tertiary structure * The tertiary structures combine into complexes * Homodimers have identical subunits * Heterodimers have different subunits * Proteins have lots of functions * Enzymes * Structure * Regulation ('allosteric') * ... * Cool thing: Odor recognition genes all produces proteins that have seven alphahelices. (Transmembrane domains.) * Concept from example: Mutations that 'mess up' one of these alpha helices are unlikely to 'survive'. * What can happen * Deletion * Insertion * Substitution * DNA mutation can create * Amino acid substitution * Premature stop (truncated) * Cryptic starts (elongated) What can happen if you have an amino-acid substitution in a protein? What can these changes do? * Alter active site * Change charge * Change function of protein (e.g,. green flourescent protein to red flourescent protein is a single amino-acid change) * No effect! Protein keep same activity * Polarity/hydrophobicity "Biology is messy, so computer scientists choose to simplify things when doing their analysis." Interesting example of protein folding issues from the chapter: Repeats * Huntington's disease (neurological) * Autosomal dominant * "Autosomal" - not on X or Y; not sex-linked * "Dominant"- only need one copy * Gene is huntingtin * Part of the gene has a large set of repeats * Age of onset and severity correlates with number of copies * This stretch of DNA is unstable - the number of copies you have is not necessarily the same as your parents. You can grow or shrink the number of repeats from one generation to the next. * Project idea: Searching for repeats! * We don't know why this happens Book stuff * Valine (V) to Glutamate (E) * Size change * Charge change? * Big hydrophobicity change (V is strongly hydrophobic, E is highly hydrophilic) Another interesting example: CFTR gene * Cystic Fibrosis: A disease in which you have problems with mucus in your lungs * Serious illness that effects epithilial cells (on the outside of your body) * Trans-membrane receptor that passes ions * We think mutations in CFTR gene cause cystic fibrosis * The CFTR gene is related to a variety of genes called ABC transporters * We see it all across bacteria * And all across eukaryotic organisms * So it's been successfully utilized in a variety of organisms * In bacteria: Mutations in the similar genes confer antibiotic resistance (the really bad (for us) or really good (for them) resistance). * Antibiotics not transported into cell * Why would a mutation that confers this kind of resistance also cause mucus buildup * CFTR deletion 508 allelle - gets rid of one phenylalanine * Not clear functionally why this matters * Hypothesis: Because of protein folding problems, it degrades before it gets to the cell membrane * Should we test for it? * 508 deletion is a prevalent mutation * Another mutation * A third mutation * Each of these reduce functionality So, what would you do as a treatment for this disease? * E.g., gene therapy * A dream that we replace the DNA in stem cells and reintroduce the stem cells in their body * We may have the technology already * Homologous recombination lets us do gene replacement * Last year, people figured out how to make stem cells from non-stem cells * Genetic testing: What allelle do you look for? * How prevalant is it in the population? * How much does it correlate with the disease? * What about false negatives and false positives? * Some prenatal testing carries risk * It gets complicated * Prenatal testing for CF: Lots of questions From DNA match to AA matches How should we align 2 AA sequences? Basic strategy - needleman wunsch Good but not great. nEW AND IMPROVED!!! some amino acid changes are better than others when you have a mismatch, how amenable are these amino acids - PAM - similarity matrix Blosum acid-base, polar non polar; hydrophobic - hydrophilic build table up; left; diagonally Same = Add number Different = are they similar? Number; Are they dissimilar? Other number. if both are hydrophobic, look at previous add +0.5 If both are hydrophilic, add 0.5 if mismatch add -0.5 Appealing - go to raw data. Take a bunch of sequences that are evol. related Identify all mutations/changes in AA sequence Determine probabilities Make a table of probabilities A corresponds to A = high number A rarely R = negative number etc Both PAM and BLOSUM do this. How do you know this is accurate? we hope... what is the original data set? If you have a technique, you can make your own table!!! How do we go from sequences to Table? Question 1. Why is the table half full? or half empty? redundant? equally a to r as r to a??? We don't know the directionality of the mutation Could build something that took that into account... Why are the diagnal numbers different A to A; some are more likely to remain the same TWPthophan How do you build the table? magic? NO. Math log of prob. of change compared to random; 0 random; pos likely; neg unlikely log because the numbers are easier to read; linear rather than order of magn. differences. How can we have a probability p s.t. log(p) = 17? p = e^17 Find probability of change Find prob of random change Take log Scale Its all heuristics! General technique How does PAM work? start with a collection with less than 1% mutation Assume no indel Build table PAM1 is 1% This represents some time period of evolution Twice this period? PAM2 = PAM 1 x PAM1 But PAM250 is not 250% - corresponds to the period. We use it because it works... Disadvantages - unidirectional; sources may not be representative of a sequence; indels assumption is dangerous - this will skew data. Not every position is equally probable for a change