BIO/CSC295 2011F, Class 18: Microarrays (1) Overview: * Microarrays. * Microarrays and Algorithms. * Baggerly: Reproducibility. * Time to work on Chou-Fasman. Admin: * Food! * Exams to be returned on Thursday. * Plan for Thursday: Discuss exam. Discuss the project. Do programming exploration for microarrays. * Sam is still working through his stack of other grading. * Upcoming work 1: Chou-Fasman programming due Thursday. * Upcoming work 2: Read Chapter 8 for Tuesday. * Upcoming work 3: Krings et al. review due next Thursday. * Upcoming work 4: Project proposals due next Thursday. * CONGRATULATIONS to Womens' Volleyball * No other CS extra this week. * EC for Bio Seminar Friday: Microbiologist? from UIowa? on ? * EC for Simon Estes concert or convo. * EC for swim meet Friday at 6:30 * EC for Food Bazaar (cooking (exclusive or) attending) * EC for Drag Show * EC for Eid on Sunday * Drag show: Waiting in line will be more exciting this year Microarray (Chapter 9 in your textbook) * Major question in many areas of biology * How do you go from a genome to differentiated cell types in a complex organism? * It's related to gene expression, so we are interested in gene expression * Gene expression is important in many areas * Cell differentiation * Normal from tumor cell * Fitness differences can be due to differences in gene expression * Techniques for studying gene expression * Including microarrays * Microarrays are one consequences of the human genome project * Q: How do we study gene expression in cells? * A: Microarrays (we'll get to it) * A: Comparing two conditions * A: Western blotting (also Immunoblotting) * Uses antibodies - A protein molecule that is produced by your immune system in response to an invader * Antibodies have high recognition for antigens * Vaccinations: Get exposed to an antigen; immune system increases the number of antibodies that respond to that antigen; when the virus appears in greater quantities (a real infection), you have enough antibodies to respond * Biologists can build antibodies that recognize specific antigens * So, run out stuff on a gel and then apply antibodies to see if they bond * Similar tissue technique: Immunoflourescence * A: Reporter genes * Fuse gene of interest with reporter (e.g. GSP - green flourescent protein). This lets you look at real time expression of proteins of interest. * A: Northern blotting (RNA expression) * RNA gets expressed as part of gene transcription * Labelled probes that are complimentary (complementary?) * Can be done with blot (run out on a gel) or a cell * Why did we put this up? * These techniques have been around for awhile * They are *incredibly* important for studying a gene of interest * A combination of the techniques lets you study a gene in particular detail * But genome-wide responses are also interesting. How does the environment affect what is being expressed in the genome. * Microarrays were developed as a way of answering genome-wide questions. * What is a microarray, and how does it work? * A little chip with lots of dots, with single-stranded DNA on it * You know what gene each dot is complementary to (need to know the genome in order to do this) * You need them laid down at precise locations. * Add mRNA and use reverse transcriptase to make cDNA, add flourescent markers, and use them to probe the chip. * Similar theory to a Northern blot, but you can do it on thousands at once. * Permitted by precise robots to lay things out and laser scanners to get a precise image. * One experiment: Just look at absolute flourescent levels * Slight variations can throw results off in creation of mRNA * The flourescence gives absolute numbers (Affymetrix) * Or does it? * So, how do we compare two conditions? * And how do we deal with random variation? * Increase sample size - You want enough input that the random variation is just noise. * Dangers: * There's variability in how well you lyse cells * There's variability in how well you treat the sample (even for practiced researchers) * There's variability in how well reverse transcriptase runs * There's variability in much more ... * Solution: Normalize! Build in controls * Things that should be blank * Things that should be bright * Things that should be at other levels * A story about a genome-wide yeast genome study * 15 minute time points * Did study with 30 minute time points * Made full data set available * Papers people write: "We found a Ideally, you should do an independent verification of the data.fifteen minute periodicity" * 30 minute ones were processed in one batch * The off-by-fifteen minutes were processed in another batch that doesn't work so well * It's "the batch effect" - you need to control for that * How does an internal control work? * Pick things that we know expression levels are constant across conditions. * One would need to read the documentation a bit more. * The book describes another approach which does relative levels, rather than absolute levels * You'll still have same problems * So you still need internal controls * Absolute levels make it easier for you to reuse the information * In case you couldn't tell, this is REALLY ASTONISHINGLY COOL TECHNOLOGY * The Baggerly video shows one application: * Look at how different cell lines respond to different anti-cancer drugs. * An opportunity for "personalized genomics" * The difficulty: We do not yet have intuition on the output * How do you know that your data set is consistent with what we already "know" * How can you tell if something is out of whack Microarrays and Computer Science Cool technology Cool Biology - What are the interesting computational problems? Correlating profiles - Building profiles Which genes react similarly? Statistical analysis is large, complicated esp for normalization. How do we normalize? Filtering Noise. Lots of number crunching.... Algorithm - statistics - code Part of Bioinformatics - statistical Algorithms designed to identify patterns in the data - Pattern searching. Draw a "macroarray" Hope you get perfect circles of color. In reality, not neat Interesting computational problem - turn the scan into fluorescence data. Where dot ends and background begins. Software can identify when this happens .... Your grid needs to match precisely. Rotate, spacing, etc Take squares, circles, etc with 10,000 spots - hard to check that mistakes are made Check the output to see it makes sense A boring computational problem - data translation. Store data and someone else can use it. There are no standards for storing this data. You need to convert data so all of it is in the same form; you can learn techniques to convert (important). Look at tools that can do this. The Baggerly Talk * What did the original Nature Medicine paper do? The paper - you can look at gene profiles for different cells Certain markers correlate with sensitivity or resistance to a cancer drug Can look at patients and determine which treatment is most effective. Existing data Two sets - NCI cell lines - try cancer treatments - see how well they work. Then do microarray profiles of cell lines - which express and which genes they don't express. Look at patients - compare Look at whole gene profile - What are the criticisms. line shift error - They know which genes are similar, but they labeled them off by one. They included some genes and the data wasn't there to support these genes (it wasn't transparent). Threw in genes that made the data look good Mislabeled samples - sensitive vs. resistant were mislabeled duplicate samples Graph with dots - outliers were fraudulent Slings and arrows of outrageous sampling Different algorithm could yield different tails Duke may have responded that "he didn't really understand what we did" A criticism of the scientific publishing process * How journals respond to criticisms * What policies are in place? * What policies should be in place? * How authors respond to criticisms In the ideal, you should be able to redo the experiment * But that's expensive/time consuming * And if you do that much work, shouldn't you be a co-author? * Is it really possible to have someone with different background replicate it? * That's a lot of work. * Turnaround time for reviews typically needs to be relatively quick * Even Baggerly says that he doesn't typically run code and data when they are supplied with a paper * Or require authors to have someone else reproduce before publishing * Ideally, you should do an independent verification of the data. * And you as a reviewer should look for that independent verification. * In this case, it may be the F-word. Research should be reproducible * When you publish, you should be required to supply the data * When you publish, you should be required to supply the program * When you publish, you should be required to document the code We all know that an off-by-one error is common, because we've made them ourselves. What were the incentives? * A big paper. (News report says that Duke is ready to withdraw 80 papers.) * Grants (> $4 million) * Patents and medical treatments put money in your pocket * FDA gets lots of pressure to approve things quickly How does this happen when you're on a team? * Everyone else trusts one person on the team. * In multi-instituion projects, it's harder to keep as close track of things * You can intentionally fool people, whether they are reviewers or peers Why wouldn't you publish? * Others might scoop you. * Using your data * Using your algorithm Note: If you don't publish your algorithm * What do you have that's worth publishing if you don't publish the algorithm? * In this case, the metagene/gene panel * Others might replicate it themselves (as is evidenced by the Baggerly analysis) Government increasingly says "If we fund it, you need to publish it." Think about: How should you change your current or future practice based on what you've heard?