Exploring Bioinformatics with Python
Basic:
[Skip To Body]
[Front Door]
|
[Reference]
[Labs]
[Projects]
Courses:
[BIO/CSC295.01 2009F]
[BIO/CSC295.01 2011F]
Python:
[python.org]
[biopython.org]
Misc:
[Exploring Bioinformatics site]
You should have Section 9.4 of St. Clair and Visick handy while you do this project. St. Clair and Visick give much of the background on why you're doing what you're doing. This lab asks you to explore some alternative techniques for doing so.
It appears that the data we're using in this lab have changed since St. Clair and Visick designed the project. A few of the problems ask you to reflect on issues related to that change.
As St. Clair and Visick suggest, one of the important techniques any bioinformatician will need to master is preprocessing data, that is, taking data in the form that it is available and converting it to a form that you need.
St. Clair and Visick recommend that you use a spreadsheet to do much of the conversion. I expect that Baggerly would critique that approach, since it is hard to document precisely what you did, and equally hard to replicate it. Instead, we will use common Linux utilities to manipulate the data.
What are common Linux utilities
? These are the small programs
that come with every Linux distribution. You generally enter these
in the terminal window, through what is called the command line.
less fileless -N fileless, but lines are preceded by the corresponding
line number. When lines are wider than the screen, they get wrapped, and
you may see the same line number multiple lines.wc filewc -l filegrep pattern filegrep -n pattern filehead -n # filetail -n # filetail -n +#cut -fcolumns filecut -f1,3,4 data would extract columns 1, 3, and 4 from the data file.uniq filejoin -t '<ctrl-v> <tab>' filessort filejoin may like you
to sort files before joining.gedit fileAs you may have guessed, when you run a Linux command, the output just appears on the screen. When you have a lot of output, it scrolls by quickly. What can you do? The easiest thing to do is to save your result to a file. You do this with
command > filename
For example, if I wanted all but the first ten lines of file
data1, and I wanted to save the result in temp1,
I would write
tail -n +11 data1 > temp1
Linux also provides a special file concatentation command,
cat>. You use cat to join files together.
cat file1 file2 > combined-file
Be careful with these kinds of commands. If you put an existing file after the greater-than sign, Linux will happily overwrite the target file!
Please keep a running log of the commands that you type for each step.
The files sample1
and sample2 contain
some sample information from microarray runs. In particular, the first
column labels the probe for one element in a microarray and the second
column gives the data from one run.
a. Using less, scroll through one of the files.
b. Using grep, determine where 267625_at falls in each
file.
c. Using wc, find out how many lines are in each file.
d. Using head, extract the first ten lines of sample1.
e. Using head, extract all but the last ten lines of sample1.
f. Using tail, extract the last ten lines of sample1.
g. Using tail, extract the lines starting with 267625_at from
sample1.
h. You may have noticed that the two files have identical column headers.
That's somewhat inconvenient. Using gedit, update the
headers so that the VALUE and ABS_CALL columns
include the sample number.
i. It will be easier to process the data if we combine them in one file.
Using join, combine the files, calling the result
combined-samples.
j. Figure out how many lines combined-samples has. Can
you explain why? Look at the first few lines of your input samples.
In their exercises 1-7, St. Clair and Visick ask you to download the file GSE9311_series_matrix.txt and to extract some lines from that file.
a. Make a copy of GSE9311_series_matrix.txt.
b. Using the wc -l tool, determine how many lines are in
the file.
c. Using head and tail explore the beginning and
end of the file.
d. St. Clair and Visick tell us that the line that begins with
ID_REF indicates the start of the actual data. Use
grep -n to find that line.
e. You should now have enough information to extract just the useful
data. Write a command (or series of commands) to extract the data
from GSE9311_series_matrix.txt, storing the result
in the file GSE9311_temp1.txt.
f. You may note that there are nine columns in the file. The first column gives the probe. The remaining eight columns represent the expression data for each of the eight experiments. It may be useful to extract the individual experiment data for easier processing.
Using cut create a separate file for a few of the
experiments, with a name like experiment-data.txt.
For example, the file GSM237282-data.txt should contain
columns 1 (labels) and 4 (GSM237282 data) from the original matrix.
YOu should certainly make GSM237280-data.txt.
As St. Clair and Visick suggest, these data alone do
not suffice for analysis. In particular, it is useful to
have the call
for each expression datum: is the gene
Present, Asent, or Marginal. The file GSE9311_family.soft.gz
contains those data, although in a different format. In particular,
the experiments are listed sequentially in the file, rather than
combined in a single table.
a. Make a copy of that file and uncompress it with
gunzip GSE9311_family.soft
b. Use less to skim through the file. What information does
it seem to contain? Use /SAMPLE to find the first sample.
c. The file contains a variety of tables. As you may have noted, tables
typically start with ID_REF. In this file, the end
of tables is marked with !sample_table_end. Using
grep -n, find out where the tables in the file start
and end. You can also use grep SAMPLE to find out which
table corresponds to which sample (the line number for the sample will
precede but be near to the line number for the start of the table.
d. Look at St. Clair and Visick's instruction 9 on p. 296. Note anything you find interesting or surprising about that instruction. Then, talk about your answer with your instructor.
e. Extract the table that corresponds to GSM237280 and call the result
GSM237280-soft.txt. Using grep -n, see what
you find out about where 244901_at is located in
GSM237280-soft.txt and GSM237280-soft.txt.
f. Look at St. Clair and Visick's instruction 10 on pp. 296-297. What do you se as possible complications in this instruction?
g. Determine a way to appropriately merge GSM237280-soft.txt
and GSM237280-soft.txt.
Because section 9.4 is an On Your Own
project, it has a lot
more steps. We're going to end with a simpler task. Let's try to
identify a few genes in which we get different results in two experiments.
a. Pick two of the experiments/conditions. (You can simply choose these by number, or you can be a more sensible scientist and think about two conditions that you would like to compare.)
b. Extract data on those two experiments from GSE9311_family.soft.
c. Design a technique for finding out when the gene product is present
in the first condition, but not in the second condition. You should be
able to do this with a combination of cut, grep,
and join (although not necessarily in that order).
Once you've identified some differences, you should explore the noted purpose of that gene.
a. Download and uncompress affy_ATH1_array_elements-2010-12-20.txt.gz, information on the array elements in Affymetrix ATH1, the microarray used in this experiment.
b. Download and uncompress ATH_GO_GOSLIM.20080712.txt, the gene ontology data file for ATH.
c. Using grep, find out what the AGI identifier for the gene of interest is, and then look it up in the gene ontology file.
# grep 244901_at affy_ATH1_array_elements-2010-12-20.txt 244901_at oligonucleotide Arabidopsis thaliana no ATMG00640 hydrogen ion transporting ATP synthases, rotational mechanism;zinc ion binding 188160 188619 # # grep ATMG00640 ATH_GO_GOSLIM.20080712.txt ATMG00640 504952169 ORF25 mitochondrial proton-transporting ATP synthase complex, coupling factor F(o) GO:0000276 359 comp mitochondria IDA Publication:501705871|PMID:12681508 TAIR 2003-07-07 ...
d. You may also want to see what information is provided in the original
.soft file.
grep 244901_at GSE9311_family.soft 244901_at orf25 Arabidopsis thaliana Mar 11, 2009 Exemplar sequence Affymetrix Proprietary Database hypothetical protein orf25 0015986 // ATP synthesis coupled proton transport // inferred from electronic annotation 0000276 // mitochondrial proton-transporting ATP synthase complex, coupling factor F(o) // inferred from electronic annotation /// 0005739 // mitochondrion // inferred from electronic annotation 0015078 // hydrogen ion transmembrane transporter activity // inferred from electronic annotation ...
This page was generated by
Siteweaver on Sat Nov 26 19:05:56 2011.
The source to the page was last modified on Thu Nov 3 15:00:10 2011.
This page may be found at http://www.cs.grinnell.edu/~rebelsky/ExBioPy/project-9.4a.html.
You may wish to validate this page's HTML
Samuel A. Rebelsky