Exploring Bioinformatics with Python

Preliminary Notes

You should have Section 9.4 of St. Clair and Visick handy while you do this project. St. Clair and Visick give much of the background on why you're doing what you're doing. This lab asks you to explore some alternative techniques for doing so.

It appears that the data we're using in this lab have changed since St. Clair and Visick designed the project. A few of the problems ask you to reflect on issues related to that change.

Background

As St. Clair and Visick suggest, one of the important techniques any bioinformatician will need to master is preprocessing data, that is, taking data in the form that it is available and converting it to a form that you need.

St. Clair and Visick recommend that you use a spreadsheet to do much of the conversion. I expect that Baggerly would critique that approach, since it is hard to document precisely what you did, and equally hard to replicate it. Instead, we will use common Linux utilities to manipulate the data.

Linux Utilities

What are common Linux utilities? These are the small programs that come with every Linux distribution. You generally enter these in the terminal window, through what is called the command line.

less file
Page through a file. Space bar advances a screen. Return advances a line. Q quits. Slash-pattern searches for a pattern.
less -N file
Like less, but lines are preceded by the corresponding line number. When lines are wider than the screen, they get wrapped, and you may see the same line number multiple lines.
wc file
Determine how many lines, words, and characters are in the file.
wc -l file
Determine how many lines are in the file.
grep pattern file
Find the lines in file that include pattern.
grep -n pattern file
Find out where pattern appears in the file. Each matching line is printed, preceded by its line number.
head -n # file
Grab the first # lines of the file.
tail -n # file
Grab the last # lines of the file.
tail -n +#
Grab all of the lines of the file, starting with line #.
cut -fcolumns file
Extract the given columns from the file. For example, cut -f1,3,4 data would extract columns 1, 3, and 4 from the data file.
uniq file
Remove repeated lines from the file.
join -t '<ctrl-v> <tab>' files
Join two columnar data files, ensuring that labels match.
sort file
Sort the lines of the file. Note that join may like you to sort files before joining.
gedit file
A simple editor.

Saving Output from a Linux Command

As you may have guessed, when you run a Linux command, the output just appears on the screen. When you have a lot of output, it scrolls by quickly. What can you do? The easiest thing to do is to save your result to a file. You do this with

command > filename

For example, if I wanted all but the first ten lines of file data1, and I wanted to save the result in temp1, I would write

tail -n +11 data1 > temp1

Linux also provides a special file concatentation command, cat>. You use cat to join files together.

cat file1 file2 > combined-file

Be careful with these kinds of commands. If you put an existing file after the greater-than sign, Linux will happily overwrite the target file!

Exercises

Please keep a running log of the commands that you type for each step.

Exercise 0: Exploring Linux Tools

The files sample1 and sample2 contain some sample information from microarray runs. In particular, the first column labels the probe for one element in a microarray and the second column gives the data from one run.

a. Using less, scroll through one of the files.

b. Using grep, determine where 267625_at falls in each file.

c. Using wc, find out how many lines are in each file.

d. Using head, extract the first ten lines of sample1.

e. Using head, extract all but the last ten lines of sample1.

f. Using tail, extract the last ten lines of sample1.

g. Using tail, extract the lines starting with 267625_at from sample1.

h. You may have noticed that the two files have identical column headers. That's somewhat inconvenient. Using gedit, update the headers so that the VALUE and ABS_CALL columns include the sample number.

i. It will be easier to process the data if we combine them in one file. Using join, combine the files, calling the result combined-samples.

j. Figure out how many lines combined-samples has. Can you explain why? Look at the first few lines of your input samples.

Exercise 1: Extracting a Table

In their exercises 1-7, St. Clair and Visick ask you to download the file GSE9311_series_matrix.txt and to extract some lines from that file.

a. Make a copy of GSE9311_series_matrix.txt.

b. Using the wc -l tool, determine how many lines are in the file.

c. Using head and tail explore the beginning and end of the file.

d. St. Clair and Visick tell us that the line that begins with ID_REF indicates the start of the actual data. Use grep -n to find that line.

e. You should now have enough information to extract just the useful data. Write a command (or series of commands) to extract the data from GSE9311_series_matrix.txt, storing the result in the file GSE9311_temp1.txt.

f. You may note that there are nine columns in the file. The first column gives the probe. The remaining eight columns represent the expression data for each of the eight experiments. It may be useful to extract the individual experiment data for easier processing.

Using cut create a separate file for a few of the experiments, with a name like experiment-data.txt. For example, the file GSM237282-data.txt should contain columns 1 (labels) and 4 (GSM237282 data) from the original matrix. YOu should certainly make GSM237280-data.txt.

Exercise 2: Adding Predictive Data

As St. Clair and Visick suggest, these data alone do not suffice for analysis. In particular, it is useful to have the call for each expression datum: is the gene Present, Asent, or Marginal. The file GSE9311_family.soft.gz contains those data, although in a different format. In particular, the experiments are listed sequentially in the file, rather than combined in a single table.

a. Make a copy of that file and uncompress it with

gunzip GSE9311_family.soft

b. Use less to skim through the file. What information does it seem to contain? Use /SAMPLE to find the first sample.

c. The file contains a variety of tables. As you may have noted, tables typically start with ID_REF. In this file, the end of tables is marked with !sample_table_end. Using grep -n, find out where the tables in the file start and end. You can also use grep SAMPLE to find out which table corresponds to which sample (the line number for the sample will precede but be near to the line number for the start of the table.

d. Look at St. Clair and Visick's instruction 9 on p. 296. Note anything you find interesting or surprising about that instruction. Then, talk about your answer with your instructor.

e. Extract the table that corresponds to GSM237280 and call the result GSM237280-soft.txt. Using grep -n, see what you find out about where 244901_at is located in GSM237280-soft.txt and GSM237280-soft.txt.

f. Look at St. Clair and Visick's instruction 10 on pp. 296-297. What do you se as possible complications in this instruction?

g. Determine a way to appropriately merge GSM237280-soft.txt and GSM237280-soft.txt.

Exercise 3: Identifying Genes of Interest

Because section 9.4 is an On Your Own project, it has a lot more steps. We're going to end with a simpler task. Let's try to identify a few genes in which we get different results in two experiments.

a. Pick two of the experiments/conditions. (You can simply choose these by number, or you can be a more sensible scientist and think about two conditions that you would like to compare.)

b. Extract data on those two experiments from GSE9311_family.soft.

c. Design a technique for finding out when the gene product is present in the first condition, but not in the second condition. You should be able to do this with a combination of cut, grep, and join (although not necessarily in that order).

Exercise 4: Exploring Genes of Interest

Once you've identified some differences, you should explore the noted purpose of that gene.

a. Download and uncompress affy_ATH1_array_elements-2010-12-20.txt.gz, information on the array elements in Affymetrix ATH1, the microarray used in this experiment.

b. Download and uncompress ATH_GO_GOSLIM.20080712.txt, the gene ontology data file for ATH.

c. Using grep, find out what the AGI identifier for the gene of interest is, and then look it up in the gene ontology file.

# grep 244901_at affy_ATH1_array_elements-2010-12-20.txt
244901_at	oligonucleotide	Arabidopsis thaliana	no	ATMG00640	hydrogen ion transporting ATP synthases, rotational mechanism;zinc ion binding	188160	188619
# # grep ATMG00640 ATH_GO_GOSLIM.20080712.txt 
ATMG00640	504952169	ORF25	mitochondrial proton-transporting ATP synthase complex, coupling factor F(o)	GO:0000276	359	comp	mitochondria	IDA	Publication:501705871|PMID:12681508	TAIR	2003-07-07
...

d. You may also want to see what information is provided in the original .soft file.

grep 244901_at GSE9311_family.soft
244901_at	orf25		Arabidopsis thaliana	Mar 11, 2009	Exemplar sequence	Affymetrix Proprietary Database	hypothetical protein	orf25	0015986 // ATP synthesis coupled proton transport // inferred from electronic annotation	0000276 // mitochondrial proton-transporting ATP synthase complex, coupling factor F(o) // inferred from electronic annotation /// 0005739 // mitochondrion // inferred from electronic annotation	0015078 // hydrogen ion transmembrane transporter activity // inferred from electronic annotation
...

This page was generated by Siteweaver on Sat Nov 26 19:05:56 2011.
The source to the page was last modified on Thu Nov 3 15:00:10 2011.
This page may be found at http://www.cs.grinnell.edu/~rebelsky/ExBioPy/project-9.4a.html.

You may wish to validate this page's HTML

Samuel A. Rebelsky
rebelsky@grinnell.edu