Exploring Bioinformatics with Python
Basic:
[Skip To Body]
[Front Door]
|
[Reference]
[Labs]
[Projects]
Courses:
[BIO/CSC295.01 2009F]
[BIO/CSC295.01 2011F]
Python:
[python.org]
[biopython.org]
Misc:
[Exploring Bioinformatics site]
As we have recently explored, the discovery of putative protein-coding genes in the genome begins with the identification of Open Reading Frames (ORFs). At first glance, an open reading frame is pretty straightforward: It's a sequence of codons beginning with AUG (or ATG) and ending with one of the stop codons. Of course, more issues come into play. However, for this project, we will focus on this straightforward view of ORFs.
0. For preparation, make a copy of the pACYC184 sequence used in the corresponding Web exploration. We will use this as the basis of our searching.
Note: For all of the following, assume that we represent our sequences as strings.
1. We should begin by identifying potential start codons.
a. Write a procedure, findStartCodonsForward(seq),
that identifies the positions of the start codons in seq, assuming
that we're only looking forward in the sequence. Your procedure should
return one list of positions.
There are two obvious strategies for finding the beginnings of open
reading frames: one can use string.find to find each copy
of AUG (or ATG), depending on the file format, or one can step through
the string, comparing each triplet in turn to AUG (or ATG).
b. How many potential start codons are there if we look at the pACYC184 sequence forward? What instructions did you use to determine this?
c. Right now, we're only looking forward in the sequence. We should also examine the paired strand. How many potential start codons are there if we look at the paired strand? (You'll also need to read the string backwards and to use the pairs of the nucleotides.) What instructions did you use to determine this?
d. At times, we want to classify our start codons by the
frame (offset 0 from starting point, offset 1 from starting
point, or offset 2 from starting point). Write a procedure
separateFrames(positions) that builds a list of three
lists, corresponding to the positions that belong in the offset-zero
frame, the offset-one frame, and the offset-two frame.
2. Let's assume, for the moment, that there are no introns in our reading frames.
a. Write a procedure, findStopCodon(seq, pos),
that finds the stop codon for a sequence that has a start codon at
pos. (You can return -1 if there is no stop codon
after the position.)
b. Write a procedure, findSimpleORFsForward(seq), that
finds all the open reading frames in seq, looking
forward in seq. You can represent each open
reading frame as a two-element list.
3. As you may have noted, many of the simple open reading frames returned by the previous procedure will be too small to code a protein.
a. Write a procedure, filterSmallORFs(list_of_orfs,
min_size), that filters out all of the small reading frames.
b. How many ORFs in in the pACYC184 sequence have a size of at least 300? What instruction did you use to determine this answer?
4. Our book notes that some codons appear rarely in some organisms. Hence, we might want to filter out ORFs that contain that codon (or too many copies of that codon).
a. Write a procedure, countCodons(codon, seq,
start, stop), that counts how many copies of
codon appear in seq, starting
at position start and stopping before position
stop. (Be careful: You don't want to count
count mis-aligned triplets as codons.)
b. Write a procedure, filterUncommonCodons(seq,
list-of-orfs, codon, percent) that filters
out any ORFs that contain a percentage of codon greater
than percent.
5. As St. Clair and Visick note, we gain additional confidence that an open reading frame is preceded by an appropriate promoter region. In a prokaryotic cell, this is similar to 5'AGGAGG sequence, six or seven nucleotides before the start codon, the Shine-Dalgarno sequence.
a. Write a procedure, sdp(seq, list-of-orfs),
that finds only the orfs in the list that have a Shine-Dalgarno
Prokaryotic promoter. You may choose any reasonable metric you want
for similar to
.
b. How many of the ORFs that you got in 3b have a promoter sequence? What code did you use to answer this question?
6. Once we've identified the open reading frames that are more likely to represent genes, it is useful to determine what sequences of amino acids they represent. For example, once we have those sequences, we can compare them to known sequences using BLAST or other technique.
a. Write a procedure, potentialProteins(seq,
list-of-orfs), that builds a list of amino acid
strings (that is, potetial proteins), one for each ORF in
list-of-orfs.
b. Pick two of those potential proteins, BLAST them, and summarize your results.
This page was generated by
Siteweaver on Sat Nov 26 19:05:56 2011.
The source to the page was last modified on Sat Nov 26 18:51:12 2011.
This page may be found at http://www.cs.grinnell.edu/~rebelsky/ExBioPy/project-6.5.html.
You may wish to validate this page's HTML
Samuel A. Rebelsky