Exploring Bioinformatics with Python

Project 6.5: Exploring Open Reading Frames

As we have recently explored, the discovery of putative protein-coding genes in the genome begins with the identification of Open Reading Frames (ORFs). At first glance, an open reading frame is pretty straightforward: It's a sequence of codons beginning with AUG (or ATG) and ending with one of the stop codons. Of course, more issues come into play. However, for this project, we will focus on this straightforward view of ORFs.

0. For preparation, make a copy of the pACYC184 sequence used in the corresponding Web exploration. We will use this as the basis of our searching.

Note: For all of the following, assume that we represent our sequences as strings.

1. We should begin by identifying potential start codons.

a. Write a procedure, findStartCodonsForward(seq), that identifies the positions of the start codons in seq, assuming that we're only looking forward in the sequence. Your procedure should return one list of positions.

There are two obvious strategies for finding the beginnings of open reading frames: one can use string.find to find each copy of AUG (or ATG), depending on the file format, or one can step through the string, comparing each triplet in turn to AUG (or ATG).

b. How many potential start codons are there if we look at the pACYC184 sequence forward? What instructions did you use to determine this?

c. Right now, we're only looking forward in the sequence. We should also examine the paired strand. How many potential start codons are there if we look at the paired strand? (You'll also need to read the string backwards and to use the pairs of the nucleotides.) What instructions did you use to determine this?

d. At times, we want to classify our start codons by the frame (offset 0 from starting point, offset 1 from starting point, or offset 2 from starting point). Write a procedure separateFrames(positions) that builds a list of three lists, corresponding to the positions that belong in the offset-zero frame, the offset-one frame, and the offset-two frame.

2. Let's assume, for the moment, that there are no introns in our reading frames.

a. Write a procedure, findStopCodon(seq, pos), that finds the stop codon for a sequence that has a start codon at pos. (You can return -1 if there is no stop codon after the position.)

b. Write a procedure, findSimpleORFsForward(seq), that finds all the open reading frames in seq, looking forward in seq. You can represent each open reading frame as a two-element list.

3. As you may have noted, many of the simple open reading frames returned by the previous procedure will be too small to code a protein.

a. Write a procedure, filterSmallORFs(list_of_orfs, min_size), that filters out all of the small reading frames.

b. How many ORFs in in the pACYC184 sequence have a size of at least 300? What instruction did you use to determine this answer?

4. Our book notes that some codons appear rarely in some organisms. Hence, we might want to filter out ORFs that contain that codon (or too many copies of that codon).

a. Write a procedure, countCodons(codon, seq, start, stop), that counts how many copies of codon appear in seq, starting at position start and stopping before position stop. (Be careful: You don't want to count count mis-aligned triplets as codons.)

b. Write a procedure, filterUncommonCodons(seq, list-of-orfs, codon, percent) that filters out any ORFs that contain a percentage of codon greater than percent.

5. As St. Clair and Visick note, we gain additional confidence that an open reading frame is preceded by an appropriate promoter region. In a prokaryotic cell, this is similar to 5'AGGAGG sequence, six or seven nucleotides before the start codon, the Shine-Dalgarno sequence.

a. Write a procedure, sdp(seq, list-of-orfs), that finds only the orfs in the list that have a Shine-Dalgarno Prokaryotic promoter. You may choose any reasonable metric you want for similar to.

b. How many of the ORFs that you got in 3b have a promoter sequence? What code did you use to answer this question?

6. Once we've identified the open reading frames that are more likely to represent genes, it is useful to determine what sequences of amino acids they represent. For example, once we have those sequences, we can compare them to known sequences using BLAST or other technique.

a. Write a procedure, potentialProteins(seq, list-of-orfs), that builds a list of amino acid strings (that is, potetial proteins), one for each ORF in list-of-orfs.

b. Pick two of those potential proteins, BLAST them, and summarize your results.


This page was generated by Siteweaver on Sat Nov 26 19:05:56 2011.
The source to the page was last modified on Sat Nov 26 18:51:12 2011.
This page may be found at http://www.cs.grinnell.edu/~rebelsky/ExBioPy/project-6.5.html.

You may wish to validate this page's HTML

Samuel A. Rebelsky
rebelsky@grinnell.edu