Project 7.5: Exploring Structure Prediction with the Chou-Fasman Algorithm

Note: You will be expected to turn in the final version of your algorithm for assessment.

0. Make your own copies of the following files:

ChouFasman.py, which includes preliminary code for this assignment.
read_fasta.py, the procedure we explored previously for reading FASTA data.
NP_005408-fasta.txt, the FASTA file for human SRC.
CF_test.txt, the test data provided by our authors.

1. Your first goal is to understand what parts of the Chou-Fasman algorithm are already implemented, and how you can use them.

What are the primary steps in the ChouFasman procedure?
Which of those steps look like they still have to be implemented?
What procedures still need to be implemented?
What does the for aa in aa_names loop at the start of the code do? (Make sure to look at the body.)
What is the relationship between CF_find_alpha, CF_extend_alpha, and CF_good_alhpa?

You may want to call each of the procedures to better understand its purpose. For example, you can call CF_find_alpha or ChouFasman on the sequence you get from NP_005408.

>>> ChouFasman(read_fasta('NP_005408-fasta.txt')[1])

2. Develop a set of test sequences that you think will be useful. Your set should include.

The Human SRC sequence we explored with PSIPRED.
The sample sequence our book provides.
A few other short sequences (twenty amino acids or so) which you have designed to test particular aspects of Fasman-Chou (e.g., what happens when there is an overlap of likely alpha helix and likely beta sheet).

3. The code to extend a potential alpha helix (step 1b on p. 218) is not yet written. Write that code.

Note: As our book notes, one difficulty with extending a region is that you may hit the beginning or end of the sequence. Be careful about those situations.

4. The code for checking whether a range is likely to be an alpha helix (step 1c on p. 218) is incomplete. Complete that code.

5. As written, CF_find_alpha does some unneccessary checking, and therefore finds duplicate regions. In particular, once it has identified a potential alpha helix in the range (X,Y), it starts again near X+1. However, it need not look for the next alpha helix before position Y+1. Update your code so that the search is more efficient.

6. There is not yet a procedure to find beta strands. Implement that procedure. (You will find that it is very similar to the one for finding alpha helices.)

7. There is not yet a procedure to find beta turns. Implement that procedure. (You will find that this procedure is a bit different, because it does not expand the region, because the contribution of an amino acid to turn probability depends on its position in the region, and because turns are just one unit long.)

8. The ChouFasman procedure currently fails to do step 4 of the algorithm (finding and handling overlaps, p. 219). Implement that portion of the algorithm.

9. Implement any other pieces you consider necessary for the full algorithm.

Part 2: Experimentally Analyze the Chou-Fasman Algorithm

Pick three proteins for which there is a known structure. For each protein, run your version of the Chou-Fasman algorithm and analyze how well (or poorly) your algorithm analyzed the protein.

For example, here are some basic analyses related to alpha helices.

What percentage of the alpha helices did your algorithm identify, at least partially?
How many additional alpha helices did your algorithm identify?
What percentage of the amino acids in alpha helices did your algorithm identify?
What percentage of the amino acids not in alpha helices did your algorithm indicate were in alpha helices?

You might also explore how well your algorithm did as compared to PSIPRED.

Part 3: Improve the Algorithm

Explore the literature about Chou-Fasman and describe three possible improvements to the algorithm. (You need not implement these improvements. However, you should describe them at a level that your colleagues in the class could understand the improvements.)

Optionally: Implement one of these improvements and analyze the effect it has on the algorithm (does it really make it better).

What to Turn In

Your finished Chou-Fasman algorithm from part 1.
The FASTA files for the three proteins you chose.
Links to the protein structures for those three proteins.
Your experimental analysis of your original algorithm.
Your description of possible improvements.