Exploring Bioinformatics with Python

Guided Project 2.5: Genetic Disease

One goal of our programming in this chapter is to do a simple version of sequence matching: Giving a long sequence (most typically, a known sequence or collection of sequences) and a shorter sequence (most typically, created by sequencing something that has been found to be biologically interesting), find the best alignment of the shorter sequence to the longer sequence.

But wait! We're going to add a complicating factor. We want to support alignment even if there's a deletion. That is, we acknowledge that in creating sequences (e.g., in transcribing DNA), sometimes parts of the template sequence get skipped. In practice, there can be many such deletions (and even some insertions). For now, we want to see what happens with no more than one deletion, but a deletion that can be arbitrarily long.

Looking Ahead

While it is tempting to step right in and try to solve that problem, we'll find it helpful to build some utility procedures along the way, as well as to learn some new programming techniques. So, what are some things we should do on our way to solving this simple version of sequence matching?

In terms of utility procedures, we'll build:

There will be others, too. You'll find it useful to put all of the procedures together into a single file. (You'll want to use many of them in future activities.)

Preparation

To get started,

a. Review the list of activities you did in the Web exploration.

b. Create a new file that will serve as your procedure library for this project (and, likely, for future projects).

c. Add import string to the top of that library, so that you always get the string library when you load it.

d. Load that file into the Python shell.

Exercise 1: Simple Problems

a. Using cut-and-paste, create a string containing the sequence of the HBB gene.

>>> hbb_dna = 'ACATTT...GC'

b. Determine the length of that string. Does it match your answer from the Web exploration?

c. Convert that string to an mRNA string by translating T's to U's. Recall that the string.replace procedure does that conversion.

d. Determine the position of the first start codon in that string. Does it match the position from the GenBank record? (Recall that Python starts its positions at 0, rather than 1.)

e. The coding sequence runs from positions 51 to 494 in the string. Extract that subsequence using the Python sublist/substring operation.

Exercise 2: A Few Utilities

Let's start our library with a few utilities.

a. In order to make our code robust (and, at times, to make it polymorphic, in the computer science sense), we will often need to check the type of a value. Add is_list and is_string to your library.

b. I find it helpful to give consistent error messages when things go wrong. Hence, you will find that some of the code that I provide for you calls a procedure called parameter_error. Add parameter_error to your library.

c. Although it is straightforward to convert a coding sequence to mRNA, it may still be useful to create a procedure to do so for clarity. That is, a client may find it clearer to see

>>> seq = transcribe_coding_strand(strand)

rather than

>>> seq = strand.replace('T','U')

In addition, as we look at richer DNA sequence representations, we may find that we want to do more than just translate T's to U's.

Write a procedure, transcribe_coding_strand, that does just that: Transcribes a coding strand of DNA (in 5' to 3' order), represented as a string, to a corresponding strand of mRNA (again, in 5' to 3' order), also represented as a string.

d. We've seen that it's useful to work with sequences as both lists and strings. Let's think about making transcribe_coding_strand work with both. One trick we'll use it to create an extra variable, sequence, that we know is a string. Here's some pseudocode mixed Python.

    if strand is a string then
        sequence = strand
    else if strand is a sequence then
        sequence = the conversion of strand to a string
    else
         report an error; the sequence isn't a type we expect
    result = sequence.replace('T','U')
    if strand is a list then
         return the conversion of result to a list
    else
         return result

Turn that psuedocode into Python code.

e. Compare your answer to our answer.

Exercise 3: From Codon to Amino Acid

One of the early Web exploration questions asks you to find the first few amino acids in the coding sequence. To do so, it would be helpful to have a procedure that does the conversion for us.

In Python, the most convenient technique for handling such conversion is the dictionary data structure.

a. Read the short reference page on dictionaries in Python.

b. Add codons, a dictionary of codons, to your library.

c. Try looking up a few codons in that dictionary. Try looking up a few non-codons.

d. Add translate_codon, a wrapper for that dictionary, to your library.

e. Try a few examples to verify that it seems to work correctly.

f. Read through the code for translate_codon and explain what it seems to be doing.

Exercise 4: Getting Full Amino Acid Names

One disadvantage of translate_codon is that it only gives us the one-letter code for an amino acid. We may find it more convenient to use the full name.

a. Create a table aa_name_dict, that can be used to translate one-letter codes to the full amino acid name. The first few lines of your code will be something like the following:

aa_name_dict = {}
aa_name_dict['A'] = 'Alanine' 
aa_name_dict['R'] = 'Arginine' 
aa_name_dict['N'] = 'Asparagine'  

b. Test the dictionary on a few values.

c. Here's a procedure, aa_name, that provides an appropriate wrapper for that table. Add it to your library.

d. amino_acids is a string that lists all of the one-letter codes for amino acids. Add it to your library. Then, test your table and the aa_name procedure with

>>> map(aa_name, amino_acids)

Exercise 5: Reading FASTA Files

As you've probably noted, cutting and pasting is not the easiest way to work with DNA data. We'd rather have the data in files, and use procedures to read and write those files.

a. Read the short reference page on reading data from files in Python.

b. Read the NCBI reference on FASTA format.

c. In psuedocode, English, or approximate Python, explain the steps you would do to read a sequence from a FASTA file using the Python file input operations.

d. Add read_fasta to your library.

e. Look through the code for read_fasta and determine whether it meets your specification from step c.

f. Try read_fasta on the FASTA files you downloaded as part of the Web exploration.

g. Extract just the coding sequence of the wild type HBB DNA. (Recall that sequence is at positions 51..295, which would be expressed as 50:295 in Python.)

Exercise 6: Writing FASTA Files

a. Read the short reference page on writing data to files in Python.

b. Review the NCBI reference on FASTA format.

c. In psuedocode, English, or approximate Python, explain the steps you would do to write a sequence to a FASTA file using the Python file input operations.

d. Add write_fasta to your library.

e. Look through the code for write_fasta and determine whether it meets your specification from step c.

f. Try write_fasta on the FASTA files you downloaded as part of the Web exploration.

Exercise 7: Reading other DNA Files

As you saw in the Web Exploration, sometimes people choose to represent DNA in other file formats, most typically by grouping the DNA into units of length ten and prefixing each line of DNA with the offset, as in

1   ATGGTGCATC TGACTCCTGT GGAGAAGTCT GCCGTTACTGC CCTGTGGGGC
51  AAGGTGAACG TGGATGAAGT TGGTGGTGAG GCCCTGGGCAG GCTGCTGGTG
101 GTCTACCCTT GGACCCAGAG GTTCTTTGAG TCCTTTGGGGA TCTGTCCACT

a. Suppose we've loaded the file into a string (you should know how to do that). Explain the steps you'd use to turn this string into the DNA sequence (also represented as a string).

b. Suppose we want to support a more general strategy: The sequence may be DNA, RNA, Protein, or other sequence of letters. Our goal is therefore to extract just the letters in some code (ACGT for simplified DNA, ACGU for simplified RNA, ARNDCEQGHILKMFPSTWYV* for amino acids). Develop an algorithm to turn the string and alphabet into a valid sequence. Write your algorithm in pseudo-Python (that is, it doesn't have to be working Python, but it should be close to the form of the Python you'd expect to write.)

c. Add cleanup_sequence to your library.

d. Compare the algorithm used in cleanup_sequence to your own algorithm.

e. Using the Python file I/O operations and cleanup_sequence, read the sequence data from the sickle cell DNA file.

Exercise 8: Comparing DNA Sequences

We now have the two DNA sequences, one extracted from the FASTA wild type file and one from the other format sickle cell file. It's time to compare them. But how? Well, we need to write a comparison algorithm.

a. In approximate Python (that is, your code need not work, but it should look like Python), develop an algorithm that compares two sequences and returns a list of all the positions in which the two sequences do not match. If one sequence is shorter, you should only match up to the length of the shorter sequence.

b. Add sequence_compare_1 to your library.

c. Read through the code to sequence_compare_1 to ensure that you understand the techniques used.

c. Try sequence_compare_1 on a few hand-coded examples.

d. Try sequence_compare_1 on the wild type and sickle cell sequences.

Exercise 9: From RNA to Protein

As we noted in the Web Exploration, it is often more important to compare resulting proteins than to compare the underlying DNA. After all, different sequences of DNA can result in the same amino acid sequences. (In particular, there are a number of cases in which the last nucleotide in a codon does not affect which amino acid is produced.) So, let's consider what tools we have for this conversion.

So, our primary goal is to break up a large string into triplets, convert each triplet to the amino acid (using translate_codon), and join them all together.

a. In approximate Python (that is, your code need not work, but it should look like Python), develop an algorithm that converts an RNA sequence (represented as a string) to the corresponding amino acid sequence (also represented as a string).

b. Add translate_rna to your library.

c. Read through the code for translate_rna and explain, in English, what the purpose of each step is.

d. Try translate_rna on your two coding sequences (after converting them to RNA, of course).

e. Use sequence_compare_1 to compare those two amino acid sequences. Do you get the same results as in the Web exploration?

f. Do exercise 14 on p. 43.

Execise 10: Improved Translation

a. Update translate_rna so that it does not start coding until it encounters a start codon.

b. Test your updated code on some simple strings, such as 'AAAGAUGGCACCAUAA'.

c. Update translate_rna so that it stops coding at the stop codons.

d. Test your updated code on some simple strings.

e. Test your updated code on the full wild-type HBB sequence.

Exercise 11: Alignment

This exercise is really the first step of the on-your-own project. It is included here for those with extra time (and to give advice for everyone).

In the on-your-own project, you'll be looking at aligning a short sequence to a longer sequence. That alignment is complicated by a possible deletion in the longer sequence. However, it turns out that it is helpful to solve the simpler problem of aligning a short sequence to a longer sequence, assuming no deletions.

a. Write a procedure, alignment_value(long_sequence, i, short_sequence) that finds the value of the alignment of short_sequence to long_sequence, starting at position i in long_sequence. At this point, the value of an alignment is simply the number of mismatches. (Smaller values are better.)

b. Write a procedure, best_alignment(long_sequence, i, short_sequence), that finds the starting position of in long_sequence of the best alignment of short_sequence to long_sequence. You should only start looking for the best alignment starting at position i.

History

Fall 2009 [Samuel A. Rebelsky]

Monday, 5 September 2011 [Samuel A. Rebelsky]


This page was generated by Siteweaver on Thu Sep 8 12:57:46 2011.
The source to the page was last modified on Thu Sep 8 12:57:45 2011.
This page may be found at http://www.cs.grinnell.edu/~rebelsky/ExBioPy/project-2.5.html.

You may wish to validate this page's HTML

Samuel A. Rebelsky
rebelsky@grinnell.edu