[an error occurred while processing this directive]

by Eric Peterson and Jen Barnet

Basics

Recent research like the Human Genome project has generated reams upon reams of genetic sequence data which must now be analyzed and understood. Some of the questions that the field of bioinformatics attempts to answer are: How is genetic information transmitted? Which parts of the DNA sequence code for genes? How has evolution changed sequences over time and betweeen species?

A common problem in bioinformatics is the problem of protein sequence alignment. Many, many sequences of amino acids of proteins have been determined either from DNA, RNA, crystal structures or other methods and can be obtained freely from online databases. Algorithms have been developed which can find the best possible alignment of two sequences when given a scoring system while still allowing for gaps which arise from the constant snipping and insertion that occurs in DNA.

There are two schemes for aligning sequences: global and local. In a global alignment we seek to align all members of the sequences, whereas in a local alignment we only seek the best matching subsequences. This can be illustrated by a simple example (taken from BLAST by Korf, Yandell and Bedell). Suppose we want to align the sequences PELICAN and COELACANTH. In a global alignment we must use each letter from both sequences, but we allow ourselves to use gaps. One such global alignment is:

P-ELICAN--
COELICANTH

where a - character indicates a gap. With a local alignment we are only concerned with subsequences that match well; in this case the best matching subsequences are:

ELICAN
ELICAN

A typical situation is that you have a sequence of your favorite protein, XYZ, and you would like a tool that will compare its alignment with an entire database of proteins and quickly tell you closely related hits. That's where BLAST comes in.

Introducing BLAST

BLAST (Basic Local Alignment Search Tool) is built to do just exactly that: search a database quickly for proteins sequences that are closely related to a query sequence. BLAST uses some heuristic tricks to speed up your search so that even if the database contains millions of protein sequences you can get results back in a matter of minutes.

When using BLAST there are (at least) a couple of important considerations to keep in mind. The first is that BLAST must have a system for scoring pairings of amino acids; these scores are put into large matrices (20 by 20) which BLAST then uses to calculate an alignment. These matrices are called substitution or scoring matrices and they have been called "evolution in a nutshell" because they tell us nature's affinity for changing one amino acid in favor of another as proteins evolve. The matrices are symmetric but even so there are 210 parameters to be decided on--if this sounds like a lot, it is. Different methods have been applied to generate scoring matrices and two of the most popular are the PAM and BLOSUM matrices. For most purposes, the BLOSUM62 matrix does a fine job of finding related proteins and it is the default matrix in most alignment programs, including BLAST. But you should be aware that these matrices are based on heuristics and if you are going to use programs like BLAST a lot you should learn more about how these matrices were generated so that you use them properly.

The second key concept to understand is the statistics of sequence alignment. There is always a chance that two proteins completely unrelated by evolution will happen to have parts that align well by chance. Understanding the likelihood that a particular alignment could happen by chance is one key to interpreting BLAST's results. Another key is common sense: it always pays to know something about the proteins that are returned in BLAST's results so that those results can be properly interpreted. BLAST quantifies the chances of getting a result by chance with a number called an "E-value" or "Expect value". This number, roughly speaking, is the number of alignments of this quality that would be expected given the size of the database searched. If you have a hit with an expect value of 1e-10, there is a good bet that the query sequence and the hit sequence are related by evolution. If the expect value is 10 or 100, then it is just statistical noise and BLAST is probably not telling you anything useful (although, again, common sense has the last word).

The BLAST homepage is found at:

http://www.ncbi.nlm.nih.gov/blast

There are many resources listed on the BLAST page; besides the web interface for submitting queries there are links to download genomic data and one can download the entire BLAST package (source and binaries). BLAST is really a whole suite of programs that can compare protein and nucleotide sequences in various ways. We will try out a couple of these in the problems below.

Finally you should know that while BLAST is popular and efficient it is by no means the only game in town... other packages include WU-BLAST, FASTA, amd ClustalW to name a few. Each of them has their individual strengths and weaknesses.

Sequence data resources

There are many places where protein sequence data can be retrieved; here are a few useful sites:

Using BLAST

Example: tubulin

In eukaryotic cells the microtubule skeleton works with actin and other filaments to provide structure within the cell required for basic cellular processes such as segregation of genetic material, intracellular transport, positioning of cell organelles, and intracellular movement of proteins and vesicles. Microtubules are long, hollow cylinders with an outer diameter just under 30 nm and wall thickness of 7 nm; the length varies since the subunits can add on to one another like Lego blocks. Elimination of microtubules within the cell causes cell death.

Tubulin is the main component of microtubules, and exists as heterodimers formed by an α- and β-tubulin. This creates a polarity on the resulting subunit, which is used to help the overall microtubule grow and shrink in length.

To get an idea of the necessity of tubulin in living cells, we'll consider a tubulin gene that is conserved across multiple species.

The gene of interest is CG3401 from Drosophila (its full name is "β-Tubulin at 60D"). On the Entrez Pubmed website, select search method as Nucleotide and enter CG3401. A selection of sequences associated with CG3401 will appear. The one we want is the 4th on the list, which codes for a CG3401 mRNA transcript; click on that link and scroll to the bottom of the page that appears. Highlight all 2521 nucleotides and select "Copy" from the Edit menu. Now go to the BLAST website and select Quickly Search for Divergent Sequences under the Nucleotide heading. Paste the sequence of CG3401 into the top box labeled "Search", push the BLAST! button and then the FORMAT! button on the new page that appears. A new window will pop up which will give you updates on your query until it is finished and the final results are displayed. Congratulations! You've made your first BLAST report. The list that appears contains all the sequences similar to CG3401 that diverge in different species.

The graph shows which segments of the gene contain sequence similarity with the original gene. Looking down the list of genes, you can see that CG3401 is similar to Drosophila genes that code for other forms of β-tubulin. Go down the list to Canis familiaris. The BLAST search shows that the Drosophila gene is most similar to a scientifically predicted dog gene that codes for tubulin.

Name 3 species that show sequence similarity to the β-tubulin gene found in Drosophila. For each of these, identify the percentage of sequence similarity present (hint: scroll downwards). Explain how this number is determined. What percentage of the full length of the CG3401 gene is conserved across the species you selected? Explain why the segment from nucleotides 500 to 1500 for β-tubulin would be conserved in these different species.

Make BLAST yours

Besides submitting BLAST queries on the web it is also possible to download the software and do searches on your own computer. Go to the BLAST page and download the BLAST package for your flavor of OS (check the left sidebar under "Software"). You want the "blast" package, not "netblast" or "wwwblast". We are going to use the bl2seq program (short for "BLAST two sequences") to generate the alignment. (Note: if you have difficulty getting this to run on your local machine, there is a link on the BLAST homepage under "Special" to "Align two sequences" with bl2seq on the web which you can use as well.)

Next go to the UniProt site and download sequences for the Pax6 gene (found in many organisms including humans) and the Eyeless gene from Drosophila which are both involved in the development of the eye. You will want to download them in FASTA format. Open the FASTA files for Pax6 and Eyeless in any text editor and take a look at the sequence data for the two proteins. As in the tubulin example, the first line begins with ">" followed by information about the source of the data, followed on the next line by the protein sequence in the single letter format. Now use bl2seq to align the two protein sequences; the command to use will look something like:

bl2seq -i Pax6.fasta -j eyeless.fasta -p blastp

The option -p blastp tells BLAST that we are aligning two protein sequences. BLAST should return a few of the highest scoring pairs of subsequences. Look over your results to get a feel for what is going on and refer to the BLAST documentation if needed, then answer the following questions:

For your top high scoring pair (or HSP as it is commonly referred to) how many hits with this high of a score would have been expected by chance (according your our BLAST output)?

You will notice that BLAST has used gaps in many of your HSPs... what is the evolutionary significance of these gaps?

Extra credit: In some of your HSPs you may notice X's in the Query or Subject sequences. What do these mean and what are their significance?

Print out your BLAST results for the top 3 HSPs in your Pax6/Eyeless alignment and turn these in with your homework.

[an error occurred while processing this directive]