Bioinformatics Tutorial
by Eric Peterson
Basics
The concept of information plays a prominent role in biology. The most famous polymer in the world, DNA, can be thought of as just a simple sequence of 4 letters: A, T, C and G, and yet put a few billion or so of these letters together and we can have the code for making a tree, a frog or Jack Black. Modern initiatives like the Human Genome project have generated reams upon reams of genetic sequence data which must now be analyzed and understood. Bioinformatics is the field of science that attempts to tackle this enormous problem. Some examples of research questions in bioinformatics are: Which parts of the DNA sequence code for genes? How has evolution changed sequences over time and betweeen species? How do we determine the evolutionary relationship between living things based on their DNA?As discussed in Chapter 1 of "Physical Biology of the Cell", both DNA and proteins are linear polymers that can be thought of usefully in terms of their sequences: DNA as a sequence of nucleotides and proteins as a sequence of amino acids. Most of you are familiar with the alphabet of DNA; the less commonly known alphabet of amino acids is shown in this table.
One useful way to compare two sequences (DNA or protein) is to align them so that the conserved parts of the sequence (the parts that have stayed the same) line up with one another. Methods to do this quickly and accurately are an active area of research in bioinformatics, but most of them rely on one of two main schemes for comparing sequences: global alignment and local alignment. In a global alignment we seek to align every letter in the sequences, whereas in a local alignment we only seek the best matching subsequences. This can be illustrated by a simple artificial example (taken from the excellent book BLAST by Korf, Yandell and Bedell and published by O'Reilly). Suppose we want to align the sequences PELICAN and COELACANTH. In a global alignment we must use each letter from both sequences, but we allow ourselves to use gaps. One such global alignment is:
P-ELICAN--
COELICANTH
where a - character indicates a gap. With a local alignment we are only concerned with subsequences that match well; in this case the best matching subsequences are:
ELICAN
ELICAN
Local alignments can be executed more quickly than a global alignment and are well-adapted to finding the similarities in two sequences. A typical situation is that you have a sequence for your favorite protein, XYZ, and you would like a tool that will compare your sequence with an entire database of proteins and quickly find closely related sequences. That's where BLAST comes in.
Introducing BLAST
BLAST (Basic Local Alignment Search Tool) is built to do just exactly that: search a database quickly for proteins sequences that are closely related to a query sequence and return the results to the user. BLAST has become such a popular and useful tool that people talk about "blasting" sequences in just the same way that they talk about "googling" web searches. BLAST is the Google of biology.What makes BLAST so useful is that it is very fast. It uses some computational shortcuts to speed up searches so that even if the database being queried contains millions of sequences you can get results back in a matter of a few seconds. However there are at least a couple of important concepts to keep in mind when using this wonderful tool. The first is that BLAST must have a system for scoring pairings of letters in the two sequences being compared; these scores are put into large matrices (20 by 20 for proteins) which BLAST then uses to calculate the score for an alignment. These matrices are called substitution or scoring matrices and they have been called "evolution in a nutshell" because they tell us nature's affinity for substituting one amino acid in favor of another as proteins evolve. Different methods have been applied to generate scoring matrices and two of the most popular are the PAM and BLOSUM matrices. For most purposes the default matrix used by BLAST, BLOSUM62, does a fine job of finding related proteins. But you should be aware that the numbers in these matrices were arrived at by heuristic methods and if you are going to use programs like BLAST regularly it would behoove you to learn how these matrices are generated so that you know their limitations.
The second key concept to understand is the statistics of sequence alignment. There is always a possibility that two proteins completely unrelated by evolution will happen to have subsequences that align well by chance. Understanding the likelihood that a particular alignment reflects random chance rather than a genuine evolutionary relationship is a major key to correctly interpreting BLAST's results. Of course it always pays to know the biology of the proteins that are returned in BLAST's results as well. BLAST quantifies the probability of getting a result by chance with a number called an "E value" or "Expect value". This number, roughly speaking, is the number of alignments with a given score or better that would be expected by pure random chance given the size of the database searched. If you have a hit with an expect value of 1e-10, there is a good bet that the query sequence and the hit sequence are related to one another. If the expect value is 10 or 100, then that hit can't be distinguished from matches due to random chance and BLAST is not telling you anything useful.
The BLAST homepage is found at:
http://www.ncbi.nlm.nih.gov/blast
There are many resources listed on the BLAST page; besides the web interface for submitting queries there are links to download genomic data and one can download the entire BLAST package (source and binaries). BLAST is really a whole suite of programs that can compare protein and nucleotide sequences in various ways.Finally you should know that while BLAST is popular and efficient it is by no means the only game in town... other packages include WU-BLAST, FASTA, amd ClustalW to name a few. Each of them has their individual strengths and weaknesses.
Sequence data resources
There are many places where protein sequence data can be retrieved; here are a few useful sites:Using BLAST
Example: gp120
HIV virions wrap themselves with a lipid bilayer membrane as they bud off from infected cells; in this viral membrane envelope are "spikes" composed of two different proteins (actually glycoproteins), gp41 and gp120. The gp denotes glycoprotein, and the number indicates their molecular weight (in kilodaltons). The function of gp120 is to help HIV avoid detection by the immune system until it encounters a CD4 receptor found on many kinds of cells in the immune system, where it can bind and infect again.In this exercise we will grab a sequence for gp120 and blast it to find related proteins. Use the LANL HIV Sequence Database site to find a sequence for gp120; the complete gp120 molecule has about 500 amino acids so a complete DNA sequence will have about 1500 bases.
Open a new window with the BLAST website and select "Translated query vs. protein database" under the "Translated" heading. Copy and paste your ~1500 nucleotide sequence of gp120 into the top box labeled "Search" and make sure that you have selected "TRANSLATED query - PROTEIN database [blastx]". For the database choose "pdb"; this database has all the sequences of proteins with known 3d structures in the Protein Data Bank (PDB). What we are doing is having BLAST translate the gp120 nucleotide sequence into an amino acid sequence and then compare it with amino acid sequences in the PDB. Finally, push the "BLAST!" button and then the "FORMAT!" button on the new page that appears. A new window will pop up which will give you updates on your query until it is finished and the final results are displayed.
How does BLAST determine the ranking of the results from your search? For your top search result, identify the percentage of sequence similarity in the alignment with your sequence. Explain how this number is determined. What percentage of the full length of the gp120 sequence is conserved in this alignment? You will notice that BLAST has used gaps in many of your alignments... what is the evolutionary significance of these gaps?
For your top hit, how many alignments with this high of a score or better would have been expected by chance? Looking at your ranked list of results and using only the "E value" which hits do you expect to (possibly) have a genuine evolutionary relationship with your gp120 sequence and why? What further information beyond the E value would you gather about the sequences returned in your BLAST hit list to better determine the evolutionary relationship of these proteins?
Print out your BLAST results and turn these in with your homework.
[an error occurred while processing this directive]