Bioinformatics and Homology Modeling:
A Student-Tested Tutorial for Beginners

Exploring Human Visual Pigments

  1. Introduction
  2. Bioinformatics
  3. Homology Modeling
  4. Test Your Bioinformatics Skills

Last revision: 2008/03/16, to here.

NEW PLAN: I am currently (2008-09-20) making an entirely new version of this tutorial. Watch for it to appear here in the next few weeks. In the meantime, the current version is workable, for the most part, but somewhat messy.

Some of the bioinformatics tools and servers used in this tutorial have changed a lot since last revision. I think I have figured out how to do the troublesome steps using other servers, but the newest instructions are sketchy (you will find warnings at such sections). Sections in green type are obsolete, I am replacing them with material that precedes them. The obsolete green sections will remain until I can incorporate the useful explanations from them into the new instructions.

Thanks to biochemistry students at USM, as well as to many other people around the world, who have used this tutorial and made helpful suggestions.

Gale Rhodes
Chemistry Department
University of Southern Maine
Contact Information

If you would like to use this tutorial in your teaching,
please ask permission. Here's why.

Do you know about the Grameen Foundation?

 


Introduction

This tutorial allows you to explore opsins -- the proteins that catch light for our eyes -- and the genes that code for opsins. But the real subject of this exercise is bioinformatics -- the use of computers to search for, explore, and use information about genes, nucleic acids, and proteins. While learning about the human opsins, you will use some of today's most powerful bioinformatics tools. You can follow up this tutorial with a study of opsins from other organisms, or by exploring any class of biomolecules that interest you.

I assume that you are conversant with biochemistry and molecular biology. If you see unfamiliar terms pertaining to the genes, mRNAs, and proteins used as examples here, break out your biochemistry text, head for the index, and review, review, review.

For more information about each database or tool, go to its home page and read, read, read.

If you are a student in my biochemistry course (CHY 463 at the University of Southern Maine), you will find that this tutorial follows closely my classroom demonstration of bioinformatics tools applied to finding desired information in databases.

History

This web page was originally composed of somewhat sketchy procedures that I pasted in here as I devised them by playing* with bioinformatics tools on the web. Each year my biochemistry students have run through the tutorial, which has led to many improvements. Long about March each year, I get it ready for them by trying it out to see what has changed in this fast-changing world. Take a look at the "last revision" date above, to see the last time I checked it.

*My play with bioinformatics tools was initiated by the book Bioinformatics for Dummies (really!), by Claverie and Notredame, Wiley Publishing, Inc., 2003 (a new edition appeared in December of 2006). Not considering myself a dummie in most arenas, I had never looked very hard at Wiley's "Dummies" books. I'm so glad I looked at this one. The authors are on the frontiers of the field, and they have produced a serious, high quality book. If you want to work through lots of clear tutorials in all areas of bioinformatics, buy it. It's the best $30 I have spent on a book in quite a few years. Just click the title to learn more about the latest edition.


Cast of Characters

I. The Databases (and their acronyms!)

  • Genbank, operated by NCBI (National Center for Biotechnology Information)
    Contains all publicly available sequences of DNA, with annotations
    Same DNA sequence content as EMBL (European Molecular Biology Laboratory) and DDBJ (DNA Data Bank of Japan)
  • UniProt Knowledgebase (Swiss-Prot and TrEMBL), operated by SIB (Swiss Institute of Bioinformatics) and EBI (European Bioinformatics Institute)
    Contains most of the publicly available sequences of proteins. Sequences in Swiss-Prot are annotated manually, and provide or link you to just about all published information about the sequence. Sequences in TrEMBL collected automatically from sequence databases, and are on their way to Swiss-Prot, but are not yet fully annotated.
  • Protein Data Bank
    Contains all publicly availalble experimentally determined structural models of proteins and nucleic acids (determined by x-ray crystallography and NMR)
  • Swiss-Model Repository
    Contains many theoretical structural models of proteins (determined by automated homology modeling)
  • Online Mendelian Inheritance in Man (woman, too, I am sure)
    A catalog of human genes and genetic disorders, linked to gene entries in GenBank.

II. The Tools

  • NCBI Map Viewer
    For finding genes and gene products (RNAs and proteins) that interest you
  • BLAST
    For finding genes or proteins with sequences similar to yours
  • ClustalW
    For comparing your sequence with others, and lots of sequences with each other
  • Phylip
    For making phylogenetic trees, which show how sequences are related to each other
  • Treeprint
    For printing phylogenetic trees
  • PSIPRED
    For predicting the location of helices, pleated sheets, and transmembrane elements of proteins of unknown structure
  • Swiss-Model
    For automated building theoretical structural models of your sequence based on known structures (homology modeling)
  • Deep View (also knows as Swiss-PdbViewer)
    For seeing and exploring macromolecular models in three dimensions, and for manual and semiautomated homology modeling
  • PubMed
    For searching ALL the literature of the life sciences
  • ExPASy (Expert Protein Analysis System)
    Not so much a tool as a tool box -- a very complete set of protein analysis tools

Here We Go

Our subject is human opsins, those proteins, found in the cells of your retina, that catch light and begin the process of vision. We will proceed by asking questions about opsins and opsin genes, and then using bioinformatics to answer them.

When I provide a web address, I'll also make it a link -- just click it to go to the site in a new browser window. Then make it a bookmark so you can find it again. This tutorial will still be open in the window behind the new one.

WARNING: Bioinformatics tools evolve rapidly, faster than I can make changes to this tutorial. So if a page does not look exactly like I say it should, or if its title is different, look around and try to do what the tutorial says. You should find the same links, but names may be slightly different. If the differences are so great that you can't proceed, send me email (see contact link at top of this page), and I'll adapt the instructions to the changes as soon as I can.

Where are the opsin genes in the human genome?

Point your browser to http://www.ncbi.nlm.nih.gov/mapview/. You find a list of species for which genome information is available.

To look at a genome viewer for an organism, click the latest Build listing, or the magnifying glass symbol farther to the right. Hold your mouse pointer on the other tool symbols for a brief description of what they do.

Find Homo sapiens (human), and click on the OLDEST "build" or version of the genome.

NOTE: Why should you use the oldest build in this tutorial? Because sometimes not all links are hooked up to the tools for the newest build.

You see a diagram of the human chromosomes, and a search box at the top. Enter "opsin" in the box next to Search for.

Click Find.

You see the diagram again, with red marks at your "hits", the locations of genes whose entries contain "opsin" as a whole or partial word. Below the diagram is a list of the indicated genes. Among them are the rhodopsin gene (RHO), and three cone pigments, short-, medium-, and long-wavelength sensitive opsins (for blue, green, and red light detection). Four hits look like visual pigments, which probably does not surprise you. To the left of each entry is the chromosome number, allowing you to tell which red mark corresponds to each entry. Note that several hits are on the X chromosome, one of the sex-determining chromosomes. You can pursue multiple hits on the same chromosome with the all matches link for that chromosome.

NOTE: In the human genome lists, you will often see duplicates marked "reference" or "Celera", referring to the results from two major efforts to sequence the human genome. At first, these two efforts were separate, but eventually they came together. When you have a choice, choose "reference," so you will be following the same path I followed in setting up the tutorial.

Click all matches next to X. Be patient: the next page may load slowly--it's packed with information.

You see a very complicated display (don't sweat -- we're going to use only a part of this now). On the left is a diagram of the X chromosome, with red marks at the positions of the gene(s) you've followed to this page -- in our case, the two opsins, medium- and long-wave, which are located near the bottom tip of the X chromosome. To the right are various representations of the X chromosome, with listings of annotated areas. The two opsin genes are highlighted in pink. If you pass your cursor over this page without clicking, you will find that some symbols provide brief information, mostly about regions that are not yet characterized well enough to have a full entry.

As you can see, there is a tremendous amount of information on this page, with links to much more. If you want full information about the meanings of abbreviations and symbols on this page, as well as the kinds of information linked to the page, you can use Map Viewer Help at the top of the page. You will find abundant information about the Map Viewer, explanations of all symbols and links, and even tutorials about how to ask and answer all kinds of questions about the genome.

For now, note the information provided for the first of the two highlighted opsin genes, OPN1LW (this is called the gene symbol). You see that this is the long-wavelength-sensitive (red) opsin, and that it's a gene involved in color blindness (a sex-linked trait -- no surprise).

What do scientists know about the opsins?

Click OPN1LW.

You have entered Entrez Gene, which is a sort of highway interchange with routing to all sorts of information about this gene. Scan down the page. Some of the information is very plain and understandable, while some is very cryptic. One of the most accessible links is to OMIM (for Online Mendeliam Inheritance in Man), a catalog of human genes and genetic disorders. Despite the name, the database includes genes of women, too.

Look down the page and find Phenotypes, and notice the links marked MIM. These are links to OMIM entries. Click one of them.

Each OMIM entry tells you about this gene and types of colorblindness, genetic disorders associated with mutations in this gene. Read as much as your interest dictates. Follow links to other information. For more information about OMIM itself, click the OMIM logo at the top of the page. Once you've satisfied your appetite, return to the Entrez Gene page (use the Back button of your browser or your browser's history list -- if you're lost, click HERE).

Next to the Display button, pull down the menu and select PubMed Links.

You have entered PubMed, a free database of scientific literature, to a list of articles directly associated with this gene locus. By clicking on the authors of each article, you can see abstracts of the article. If you are on a university campus where there is online access to specific journals, you might also see links to full articles. PubMed is your entry point to a wide variety of scientfic literature in the life sciences. On the left side of any PubMed page, you will find links to a description of the database, help, and tutorials on searching. Use the Find tool of your browser to find the name Nathans on this page. Read the abstract of the article by Nathans and co-workers before returning to Entrez Gene.

What is the nucleotide sequence of this gene?

Remember that we are looking at the gene for the red-sensitive opsin in human vision, and it is located near the bottom tip of the X chromosome. Scroll down (way down!) to NCBI Reference Sequences (RefSeq). In the first section, mRNA and Protein(s), all of the following are available:

  • the protein sequence (sequence of this gene's protein product, the red opsin), here listed as NP_064445.1 (P for protein);
  • the mRNA Sequence (sequence of nucleotide bases in the messenger RNA), here listed as NM_020061.3 (M for mRNA);
  • the source sequences (entire sequence(s) of the genome fragment(s) containg this sequence, from GenBank).

Note that the two links to mRNA sequence and protein sequence are given as NM_020061.3→NP_064445.1, the arrow implying that the sequence of the NM entry is translated (by protein synthesis) to give the sequence of the NP entry.

Click the entry number for the mRNA sequence: NM_020061.3

This is a typical GenBank nucleotide file, and a lot of it is hard to read, but a few things are clear. First note, under references, citations to the publication of this sequence in the scientific literature. To see an abstract of the article in which this gene was described, click the PubMed link (a number) below the first reference and read it. Or instead, find the word Nathans on the page, and and click the PubMed link below the related article. As you see, you've been here before. There are many ways to move from one database to another, which is both a blessing and a curse. You have to keep your eyes open for useful links, and when you find a path that you think you might use again, make a note of it and bookmark the web pages. It is frustrating to know there's an easier way to do something, and not remember how you did it.

NB to GR: point back to this abstract when you get the phylogenetic tree. (did you?)

Scroll to the bottom of this long page. The last thing, labeled ORIGIN, is the sequence of this messenger RNA. You are seeing the actual list of As, Ts, Gs, and Cs that make up the message for synthesis of this opsin. But wait! You know that RNA contains no T. In most nucleotide databases, U from RNA is represented as T, to make for easy comparison of DNA and RNA sequences. This sequence information is not in the form that is most useful for searching in databases, say, searching for related genes. Let's display this entry in a form more useful for searching.

At the top of the page, beside the Display button, pull down the menu that says GenBank (the default display format for each entry), and select FASTA (note that several other display options are available). Now you see one descriptive or "comment" line that begins with ">", followed by the nucleotide sequence. This little file is just what you need to search nucleotide databases for similar sequences. Let's keep it for future use.

Click and drag on the web page to select everything from the ">" through the last nucleotide. Be careful not to select anything else. From your browser's Edit menu, select Copy to make a copy of this information on your clipboard, for pasting elsewhere. Now start your favorite word processor, make a new document, and paste. The FASTA comment and sequence should appear. Select all of the text and change the font to Courier or Monaco -- these "typewriter" fonts make it easy to align letters into columns, because all letter are the same width. Save this file, choosing text or plain text as the file type. Call it mrnared.txt. Save it to a convenient location for the files you'll be making later. Click your browser's Back button until you return to the Entrez Gene page for this gene.

What is the amino-acid sequence of this gene?

Under NCBI Reference Sequences (RefSeq), click the entry number NP_064445.1 for the protein sequence.

Things look a lot like before, but this is a protein entry (the classical view is that gene products are proteins, but not all of them are), containing the amino-acid sequence in one-letter abbreviations. Just as with the mRNA entry, turn this into a FASTA display, and copy it into a new word-processor document. Save it in text format as protred.txt. Return to Entrez Gene.

What does the neighborhood of this gene look like?

Click the first entry number beside Source Sequence.

This entry shows the sequence of the specific DNA clone that contains the opsin gene, along with information about how this clone was produced. This entry thus shows the gene in the slightly larger context of the cloned fragment in which this gene was found. This sequence would allow you to see flanking regions around the gene, and perhaps to design PCR primers for making useful quantities of the nucleotide sequence so you could express this gene in a cloning vector. From this page, you could also find neighboring sequences if you wanted to look farther afield. As before, display this entry in FASTA format. You will get several entries, each a different clone that was found to contain this region of the genome. Save the first FASTA entry (from the ">" to the end of the nucleotide sequence) as a word processor text document entitled GBred.txt. (Why GB? Because the last time I looked [still true 2006/03/21], these entries were called GenBank entries. But things change fast in this business.)

What proteins in humans are similar to the red opsin?

Now return to the NCBI Map Viewer. We're going to search the human genome for sequences similar to that of the red opsin.

Click the Blast symbol (circled B) next to Homo sapiens (human), OLDEST build.

This is the NCBI's BLAST search tool. BLAST is a widely used program for finding sequences similar to a "query" sequence that you're interest in. Pick these options from the various menus:

  • Database: Build Protein for OLDEST build (look at bottom of the Database menu). This means that you will search the protein sequences in this build of the database.)
  • Program: BLASTP (Use the version of BLAST that compares protein sequences, unlike BLASTN, which compares nucleotide sequences.)
  • Other Parameters, Expect: 10 (The higher the number, the less stringent the matching, and the more hits you'll get)

Next, copy the FASTA data from your file protred.txt to your clipboard, and paste it into the BLAST search box, above which it says, "Enter an accession..." Check to be sure that the first character in the box is the ">" at the beginning of the FASTA data. Then click Begin Search.

The next page is for formatting your search results. We will take all defaults, and just click the View Report button. When your results are ready, the results of BLAST page appears. Look down the page to the graphical display, a box containing lots of colored lines. Each line represents a hit from your blast search. If you pass your mouse cursor over a red line, the narrow box just above the box gives a brief description of the hit. You'll find that the first hit is your red opsin. That's encouraging, because the best match should be to the query sequence itself, and you got this sequence from that gene entry. The second hit is the green opsin -- remember that the PubMed entry reported that the red and green pigments are the most similar. The third and fourth hits are the blue opsin and the rod-cell pigment rhodopsin. Other hits have lower numbers of matching residues, and are color coded according to a score of matches. If you click on any of the colored lines, you'll skip down to more information about that hit, and you can see how much similarity each one has to the red opsin, your original query sequence. As you go down the list, each succeeding sequence has less in common with red opsin. Each sequence is shown in comparison with red opsin in what is called a pairwise sequence alignment. Later, you'll make multiple sequence alignments from which you can discern relationships among genes.

See what you can figure out about what the scores mean. Identities are residues that are identical in the hit and the query (red opsin), when the two are optimally aligned. Positives are residues that are very similar to each other (see residue number 1 in the blue opsin -- it's threonine in red opsin, and the very similar serine in the blue). Gaps are sometimes introduced into a hit to improve its alignment with the query. The more identities and positives, and the fewer gaps, the higher the score. Note that blue opsin and rhodopsin are only about 45% identical to the red opsin. Other proteins, which are apparently not visual pigments, have even lower scores. Now let's take a look at where all these hits are in the human genome.

Where are all the genes for these other proteins?

Click the Genome View button near just below the introductory information at the top of this result page. If this button does not appear, go back and make sure you are searching the database for the OLDEST build of the human genome.

You have come full circle. You are back that the human chromosome diagram, and all the hits of your search, in the colors that signify their BLAST scores, are located for you on the diagram. Notice that there are about 100 proteins (discovered so far, that is) that have 40% or more positives in alignment with red opsin. The opsins are members of the very large family of G protein-coupled receptors, key players in signal transduction.

How are the opsin genes related to each other?

Answering this question requires making a multiple sequence alignment and then using it to make a phylogenetic tree. For these tasks, we move to another database where it's a little easier to gather a bunch of sequences into a single FASTA file.

Point your browser to http://us.expasy.org.

You see the home page of ExPASy, the Expert Protein Analysis System. As I said earlier, ExPASy is a complete protein tool box. With ExPASy, you can do almost any imaginable analysis or comparison of protein sequences and structures. In my humble opinion, Swiss sequence database tools are among the easiest ones to use.

Click UniProt Knowledgebase (SwissProt and TrEMBL) under Databases.

Read the introduction to these databases. They are high quality protein sequence databases with abundant annotation, minimal redundancy, and many connections to other databases.

Click Advanced search in the UniProt Knowledgebase.

With advance searching, you can limit your search to specific genes and organisms, and you can search on descriptive information in the entries.

Set up a search for human opsins, as follows:

  • Search UniProtKB/Swiss-Prot only.
  • Enter Description: opsin
  • Organism: Choose "Human" from the pull-down menu
  • Check "Append and prefix * to query terms." The * is a "wild card". You are searching for all entries that contain "opsin" as a whole or partial word.

Click Submit.

The page UniProtKB/Swiss-Prot description is your search result page.

Look over the results. On 2008/03/16, this search gave 15 hits, including the rod pigment rhodopsin (OPSD), along with the three cone pigments (OPSB, OPSG, OPSR). There is also a "visual pigment-like receptor peropsin", OPSX. Sound mysterious. Let's find out more about it, and in the process, see a typical Swiss-Prot entry.

Next to OPSX_Human, click on the number (014718) in the column headed AC (accession code?).

You see the UniProtKB/Swiss-Prot View of entry O14718. Peruse this entry and try to find out just what this rhodopsin-like protein is thought to do. Under Comments, you'll learn that it's found in the retina (the RPE or retinal pigment epithelium), and that it may detect light, or perhaps monitors levels of retinoids, the general class of compounds that are the actual light absorbers in opsins. Also under Comments •Similarity, you see, as mentioned earlier, that this protein is a member of the large family of G protein-coupled receptors (GPCRs). If you click G-protein coupled receptor 1 family. Opsin subfamily, you find a list of all purported members of this subfamily in SwissProt. Return from that page and click the adjacent View Classification to produce a list of all GPCRs in SwissProt, with summary Statistics at the bottom indicating that the human genome alone contains 809 of them (as of 2008/03-16)! It is a big family.

Now back up to the UniProtKB/Swiss-Prot entry page for 014718, OPSX_HUMAN.

Under References click the journal citation, "Proc. Natl. Acad. Sci. U.S.A. 94:9893-9898(1997). From the resulting page, you can read a full article in the Journal of the National Academy of Sciences (PNAS) about this protein. Like many journals, PNAS puts full articles online just 6 to 12 months after publication.

Return from the PNAS reference, and look further down the entry page, where you find cross-references to this protein or its gene in other databases, predicted structural features of the protein, and last, the sequence. Note also, at the bottom of the page, links to a number of ExPASy tools listed for further analysis of this sequence. Try some of them. For example, I just learned in about ten seconds from Compute pI/MW that the isoelectric pH (or pI) of this protein is 8.78, and its molar mass is 37422.92. And I learned in no time at all from ScanProSite that the sequence contains signatures indicating that the protein is probably a G protein-coupled receptor (no surprise, but comforting) and that it has a retinal binding site. ProSite is a tool for finding signatures of function in new sequences.When you finish playing with these powerful tools, return to your SwissProt search results by use of the back button of your browser. If you're lost, go back to ExPASy and do the search again.

Now let's compare the sequences with each other. We'll use the program ClustalW to make a multiple sequence alignment.

Return to the search result page, UniProtKB/Swiss-Prot description: opsin. Scroll down the result page and check the boxes at the left of these entries

  • OPSB (blue-sensitive opsin)
  • OPSD (rhodopsin)
  • OPSG (green-sensitive opsin)
  • OPSR (red-sensitive opsin)
  • OPSX (visual pigment-like receptor opsin)

At the top of the page, at Send selected sequences to, select Clustal W (multiple alignment) from the menu, and click Submit.

ClustalW has been implemented at many web sites. This one, located at EMBnet.org, automatically receives the FASTA files from the selected entries, allows you to make some settings of the alignment criteria, and then does the alignment. We will just accept the default alignment settings. First, scroll in the Input Sequences box and verify that it contains five FASTA files, one right after the other. To make them easier to identify in subsequent outputs, edit the name of each FASTA comment line (begins with ">") as follows:

  • Change "sp|P03999|OPSB_HUMAN Blue-sensitive opsin (Blue cone photoreceptor pigment) - Homo sapiens (Human)." to "Blue".
  • Change "sp|P08100|OPSD_HUMAN Rhodopsin (Opsin 2) - Homo sapiens (Human)." to "Rhodopsin".
  • Change "sp|P04001|OPSG_HUMAN Green-sensitive opsin (Green cone photoreceptor pigment) - Homo sapiens (Human)." to "Green".
  • Change "sp|P04000|OPSR_HUMAN Red-sensitive opsin (Red cone photoreceptor pigment) - Homo sapiens (Human)." to "Red".
  • Change "sp|O14718|OPSX_HUMAN Visual pigment-like receptor peropsin - Homo sapiens (Human)." to "Peropsin".

In all cases, be sure to leave the ">" in the first line of each FASTA entry. To save some work in case something goes wrong, select the edited contents of the Input Sequences box, copy it, and paste it onto an empty word-processor page, and save the file in text format. Name it Opsins.txt.

Do not enter your email address.

Click Run ClustalW.

The resulting page is called ClustalW query receipt. Once the task is complete, links to results files will appear.

Click clustalw (aln), the link to your multiple-sequence alignment file.

You see the typical ClustalW alignment file, showing our five protein sequences aligned to maximize identical and similar residues. Below each line of five sequences are symbols to show the extent of similarity among the sequences. An asterisk (*) means that the same residue is always (that is, for all of these sequences) found at that location; for example, the first asterisk marks a location where only N (asparagine) is found. Colon (:) means that all residues at this location are very similar; for example, the first colon is where only F (phenylaline), I (isoleucine), and L (leucine) -- residues with large, nonpolar sidechains -- occur. Period (.) means somewhat similar residues; for example, at the first period, serine, threonine, and glutamine occur -- all polar, but varied in size. If there is no mark then the residues at that location display no predominant common properties.

Once more, as a safety measure, copy this alignment to your clipboard, and paste it onto an empty word-processor page. Then save the file in text format. Name it OpsMSA.txt. Remember that it is still on your clipboard, for pasting at our next stop. This multiple sequence alignment is one type of input you can use to make a phylogenetic tree.

What does the family tree of human opsins look like?

 

NOTICE 2008-03-16: The tools used for the next parts of this tutorial have changed so much since 2007 that I am still working out an efficient way to get a tree to conclude this part of the tutorial.

Sorry, will try to fix this. Meanwhile, all sections in green refer to servers that are no longer available.

In black, following, are replacement instructions for the old green stuff farther below. The new instructions will be very sketchy until I finish them and incorporate more complete explanations from the old green stuff.

Point your browser to http://bioweb2.pasteur.fr/ and click the colored flag (half Stars and Stripes, half Union Jack) to get to the English version of this page.

Click the heading Phylogeny, and then click protdist.

Enter your email as requested, and paste the contents of your ClustalW output (saved as OpsMSA.txt) into the box under Alignment File. Under Bootstrap Options, check Perform a bootstrap before analysis, enter an odd number after Random number seed (must be odd), leave 100 in the replicates box, and make no other changes on the page. Near the top, click Run. When the output page appears, copy and save the contents of the box labeled protdist.outfile.

Click the back button on your browser to return to the Phylogeny page. Find and click neighbor.

Enter your email as requested. Into the box labeled Distances matrix file, paste the contents of the file protdist.outfile. Under Bootstrap options, check Analyze multiple data sets, for How many set? enter 5, put in an odd number for a seed, and check Compute a consensus tree. Make no other changes on the page. Click Run.

There are four output files. Copy and save the contents of the last one, neighbor.outtree.

Skip to black text below.

The green text that comes next contains obsolete instructions, but far more useful and detailed explanations of what you just did. I hope to incorporate the explanations into the previous instructions soon. For now, reading the green sections, but ignoring specific instructions and filenames, will help you to understand how you just produced the information that you need to make a family tree of the 5 opsins using the tree-printing program Phylodendron (below).


DON'T Point your browser to http://bioweb.pasteur.fr/seqanal/phylogeny/phylip-uk.html

This is one home of the program Phylip, One of the most rigorous tools for constructing phylogentic trees from aligned sequences.

Under Proteins, next to protdist, click "advanced form."

You are about to run protdist, a program that computes the "distance" of sequences from each other. These so-called distance matrices will be used by Phylip to construct your tree.

Enter your email into the top box.

In the alignment file box, paste your mutiple sequence alignment from ClustalW. Include everything you copied from the clustalw (aln) page, including the identifying information preceding the data.

Click "Bootstrap Options" to move down this page, and then make these settings:

  • Check the box for "Perform a bootstrap before analysis"
  • Enter any odd number for a seed
  • Enter 100 replicates

Scroll back up to the top of the page, and click "Run protdist".

protdist constructs distance matrices by a process called "bootstrapping". Bootstrapping is a bias-reducing procedure in which protdist builds an alignment of pseudosequences by picking residue positions at random and stringing the residues at those positions together until the sequence is the same length as the original ClustalW alignment. From this pseudosequence alignment, protdist determines the relative number of sequence difference among the five proteins, as determined from a random sampling of their sequences. The result of the process is a called distance matrix, and you will see it soon. This process is repeated, 100 times in our case, to make 100 distance matrices. The tree we will ultimately produce represents a consensus of the 100 matrices.

There may be a delay of a few minutes before the result page appears. If the server is busy, you may be informed that results are being sent by email. If so, check you email in two or three minutes. You will receive five messages. The one whose subject starts with "access" contains a link to your result page. Click the URL, or paste it into your browser and press <return> to open the page.

On the Phylip: protdist page that results, click outfile to see the output from protdist. The file contains 100 matrices containing numbers that represent the relative number of differences among the five sequences. Each matrix has the sequence names in the first column, and you should imagine that these sequence names are also the headings for the remaining columns. The number at the intersection of the row Blue and the column with the imaginary heading Peropsin gives the relative magnitude of the sequence differences between the blue opsin and peropsin. The matrices have zeros on the diagonal because each pseudosequence is identical to itself.

Click the Back button of your browser to return to the Phylip: protdist page.

On the first pull-down menu of the Phylip: protdist page, pick "neighbor." Read the menu carefully: don't pick "weighbor".

Click "Run the selected program on outfile" to run Phylip with the output file of matrices you just examined. You are running a procedure called "neighbor joining" to construct an evolutionary tree.

On the Phylip: neighbor page that appears next, beside "Distance method?" Make sure "Neighbor-joining" is selected.

Click "Bootstrap options" to move down the page, and then make these settings:

  • Check "Analyze multiple data sets (M)"
  • Enter 100 data sets (using all of the replicates from protdist)
  • Enter an odd number for a seed
  • Check "Compute a consensus tree"

 Scroll down to "Other options".

This entry area gives you the option of designating an outgroup for the root of your tree. An outgroup is the sequence you think is most distant from the others, possibly the common ancestor of all. We don't know that in this case, so leave the default of 1.

 At the top of the page, click "Run neighbor".

 The resulting files are

outfile.consense -- your tree, in a text file, and outtree.consense -- your tree in a format used by tree-printing programs.

Click on outfile.consense to see the tree.

Scroll down to the bottom of this file to see the consensus tree. This tree is "unrooted", meaning that we do not know the ancestor of all these sequences. We learn from this tree which sequences are most alike and which are most different. We also learn how often the connections of this tree were made the same way in the 100 trees made from those 100 difference matrices. The numbers on the branches indicate the number of times that partition of the species into the two sets separated by that branch occurred among the 100 trees. For example, the separation of Red and Green from the other three, indicating that Red and Green are more similar to each other than to the other three, occurred in all 100 trees. The separation of Blue and Peropsin from the other three occurred in only 62 of the 100 trees. In the other 38 trees, Rhodopsin and Peropsin were separated from the other three. (Can you extract this information from this file?) In the tree branching shown, the majority rules, and the results of 38 of the trees are discarded.

Note: Your results may be slightly different from mine. Because of the random choices made in constructing the tree, the percentages in the paragraph above my vary. I have gotten as high as 82% consensus on the separation of Blue and Peropsin from the other three.

You can save this file by selecting all and pasting it into a word-processor document. Call it outfile_consense.txt.

Return to the Phylip: neighbor page and click on outtree.consense. This is your tree in Newick format, which is widely used by tree-printing programs like Phylodendron. Let's use this program to give us a tree in attractive graphics, rather than text. Select:All and copy this little file to your clipboard.


Note: With a consense.outtree or neighbor.outtree file from a Phylip server, you can continue here:

Point your browser to http://iubio.bio.indiana.edu/treeapp/treeprint-form.html.

Paste the contents of yourconsense.outtree or neighbor.outtree file into the Tree Data box. Select Phenogram from among the Tree Styles. From the menu at Extra Options, Output, select GIF image (not Gif Image map) format for your output file. Give your tree a title, such as "Human Visual Opsins and Opsin-Like Proteins". Set Tree Growth to horizontal. Finally, click Submit.

Your GIF-format tree appears in your browser window. To keep it, chose Save As ... from the File menu. Call the file OpsinTree.gif. My tree looks like this:

Look again at this abstract in the light of your tree.

What is the structure of an opsin?

By now, I'm particularly curious about peropsin, but it's not likely that the structure of a recently discovered protein of unknown function has been determined. But it is likely that all opsins are similar in structure, so let's see is we can find an opsin in the database for macromolecular structures, the Protein Data Bank (PDB). It will give us an idea of what kind of thing an opsin is.

In fact, the PDB does not contain molecular structures at all. Is is better to say that it contains models of macromolecules. These models are interpretations of data from one of the two main methods of macromolecular structure determination: x-ray crystallography and NMR spectroscopy. When researchers determine the structure of a macromolecule, they deposit a file containing the three-dimensional coordinates of all the atoms in the model. This coordinate file -- along with an online molecular graphics tool (like the PDB's Jmol Viewer) or a computer graphics program like DeepView -- are all that you need to see and study the molecule on your computer. Next we will retrieve a model from the PDB and view it with an online graphics tool. We'll also visit the home of a topnotch computer graphics program that you can download FREE and use on your home computer.

Point your browser to http://www.rcsb.org/pdb/.

The PDB home page contains a simple search box at the top. You can search for models using simple keywords or PDB ID codes. An PDB code has four characters, like 1CYO. How would you ever know a model by its code? When a new structure is published, the authors usually give the PDB code in the last reference of the bibiography. With that code, you can go straight to the model you want to see. But more often, your question, like ours, is more general. For such cases, PDB also provides forms for more sophisticated searches. For now, let's just see if any opsin models are availalble. Type "opsin" into the search box, make sure the PDB ID or keyword is selected, and click Search.

On 2007/01/05, this search returned only one model, which is quite puzzling, because a search for "rhodopsin" returns 48 models. So this tells me that the quicky (quirky?) search tool at the PDB still needs some work. But this shortcoming is a gift for us right now. We've bagged a live opsin of known structure ; the PDB contains only models derived experimentally—either by x-ray crystallography or NMR spectroscopy. Lets take a look at it.

Click the PDB file code 1GZM above the tiny image of the model.

You have come to the Structure Summary page for this model, which is its home page at the PDB. This page is connected to just about everything you could possible do with this model. At the PDB, your first goal is always to get to the Structure Summary page for the model you are seeking.

NOTE: Structure Summary does not exactly jump out at you on this page. It's the tab selected over the main part of the entry. Those tabs should be more prominent—they are what distinguishes each of the important pages for a model. If you want to know where you are in the PDB, look at two sets of tabs at the top of the page. The other tabs open LOTS of information about this model, but we will stick with structure.

In the left column of all PDB pages, you find a set of nested menus. Click Display Molecule to open the PDB display options. If you already own or use one of the listed viewers, like the free program DeepView, you are in business. Click your viewer to download the model and view it in a familiar environment. But let's assume that you are new to all this, and use a handy viewer that works in your browser.

Click Jmol Viewer. Assuming that your computer has up-to-date Java software, your browser will load the viewer, and it will load the file 1GZM. Your should see models of two rhodopsin molecules—with backbones shown as ribbon-like cartoons, one green, one blue—and several ball-and-stick models of smaller molecules. Is rhodopsin a dimer? No, but in the crystals of rhodopsin from which this model was derived contained two rhodopsin molecules per asymmetric unit (the smallest portion from which the entire crystal is constructed. PDB files usually show the full contents of the asymmetric unit. If more than one molecule is present, they are referred to as chains in the model.

NOTE ON VIEWERS: The viewer on display is the widely used Jmol, which you will find in use as a molecular viewer at many web sites. If you take time to get to know this viewer fairly well, you will get more out of the many sites that use it. Like most of the other viewers listed at PDB, Jmol is quite limited in its capacity for analysis of protein structure.
In my humble opinion, the most powerful protein-analysis tool listed at PDB is DeepView. DeepView may be the only protein-structure viewing and analysis tool you will ever need. You will learn about it in the Homology Modeling section, below.

Here are some other things you can do to get to know models in a Jmol frame (to get back to the original rendition, reload the page):

  • Click/drag (left button if you have more than one) on the image to rotate the structure. You should be able to tell that is has a lot of alpha helix.
  • Hold down option (for Macintosh; alt for Windows) and click/drag to zoom in (drag towards you) or out (drag away) or the rotate the model in the plane of the screen (drag left or right).
  • Hold down ctrl (or right-click) the image: up pops a set of menus, and if you browse around on them, you'll see that there is much more to Jmol. Let's try just a couple of things to give you some general ideas.
  • Using the pop-up menus, Select:Protein:All
    Nothing appears to happen. You have selected part of the model (the protein part, but not the small molecules). Now let's change it.
  • Color:Cartoon:By Scheme:Secondary Structure
    The cartoons become red (well, bright pink) for alpha helix, and yellow for beta sheet. I'll bet you had not noticed the beta sheet in the models before. Look one of the chains over carefully to get a feeling for its structure. How many helices are present? How many strands of bet a sheet? Are the strands parallel or antiparallel?
  • Do you know how to view stereo pairs? (If not click HERE to learn how.) Then Render:Stereographic:(choose your favorite mode). Due to a bug in Jmol, pick the opposite mode from what you really want. Cross-eyed viewing gives you wall-eyed, and vice versa. Now you can see the models as solid object with convincing depth. If you are ever going to do anything serious with protein structure, you'll need to find a way to view them in 3D.
  • Work in stereo or not, as you prefer. Clear the display: Select:None; then Select:Display Selected Only. The display goes blank; nothing is selected and you are displaying only the selection (very logical!).
  • Select:Protein:All (means select both backbond and sidechains). Then Render:Scheme:CPK Spacefilling. The protein portion is now show as a spacefilling model. In this rendition, you get a good idea of the overall shape of the protein. Unfortunately, the Jmol menu does not allow you to color the two chains separately or get rid of one of them.
  • Render:Scheme:Wireframe. Now you see all of the protein parts of this model in wireframe. This is not as impressive as some other schemes, but is actually the most useful when you start exploring models in detail, because the wires do not hide each other like ball and sticks or spacefilling models.

To learn more about Jmol, consult the help links at PDB below the display. You can also find extensive help for all viewers listed there.

Finding Opsin Homologs in the PDB

Now, let's try to find other models in the PDB that are homologous to the human opsins. You will ask the PDB, "List all models whose sequences can be aligned with that of human red opsin, in order of sequence similarity." In PDB terminology, the red opsin sequence is the query, and similar models found are called subjects.

First, open your query file protred.txt (FASTA sequence of red human opsin), and copy the sequence portion only to the clipboard; omit the comment line that begins with >.

At the top right of any PDB page, click Search. From the list of search types, click Sequence. On the resulting page, click the button next to use Sequence, and paste your red opsin sequence into the box just below. Not that the search tool is your new friend Blast, and that a E cut-off value of 10 is given as a default. This will limit the results to subjects of quite high similarity. Higher E-value cutoffs will be less restrictive, and give more hits or subjects. Click the search button. The search tool is now looking for PDB models whose sequences are similar to the human red opsin sequence.

On 2007/01/07, I got 14 subjects, or 14 PDB models whose sequences are homologous to the search sequence. Each is listed with an E-value (not the same as the E cut-off value on the previous page), which is the probability that the sequence similarity between query and subject is a coincidence. The first result or subject is PDB model 1F88, a model of bovine rhodopsin. The E-value is 2.4 x 10 -76 . In other words, while the probability that a coin flip and your call will agree just by chance is 0.5, the probability that the similarity between human red opsin and bovine rhodopsin is just a chance occurence is

0.00000000000000000000000000000000000000000000000000000000000000000000000000024,

which means, to any sane biologist, that these two molecules descended from a common ancestor. There is no chance that, in the history of the universe, two proteins could arrive at sequences this similar by chance. This also means that the structure of the bovine rhodopsin is a sure bet to be very similar to that of the human red opsin, whose structure is unknown (if if were known, this search would have found it).

Now look down the list of the models you found. They are all models of the same substance: bovine rhodopsin. The only molecule of known structure that is highly homologous to the human red opsin is this one. As you know by now, these molecules are in the family of G-protein coupled receptors, membrane proteins whose structures are proving to be quite elusive.

Use the results page to answer these questions about the comparison between human red opsin and the bovine rhodopsin in PDB 1F88:

  1. How many corresponding residues, and what percent of the residues, do the two proteins have in common (exact matches)?
  2. How many and what percent of corresponding residues are similar in chemical properties?
  3. How many gaps did the alignment program introduce, and how many residues in each gap, to get best alignment between human red opsin and 1F88?
  4. Find the longest string of exact matches between the two proteins. How many matches does it contain, and what are the beginning and ending residue numbers?

Now you know how to search the PDB for models whose sequences are similar to a target or query sequence. Structural biologists use such searches when they have a new protein sequence and want to know its structure. If the structure is known, this search would find it. If not, any hits with high sequence similarity can tell researchers the overall fold of the newly discovered protein.

Where Do You Go From Here?

Well, there you have a basic introduction to bioinformatics. With the tools you've tried out, you can explore the vast stores of genetic and structural information available on the Internet. Every page we've visited has many more links to other tools. You can figure out a lot just by visiting them and playing around, and there's usually built-in help and tutorials. I hope this tutorial spurs you to learn more about how to use bioinformatics in your explorations. If you want to learn more about using databases in structural investigations of proteins, see the Encore! next.

Test Yourself

Click HERE for a series of questions to answer using the tools of this tutorial. These questions make an appropriate assignment for assessing your (or your students') ability to use the tools of bioinformatics.


Encore! Exploring Protein Structure by Homology Modeling

How would you like to create a model of that mysterious peropsin we found among sequences homologous to the human opsins, and to find out whether it contains a binding site for retinal? This is a reasonable goal, if you are willing to learn how to use a more powerful tool for protein viewing and analysis.

Jmol is a great structure viewer for beginners, but among free (my favorite price) viewers, there are much more sophisticated, powerful, and versatile, viewers than Jmol. My all-time favorite is DeepView (also known as Swiss-PdbViewer). The encore section of this tutorial requires DeepView. We will use it to look more closely at bovine rhodopsin, including exploring its bound retinal molecule. We will also probe more deeply into that mysterious peropsin we noticed among the sequences similar to the human opsins. In fact, we will make a model of peropsin, by threading its sequence onto the three dimensional structure of bovine rhodopsin. (You can't do that with any other free protein viewer!!)

This process is called homology modeling, and it can provide sort of an educated guess about the structure of a protein (the target) from its sequence, as long as one of more structures of homologous proteins (templates) are available in the Protein Data Bank. How good a guess? In short, the more similar are the sequence and function of your target and your template(s), the better the model. So in our case, if we can find a template of similar sequence and function (a homologous retinal-binding opsin, that is), we should be able to get a decent model of peropsin, and see whether it's really feasible to think that it binds retinal. We might even be able to tell whether it binds retinal covalently or noncovalently, depending on whether an appropriate covalent-binding residue is suitably located in the homology model. At the very best, however, you cannot learn fine details of side-chain conformations, as you can from determining protein structure by X-ray crystallography or NMR. But homology models can be useful for preliminary exploration, and might also point to useful target residues for chemical analyses of structure, such as site-directed mutagenesis.

NOTE: Creating a model is loosely called structure determination, but in fact, we never really determine the structures of molecules. What we call structure determination is really creating molecular models that fit data or fit what we know about a substance. A model based on data from x-ray crystallography or spectroscopy is called an experimental model. A model not based on experimental data is a theoretical model, and a theoretical model based on homology to one or more experimental models is a homology model.

If you already have and know how to use DeepView, just continue with this tutorial. If not, take time to learn DeepView from this Beginner's Tutorial, and then return here.

Exploring Rhodopsin and Peropsin

If you came here directly from the DeepView Tutorial, you might want to read a bit about the target of this modeling project, peropsin, which is the subject of the somewhat dated bioinformatics tutorial (above). Peropsin is found in the retina, it shares modest sequence homology to the visual pigments of the retina (like rhodopsin), but its function is unknown (as of early 2008). In the tutorial that follows, you will make a homology model and see whether peropsin is likely to have a binding site for the prosthetic group retinal, as the visual pigments do.

For this part of the tutorial, I'm assuming that you know how to use DeepView, also known as Swiss-PdbViewer, and how to download models (called coordinate files) from the Protein Data Bank. If not, please work through sections 1-11 of the DeepView Tutorial and come back to this section. The conventions used in the rest of this tutorial are the same as those in the DeepView Tutorial.

Quick Homology Models (but you don't learn anything).

Many genomics databases include homology models generated by automated servers, or quick and easy ways to submit a sequence or database entry for modeling. You get a model quickly, but you don't learn much about the process. Let's get a peropsin model by this route, and then we will do it by a more hands-on method that helps you to see a lttle bit about what the automated servers are doing. Then later on, if an automated server gives you a poor model (one that conflicts with things that other research has told you, or one with obvious errors, like charged residues in its interior), you will know how to intervene in the modeling in hopes of getting a better model (one that agrees with reliable evidence about the structure).

Go to http://us.expasy.org/, select UniProt Knowledgebase (Swiss-Prot and TrEMBL), and enter OPSX, the protein name for peropsin, in the Search Swiss-Prot and TrEMBL for box. Click Go. On the result page, click OPSX Human. You are now at UniProtKB/Swiss-Prot entry O14718. Look over the wealth of information that makes up the annotations for this entry. Through this page, you are connected to the information about this gene is databases all over the world of structural biology.

After looking over this page, and perhaps visiting some other sites to see what thay have on OPSX, scroll to the bottom and click on Submit a homology modeling request to SWISS-MODEL.

At Swiss-Model, you see that the server has already started filling out the form for you, by displaying the accession number O14718 to the top. Fill in your email address and your name. Click Submit to SWISS-MODEL. You just asked Swiss-Model to make a homology model for you. The server will search (using pBlast) for templates with sufficient sequence similarity; gather the templates and thread your model onto them, search databases of protein loops to build parts that do not match well with the templates; to build loops similar to those in databases, if possible; to build remaining loops that are mostly guesswork, but with reasonable conformations; and finally, to optimize the model by minimizing its conformational energy.

NOTE: You can also paste any FASTA sequence into a form for a homology modeling request by going to http://swissmodel.expasy.org/ and clicking on First Approach Mode under Modeling Requests.

Swiss-Model will send your model, along with the template models, to you by email. The file will be all ready for viewing in DeepView, and it will include all templates used in the modeling.

Note: From the UniProtKB/Swiss-Prot entry O14718 page, you can also get homology models of peropsin using various templates by clicking O14718 next to ModBase under Cross References, 3D Structure Databases. Again, you get a model, but don't learn much.

Homology Modeling Step by Step

Now let's make a peropsin homology model by the do-it-yourself (well, partically, at least) method. This will give you a much clearer picture of what's going on when an automated server makes a model for you.

First, we'll need a FASTA file for human peropsin. Go to http://us.expasy.org/, select UniProt Knowledgebase (Swiss-Prot and TrEMBL), and enter OPSX, the protein name for peropsin, in the Search Swiss-Prot and TrEMBL for box. Click Go. On the result page, click OPSX Human. Near the bottom of the resulting UniProtKB/Swiss-Prot entry, click on O14718 in FASTA format. Save the resulting text file as peropsin.txt. (I selected and copied the file from the browser display, pasted it into a new word processor file, and saved it in text format).

NOTE: You can also paste any FASTA sequence into a form for homology modeling by going to http://swissmodel.expasy.org/ and clicking on First Approach Mode under Modeling Requests.

Start DeepView. Then Cancel the initial dialog box, which is expecting you to load a PDB file. Your peropsin file is in FASTA format, so you have to load it by a different procedure.

SwissModel: Load Raw Sequence to Model...
A reminder about DeepView Tutorial format: the instruction above this line tells you to select the command Load Raw Sequence to Model.. from the SwissModel menu.

The resulting dialog box looks just like one for loading a PDB file, but now DeepView is looking for a sequence file in FASTA format. Navigate to your peropsin.txt file, select it, and click Open.

DeepView displays the sequence of peropsin as an alpha helix. This is a compact way to get the 337 residues onto the screen. First, let's see what Prosite (above) has to say about the nature of this protein. DeepView contains an internal link to ProSite through the command Edit: Search for ProSite Patterns. Because ProSite works on sequence only, we don't need to know anything about structure to see what signatures of protein function ProSite can find.

Edit: Search for ProSite Patterns

  • If DeepView returns an error message, your DeepView installation may not have included the ProSite data. Go HERE and follow instructions for this installation. You'll probably have to restart DeepView, but not your computer, after installing.

This command elicits a small window listing signatures or patterns found in peropsin. Note the last two entries, indicating that ProSite recognizes patterns indicating that peropsin is a G protein-coupled receptor (GPCR) with a retinal-binding site. Click the black descriptions of sites to highlight the residues that ProSite recognized. Click the red ProSite entry numbers to download a full description of the ProSite documentation for recognizing a specific type of protein. The entry for GPCRs contains a full list of sequences from this family in SwissProt/TrEMBL. It's a huge list. The entry for retinal binding sites contains a list of sequences that appear to contain this site, followed by a description of the types of proteins in which this pattern is found. All entries end with the specific patterns that ProSite looks for to a particular type of protein. Study these lines to find out what ProSite is looking for when it scans a sequence.

Now you know that there is likely to be a retinal-binding site in peropsin. Let's see if we can find it.

SwissModel: Find Appropriate ExPDB Templates

  • If DeepView returns an error message, you need to Preferences: Swiss-Model, and make the following settings:
    • Modeling Server: http://swissmodel.expasy.org/cgi-bin/sm-submit-request.cgi
    • Template Server: http://swissmodel.expasy.org/cgi-bin/blastexpdb.cgi
    • Also specify your name, an email address where you can receive large files, and your preferred browser.
    • Then repeat the command SwissModel: Find Appropriate ExPDB Templates

DeepView starts up your preferred browser and completes a form containing the peropsin sequence in FASTA format. Click Submit to conduct a search for proteins of known three-dimensional structures that are homologous to peropsin. The Swiss-Model Template Server returns a list of model that should server as suitable templates for making a model of peropsin. These models are in a special structure database called ExPDB, for excerpts of PDB models. An ExPDB entry is usually one domain from a multidomain protein, or one chain from a PDB model that contains more than one chain.

On 2007/01/08, I got sixteen possible templates, all of which were various models of bovine rod rhodopsin. As of this date, rhodopsin from the rod cells of good ol' Bos taurus (whence "Come, Bossy!") is the only visual pigment of known structure. The first recommended template, 1f88A, is chain A from PDB entry 1F88, a model determined by X-ray crystallography. The BLAST score for 1f88A indicates that the odds that the sequencs of of peropsin and model 1f88A are similar by chance is 9x10-37, which implies strong reason for similarity. Biologists would say that the only reasonable explanation for such similarity is evolution: peropsin and bovine rhodopsin evolved from a common ancestor.

Because all the potential templates are the same protein, we will work with just one template, 1f88A. Click its file name in the download ExPDB column of the Template Selection table. Depending on how your browser and DeepView are set up, the file might open automatically in DeepView, or you may have to specify DeepView as the helper application, or you may have to save the file to your desktop. If you've been using DeepView before, you've probably worked out the best way to handle PDB downloads. Use your favorite method, and then open the file in DeepView (by File: Open PDB File, if you saved the file to your desktop).

The model should appear near your large alpha helix of peropsin residues.

Wind: Sequences Alignment
DeepView
displays the sequences of both models, right-justified -- that is, aligned at their N-terminal residues. Now for some real magic.

Fit: Magic Fit
In the blink of an eye, it appears that your peropsin helix is gone. But its sequence has been aligned with that of 1f88A, and each its residues has been superimposed upon the residue with which it aligns sequentially. In short, the peropsin chain has been threaded onto the 3-dimensional model of bovine rhodopsin.

<control-tab-tab-tab...>
Holding down the control key and pressing tab tells DeepView to "blink" between the structures. You are seeing, alternately, 1f88A and the target peropsin homology model. Your target might still be colored gray, with a ProSite pattern highlighted in cyan; if so, display the target, and Color: CPK to give it "normal" color. As you blink back and forth between the models, you might see some strange things about the target. For example, look for a very long peptide bond at one end (between residues 315 and 316). Such features are obviously not chemically realistic; they just represent the best DeepView could do at aligning the sequences. Such a problem suggests that, in this region, the two proteins are not structurally very similar. But the overall alignment seems to work well.

In the Sequences Alignment window, notice that the sequences are not longer left-justified; they are aligned by homology. Straight vertical lines connect residues that are identical in the two models, two dots connect quite similar residues (like valine and leucine, both bulky nonpolar), one dots connects less similar pairs (serine and glutamine, both polar), and dissimilar pairs are unconnected (glycine and isoleucine). To see the full alignment conveniently, click the little document icon at the left end of the Sequences Alignment window. You can save this diagram as a plain text file for printing (File: Save: Sequence Alignment).

Wind: Layer Infos
This window gives information about the display property of all models currently loaded. With what we are doing, it's a handy way to be more quantitative about similarities. According to the Sel column (far right), 299 residues are currently selected. This is the number of residues in the aligned regions of the two models. Blink to display 1f88A. Select: All. The Sel column tells you that 1f88A contains 346 residues. Blink to display peropsin. How many residues does it contain?

With the peropsin model remaining on display, Select: aa Identical to ref. Structure. DeepView tells you that 88 residues of peropsin align with identical amino-acid residues of 1f88A. This is 88/346 or only 25% sequence identity. Select: aa Similar to ref. Structure. The percentage of aligned residues that are chemically similar (double dots in the Sequences Alignment window) is 182/346 or 53%. If two proteins show more than about 35% sequence similarity under best alignment, then they are almost certain to be of similar structure.

Select: aa Making Clashes
A large number of residues in the homology model are trying to occupy the same space. This is obviously unrealistic. Make the model more feasible by using Tools: Fix Selected Sidechains: Quick and Dirty. Again, Select: aa Making Clashes to see if DeepView has improved these problems. Press <return> to show only the selected residues. Pink dotted lines reveal the clashes. Some of them look pretty serious.

You can solve all of the problems of this primitive model by sending it off to the Swiss-Model server for optimization.

SwissModel: Submit Modeling Request
Your browser appears again, this time with a new form. The form tells you that a project file has been created, and gives you its location (on a Macintosh, the location is /Applications/SPDBV_3.9b1.01_univ/temp/, and the file name is SwissModelRequest-xxx.spd. On the form, under Your Swiss-Model project file can be found in:, click Browse, and navigate to the file listed on the form. This is the project file that your browser will send to Swiss-Model for optimization. (The other file in the same location, SwissModelRequest-xxx.htm is just the web page form you are viewing. Don't select this file by accident.) Complete the form by checking your email information, select Swiss-PdbViewer mode for the format of your final model, and uncheck the option of getting a WhatCheck report of the final model. Then click Send Request.

NOTE: After clicking the Browse button on the form, you could select and send ANY DeepView project file that you created by Fit:Magic Fitting a FASTA sequence on to one or more templates. So your procedure to this point does not have to follow the tutorial exactly. But the project file must contain one target sequence, loaded as the first model, and one or more templates loaded afterward. If you are working on several modeling projects, you might want to move a copy of SwissModelRequest-xxx.spd to a working folder for this model, and use the same folder for the results files that you will receive.

ALTERNATIVE REQUEST METHOD: If you have problems sending the project file by way of your browser, you can submit it directly to Swiss-Model. Go to http://swissmodel.expasy.org/. Under Modeling Requests, click Project (optimise) mode. You will see a form similar to the one the DeepView creates. Fill in your email address, name, and a project title. Click Choose File, navigate to your project file, and choose it. Select Swiss-PdbViewer mode for the format of your final model, and uncheck the option of getting a WhatCheck report of the final model. Then click Send Request.

By either submission method, your browser should return a message indicating successful uploading of your project file (385787 bytes for this project), and provide further information. You will receive your optimized model by email. It may take several hours. Once you receive several email files from Swiss-Model, you are ready to resume this tutorial.

NOTE TO USERS: 2007/01/13 -- revised to here, then found that this project was failing at Swiss-Model. The folks there are trying to track down the problem. Tutorial beyond this point is revised to fit the expected results from the hands-on method, and the actual results from the automated method, but it might need additional changes.

On 2007/01/11, my automated modeling request elicited four email messages from Swiss-Model, with one that contains this subject line: SwissModel-Model-AAAa080xq. The AAA number is a Swiss-Model project number (yours will be different, of course), and this is the email that contains your homology model as an attachment (mine was named AAAa080xq.pdb). Save your model file to a convenient location, and then start DeepView and open the file.

NOTE: The other emails contain information about how your modeling project was carried out, and some news for Swiss-Model users. If a modeling project fails, one of these emails will tell you (perhaps cryptically) just what happened. The mail whose subject line includes the word TraceLog gives a list of all operations in making your model.

ANOTHER NOTE: For an automated request, there may be more than one template. In the remaining instructions, I will assume more than one template, in order to include instructions for looking at only the ones you want.

Studying and Evaluating Your Model

Blink (hold down ctrl and press tab repeatedly) to see the target (peropsin) and the templates in sequence. In the cyc column of the Layers Info window, for all models except the target and the first template, click each checkmark once to turn it to +, than again to turn it to -. This prevent the other templates from appearing during blinking. Now blinking will simply alternate between showing the target and the first template.

By default, templates are displayed as backbone only. Turn on all of the first templatse side chains by shift-clicking anywhere in the side column of the Control Panel. With peropsin on display, shift-click any checkmark in the show column to turn off display of all residues, leaving a ribbon model. By default, the ribbon is colored to show the quality of the model. Most of the ribbon is blue or green, and some short segments are red. Blue indicates residues that fit very well with the template, green means not bad, while red indicates residues that did not match up well with template residues. (The menu command option for showing this color scheme is Color: B-Factor. although the term B-Factor does not apply here.)

It is typical in a modeling project like this that scaffold residues, such as those in the seven helices, model well, but surface loops, which define the specific function of the protein, constitute the most significant differences between target and template, and do not model as well. Ironically, you learn mostly what you already know about your target (in this case, that it's a seven-helix bundle), and you learn least about the most interesting parts, the parts that differ most from your template, and that give your target protein a different function from that of your template.

Let's see whether optimization really improved our hand-made model noticeably. With the peropsin layer displayed, and the Layer Infos window open, Select: aa Making Clashes. If you find any, fix them as outlined above. Look for other funny stuff, such as long peptide bonds. If such things are all gone, and the model is at least structurally realistic. To learn more about judging the quality of models -- homology models and others -- visit these two resources:

  1. Principles of Protein Structure, Comparative Protein Modelling and Visualisation, by Nicolas Guex (creator of DeepView) and Manual C. Peitsch
  2. Judging the Quality of Macromolecular Models, by Gale Rhodes

Now let's see whether it appears that peropsin contains a pocket for retinal binding. Blink to the 1f88A layer (in my project, the first template). In the Control Panel, scroll down to the bottom and click RET977, to select the retinal molecule in the bovine rhodopsin model. The line RET977 should turn red when you click it. Press <return> to remove all but retinal from the display.

Select: Neighbors of Selected aa..: Select groups that are within 4.0 A of the picked atom. Click OK and press <return>. Click the labl heading in the Control Panel to label all displayed residues, and click the side heading to add side chaines . Next to RET977 at the bottom of the Control Panel, add a checkmark to the column headed by four dots and a small v (van der Waals surface column). You should now have a dramatic display of the retinal-binding pocket of bovine rhodopsin. Note that LYS296 forms a covalent imide link, with retinal, the result of addition of the LYS296 amino group to the aldehyde group of retinal. Now the exquisite fit of the retinal among other residues, with most hydrophobic side chains snuggling up to the hydrophobic retinal molecule.

Now let's see whether our model of peropsin allows such a pocket, and provides an appropriately placed lysine residue for covalent bonding. Select: Extend to other layers. This selects the residues in model peropsin that are aligned with the displayed residues of af88A. Blink to the peropsin layer, press<return> to display the selected residues, shift-click any checkmark in the ribn column to remove the ribbon display, and click the labl heading to label all displayed residues. Now blink between the layers, looking for similarities and differences. You should see that LYS284 of peropsin is perfectly placed to link to retinal, that no pocket residues of peropsin intrude upon the retinal, and that many of the pocket residues are identical to those in bovine rhodopsin. It appears that peropsin could accomodate a retinal molecule, and could attach it by an imide link as well. Finally, Color:Layer to give the the target and template different colors, and then make both layers visible (in Layer Infos window, click to put check marks in the vis colum for both. With the models superimposed, you can readily see that the target model has a nicely formed retinal pocket.

So. Does peropsin carry retinal in the human eye? I don't know. This homology model certainly suggests that retinal binding is feasible. Proving that peropsin is a retinal carrier in the eye requires more than just building models. It requires purifying peropsin from retinal tissue, followed by chemical analysis to detect retinal. Finding out if the binding is just like what we are seeing in the model would require determining the structure of the peropsin-retinal complex by X-ray crystallography or NMR. Or a researcher could use the model we've made to select residues to change (by site-directed mutagenesis) and see if the changes affect binding. Modeling only begins the quest to determine what peropsin actually does. To fully understand peropsin will require a conversation between theory (which includes model building) and experiment (chemical analysis, spectroscopy, structure determination, monitoring peropsin gene expression). This powerful dialog is the engine that propels science, and our growing understanding of nature.

 


Test Your Bioinformatics Skills

Exam #3 for USM Biochemistry (CHY463/563), spring semester, 2004.
See class schedule for instructions.

The subject of this assignment is homeodomain genes, which are involved in numerous aspects of development, including establishment of body plan and differentiation of various types of stem cells into specialized cells.

If your instructor assigns this test, follow these instructions:

  • Copy and paste the Questions sections into email.
  • Fill in answers in the spaces provided.
  • If the answer is a file, save your file with the name provided. Submit all requested files in a format acceptable to your instructor.
    For USM Biochemistry students, the following are acceptable:
    • email attachment
    • CD (full size only)
    • floppy disk
    • USB memory device

Questions

Part 1. Sequence Work

Requires completion of all of this tutorial EXCEPT the Encore! structural work.

  1. Start at the NCBI Map Viewer. How many genes in the human genome contain the term "homeo" in their name? To be sure you find them all, search for "*homeo*". The asterisks are wild cards., which means that you are searching for "homeo" preceded or followed by any other characters.
    Number found: ______ .
  2. Which chromosome contains the largest number of these genes? How many?
    Chromosome # ______ ; Number of "homeo" genes on this chromosome: ______ .
  3. Among the genes found in question 1, find one that has a role in insulin action.
    Name of the gene: _____________________________. Four-character ID: ______ .
  4. What chromosome contains this gene?
    Chromosome # ______ .
  5. According to OMIM, what is the role of the protein encoded by this gene?
    Role (limit to 25 words): ________________________________________________ .
  6. Obtain the protein sequence of this gene, in FASTA format.
    File name:
    HmPrt.txt (You will use this file in Part 2.)
  7. Go to ExPASy. How many annotated human genes in SwissProt and trEMBL contain the term "homeo"? Note that "*" is automatically used as prefix and suffix unless you specify otherwise.
    Number found: ______ .
  8. Make a phylogenetic tree of the first 24 of these genes plus the insulin-related gene found in question 3, a total of 25 sequences. Use the insulin-related gene as the "outgroup". (NOTE: ClustalW at EMBnet accepts no more than 30 sequences for alignment.)
    Hint: After conducting the search, you will need to use, in sequence, ClustalW, Phylip, and Phylodendron to complete this task.
    Files:
    ClustalW input file name:
    HmCWIn.txt
    ClustalW output file name for Phylip [clustalw (aln) file]:
    HmCWOut.txt
    Phylip output file name [text tree (outfile.consense) file]:
    HmCons.txt
    Phylip input for Phylodendron [Newick (outtree.consense) file]: HmOutTree.txt
    Phylodendron output file hame: HmTree.gif
  9. According to your tree, what two SwissProt entries in this group are the most similar?
    SwissProt entry numbers ______ and ______ .
  10. What SwissProt entry is most similar to the insulin-related gene?
    SwissProt entry number ______ .
  11. What can you find out about the function of this similar gene?
  12. Go to the Protein Data Bank. Search for models of human homeodomain proteins.
    1. How many models do you find?
      Number: ______ .
    2. What method of structure determination produced the first of these models?
      PDB ID code: ______ . Method: _______________________ .
    3. View the first model on the list with QuickPDB or your favorite molecular viewer.
      What are the main secondary structural elements (helix, sheet, coil) in this protein?
      Secondary structural elements: ___________ .
    4. Give beginning and ending residue numbers of three secondary structural elements.
      1. start: ______ end: ______ .
      2. start: ______ end: ______ .
      3. start: ______ end: ______ .
  13. Find a model of a human homeodomain/DNA complex.
    1. How many models do you find?
      Number: ______ .
    2. What method of structure determination produced the first of these models?
      PDB ID code: ______ . Method: _______________________ .
    3. View the first model on the list with QuickPDB or your favorite molecular viewer.
      What secondary structural element(s) (helix, sheet, coil) interact with DNA?
      Secondary structural elements: ___________ .
    4. Give beginning and ending residues of main secondary structural element in contact with DNA.
      Residue start: ______ end: ______ .

Part 2. Structural Work

Requires completion of the Encore! structure section of this tutorial.

Search the Protein Data Bank for human homeodomain/DNA complexes. View the first model you find with DeepView (Swiss-PdbViewer).

  1. PDB File ID: _________ .
  2. What patterns does ProSite recognize in this protein?
    List ProSite patterns:
    ________________________
    ________________________
    etc
  3. What secondary-structural element (helix, sheet, coil) contains most of the residues that ProSite recognizes as a homeodomain?
    Element: ________________ ; Residue numbers, start: ______ end:______ .
  4. What secondary-structural element(s) of the homeodomain protein interact(s) directly with DNA?
  5. List three residues involved with hydrogen bonds to DNA.
    List interactions like this: "LYS89-T6" means a hydrogen bond between lysine-89 of the protein and thymine-6 of the DNA.
    Interactions:
    _______________________
    _______________________
    _______________________
  6. Comment on the quality of this model, especially in areas of protein-DNA interaction.
    Quality criteria (limit to 10 words): ___________________________________________ .
    Comments: (limit to 25 words): _______________________________________________ .
  7. Use DeepView and the file HmPrt.txt, which you saved in Part 1, to make a homology model of the insulin-related human homeodomain protein (the target), using the best available template. Submit your project file returned from Swiss-Model.
    Project file name: _________________.pdb
  8. How many residues does the target protein contain?
    Number of residues in file HmPrt.txt: ______ .
  9. How many residues of the target are modeled by the best template?
    Number of residues in the homology model: ______ .
  10. Why are the remaining residues missing?
    Reason(limit to 25 words): ___________________________________________________ .
  11. How many residues of the template are identical to corresponding residues of the target? How many are similar?
    Number identical: ______ .
    Number similar: ______ .
  12. How many residues of the target are modeled well by the template?
    Number modeled well: ______ .
    Criteria (limit to 25 words): __________________________________________________ .


Back to Goodies List

Back to Biochemistry Resources

HOME


 You are visitor number