Instruction for users

We have developed this Gene Function Similarity Search Tool (GFSST) search engine 
web server (URL: http://gfsst.nci.nih.gov) for mining the human and mouse proteomes 
utilizing the functional information provided within the Gene Ontology (GO), a 
structured controlled vocabulary. The user has the option of searching with a protein 
identifier (either a protein Accession number or an Entry name) from the UniProt 
database, or with one or several GO terms. The search can be carried out against the 
human or the mouse proteome. The search engine can also be used as retrieval tool for 
the human and mouse UniProt databases.

Search Options

For a given protein with known functions or a set of protein functions and a given 
proteome, the GFSST search engine not only can find any proteins in the proteome 
with the same functions (shared GO terms) but also can retrieve proteins with similar 
functions (having not necessarily shared, but very similar GO terms).

Search with a source protein

Given a source protein, GFSST can retrieve functionally related proteins from a 
specific proteome. To illustrate this process, we used human BRCA1 as an example. 
GFSST provided two ways to input a source protein: by accession number or by 
Entry name. BRCA1 has Entry name BRCA1_HUMAN and accession number P38398.

Type P38398 into the protein accession number box then click the "Submit Query" button. 
In about a minute, but usually in seconds, the results from GFSST's query of the human 
proteome (default database) will be presented. All proteins in the human proteome with 
a D-value < 0.05 (default threshold value) against the source protein in all three GO 
categories (Biological Process, Cellular Component, Molecular Function) will show up. 
The D-value, similar to the P-value, varies from 0 to 1. D stands for distribution. 
If users want to retrieve any protein that has the same functions as the source protein, 
the D-value can be set to 0.
 
Accession numbers and properly formatted names should be copied and pasted from UniProt, 
one of the major protein databases (for convenience, there is a direct link to UniProt 
from the main page). Entry name is not a universal ID. If the source protein is a mouse 
protein, before put its entry name into the entry name box, Mouse should be selected from 
the pull-down menu on the left of the entry name box. For example, given source protein 
BRCA1_MOUSE, Mouse should be selected from the pull-down input menu and BRCA1_MOUSE 
should be typed into the entry name box. 

Users may also search the mouse proteome, by selecting mouse instead of human from the 
Search Against pull-down menu near the bottom above the Input D-value box.


Search by GO terms

GSFFT also provides a robust retrieval tool for gene and gene products based on their 
associated GO terms. It is more flexible than searching by a source protein. Users 
can design their search targets by a single or a combination of GO terms. Given GO 
terms, GFSST retrieves gene products from a specific proteome (mouse or human). Hence,
users actually design their search target by gene functions (GO terms). GSFFT can find 
genes or gene products matched by those exact functions and/or similar functions.

Click on links provided for each of the GO categories to try out the system with three 
real cases, or set up your own searches with GO terms or groups of terms. There is the 
option of searching with GO terms in multiple categories, but be aware that this is 
currently set up as independent (And/Or) functions. Especially if only one GO term is 
selected in each major category, this may give a very large number of hits, with most 
of them being specific to one of the terms but not the other.  

Examples of GSFFT Search using GO terms:

Glucose metabolism is a critical pathway in the study of diabetes. Target proteins with the 
glucose metabolism function (GO term GO: 0006006) in the Biological Process category will thus 
be relevant to diabetes. GFSST delivered 19 exact matches for this GO term in the UniProt human 
proteome. Insulin is at the top of the list of search results.
 
DNA damage response, signal transduction by P53 class mediator (GO:0030330) is a very 
important function in cancer research. Performing a GFSST search for this GO term in the 
Biological Process category, we find no exact matches to gene products in the UniProt human 
proteome. There are four proteins, including BRCA1_HUMAN matched by GO:0006978 (a child of 
GO:0030330) with D-value 0.0000376. P53_HUMAN and P73_HUMAN are matched by GO:0008630 with 
D-value 0.0000752370. It is not surprising that there are no exact matches for the term 
GO:0030330, since there are more specific child terms below GO:0030330 that are instead 
assigned to these gene products. 

Users can also query with a set of GO terms to obtain target proteins for the biological 
functions of interests. Angiogenesis inhibitors are designed to stop tumors from developing 
a blood supply, a prerequisite for tumor growth and metastasis. Four GO terms GO:0016525 
(negative regulation of angiogenesis), GO:0008285 (negative regulation of cell proliferation), 
GO:0042981 (regulation of apoptosis), and GO:0006917 (induction of apoptosis) describe the 
biological processes in which angiogenesis inhibitors may participate. GFSST found 402 proteins 
with D-value <0.05 with the above four GO terms. For convenience, a web page for this example 
has been linked from the Biological Process GO IDs (example) at the GFSST main page.

Example of Output Format

The following is a single unit (block) of the output generated after searching with BRCA1_HUMAN.

The first line consists of the Entry Name, GO category, and D-value of the source gene (BRCA1_HUMAN)
and the target gene (VHL_HUMAN). The Entry Name is linked to the Expert Protein Analysis System (ExPASy). 
The subsequent lines are paired GO terms and their D-values. The first column of GO terms is derived 
from BRCA1_HUMAN, the second GO column from VHL_HUMAN. All GO terms are linked to the EMBL-EBI QuickGO 
database. 

VHL_HUMAN Biological Process : 0.0094604740
GO0016567 GO0016567 0.0000000000
GO0045786 GO0045786 0.0000000000
GO0006357 GO0000122 0.0002758720
GO0042981 GO0006916 0.0003009510
GO0042127 GO0008285 0.0005015860
GO0006978 GO0006950 0.0041380860
GO0046600 GO0000902 0.0067714140
GO0045739 GO0006508 0.0096429960
GO0006359 GO0045597 0.0635133610


For questions and comments, please send e-mail to Peisen Zhang at zhangpeis@mail.nih.gov