Instruction for users We have developed this Gene Function Similarity Search Tool (GFSST) search engine web server (URL: http://gfsst.nci.nih.gov) for mining the human and mouse proteomes utilizing the functional information provided within the Gene Ontology (GO), a structured controlled vocabulary. The user has the option of searching with a protein identifier (either a protein Accession number or an Entry name) from the UniProt database, or with one or several GO terms. The search can be carried out against the human or the mouse proteome. The search engine can also be used as retrieval tool for the human and mouse UniProt databases. Search Options For a given protein with known functions or a set of protein functions and a given proteome, the GFSST search engine not only can find any proteins in the proteome with the same functions (shared GO terms) but also can retrieve proteins with similar functions (having not necessarily shared, but very similar GO terms). Search with a source protein Given a source protein, GFSST can retrieve functionally related proteins from a specific proteome. To illustrate this process, we used human BRCA1 as an example. GFSST provided two ways to input a source protein: by accession number or by Entry name. BRCA1 has Entry name BRCA1_HUMAN and accession number P38398. Type P38398 into the protein accession number box then click the "Submit Query" button. In about a minute, but usually in seconds, the results from GFSST's query of the human proteome (default database) will be presented. All proteins in the human proteome with a D-value < 0.05 (default threshold value) against the source protein in all three GO categories (Biological Process, Cellular Component, Molecular Function) will show up. The D-value, similar to the P-value, varies from 0 to 1. D stands for distribution. If users want to retrieve any protein that has the same functions as the source protein, the D-value can be set to 0. Accession numbers and properly formatted names should be copied and pasted from UniProt, one of the major protein databases (for convenience, there is a direct link to UniProt from the main page). Entry name is not a universal ID. If the source protein is a mouse protein, before put its entry name into the entry name box, Mouse should be selected from the pull-down menu on the left of the entry name box. For example, given source protein BRCA1_MOUSE, Mouse should be selected from the pull-down input menu and BRCA1_MOUSE should be typed into the entry name box. Users may also search the mouse proteome, by selecting mouse instead of human from the Search Against pull-down menu near the bottom above the Input D-value box. Search by GO terms GSFFT also provides a robust retrieval tool for gene and gene products based on their associated GO terms. It is more flexible than searching by a source protein. Users can design their search targets by a single or a combination of GO terms. Given GO terms, GFSST retrieves gene products from a specific proteome (mouse or human). Hence, users actually design their search target by gene functions (GO terms). GSFFT can find genes or gene products matched by those exact functions and/or similar functions. Click on links provided for each of the GO categories to try out the system with three real cases, or set up your own searches with GO terms or groups of terms. There is the option of searching with GO terms in multiple categories, but be aware that this is currently set up as independent (And/Or) functions. Especially if only one GO term is selected in each major category, this may give a very large number of hits, with most of them being specific to one of the terms but not the other. Examples of GSFFT Search using GO terms: Glucose metabolism is a critical pathway in the study of diabetes. Target proteins with the glucose metabolism function (GO term GO: 0006006) in the Biological Process category will thus be relevant to diabetes. GFSST delivered 19 exact matches for this GO term in the UniProt human proteome. Insulin is at the top of the list of search results. DNA damage response, signal transduction by P53 class mediator (GO:0030330) is a very important function in cancer research. Performing a GFSST search for this GO term in the Biological Process category, we find no exact matches to gene products in the UniProt human proteome. There are four proteins, including BRCA1_HUMAN matched by GO:0006978 (a child of GO:0030330) with D-value 0.0000376. P53_HUMAN and P73_HUMAN are matched by GO:0008630 with D-value 0.0000752370. It is not surprising that there are no exact matches for the term GO:0030330, since there are more specific child terms below GO:0030330 that are instead assigned to these gene products. Users can also query with a set of GO terms to obtain target proteins for the biological functions of interests. Angiogenesis inhibitors are designed to stop tumors from developing a blood supply, a prerequisite for tumor growth and metastasis. Four GO terms GO:0016525 (negative regulation of angiogenesis), GO:0008285 (negative regulation of cell proliferation), GO:0042981 (regulation of apoptosis), and GO:0006917 (induction of apoptosis) describe the biological processes in which angiogenesis inhibitors may participate. GFSST found 402 proteins with D-value <0.05 with the above four GO terms. For convenience, a web page for this example has been linked from the Biological Process GO IDs (example) at the GFSST main page. Example of Output Format The following is a single unit (block) of the output generated after searching with BRCA1_HUMAN. The first line consists of the Entry Name, GO category, and D-value of the source gene (BRCA1_HUMAN) and the target gene (VHL_HUMAN). The Entry Name is linked to the Expert Protein Analysis System (ExPASy). The subsequent lines are paired GO terms and their D-values. The first column of GO terms is derived from BRCA1_HUMAN, the second GO column from VHL_HUMAN. All GO terms are linked to the EMBL-EBI QuickGO database. VHL_HUMAN Biological Process : 0.0094604740 GO0016567 GO0016567 0.0000000000 GO0045786 GO0045786 0.0000000000 GO0006357 GO0000122 0.0002758720 GO0042981 GO0006916 0.0003009510 GO0042127 GO0008285 0.0005015860 GO0006978 GO0006950 0.0041380860 GO0046600 GO0000902 0.0067714140 GO0045739 GO0006508 0.0096429960 GO0006359 GO0045597 0.0635133610 For questions and comments, please send e-mail to Peisen Zhang at zhangpeis@mail.nih.gov