FuSeS predict function of unknown proteins - RUTGERS, THE STATE UNIVERSITY OF NEW JERSEY

FUSES PREDICT FUNCTION OF UNKNOWN PROTEINS

Sponsoring Institution

National Institute of Food and Agriculture

Project Status

TERMINATED

Funding Source

HATCH

Reporting Frequency

Annual

Accession No.

0228906

Grant No.

(N/A)

Project No.

NJ01150

Proposal No.

(N/A)

Multistate No.

(N/A)

Program Code

(N/A)

Project Start Date

Jul 1, 2012

Project End Date

Jun 30, 2017

Grant Year

(N/A)

Project Director
Bromberg, Y.

Recipient Organization
RUTGERS, THE STATE UNIVERSITY OF NEW JERSEY
3 RUTGERS PLZA
NEW BRUNSWICK,NJ 08901-8559

Performing Department
Biochemistry & Microbiology

Non Technical Summary
Extensive sequencing is increasingly leaving us with a potential goldmine of "not-quite-useful-yet" data. Existing tools that gauge the specifics of protein sequence-encoded functionality lack both functional specificity and sensitivity. Our long-range goal is to develop novel computational methods that can accurately identify residues specifically relevant for protein function, reveal the range of functions encoded by a given meta-, gen-, ex-, transcript-ome, and reduce the experimental work needed to describe the variome-mediated functional differences. The particular objective of this proposal is to elucidate the molecular functional make-up of the currently available meta-proteomes using per-residue functional significance predictions to profile/cluster protein sequences. We suggest a three-tiered approach: first, predict functional sequence (FuSe) residues using in silico mutagenesis. Then, align experimentally annotated orthologues and close paralogues to extract FuSe Signatures (FuSeS) - sets of FuSe residues representative of specific protein functions. Use FuSeS to gauge functions of available un-annotated sequences. Finally, cluster the pool of FuSeS-less proteins to build a collection of new FuSeS defining yet unknown functions. Note that while all aims are logically interconnected, the project is modular and the completion of one aim/module is sufficiently independent of the others; i.e. data collection and proofs of concept for all aims may proceed simultaneously, while modules unsuccessful in development may be replaced. The expected outcome of this project is a database of protein functional signatures (FuSeS) and a corresponding computational tool (FuSeScanner) for protein function annotation from sequence alone. The innovation of FuSeS is in building on established methodologies to create a completely unique, novel and highly informative functional view of existing proteome data. This is also highly significant as FuSeS can be used to generate new experimentally testable hypotheses about the make up and optimization of specific microbiotic environments. Understanding the human gut microbiome, for instance, could facilitate research in the directions of food safety and childhood obesity. Deeper knowledge of electron transfer chains in microbial communities could potentially aid research and development of sustainable energy resources. FuSeS will also be easily, cheaply, and accurately applicable to any -omic study requiring a more succinct annotation of protein function.

Animal Health Component

(N/A)

Research Effort Categories

Basic

50%

Applied

10%

Developmental

40%

Classification

Knowledge Area (KA)	Subject of Investigation (SOI)	Field of Science (FOS)	Percent
201	7299	1000	25%
201	7299	2080	25%
304	7299	1000	25%
304	7299	2080	25%

Knowledge Area
201 - Plant Genome, Genetics, and Genetic Mechanisms; 304 - Animal Genome;

Subject Of Investigation
7299 - Research equipment and methods, general/other;

Field Of Science
2080 - Mathematics and computer sciences; 1000 - Biochemistry and biophysics;

Keywords

sequence analysis

function annotation

protein annotation

computational mutagenesis

functional signatures

Goals / Objectives
#1. Identify protein functional residues using in silico mutagenesis. Defining a functional role for every residue facilitates understanding of protein molecular/cellular functions. However, quantitative ways are lacking to compare the importance of a "catalytic residue" to a "binding hot-spot", and to differentiate these from structurally important core positions. Existing tools define functional sequence (FuSe) residue importance using evolutionary/family and structural information. We will develop a generic way to score the importance of FuSe positions by evaluating the functional effects of their modifications. Experimentally probing each residue is costly and not feasible on a large scale. We have previously built an accurate/efficient sequence-based computational tool (SNAP) to predict functional effects of nsSNPs. A 20-mer vector of SNAP scores describes the functional impact of having every possible residue at a given sequence position. We will use these vectors to create a single well-calibrated index of each position's functional importance. #2. Identify functional sequence signatures (FuSeS). The ability to recognize FuSeS, sets of residues that together are responsible for a specific function, elucidates functional mechanisms. Sequence or structural families/domains are often used to indirectly approximate functionality of unknown proteins. FuSeS will instead characterize molecular function directly. Using multiple sequence alignments we will identify sets of FuSe residues consistently present in all proteins of experimentally annotated similar function. We will quantitatively evaluate the validity of FuSeS by their ability to precisely identify other functionally related sequences. For qualitative inference support, we'll also map FuSeS onto available protein 3D structures and manually inspect their correlation with the probable locations of functionality-defining sites. We'll extend the FuSeS concept to sequences not yet studied experimentally for further elucidation of molecular functions of the human proteome. #3. Build a database of FuSeS and implement a protein sequence-scanning tool. FuSeS and their functional annotations, gleaned from source sequences, will be stored in a freely available database - DFuSeS. Additionally, FuSeScanner, a methodology for searching protein sequences for known FuSeS, will be developed and tested. We will use FuSeScanner to annotate all protein sequences predicted from existing metagenomes. The resulting annotations will be stored in DFuSes and referenced to the corresponding FuSe Signatures.

Project Methods
We'll evaluate the accuracy of annotating functional sites based on experimental mutagenesis data. We'll use pre-defined explicitly functional sites, structurally functional sites, and negative controls (all other residues) to evaluate FuSe (functional sequence) predictive abilities of experimental mutagenesis. We'll further collect protein per-residue functional annotations from various sources to create a training/testing set. We have previously developed SNAP, a neural network based method for evaluating the functional effect of single amino acid substitutions. We'll train a standard feed-forward neutral network to recognize FuSe residues from SNAP vectors of all possible substitutions plus additional features, such as conservation scores and predicted secondary structure. We'll vary feature sets, AI algorithms, and their parameters to create multiple prediction methods. These will be compared to each other, to the residue conservation baseline, and to other methods using our testing sets and external data of same type. We expect to develop a fast/accurate in silico mutagenesis-based method for computing well calibrated scores representative of per-residue FuSe propensities. We'll extract from SwissProt all enzymes with EC numbers. The sequences will be split into subsets at every digit of all assigned ECs. We'll also extract proteins with manually assigned GO "molecular function" terms. First, we'll build MSAs for all full four-digit EC number subsets and all GO subsets using MAFFT. For all protein groups we'll extract FuSeS: 1) Select from the MSA columns where >90% of sequences contain a FuSe. 2) Eliminate from this set columns where at least one FuSe residue isn't SNAP conserved 3) Randomly split the sequence set into ten subsets and iteratively recombine nine of ten, leaving a different one out each time; build MSAs of the ten 90%-subsets and repeat steps 1,2 on each. 4) Define FuSeS per protein as a set of residues in columns selected by all ten MSAs to which it belongs. For each protein of known function, we expect to determine a set of functional sequence signatures (FuSeS) defining said function. Newly created FuSeS will be stored in the DFuSeS database together with their corresponding functions. We will use these FuSeS to annotate functions of all UniProt proteins as follows (FuSeScanner): 1) PSI-BLAST all sequences used in FuSeS building against UniProt, 2) validate each hit against its query sequence checking SNAP conservation of all FuSeS residues, 3) in hits where the FuSeS are not conserved, re-align sequences first using ClustalW and, failing that, Smith-Waterman, and 4) transfer function annotation from the query sequence to the hit if all FuSeS residues are conserved. We'll fvaluate the FuSeScanner ability to annotate the function of SwissProt proteins and that of the electronically GO-annotated proteins. Numerous recent metagenomic studies have produced tens of millions of protein sequences. We will use FuSeScanner to annotate functions of as many proteins predicted from this data as possible. We also expect to build DFuSeS, a database of FuSeS, their functional annotations, and references to known proteins sequences.

Progress 07/01/12 to 06/30/17

Outputs
Target Audience:The main target audience for our work is research scientists working on molecular functions of microbes and microbiomes. We also hope to have reached scientists interested in evaluating functional effects of genomic variants regardless of organism Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided?In the course of this project one Ph.D. student has defended his thesis on the basis of his fusion and mi-faser work. One more Ph.D. thesis in collaboration with a group at Technical University of Munich is in progress. Multiple undergrads were involved in developing various components of the tools (including two undergrads, whose names appear on resulting publications). How have the results been disseminated to communities of interest?All tools were presented at relevant conferences and appeared in corresponding publications. All are also available online for free to non-commercial entities at services.bromberglab.org What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? Overall, we accomplished all goals of this project -- but with different names of resources and slightly different implementations of the main ideas. The summery of the accomplishments are below: 1. We built a predictor, fun-TRP (functional toggle rheostat predictor), for finding functional residues in the proteins. We identified two types of functional residues -- those that are critical for function (on/off toggle switches) and those that re necessary for fine tuning the activity (rheostats). We showed that our approach facilitates prediction of functional effects of variants in protein sequences, allows tracing evolutionary history of molecular functions, and facilitates targeted synthetic biology construction of specific function in enzymes. 2. Rather than focusing on functional signatures of proteins, we opted for building a predictor of functional similarity between proteins. This one was optimized to compare short peptides (which may or may not carry functional signatures) to full-length proteins to their identify functional similarity of peptide-parent proteins. This approach makes our method, faser (functional annotation of sequence reads), applicable for the analysis of metagenomic data, potentially leading to discovery of new functions. 3. We built fusionDB (functional-- a database of functional clusters of all bacterial proteins by first comparing them for functional similarity and then clustering the network of functionally similar proteins. This approach allows for mapping new microorganism sequences into a framework of older/already-annotated bacteria to better understand molecular functionality encoded in the new genomes.

Publications

Type: Journal Articles Status: Published Year Published: 2017 Citation: Zhu, C., Mahlich, Y., Miller, M., Bromberg, Y. (2017) fusionDB: assessing microbial diversity and environmental preferences via functional similarity networks. Database: http://services.bromberglab.org/fusiondb. Nucleic Acids Research, gkx1060.
Type: Journal Articles Status: Published Year Published: 2017 Citation: Miller, M., Bromberg, Y., Swint-Kruse, L. (2017) Variant effect prediction methods fail for rheostat positions. Nat Scientific Reports 7, 41329
Type: Journal Articles Status: Published Year Published: 2017 Citation: Zhu, C., Miller, M., Marpaka, S., Vaysberg, P., Ruhlemann, M.C., Wu, G., Heinsen, F.A., Tempel, M., Zhao, L., Lieb, W., Franke A., Bromberg, Y. (2017) Functional sequencing read annotation for high precision microbiome analysis. Nucleic Acids Research, gkx1209

Progress 10/01/15 to 09/30/16

Outputs
Target Audience:A wide range of microbiologists interested in accessing functionality of their proteins and researchers interested in type III secretion systems and bacterial pathogenicity. We also organized a microbiology workshop at the Pacific Symposium on Biocomputing, which likely attracted a diverse audience. Changes/Problems:We have decided to go away from functional signature approach to functional annotation of proteins and to switch to a more reliable alignment-based technique, which allows for exploration of microbiome contents. This is a major change in algorithms, but conceptually a very similar approach as to what has been originally described in the proposal. What opportunities for training and professional development has the project provided?There is a graduate student and two visiting research scientists working on the project. The graduate student will use this work to defend his thesis at the end of this academic year. The research scientists will be applying the developed tools in their home labs in Germany, helping advance their own research. How have the results been disseminated to communities of interest?Via published journal articles and the organized workshop mentioned above (as well as via our lab website) What do you plan to do during the next reporting period to accomplish the goals?We will continue with development and refinement of the software andthe database.

Impacts
What was accomplished under these goals? We have have demonstrated that there are two types of functionally important positions in the protein sequence -- rheostats and toggles. We have demonstrated that these have different effects on function when mutated and recognizing the type of position prior to further analysis is both very necessary and not currently accomplished by any available method. We are in the process of building a computational classifier to recognize these types of positions. We are also continuing with our efforts to build read-based annotation software and to create a database of functions available to microbes and microbiomes in our training sets. These will be accessible via an interface that is also currently under development

Publications

Type: Journal Articles Status: Published Year Published: 2016 Citation: Greene, C.S., Foster, J.A., Stanon, B.A., Hogan, D.A., Bromberg, Y. (2016) Computational approaches to study microbes and microbiomes. Pac Symp Biocomput 2016. :557-567
Type: Journal Articles Status: Published Year Published: 2016 Citation: Goldberg, T., Rost, B., Bromberg, Y. (2016). Computational prediction shines light on type III secretion origins. Nat Sci Rep, 6, 34516.
Type: Journal Articles Status: Published Year Published: 2016 Citation: Rost, B., Radivojac, P, Bromberg, Y. (2016) Protein function in precision medicine: deep understanding with machine learning. FEBS Lett. 590(15): p. 2327-41

Progress 10/01/14 to 09/30/15

Outputs
Target Audience:Microbiologists interested in analyzing functional similarity of bacteria as encoded by bacterial proteins Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Nothing Reported How have the results been disseminated to communities of interest?Via a publication and the database referenced above What do you plan to do during the next reporting period to accomplish the goals?We will continue with development and refinement of the database

Impacts
What was accomplished under these goals? We built a new system for recognizing functional similarity of microorganisms on the basis of the proteins that their genomes encode. While not a direct goal for this project, it is a useful contribution to the goals. This system is located at http://bromberglab.org/databases/fusiondb and was developed by Chengsheng Zhu at the Bromberg Lab. fusionDB currently contains 1,374 bacterial genomes annotated with temperature, oxygen requirement and habitat metadata. Bacterial proteins are assigned to functional clusters, and each organism is thus mapped to a set of functions. fusionDB allows searching for organism names combined with specific environment metadata, and creates an XML-formatted network file (fusion+ network, C. Zhu et. al. ) of selected organisms that can be visualized by Gephi. In fusion+ networks, organisms cluster on the basis of shared function, which allows for exploration of the specific environmental factor(s) that drives microbial diversification. It offers a fast and simple way to detect pan-function (all functions of a set of organisms) and core-function (all functions found in every organism of a set) repertoires, as well as traces of horizontal gene transfer.

Publications

Type: Journal Articles Status: Published Year Published: 2015 Citation: Zhu C, Delmont TO, Vogel TM, Bromberg Y (2015) Functional Basis of Microorganism Classification. PLoS Comput Biol 11(8): e1004472. doi: 10.1371/journal.pcbi.1004472

Progress 10/01/13 to 09/30/14

Outputs
Target Audience: Protein scientists, interested in in-depth annotation of their proteins or peptides Changes/Problems: We found more promising directions in identifying per-residue protein activity from structural alignments. We hope to be able to transition the information we learn from these alignments into a sequence-based annotation, but we have not yet developed a framework as to how this will be done. This new direction will contribute to the overall goals of the project, but may prevent us from completing the aims as described. We are also experiencing issues recruiting students/post-docs interested in and qualified for computational tool development as described in this project. We hope to overcome this challenge by reaching out to bioinformatics communities world-wide, but, at this point, most of the project work is accomplished via collaborations with other labs. What opportunities for training and professional development has the project provided? Nothing Reported How have the results been disseminated to communities of interest? Results have been disseminated via publications, conferences, and via an informal report at the VarI-SIG'14 (former SNP-SIG), a meeting co-organized by PI-Bromberg and attended by >100 computational biologists in Boston, Jul 2014. What do you plan to do during the next reporting period to accomplish the goals? Nothing Reported

Impacts
What was accomplished under these goals? Further work was done in development of the metric (SaHLe) for measuring functional similarity of protein structural folds. Once functionally similar folds are identified, we hope to be able to transfer this knowledge into sequence-based identification of active sites and motifs. This work is not along the originally outlined project goals -- but, we believe, will contribute signficantly to the ultimate goals of the project -- per residue classification of protein activity. Additionally, we implemented the visualization of SNAP predictions for whole protein in silico mutagenesis as part of the PredictProtein pipeline. The pipeline is freely available to all academic researchers and could be used for in depth study of specific proteins.

Publications

Type: Journal Articles Status: Published Year Published: 2014 Citation: Yachdav, Guy, et al. "PredictProteinan open resource for online prediction of protein structural and functional features." Nucleic acids research (2014): gku366.

Progress 10/01/12 to 09/30/13

Outputs
Target Audience: Nothing Reported Changes/Problems: Nothing Reported What opportunities for training and professional development has the project provided? Nothing Reported How have the results been disseminated to communities of interest? Results have been disseminated via publications in major journals, and via an informal report at the SNP-SIG'13, a meeting co-organized by PI-Bromberg andattended by >100 computational biologists in Berlin, Jul 2013. What do you plan to do during the next reporting period to accomplish the goals? We will continue developing our methods for identifying protein functional sites.

Impacts
What was accomplished under these goals? Significant progress was made for the goals of aim 1 -- we have developed a method for annotating protein functional site residues. We als started working towards identifying the functional site signatures -- so far in structure and limited to metal containing proteins, but moving towards broader sequence-based annotation.

Publications

Type: Journal Articles Status: Published Year Published: 2013 Citation: Senn, S., Nanda, V., Falkowski, P., and Bromberg, Y. (2013). Function-based assessment of structural similarity measurements using metal co-factor orientation. Proteins.
Type: Journal Articles Status: Published Year Published: 2013 Citation: Bromberg, Y., Kahn, P.C., and Rost, B. (2013). Neutral and weakly nonneutral sequence variants may define individuality. Proc Natl Acad Sci U S A 110, 14255-14260.
Type: Book Chapters Status: Published Year Published: 2013 Citation: Bromberg, Y. (2013). Chapter 15: disease gene prioritization. PLoS Comput Biol 9, e1002902.
Type: Journal Articles Status: Published Year Published: 2013 Citation: Bromberg, Y. (2013). Building a Genome Analysis Pipeline to Predict Disease Risk and Prevent Disease. J Mol Biol 425, 3993-4005.
Type: Journal Articles Status: Published Year Published: 2013 Citation: Hecht, M., Bromberg, Y., and Rost, B. (2013). News from the Protein Mutability Landscape. J Mol Biol.
Type: Journal Articles Status: Published Year Published: 2013 Citation: Capriotti, E., Altman, R.B., and Bromberg, Y. (2013). Collective judgment predicts disease-associated single nucleotide variants. BMC Genomics 14 Suppl 3, S2.

Progress 10/01/11 to 09/30/12

Outputs
OUTPUTS: As a direct result of my research for the FuSeS project, the annual SNP-SIG meeting that I co-chair (2012 edition, Long Beach, CA) had a specific subfocus on impact of mutations in functionally significant sites. PARTICIPANTS: Yana Bromberg -- Principle Investigator, Rutgers Chris Rusnak -- Undergraduate student, Rutgers; Data extraction and data model building. Burkhard Rost -- non-formal collaborator, Technical University of Munich Christian Schaefer -- Co-supervised Ph.D. Student, in the lab of Dr. Burkhard Rost, Technical University of Muinch; Data collection, manuscript write-up TARGET AUDIENCES: Nothing significant to report during this reporting period. PROJECT MODIFICATIONS: Nothing significant to report during this reporting period.

Impacts
The first stage of the FuSeS approach requires identifying functionally significant residues in protein sequences. We started this project with the assumption that, in human proteins, sequence positions often altered in disease are likely functionally significant. We further looked for a way to quantify and augment this significance with computational methods (which could be further used in non-human organisms). Relatively few human mutations, reported in databases such as OMIM, PMD and Swiss-Prot, are experimentally assessed for their disease causing impact. We made computational predictions of functional impact of disease-annotated mutations and non-disease variants collected from these databases using SNAP, our in-house neural network based annotation program. Most disease-causing mutations were predicted to severely impact protein function. In fact, the raw prediction scores for disease-causing mutations were higher than the scores for the function-altering data set originally used for SNAP development. This finding means that, on average, disease-mutations are severely deleterious to the affected protein function, as indicated by the absolute value of the SNAP score. The neutral SNAP score enrichment in the set of nsSNPs not currently linked to disease suggests that strong disease associations among these are unlikely. Our research suggests that (1) disease-causing nsSNPs are well identified by SNAP, even though it was developed to predict the impact of mutations on protein function and (2) screening naturally-occurring variants (whether in wild-type or phenotypically different organisms) for high SNAP scores suggests initial filtering for functionally significant sites. Using a gold standard set of functional site residues, extracted from the Catalytic Site Atlas and Swiss-Prot, we will compare the computational site predictions made in this manner to the approach we proposed in the initial write-up and are currently developing (i.e. in silico mutagenesis). We expect our in silico mutagenesis technique to outperform this baseline method. (Note that SNAP scores for all mutants used in this study are available via SNPdbe, a database developed previously at http://www.rostlab.org/services/snpdbe/)

Publications

Schaefer, C., Bromberg, Y., Achten, D., and Rost, B. (2012) Disease-related mutations predicted to impact protein function. BMC Genomics 13 Suppl 4, S11