Abstract
Free full text
CORUM: the comprehensive resource of mammalian protein complexes—2009
Abstract
CORUM is a database that provides a manually curated repository of experimentally characterized protein complexes from mammalian organisms, mainly human (64%), mouse (16%) and rat (12%). Protein complexes are key molecular entities that integrate multiple gene products to perform cellular functions. The new CORUM 2.0 release encompasses 2837 protein complexes offering the largest and most comprehensive publicly available dataset of mammalian protein complexes. The CORUM dataset is built from 3198 different genes, representing ~16% of the protein coding genes in humans. Each protein complex is described by a protein complex name, subunit composition, function as well as the literature reference that characterizes the respective protein complex. Recent developments include mapping of functional annotation to Gene Ontology terms as well as cross-references to Entrez Gene identifiers. In addition, a ‘Phylogenetic Conservation’ analysis tool was implemented that analyses the potential occurrence of orthologous protein complex subunits in mammals and other selected groups of organisms. This allows one to predict the occurrence of protein complexes in different phylogenetic groups. CORUM is freely accessible at (http://mips.helmholtz-muenchen.de/genre/proj/corum/index.html).
INTRODUCTION
Major cellular processes like cell cycle, protein folding and protein degradation depend on the activity of protein complexes (1). To date there are no reliable estimates about the total number of protein complexes in cells (complexome), but data from single cell organisms provide evidence, that more than half of the gene products are involved in the formation of protein complexes (2). In the advent of protein network analyses, topological properties of protein complexes resulted in paraphrases such as ‘party hubs’ (3) or ‘multi-interface hubs’ (4). Bioinformatics analysis of protein–protein interaction (PPI) datasets revealed that protein complex subunits are stronger evolutionary conserved and show a higher essentiality than proteins from other interactions (4).
As the most comprehensive PPI and protein complex data are available for Saccharomyces cerevisiae, most of these discoveries were obtained using data from yeast. In addition to a manually curated dataset of protein complexes (5), tag-based high-throughput approaches were performed in order to define the yeast complexome (6,7). The importance of manually curated gold-standards was demonstrated by analyses of results from high-throughput experiments. In an assessment of different high-throughput technologies for the analysis of PPIs it was shown, that each method, depending on its physiochemical constraints, captures interactions for different subsets of proteins (8). Thus, none of the existing methods is able to detect all interactions and it was also shown that even the combined dataset of five different methods missed ~40% of experimentally validated, manually curated interactions (9).
For mammals no comprehensive high-throughput dataset of protein complexes is publicly available. Bioinformatics analyses of the mammalian complexome can be performed either by using artificially constructed protein complexes (10) or data from manually curated datasets (11,12). In 2008, the CORUM database was introduced as the most comprehensive catalogue of mammalian protein complexes. All data are manually curated including information of protein complex subunits and methods of purification as well as additional information such as functional annotation using the Functional Catalogue (FunCat) annotation scheme (13), stoichiometry of the subunits and information about association with diseases (14). Analyses of the CORUM dataset have shown (i) that mammalian protein complexes are most frequently composed of 3 or 4 different subunits and (ii) that proteins tend to be reused in up to 53 protein complexes (15).
The CORUM dataset has been used for a number of bioinformatics analyses like tissue-specific expression of proteins (16), functional interpretation of high-throughput data (17–19) or to predict interactions of protein regions (20). In addition, the dataset contributes to web-based applications like the DICS database of functional modules (21) or the COFECO tool for composite function annotation (22).
The CORUM Release 2.0 presents a significantly extended dataset that now consists of 2837 mammalian protein complexes. In addition to existing cross-references the dataset was mapped to Entrez Gene identifiers and functional annotation of Gene Ontology (GO) terms. In order to enable more specific search results in comments, the content is now distributed into the three sections ‘Disease Comment’, ‘Functional Comment’ and ‘Subunit Comment’. Finally, an analysis tool was implemented that allows one to predict the occurrence of orthologous protein complex subunits in other mammals and other groups of organisms. The ‘Phylogenetic Conservation’ tool provides a probability whether or not a protein complex is likely to occur in the analysed model organisms. CORUM is freely accessible at http://mips.helmholtz-muenchen.de/genre/proj/corum/index.html.
NEW DEVELOPMENTS
Dataset and cross-references
In 2008 the CORUM dataset consisted of 1750 mammalian protein complexes, mainly characterized in human (60%), mouse (14%) and rat (14%) (14). While the relative abundance of the related organisms remained stable in the meantime, the number of protein complexes has grown to 2837 in September 2009. Thus, CORUM is the largest set of mammalian protein complexes publicly available.
However, compared to data from single-cell organisms only a minor fraction of the mammalian complexome has been discovered so far. Data from yeast have shown that at least 45% of the gene complement function as subunits in protein complexes (14). Considering that there is no comprehensive mammalian high-throughput dataset available to date, the fraction of genes that are involved in protein complex formation is comparably low. These estimates are based on the number of different complex subunit genes divided by a given number of 20 488 genes in human (14). Compared to the first CORUM release, this fraction increased moderately from 12% (2400 genes) to 16% (3198 genes). The slow increase of novel protein complex subunits presumably results from the reuse of subunits (Figure 1) in different protein complexes or protein complex variants (15). Data from the CORUM ‘Core Set’ (see below) show that proteins like ‘integrin beta-1’, ‘histone deacetylase 1’ and ‘histone deacetylase 2’ appear in 54, 51 and 38 different human protein complexes. Multiple reutilization of protein complex subunits is particularly found in large protein complex families like SNARE complexes and ubiquitin E3 ligases. The ubiquitin E3 ligase subunit ring-box 1 (Rbx1), for example, was identified in 35 complexes.
In addition to the complete dataset, CORUM now offers a reduced ‘Core Dataset’ for download and searches that avoids redundancies of data. Thoroughly investigated protein complexes like ‘SNARE complex (Vamp2, Snap25, Stx1a, Cplx1)’, ‘succinyl-CoA synthetase, ADP-forming’ and ‘cytochrome bc1-complex (EC 1.10.2.2), mitochondrial’ are characterized in more than one mammalian organism. Due to the close phylogenetic relationship between mammals it can be assumed that the majority of protein complexes are conserved in mammals. However, as the aim of CORUM is to provide a comprehensive dataset, also evolutionary conserved protein complexes from different organisms (interologous protein complexes) are annotated in CORUM. To some extent this introduces redundancies, but on the other hand proves that the same protein complex in fact exists in different organisms.
Results from several laboratories that investigated the same protein complex but characterized the molecule with a different composition are another source of dataset expansion. These may stem from different experimental conditions that result in different complex compositions depending on the stringency of the experimental procedures or from different biomaterial that was used for the characterization. Bioinformatics applications like machine learning require non-redundant datasets. For these users we offer the ‘Core Set’ of 2084 distinct protein complexes. For the set only one representative of each interologous group of protein complexes or from protein complex variants was selected. We chose protein complexes which were thoroughly characterized and preferably from Homo sapiens.
Annotation of protein complex subunits in CORUM is performed with UniProt identifiers. Since some users prefer identifiers from Entrez Gene, we mapped the UniProt identifiers to the corresponding Entrez Gene identifiers. This was realized in a semi-automatic procedure using the CRONOS tool (23). CRONOS allows the mapping of identifiers, gene names and protein names from various resources like UniProt, RefSeq and Ensembl. In total, 4310 out of 4336 distinct subunits (98%) could be mapped to corresponding Entrez Gene identifiers. For 26 gene products like MRPS15 from Bos taurus or SPCS1 from Canis familiaris no respective identifier was available in Entrez.
CORUM is the only resource of protein complexes that includes functional annotation of the molecules. We use the FunCat annotation scheme for protein and protein complex function characterization (13). The FunCat has been used for genome annotation and was also frequently used for the analysis of protein networks and high-throughput experiments (13). The hierarchical structure of the FunCat allows browsing for protein complexes with particular cellular functions or localizations. In recent years, GO has become a widely used tool for the annotation of eukaryotic genomes (24). In contrast to the FunCat annotation scheme, the GO is constructed as a set of acyclic graphs, allowing more than one parent class per child (24). In order to enable bioinformatics analyses of protein complexes based on GO terms, the new CORUM release provides a mapping from FunCat to GO. The mapping was performed using the table that is available for download at http://www.geneontology.org/external2go/mips2go. As a result 840 FunCat categories could be mapped to 896 GO terms. Manual inspection of 100 randomly chosen protein complexes revealed that FunCat categories and GO terms are in agreement.
Some valuable information concerning protein complexes cannot be covered by systematic annotation schemes but is represented as free text comment in CORUM. This information includes protein complex composition (e.g. additional subunits of unknown identity), association of protein complexes with diseases or particular functional properties. In the first CORUM release this additional information was collected in a single comment field. In CORUM release 2.0 this content is now distributed among the three comment fields ‘Functional Comment’, ‘Disease Comment’ and ‘Subunit Comment’. This separation allows to search in a particular type of information or using a wild card ‘_’ for instance to retrieve all 223 protein complexes with information about disease association.
Phylogenetic analysis of protein complexes
Protein complex subunits from protein complexes like ribosomes and chaperonins are highly conserved in evolution. Beside ribosomal RNAs, subunits from complexes such as RNA polymerases (25) and F1-ATPases (26) were used for phylogenetic analyses in the early days of sequence-based phylogenetic analyses. Based on data from 191 sequenced genomes, 2 years ago a novel endeavor was started to investigate highly conserved proteins for phylogenetic analysis (27). Analysis revealed 31 highly conserved proteins that allow a new reconstruction of the tree of life and 28 of these proteins are known to be protein complex subunits (23 ribosomal proteins). To enable scientists to obtain some insight into the phylogenetic conservation of subunits, the ‘Phylogenetic Conservation’ tool has been developed for comparative proteome analysis. The ‘Phylogenetic Conservation’ tool is based on sequence similarity data that are obtained from the SIMAP database (28). The Similarity Matrix of Proteins (SIMAP) database provides a comprehensive and up-to-date dataset of the pre-calculated sequence similarity matrix and sequence-based features like InterPro domains for all proteins contained in the major public sequence databases.
The ‘Phylogenetic Conservation’ tool in CORUM presents the similarity of the protein complex subunits to proteins from other organisms as tables (Figure 2). As default comparison to 18 organisms are shown, four mammals (Homo sapiens, Mus musculus, Rattus norvegicus and Bos taurus), three other vertebrates (Xenopus laevis, Danio rerio and Takifugu rubripes), two invertebrates (Caenorhabditis elegans and Drosophila melanogaster), two plants (Arabidopsis thaliana and Oryza sativa), three fungi (Neurospora crassa, Schizosaccharomyces pombae and S. cerevisiae), one slime mold (Dictyostelium discoideum) and three prokaryotes (Thermoplasma acidophilum, Escherichia coli and Bacillus subtilis). In addition to the numerical values, the degree of protein sequence similarity is colour coded.
The conservation of protein complexes appears to be conserved among all phylogenetic related organisms and separates organisms of distant phylogenetic relation, depending on the respective complex. This can be illustrated with the proteasome and three proteasome activatory complexes. Two subunits of the ‘Modulator (PA700-dependent proteasome activator)’ are highly conserved (red colour) within all eukaryotes, whereas the ‘PA28 gamma complex’ is only highly conserved within vertebrates (Figure 2). Finally, high conservation of the ‘11 S REG complex’ is restricted to the four mammalian proteomes. The 20 S proteasome complex is a high-molecular-weight protease that is essential for protein degradation in mammals. Results of the ‘Phylogenetic Conservation’ tool reveal weak similarity for proteins in the archaeon T. acidophilum (Supplementary Figure S1). In fact, an archetype of proteasomes, consisting of only two different subunits is frequently found in archaea (29). On the other hand, sophisticated proteasome architectures like the 26 S proteasome or the availability of several proteasome activatory complexes are not found in Thermoplasma or other prokaryotes. In agreement with this observation, the three above mentioned proteasome activators show no similarity to proteins from Thermoplasma (Figure 2). Results of the ‘Phylogenetic Conservation’ tool can be retrieved for single protein complexes or for multiple complexes that were found by one of the search options in CORUM.
FUNDING
ERA-NET PathoGenoMics ‘Pathomics’ grant (BMBF) (to B.W.). Funding to open access charge: Helmholtz Center Munich (Helmholtz Zentrum München).
Conflict of interest statement. None declared.
REFERENCES
Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
Full text links
Read article at publisher's site: https://doi.org/10.1093/nar/gkp914
Read article for free, from open access legal sources, via Unpaywall: https://academic.oup.com/nar/article-pdf/38/suppl_1/D497/16772275/gkp914.pdf
Free to read at nar.oxfordjournals.org
http://nar.oxfordjournals.org/cgi/content/full/38/suppl_1/D497
Free to read at nar.oxfordjournals.org
http://nar.oxfordjournals.org/cgi/reprint/38/suppl_1/D497.pdf
Free to read at nar.oxfordjournals.org
http://nar.oxfordjournals.org/cgi/content/abstract/38/suppl_1/D497
Citations & impact
Impact metrics
Citations of article over time
Alternative metrics
Smart citations by scite.ai
Explore citation contexts and check if this article has been
supported or disputed.
https://scite.ai/reports/10.1093/nar/gkp914
Article citations
Brain-wide alterations revealed by spatial transcriptomics and proteomics in COVID-19 infection.
Nat Aging, 4(11):1598-1618, 14 Nov 2024
Cited by: 0 articles | PMID: 39543407
Proteome-wide copy-number estimation from transcriptomics.
Mol Syst Biol, 20(11):1230-1256, 27 Sep 2024
Cited by: 0 articles | PMID: 39333715 | PMCID: PMC11535397
SPIDER: constructing cell-type-specific protein-protein interaction networks.
Bioinform Adv, 4(1):vbae130, 30 Aug 2024
Cited by: 0 articles | PMID: 39346952 | PMCID: PMC11438548
New GO-based measures in multiple network alignment.
Bioinformatics, 40(8):btae476, 01 Aug 2024
Cited by: 0 articles | PMID: 39082966 | PMCID: PMC11310457
Single-cell omics: experimental workflow, data analyses and applications.
Sci China Life Sci, 23 Jul 2024
Cited by: 0 articles | PMID: 39060615
Review
Go to all (447) article citations
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
CORUM: the comprehensive resource of mammalian protein complexes-2019.
Nucleic Acids Res, 47(d1):D559-D563, 01 Jan 2019
Cited by: 332 articles | PMID: 30357367 | PMCID: PMC6323970
CORUM: the comprehensive resource of mammalian protein complexes-2022.
Nucleic Acids Res, 51(d1):D539-D545, 01 Jan 2023
Cited by: 40 articles | PMID: 36382402 | PMCID: PMC9825459
CORUM: the comprehensive resource of mammalian protein complexes.
Nucleic Acids Res, 36(database issue):D646-50, 26 Oct 2007
Cited by: 222 articles | PMID: 17965090 | PMCID: PMC2238909
The Negatome database: a reference set of non-interacting protein pairs.
Nucleic Acids Res, 38(database issue):D540-4, 17 Nov 2009
Cited by: 68 articles | PMID: 19920129 | PMCID: PMC2808923