Abstract
Free full text
ChEMBL: a large-scale bioactivity database for drug discovery
Abstract
ChEMBL is an Open Data database containing binding, functional and ADMET information for a large number of drug-like bioactive compounds. These data are manually abstracted from the primary published literature on a regular basis, then further curated and standardized to maximize their quality and utility across a wide range of chemical biology and drug-discovery research problems. Currently, the database contains 5.4 million bioactivity measurements for more than 1 million compounds and 5200 protein targets. Access is available through a web-based interface, data downloads and web services at: https://www.ebi.ac.uk/chembldb.
INTRODUCTION
A wealth of information on the activity of small molecules and biotherapeutics exists in the literature, and access to this information can enable many types of drug discovery analysis and decision making. For example: selection of tool compounds for probing targets or pathways of interest; identification of potential off-target activities of compounds which may pose safety concerns, explain existing side effects or suggest new applications for old compounds; analysis of structure–activity relationships (SAR) for a compound series of interest; assessment of in vivo absorption, distribution, metabolism, excretion and toxicity (ADMET) properties; or construction of predictive models for use in selection of compounds potentially active against a new target (1–5). Access to this information is especially important due to the continuing shift in fundamental research on disease mechanisms from the private to public sectors.
However, bioactivity data published in journal articles are usually found in a relatively unstructured format and are labour-intensive to search and extract. For example, compound structures are frequently depicted only as images and are not therefore searchable, protein targets may be referred to by a variety of synonyms or abbreviations with no reference to any database identifiers, and details of assays may be included only in Supplementary Data or by reference to previous publications. In addition, there is not currently any requirement by most journals for authors to deposit small-molecule assay results in public databases (as is the case for sequence, protein structure and gene expression data). Historically, therefore, the majority of the published small-molecule bioactivity data have only been readily available via commercial products.
In recent years, in response to the growing demand for open access to this kind of information, a variety of public-domain bioactivity resources have been developed. PubChem BioAssay (6) and ChemBank (7) are large archival databases providing access to millions of deposited screening results, typically from high-throughput screening (HTS) experiments. A number of other primary resources extract bioactivity data from literature, but tend to focus on particular thematic areas, and primarily on binding affinity information. For example, BindingDB contains quantitative binding constants manually extracted from publications, focusing chiefly on proteins that are considered to be potential drug targets (8). PDBBind (9), Binding MOAD (10) and AffinDB (11) contain binding affinity information for protein–ligand complexes found in the Protein Data Bank (PDB, 12). PDSP Ki database stores screening data from the National Institute of Mental Health's Psychoactive Drug Screening Program (13). BRENDA provides binding constants for enzymes (14), IUPHAR contains ligand information for receptors and ion channels (15), while GLIDA (16) and GPCRDB (17) provide information specifically for G-protein-coupled receptors. Other resources, such as DrugBank, provide detailed annotation around the properties and mechanism of action of approved drugs (18).
However, in order to make informed decisions in drug discovery or to design experiments to probe a biological system with chemical tools, it is important to consider not only the binding affinity of a compound for its target, but also its selectivity, efficacy in functional assays or disease models and the likely ADMET properties of the compound. Moreover, researchers need the ability to intelligently cluster relevant information across studies (based on target or compound similarities, for example) and to integrate data across therapeutic areas. ChEMBL aims to bridge this gap by providing broad coverage across a diverse set of targets, organisms and bioactivity measurements reported in the scientific literature, together with a range of user-friendly search capabilities (19).
DATA CONTENT
Data extraction and curation
The core activity data in the ChEMBL database are manually extracted from the full text of peer-reviewed scientific publications in a variety of journals, such as Journal of Medicinal Chemistry, Bioorganic Medicinal Chemistry Letters and Journal of Natural Products. The set of journals covered is by no means comprehensive, but is selected to capture the greatest quantity of high-quality data in a cost, and time-effective manner. From each publication, details of the compounds tested, the assays performed and any target information for these assays are abstracted.
Structures for small molecules are drawn in full, in machine-readable format, despite the structure often being provided as a scaffold and a list of R-group substituents, or referred to only by name in the original publication. Information about the particular salt form tested is also captured, where available, although this is often inconsistent in the literature. Before loading to the database, structures are checked for potential problems (e.g. unusual valence on atoms, incorrect structures for common compounds/drugs), then normalized according to a set of rules, to ensure consistency in representation (e.g. compounds are neutralized by protonating/deprotonating acids and bases to ensure a formal charge of zero where possible). Preferred representations are used for certain common groups (e.g. sugars, sulphoxides and nitroxides). Some chemical structures are typically only reported in an implicit format, and this is checked and assigned on registration—for example, the stereochemistry of the steroid framework is invariably not published, but is assumed to be that of the naturally occurring configuration, unless otherwise defined. Common salts are also stripped from the extracted compounds, and both the salt form and the parent compound are entered into the database. This allows users to view all data associated with the same parent compound, regardless of the salt form tested, while still retaining the salt information if required.
Details of all types of assays performed are extracted from each publication, including binding assays (measuring the interaction of the compound with the target directly), functional assays (often measuring indirect effects of the compound on a pathway, system or whole organism) and ADMET assays (measuring pharmacokinetic properties of the compound, interaction with key metabolic enzymes or toxic effects on cells/tissues). The activity endpoints measured in these assays are recorded with the values and units as given in the paper, but for the purposes of improved querying are also standardized, where possible, to convert them to a preferred unit of measurement for a given activity type (e.g. IC50 values are displayed innM, rather thanµM/mM/M, half-life is reported in hours rather than minutes/days/weeks). This enables the user to more easily compare data across different assays.
To maximize the utility of bioactivity data, the targets of assays need to be represented robustly and consistently, in a manner independent of the various adopted names and synonyms used across different sources. To this end, detailed manual annotation of targets is carried out within ChEMBL. Where the intended molecular target of an assay is reported in a publication, this information is extracted, together with associated details of the relevant organism in which the assay was performed (or the organism from which the protein/cell-line was derived for an in vitro assay). Target assignments are carefully checked by our curators, and corrected where necessary, then further annotated where any ambiguity exists. For example, for an in vitro binding assay, it is often possible to determine the precise protein target with which the compound is interacting and assign a single relevant protein to the assay. However, in other cases this may not be possible. For example, an assay may describe interaction of a compound with a target which is known to be a protein/biomolecular complex (e.g. ribosomes, GABA-A receptors or integrins). In this case, several protein subunits may be assigned to the assay, but a ‘complex’ field in the database is used to record the fact that these proteins are associated as a specific protein complex. In other cases, the assay performed may not allow elucidation of the precise protein subtypes with which a compound is interacting (e.g. cell/tissue-based assays where several closely related subtypes of the protein are likely to be expressed, or those reported prior to the discovery of particular receptor/enzyme subtypes). Again, the assay may therefore be mapped to each of the possible protein targets, but a ‘multi’ field in the database records the fact that it is not clear whether the compound is interacting non-specifically with all of these proteins, and consequently less confidence should be placed in these assignments.
In many cases, such as whole organism-based phenotypic assays, it is not possible to unambiguously determine the protein target that is responsible for the observed effect of the compound. In these cases, the assay will be mapped to a ChEMBL target representing the non-molecular system on which an effect is observed. For example, an assay measuring the cytotoxicity of a compound against the human breast carcinoma-derived MCF-7 cells would be mapped to a ChEMBL cell-line target representing MCF-7. An in vitro assay measuring inhibition of growth of Mycobacterium tuberculosis would be mapped to a ChEMBL organism target representing M. tuberculosis. This allows users to easily retrieve information about other assays performed on the same systems, even though the underlying mechanism of action of the compounds might be different. Protein targets are further classified into a manually curated family hierarchy, according to nomenclature commonly used by drug discovery scientists (e.g. ligand-based classification of G-protein-coupled receptors, division of enzymes into proteases/kinases/phosphatases etc.), and organisms are classified according to a simplified subset of the NCBI taxonomic structure (20). This also allows data to be queried at a higher level (e.g. for all protein kinases or Mycobacterium species).
Approved drugs
In addition to literature-derived data, ChEMBL also contains structures and annotation for Food and Drug Administration (FDA)-approved drugs. For each drug entry, any information about approved products (from the FDA Orange Book, 21) including their trade names, administration routes, dosage information and approval dates is included in the database. Structures for novel drug ingredients are manually assigned, and for protein therapeutics, amino-acid sequences may be included, where available. Each drug is also annotated according to the drug type (synthetic small molecule, natural product-derived small molecule, antibody, protein, oligosaccharide, oligonucleotide, inorganic etc.), whether there are ‘black box’ safety warnings associated with a product containing that active ingredient, whether it is a known prodrug, the earliest approval date (where known), whether it is dosed as a defined single stereoisomer or racemic mixture, and whether it has a therapeutic application (as opposed to imaging/diagnostic agents, additives etc.). This information allows users of the bioactivity data to assess whether a compound of interest is an approved drug and is therefore likely to have an advantageous safety/pharmacokinetic profile or be orally bioavailable, for example.
Data model
The most important entity types within ChEMBL are documents (from which the data are extracted), compounds (substances that have been tested for their bioactivity), assays (individual experiments that have been carried out to assess bioactivity) and targets (the proteins or systems being monitored by an assay). Each extracted document has a list of associated compound records and assays, which are linked together by activities (i.e. the actual endpoints measured in the assay with their types, values and units).
Since the same compound may have been tested multiple times in different assays and publications, the compound records are collapsed, based on structure, to form a non-redundant molecule dictionary. Standard IUPAC Chemical Identifier (InChI) representation (22) is used to determine which compounds are identical and which should be registered with new identifiers. In general, the Standard InChI representation distinguishes stereoisomers of a compound, but not tautomers. Hence, stereoisomers will be given unique identifiers, but tautomers will not. We have taken the view that although a particular binding interaction may involve a specific ionization or tautomer state, in a biological assay, there will be interconversion and equilibration across these forms. A smaller number of protein therapeutics and substances with undefined structures are also included in the molecule dictionary. Additional information is then associated with the entries in this table, such as structure representations, calculated properties, synonyms, drug information and parent–salt relationships.
Similarly, a non-redundant target dictionary stores a list of the proteins, nucleic acids, subcellular fractions, cell-lines, tissues and organisms that are subject to investigation. Each assay is then mapped to one or more entries in this dictionary, as described above. Further information, such as protein family classification, is also linked to the target dictionary.
Each record in the documents, assays, molecule dictionary and target dictionary tables is assigned a unique ChEMBL identifier, which takes the form of a ‘CHEMBL’ prefix followed immediately by an integer (e.g. CHEMBL25 is the compound aspirin, CHEMBL210 is the human β-2 adrenergic receptor target). In addition, external identifiers are recorded for these entities where possible. For example, all small molecule compounds with defined structures are assigned ChEBI identifiers (23) and Standard InChIKeys. Where data are taken from other resources, the original identifiers are also retained (e.g. SIDs and AIDs for PubChem substances and assays, HET codes for PDBe ligands). PubMed identifiers or Digital Object Identifiers (DOIs) are stored for documents (20,24). Protein targets are represented by primary accessions within the UniProt protein database (25), and organism targets are assigned NCBI taxonomy IDs and names.
Data exchange
The PubChem BioAssay database accepts deposited results from many laboratories and screening centres and contains a large quantity of data, primarily from high-throughput screening experiments, measuring inhibition of a target by large numbers of compounds, often at a single compound concentration. As such, the number of data points within PubChem is huge, but a very small proportion of these represent compounds with dose–response measurements (e.g. IC50, Ki) of an affinity likely to specifically perturb a biological system. In contrast, due to extraction from published pharmacology and drug discovery literature, ChEMBL contains a much larger proportion of active compounds identified using dose–response assays. The number of distinct protein targets with dose–response measurements recorded in PubChem is also smaller (currently fewer than 700 proteins, compared with more than 4000 in ChEMBL). However, there are also novel protein targets in PubChem that are not currently included in ChEMBL. Therefore, the types of data reported in PubChem and ChEMBL are distinct and complementary. To maximise the utility of the two data sets to users, we have worked with the PubChem group to develop a data exchange mechanism. All ChEMBL literature-derived assays are now included in PubChem BioAssay, and a subset of PubChem assays (confirmatory and panel assays with dose–response endpoints) have been loaded into ChEMBL. Assays from PubChem are clearly marked, both on the ChEMBL interface and in the database, allowing users to easily determine where data have originated, while benefiting from being able to retrieve more information through a single point of access.
Similarly, compounds and binding measurements from ChEMBL have been integrated into BindingDB, and the reciprocal incorporation of BindingDB data into ChEMBL is planned.
Current content
Release 11 of the ChEMBL database contains information extracted from more than 42500 publications, together with several deposited datasets, and data drawn from other databases (Table 1). In total, there are more than 1 million distinct compound structures represented in the database, with 5.4 million activity values from more than 580000 assays. These assays are mapped to 8200 targets, including 5200 proteins (of which 2388 are human).
Table 1.
Data Source | Number of compound structures | Number of assays | Number of activity results | Number of targets | Number of protein targets | Number of organisms |
---|---|---|---|---|---|---|
ChEMBL literature extraction | 629943 | 580624 | 3282945 | 7957 | 5104 | 1552 |
PubChem BioAssaya | 364203 | 1636 | 2079974 | 681 | 647 | 63 |
GSK TCAMS Malaria Data (32) | 13467 | 6 | 81198 | 3 | 0 | 2 |
PDBe Ligands | 12337 | 0 | 0 | 0 | 0 | 0 |
Novartis-GNF Malaria Data (33) | 5675 | 4 | 22788 | 3 | 0 | 2 |
St Jude Children's Hospital Malaria Datab (34) | 1524 | 16 | 5456 | 8 | 0 | 5 |
Guide to Receptors and Channels (35) | 560 | 344 | 801 | 239 | 239 | 6 |
Sanger Institute Genomics of Drug Sensitivity in Cancer | 17 | 352 | 5984 | 352 | 0 | 1 |
aPubChem BioAssay set includes only confirmatory/panel assays from PubChem that have dose–response end points.
bOnly compounds with dose-response measurements from the St Jude malaria screening data set have been incorporated into ChEMBL, but the full high-throughput screening data can be downloaded from the ChEMBL-NTD website: https://www.ebi.ac.uk/chemblntd.
DATA ACCESS
The ChEMBL interface
The ChEMBL database is accessible via a simple, user-friendly interface at: https://www.ebi.ac.uk/chembldb. This interface allows users to search for compounds, targets or assays of interest in a variety of ways.
For example, users wishing to retrieve potential tool compounds for a target of interest can perform a keyword search of the database using a protein name, synonym, UniProt accession or ChEMBL target identifier of interest. Alternatively, targets can be browsed according to protein family (e.g. to retrieve all chemokine receptors), or organism (e.g. to retrieve all Plasmodium falciparum targets). Since the database only includes protein targets for which bioactivity data are available, users can also perform a BLAST search of the ChEMBL target dictionary with a protein sequence of interest. This can be useful to identify closely related proteins with activity data, even if the sequence of interest is not represented in the database (e.g. activity data for a mouse orthologue of a human target).
Having retrieved a target, or multiple targets, of interest, a simple drop-down menu allows users to display all associated bioactivity data, or to filter the available data to select activity types of interest (for example to include only IC50 and Ki measurements below a given concentration threshold, or only certain ADMET endpoints, see Supplementary Figure 1). The resulting bioactivity table gives details of each compound that was tested (together with the particular salt form used in the assay), the measured activity type, value and units, a description of the assay, details of the target (including the organism) and, importantly, a link to the publication from which the data have been extracted. Data from this view can be exported as a text file or spread sheet for further analysis.
Alternatively, users may have a particular compound of interest and wish to retrieve potency, selectivity or ADMET information for this, or closely related compounds. Again, users can search for compounds using a keyword search with names/synonyms or ChEMBL identifiers. However, a more effective strategy will often be to search by compound structure. The interface provides a choice of several different drawing tools (26), allowing users to sketch in a structure or substructure of interest (Figure 1). A compound similarity or substructure search of the database (implemented using the Accelrys Direct Oracle Cartridge: http://accelrys.com/products/informatics/cheminformatics/accelrys-direct.html) can then be carried out to retrieve ChEMBL compounds similar to, or containing, the input structure.
Having retrieved a list of compounds of interest, a variety of calculated properties such as molecular weight, calculated lipophilicity (AlogP, 27) and polar surface area (28) can be viewed and filtered via a graphical display. This may be useful to restrict the set of compounds to those that are likely to have appropriate drug-like properties (29), before retrieving or filtering the associated bioactivity data.
For each of the main data types in ChEMBL (compounds, targets, assays and documents), report card pages are available. These provide further details about the entity of interest, such as names and synonyms (for targets and compounds), journal/article details (for documents), drug annotation, structures and calculated physicochemical properties (for compounds), together with cross-references to other resources (e.g. UniProt, PDBe, ChEBI, DrugBank and CiteXplore: http://www.ebi.ac.uk/citexplore). Each report card also contains a series of clickable graphical ‘widgets’ summarizing and providing rapid access to all of the bioactivity data available for that entity (Figure 2).
A table view of approved drugs is also provided, with relevant annotation (e.g. drug type, administration route, ‘black box’ safety warnings) indicated by a series of sortable icons (see Supplementary Figure 2). Users can download the structures for these drugs or go to report cards to access further information, such as bioactivity data.
Downloads and web services
While the ChEMBL interface provides the functionality required for many common use-cases, some users may prefer to download the database and query it locally (for use in large-scale data mining, to integrate with their own proprietary data, or due to data security policies around the use of chemical structures at their institutions, for example). Each release of ChEMBL is freely available from our ftp site in a variety of formats, including Oracle, MySQL, an SD file of compound structures and a FASTA file of the target sequences, under a Creative Commons Attribution-ShareAlike 3.0 Unported license (http://creativecommons.org/licenses/by-sa/3.0).
In addition, a set of RESTful web services is provided (together with sample Java, Perl and Python clients), to allow programmatic retrieval of ChEMBL data in XML or JSON formats (see https://www.ebi.ac.uk/chembldb/ws for more details).
Finally, to allow greater interoperability of the ChEMBL data with molecular interaction and pathway data (e.g. for annotation of pathways with chemical tools), a subset of the database (compounds active in binding assays against protein targets) is available in PSI-MITAB 2.5 format (30) via PSICQUIC web services (31).
FUNDING
A Strategic Award for Chemogenomics from the Wellcome Trust [086151/Z/08/Z]; and the European Molecular Biology Laboratory. Funding for open access charge: European Molecular Biology Laboratory.
Conflict of interest statement. None declared.
ACKNOWLEDGEMENTS
We are grateful to former colleagues at Inpharmatica Ltd., our data extractors, part-time curators and interns for their contributions to the database. We thank Yanli Wang and Evan Bolton for their assistance with the PubChem data integration. We also greatly appreciate and acknowledge the feedback from users on data content and organization of the database.
REFERENCES
Articles from Nucleic Acids Research are provided here courtesy of Oxford University Press
Full text links
Read article at publisher's site: https://doi.org/10.1093/nar/gkr777
Read article for free, from open access legal sources, via Unpaywall: https://academic.oup.com/nar/article-pdf/40/D1/D1100/16955876/gkr777.pdf
Citations & impact
Impact metrics
Article citations
MolE: a foundation model for molecular graphs using disentangled attention.
Nat Commun, 15(1):9431, 12 Nov 2024
Cited by: 0 articles | PMID: 39532853 | PMCID: PMC11557931
Discovery of novel NLRP3 inhibitors based on machine learning and physical methods.
BMC Chem, 18(1):210, 28 Oct 2024
Cited by: 0 articles | PMID: 39468648 | PMCID: PMC11520493
Exploring potential therapeutic targets for asthma: a proteome-wide Mendelian randomization analysis.
J Transl Med, 22(1):978, 29 Oct 2024
Cited by: 0 articles | PMID: 39472987 | PMCID: PMC11520847
Improving docking and virtual screening performance using AlphaFold2 multi-state modeling for kinases.
Sci Rep, 14(1):25167, 24 Oct 2024
Cited by: 0 articles | PMID: 39448664 | PMCID: PMC11502823
AI-DPAPT: a machine learning framework for predicting PROTAC activity.
Mol Divers, 19 Oct 2024
Cited by: 0 articles | PMID: 39425859
Go to all (1,542) article citations
Other citations
Data
Data behind the article
This data has been text mined from the article, or deposited into data resources.
BioStudies: supplemental material and supporting data
Chembl (2)
- (1 citation) ChEMBL - CHEMBL25
- (1 citation) ChEMBL - CHEMBL314854
Similar Articles
To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.
The ChEMBL bioactivity database: an update.
Nucleic Acids Res, 42(database issue):D1083-90, 07 Nov 2013
Cited by: 680 articles | PMID: 24214965 | PMCID: PMC3965067
Collation and data-mining of literature bioactivity data for drug discovery.
Biochem Soc Trans, 39(5):1365-1370, 01 Oct 2011
Cited by: 16 articles | PMID: 21936816
ChEMBL web services: streamlining access to drug discovery data and utilities.
Nucleic Acids Res, 43(w1):W612-20, 16 Apr 2015
Cited by: 261 articles | PMID: 25883136 | PMCID: PMC4489243
Using ChEMBL web services for building applications and data processing workflows relevant to drug discovery.
Expert Opin Drug Discov, 12(8):757-767, 12 Jun 2017
Cited by: 21 articles | PMID: 28602100 | PMCID: PMC6321761
Review Free full text in Europe PMC
Funding
Funders who supported this work.
Cancer Research UK (1)
Cancer Research UK Cancer Therapeutics Unit
Professor Paul Workman, Institute of Cancer Research
Grant ID: 11566
Engineering and Physical Sciences Research Council (1)
Grant ID: EP/I037229/1
Wellcome Trust (2)
Grant ID: 086151/Z/08/Z
Chemogenomics.
Professor Janet Thornton, European Bioinformatics Institute
Grant ID: 086151