PheneBank: Processed Medline Abstracts and PMC full articles
- 1. University of Cambridge
- 2. Queen Mary University
Description
The PheneBank project:
Free text scientific literature has the potential to be an incredibly valuable source of data for uncovering the often hidden relationships between genes, diseases and phenotypes. Phenotypic descriptions cover abnormalities in anatomical structures, processes and behaviours. For example 'growth delay' and 'body weight loss'. Such descriptions form the basis for determining the existence and treatment of a disease but, because of their inherent complexity, have previously received less attention by the text mining community. In recent years, significant effort has been spent by a small number of expert curators to create coding systems for phenotypes (called "ontologies"), such as the Human Phenotype Ontology (HP) and the Mammalian Phenotype Ontology (MP). The PheneBank project proposes to support and speed up curation using terms discovered directly from the literature and to automatically integrate them with such standard ontologies.
The project seeks to harness texts for extracting statistically significant associations between phenotypes, diseases and genes. Earlier approaches have suffered from not providing deep semantic representations of the phenotypes they tried to target. Our deep learning-based approach is an attempt to overcome this issue by reducing the uncertainty between textual and ontological forms of phenotypes. Specifically, the model treats multitoken named entities as a single token which allows more reliable handling of multiword expressions. The approach builds on ground breaking research at the European Bininformatics Institute by the PI (Nigel Collier) and the Co-investigator (Damian Smedley, Queen Mary University London), including terminology alignment of phenotypes using pairwise scoring of the conceptual elements that make up the phenotype.
https://sites.google.com/site/nhcollier/projects/phenebank
The dataset:
As an output of the PheneBank project, we release the set of 24 million MEDLINE abstracts as well as 3.8M open-access PMC full articles annotated with 9 classes of entity: Phenotype, Disease, Anatomy, Cell, Cell_line, GPR, Gene_variant, Molecule, and Pathway. The entities have been mapped to five major ontologies: SNOMED, HPO, MeSH, PRO, and FMA.
Processing:
The NER tagging has been done using a BiLSTM-CRF neural model trained on expert-annotated data (to be released for research). The grounding to ontologies relies on semantic embedding of concepts and entities in a unified semantic space.
Data format:
The "PheneBank_Processed_PubMed.tar.gz" file contains 24359010 .txt files that are classified into 812 directories. Each .txt file is named with a PubMed article ID and contains the corresponding article's abstract and its annotations.
The "PheneBank_Processed_PMC.tar.gz" files has 6180 directories which are named after the journal titles from which the articles have been drawn. There are three files per each article (i.e., 3 .txt files for the 3751770 distinct articles), containing text from different parts of the article: .title.txt, .abstract.txt, and .body.txt.
Each line starts with a word; for those words that are identified as entities, entity type and mapping information are followed in the same line (tab separated), with the following format:
word <TAB> ::: <TAB> entity_type <TAB> entity_concept_ID_1##confidence_score_1 entity_concept_ID_2##confidence_score_2 ...
Note that the concepts are sorted according to their mapping confidence scores.
Files
PheneBank_Processed_PubMed_v2.zip
Files
(41.4 GB)
Name | Size | Download all |
---|---|---|
md5:8b25087e6f14cedab84a659fd1e64556
|
16.7 GB | Download |
md5:2f636513372638c681471070efcca593
|
24.7 GB | Preview Download |