Transformer-based approach for symptom recognition and multilingual linking.

Vassileva S; Grazhdanski G; Koychev I; Boytcheva S

doi:10.1093/database/baae090

Transformer-based approach for symptom recognition and multilingual linking.

Affiliations

1. Faculty of Mathematics and Informatics, Sofia University St. Kliment Ohridski, Blvd "James Bourchier" 5, Sofia 1164, Bulgaria.
Authors
Vassileva S¹
Grazhdanski G¹
Koychev I¹
Boytcheva S¹
(4 authors)

ORCIDs linked to this article

Database : the Journal of Biological Databases and Curation, 01 Sep 2024, 2024:baae090
https://doi.org/10.1093/database/baae090 PMID: 39259689 PMCID: PMC11389607

This article is in the Europe PMC Open access subset. Refer to the copyright information in the article for licensing details.

Free full text in Europe PMC

Abstract

This paper presents a transformer-based approach for symptom Named Entity Recognition (NER) in Spanish clinical texts and multilingual entity linking on the SympTEMIST dataset. For Spanish NER, we fine tune a RoBERTa-based token-level classifier with Bidirectional Long Short-Term Memory and conditional random field layers on an augmented train set, achieving an F1 score of 0.73. Entity linking is performed via a hybrid approach with dictionaries, generating candidates from a knowledge base containing Unified Medical Language System aliases using the cross-lingual SapBERT and reranking the top candidates using GPT-3.5. The entity linking approach shows consistent results for multiple languages of 0.73 accuracy on the SympTEMIST multilingual dataset and also achieves an accuracy of 0.6123 on the Spanish entity linking task surpassing the current top score for this subtask. Database URL: https://github.com/svassileva/symptemist-multilingual-linking.

Free full text

Database (Oxford). 2024; 2024: baae090.

Published online 2024 Sep 11. https://doi.org/10.1093/database/baae090

PMCID: PMC11389607

PMID: 39259689

Transformer-based approach for symptom recognition and multilingual linking

Sylvia Vassileva, Georgi Grazhdanski, Ivan Koychev, and Svetla Boytcheva

Author information Article notes Copyright and License information Disclaimer

Go to:

Abstract

Database URL: https://github.com/svassileva/symptemist-multilingual-linking

Go to:

Introduction

Clinical named entity recognition (NER) and entity linking are the task of automatically detecting important terms in clinical text like symptoms, diseases, procedures, diagnoses, and others and identifying the correct concept from a standard medical ontology or classification represented by the text. Detecting and normalizing clinical terms is an important task in clinical Natural Language Processing used for extracting structured information from clinical texts and allowing its subsequent automatic processing for downstream tasks. Due to the complexity of clinical texts, extracting the terms and linking them to the correct concepts from medical ontologies is a very challenging task. The search space for entity linking is usually hundreds of thousands of concepts, and the majority of these concepts are in the long tail of the distribution, i.e. they have very few representatives in the train datasets and medical vocabularies. Transformer-based models have been successfully applied for entity recognition, especially in high-resource languages like English. However, due to the limited labeled resources in other languages, the task of clinical NER remains challenging for a lot of languages.

Different shared tasks are organized each year to advance the methods for clinical information extraction by providing reference datasets that different teams can use to train their methods. The BioCreative Challenge and Workshop presents researchers with different challenges and allows them to compete and present their work at the BioCreative Workshop (1).

SympTEMIST is a shared task, part of the BioCreative VIII Challenge, aimed at detecting and normalizing symptoms, signs, and findings in Spanish clinical texts (2). The organizers provided a Spanish clinical dataset for training and evaluation of the participants’ methods. It consists of three subtasks—NER (Subtask 1), entity linking (Subtask 2), and multilingual entity linking (Subtask 3) for several languages—English, Portuguese, French, Italian, and Dutch (https://temu.bsc.es/symptemist/). The shared task aims to identify all signs, symptoms, non-numerical descriptions of test results, findings from imaging procedures, as well as patient death events. The first subtask focuses on NER in Spanish clinical case reports of entities from the aforementioned categories. The second subtask consists of identifying the correct Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) code associated with each mention from Subtask 1. SNOMED CT (https://www.nlm.nih.gov/healthit/snomedct/us_edition.html) is an internationally used medical ontology and collection of terms used for clinical documentation and reporting (3). The list of possible SNOMED CT codes is provided by the organizers, and it is possible for some mentions to be missing from the knowledge base (KB). The third subtask includes identifying the correct SNOMED CT code in five different languages for automatically transferred entities in 350 case reports. The target list of SNOMED CT codes is the same as in Subtask 2.

This paper describes our approach for symptom recognition in Spanish as well as multilingual entity linking in six languages. The contributions of this paper are as follows: we propose a system for Spanish clinical NER using a data augmentation approach which shows a 0.73 F1 score on symptom recognition. We propose a hybrid approach for improving entity linking in multiple languages using dictionaries, cross-lingual SapBERT generation, and GPT 3.5 reranking which shows a consistently high accuracy score of 0.73 on average across five languages, improving on the best score on the SympTEMIST dataset (4). The approach can be adapted to other languages as well. We have published our code in GitHub (https://github.com/svassileva/symptemist-multilingual-linking).

Go to:

Related work

Approaches based on transformer architectures are frequently utilized for NER tasks. Notably, the top-ranked system in the SympTEMIST challenge incorporates an ensemble of transformer models for Spanish clinical text and achieved an F1 score of 0.74 (strict) (5). Prior research on NER for Spanish text demonstrated that Spanish RoBERTa (6) enhanced with a conditional random field (CRF) head outperforms other NER models (7, 8). Moreover, the integration of a Bidirectional Long Short-Term Memory (BiLSTM) layer has been shown to improve model performance, achieving an F1 score of 0.79 in the context of the procedure challenge (7).

The widely adopted solution for entity linking in the context of clinical texts is the application of cosine similarity search alongside embeddings from the cross-lingual SapBERT model, as introduced by Liu et al. (9). Such an approach was utilized by the top three teams in the MedProcNER challenge (https://temu.bsc.es/medprocner/) (10, 11, 8), as well as by top-ranked teams in the SympTEMIST challenge, augmented with a reranking model based on Bidirectional Encoder Representations from Transformers (BERT), achieving an accuracy of 0.60.

The most prominent multilingual end-to-end solution for NER and entity linking is mReFinED (12). The proposed two-state solution is based on a semisupervised method for annotating entity mentions within Wikipedia articles, subsequently leveraging this annotated corpus to train a multifaceted BERT-oriented model performing NER, entity disambiguation, and entity linking.

Some multilingual general domain approaches based on sequence-to-sequence models successfully utilize entity linking, as exemplified by the mGENRE system (13). The latter one is designed to predict the normalized entity name for each mention in a specified language using an auto-regressive model (13) and demonstrated state-of-the-art results for the benchmark dataset Mewsli-9 (14).

In the biomedical and clinical domain, Zhu et al. propose a system that maps language-specific mentions to Unified Medical Language System (UMLS) using a controllable contrastive generation framework using a template-based UMLS concept summary to guide the decoder to generate the correct entity (15).

Go to:

Data

SympTEMIST dataset

The SympTEMIST corpus (2) contains 1000 clinical case reports in Spanish with labeled symptom spans and their corresponding concept code from SNOMED CT. The corpus contains labeled train and test sets and a gazetteer of SNOMED CT codes and different aliases in Spanish. The train set has 750 documents, 11,899 sentences, and 343,243 tokens. The test set has 250 documents, 3,986 sentences, and 114,536 tokens. The train set contains 3,484 annotated entities with 1534 unique entity codes. Fifty-nine mentions have no SNOMED CT code assigned, and the rest have a single corresponding code. An additional set of composite mentions was released after the end of the challenge. There is one nested mention. The Spanish SympTEMIST gazetteer contains a total of 164,817 aliases for terms in multiple categories, including findings, disorders, morphologic abnormalities, and others.

As part of the experimental multilingual subtask, the organizers automatically translated 350 Spanish clinical case reports into five languages and then transferred the labeled entities from Spanish into the target language using lexical annotations transfer techniques (2). The goal is to predict the correct SNOMED CT code based on the identified entity mentioned in the text. Figure 1 shows the number of entities in the train and test sets in different languages in the SympTEMIST dataset. Due to differences in the automatic translation quality, the number of labeled entities is slightly different in the different languages (2). Figure 2 shows the distribution of SNOMED CT concept types in the SympTEMIST gazetteer. The majority of concepts are disorders, followed by findings and morphologic abnormalities.

An external file that holds a picture, illustration, etc.
Object name is baae090f1.jpg

Figure 1.

The number of entities in the train and test sets in different languages in the SympTEMIST dataset.

An external file that holds a picture, illustration, etc.
Object name is baae090f2.jpg

Figure 2.

The number of concepts per type in the gazetteer in the SympTEMIST dataset.

Language pretraining dataset

To evaluate the effect of further pretraining of the language model, we compiled an additional dataset consisting of Spanish UMLS synonyms of all terms included in the SympTEMIST gazetteer, for a total of 337,039 aliases.

UMLS KB

Using the UMLS Spanish SNOMED CT (3), a KB of all aliases of SympTEMIST gazetteer concepts was created. In addition, data from the gazetteer and the train set for Subtask 2 were added. The KB consists of 289,734 aliases of symptoms. Figure 3 shows the frequency distribution of concept aliases in the Spanish KB—the majority of concepts have up to two aliases.

An external file that holds a picture, illustration, etc.
Object name is baae090f3.jpg

Figure 3.

The concept alias frequency in the Spanish KB.

For the multilingual subtask, we compiled all UMLS SNOMED CT aliases for the different languages and combined them with the train set entities. Figure 4 shows the number of aliases in the KB for each language.

An external file that holds a picture, illustration, etc.
Object name is baae090f4.jpg

Figure 4.

The number of aliases in the KB for the different languages.

Go to:

Methods

Subtask 1—NER

The clinical report texts are first split into sentences using the SPACCC Sentence Splitter (https://github.com/PlanTL-GOB-ES/SPACCC_Sentence-Splitter) as about 33% of the case reports exceed the 512 input token limit of the employed models. Following the method we used in the original competition (16), we approach the NER subtask as a token classification task and label the train set using the IOB2 scheme (17). We train a transformer-based model with an additional two-layer BiLSTM, followed by a linear layer and a conditional random field (CRF) on top of the token classification task using the negative conditional log likelihood of a sequence of labels as a loss function. Figure 5 shows the NER pipeline architecture. We trained the model for 20 epochs using the following hyperparameters: learning rate, 5e-5 (recommended by the original BERT paper (18)); Adam beta1, 0.9; Adam beta2, 0.999 [Adam default configuration of huggingface library (https://huggingface.co/docs/transformers/en/main_classes/optimizer_schedules)]; weight decay, 0.1 (recommended by the original RoBERTa paper (19)); batch size, 8; and gradient accumulation steps, 2.

An external file that holds a picture, illustration, etc.
Object name is baae090f5.jpg

Figure 5.

The architecture of the NER model.

After splitting the training dataset into sentences, we augment it by randomly replacing some of the annotated mentions with a synonym from the Spanish UMLS. This results in 1672 additional example sentences for a total of 13,571.

Classification model selection

For the token classifier base model, we performed experiments with different Spanish BERT-based models—CLIN-X-ES (20) and PlanTL-GOB-ES/roberta-base-biomedical-clinical-es (6) (Spanish RoBERTa). We chose these models as they had performed well in previous challenges on NER in Spanish clinical texts showing competitive results in MedProcNER (21) and DisTEMIST (22).

CLIN-X-ES is based on XLM-RoBERTa Large, a cross-lingual language model, and is additionally trained on Spanish clinical corpus using the masked language modeling objective. The corpus combines the MeSpEN (23) dataset and Scielo archive (https://scielo.org/). We use this model as-is without further language pretraining.

Spanish RoBERTa (PlanTL-GOB-ES/roberta-base-biomedical-clinical-es) is a monolingual Spanish model, based on RoBERTa and trained on a large Spanish biomedical–clinical corpus of more than 1B tokens. Systems using this model have achieved very good results on previous Spanish biomedical–clinical tasks. We further pretrain the Spanish RoBERTa model on the language pretraining dataset to improve the language model’s training on symptoms.

Language model pretraining

We evaluate the effect of further pretraining of the base transformer model using the language pretraining dataset and the Spanish RoBERTa model with the masked language modeling objective for four epochs. Hyperparameter values are the same as those used for pretraining RoBERTa (base) in the original RoBERTa paper(19).

Subtask 2—entity linking

The entity linking task aims to predict the correct SNOMED CT code, using the gold entities provided in the SympTEMIST dataset. The dataset originally provided by the organizers had all composite mentions, which correspond to more than one code, removed from the train and test sets, and therefore our system targets outputting a single code. Some entity mentions have no corresponding code in SNOMED CT and are marked as NO_CODE (59 mentions in the train set).

We use the model developed for the competition (16) which did not have any composite mentions in its training data and therefore does not address the composite code problem. Entity linking is performed in two steps—first, we try to match the mention to an alias in the KB by using an exact match string search of the lowercase symptom name. Second, for entities that did not match a KB alias, we performed a cosine similarity search on the cross-lingual SapBERT (9) embeddings, retrieving the closest alias from the KB. The architecture of the entity linking model is shown in Figure 6. We gather symptom synonyms from different data sources including the SympTEMIST gazetteer, train data, and UMLS symptoms (3). We also augment the rare symptom concepts with less than five aliases by adding/removing random characters to generate five new aliases for each concept.

An external file that holds a picture, illustration, etc.
Object name is baae090f6.jpg

Figure 6.

The architecture of the Enity Linking model.

Subtask 3—multilingual entity linking

For the multilingual entity linking task, the organizers provided automatically translated symptom mentions in English, French, Italian, Dutch, and Portuguese with the corresponding SNOMED CT codes, that are derived from the original gold Spanish annotations, used in the second task. The pipeline used to assign a SNOMED CT code to the automatically translated mentions is shown in Figure 7. First, each mention is transformed to lowercase. Then, a lowercase exact match string search is performed against a language-specific dictionary of all UMLS SNOMED CT aliases, combined with entities from the corresponding train set. If a mention is found in the dictionary, it is assigned the corresponding code. Mentions not found in the dictionary are further processed by a cross-lingual SapBERT to find the five most similar entities (in terms of cosine similarity) from the UMLS KB. Finally, each mention, its five candidate entities, and the text of the Spanish patient case report from which the mention is derived are used to construct a one-shot prompt for GPT-3.5, which selects the best candidate. In case GPT-3.5 returns an entity that is not included in the five candidates provided, NO_CODE is assigned, which addresses a limitation of the basic cross-lingual SapBERT approach.

An external file that holds a picture, illustration, etc.
Object name is baae090f7.jpg

Figure 7.

A pipeline for assigning SNOMED CT codes to automatically translated symptom mentions (Subtask 3), consisting of four main steps: first, symptom mentions are preprocessed and a dictionary lookup is performed, second if a mention is found in the dictionary, it is assigned the corresponding code, third, any mentions not found in the dictionary are processed by the cross-lingual SapBERT to find the five most similar candidates in a KB, and finally, for each mention, GPT-3.5 is prompted to determine the best candidate from the five, using the mention and the original Spanish document from which the mention is derived.

Go to:

Experiments and results

Train and validation datasets

We compile the train and validation sets by first splitting the original texts into sentences and then creating three bins of sentences based on the longest mention contained in the sentence—short (less than 38 symbols), medium (between 38 and 90 symbols), and long (over 90 symbols). Finally, 80% of the sentences in each bin are used as train examples, and the rest—for validation. Bin margins are determined based on the mention length distribution in the original train set. The validation set is used for model comparison in different experiments, while the full train set is used to train the final models for test set evaluation.

Micro-averaged precision, recall, and F1-score are used for the NER subtask. Accuracy is used for entity linking.