Europe PMC

This website requires cookies, and the limited processing of your personal data in order to function. By using the site you are agreeing to this as outlined in our privacy notice and cookie policy.

Abstract 


This paper presents a transformer-based approach for symptom Named Entity Recognition (NER) in Spanish clinical texts and multilingual entity linking on the SympTEMIST dataset. For Spanish NER, we fine tune a RoBERTa-based token-level classifier with Bidirectional Long Short-Term Memory and conditional random field layers on an augmented train set, achieving an F1 score of 0.73. Entity linking is performed via a hybrid approach with dictionaries, generating candidates from a knowledge base containing Unified Medical Language System aliases using the cross-lingual SapBERT and reranking the top candidates using GPT-3.5. The entity linking approach shows consistent results for multiple languages of 0.73 accuracy on the SympTEMIST multilingual dataset and also achieves an accuracy of 0.6123 on the Spanish entity linking task surpassing the current top score for this subtask. Database URL: https://github.com/svassileva/symptemist-multilingual-linking.

Free full text 


Logo of databaseAlertsAuthor InstructionsSubmitAboutDatabase
Database (Oxford). 2024; 2024: baae090.
Published online 2024 Sep 11. https://doi.org/10.1093/database/baae090
PMCID: PMC11389607
PMID: 39259689

Transformer-based approach for symptom recognition and multilingual linking

Abstract

This paper presents a transformer-based approach for symptom Named Entity Recognition (NER) in Spanish clinical texts and multilingual entity linking on the SympTEMIST dataset. For Spanish NER, we fine tune a RoBERTa-based token-level classifier with Bidirectional Long Short-Term Memory and conditional random field layers on an augmented train set, achieving an F1 score of 0.73. Entity linking is performed via a hybrid approach with dictionaries, generating candidates from a knowledge base containing Unified Medical Language System aliases using the cross-lingual SapBERT and reranking the top candidates using GPT-3.5. The entity linking approach shows consistent results for multiple languages of 0.73 accuracy on the SympTEMIST multilingual dataset and also achieves an accuracy of 0.6123 on the Spanish entity linking task surpassing the current top score for this subtask.

Database URL: https://github.com/svassileva/symptemist-multilingual-linking

Introduction

Clinical named entity recognition (NER) and entity linking are the task of automatically detecting important terms in clinical text like symptoms, diseases, procedures, diagnoses, and others and identifying the correct concept from a standard medical ontology or classification represented by the text. Detecting and normalizing clinical terms is an important task in clinical Natural Language Processing used for extracting structured information from clinical texts and allowing its subsequent automatic processing for downstream tasks. Due to the complexity of clinical texts, extracting the terms and linking them to the correct concepts from medical ontologies is a very challenging task. The search space for entity linking is usually hundreds of thousands of concepts, and the majority of these concepts are in the long tail of the distribution, i.e. they have very few representatives in the train datasets and medical vocabularies. Transformer-based models have been successfully applied for entity recognition, especially in high-resource languages like English. However, due to the limited labeled resources in other languages, the task of clinical NER remains challenging for a lot of languages.

Different shared tasks are organized each year to advance the methods for clinical information extraction by providing reference datasets that different teams can use to train their methods. The BioCreative Challenge and Workshop presents researchers with different challenges and allows them to compete and present their work at the BioCreative Workshop (1).

SympTEMIST is a shared task, part of the BioCreative VIII Challenge, aimed at detecting and normalizing symptoms, signs, and findings in Spanish clinical texts (2). The organizers provided a Spanish clinical dataset for training and evaluation of the participants’ methods. It consists of three subtasks—NER (Subtask 1), entity linking (Subtask 2), and multilingual entity linking (Subtask 3) for several languages—English, Portuguese, French, Italian, and Dutch (https://temu.bsc.es/symptemist/). The shared task aims to identify all signs, symptoms, non-numerical descriptions of test results, findings from imaging procedures, as well as patient death events. The first subtask focuses on NER in Spanish clinical case reports of entities from the aforementioned categories. The second subtask consists of identifying the correct Systematized Nomenclature of Medicine-Clinical Terms (SNOMED CT) code associated with each mention from Subtask 1. SNOMED CT (https://www.nlm.nih.gov/healthit/snomedct/us_edition.html) is an internationally used medical ontology and collection of terms used for clinical documentation and reporting (3). The list of possible SNOMED CT codes is provided by the organizers, and it is possible for some mentions to be missing from the knowledge base (KB). The third subtask includes identifying the correct SNOMED CT code in five different languages for automatically transferred entities in 350 case reports. The target list of SNOMED CT codes is the same as in Subtask 2.

This paper describes our approach for symptom recognition in Spanish as well as multilingual entity linking in six languages. The contributions of this paper are as follows: we propose a system for Spanish clinical NER using a data augmentation approach which shows a 0.73 F1 score on symptom recognition. We propose a hybrid approach for improving entity linking in multiple languages using dictionaries, cross-lingual SapBERT generation, and GPT 3.5 reranking which shows a consistently high accuracy score of 0.73 on average across five languages, improving on the best score on the SympTEMIST dataset (4). The approach can be adapted to other languages as well. We have published our code in GitHub (https://github.com/svassileva/symptemist-multilingual-linking).

Related work

Approaches based on transformer architectures are frequently utilized for NER tasks. Notably, the top-ranked system in the SympTEMIST challenge incorporates an ensemble of transformer models for Spanish clinical text and achieved an F1 score of 0.74 (strict) (5). Prior research on NER for Spanish text demonstrated that Spanish RoBERTa (6) enhanced with a conditional random field (CRF) head outperforms other NER models (7, 8). Moreover, the integration of a Bidirectional Long Short-Term Memory (BiLSTM) layer has been shown to improve model performance, achieving an F1 score of 0.79 in the context of the procedure challenge (7).

The widely adopted solution for entity linking in the context of clinical texts is the application of cosine similarity search alongside embeddings from the cross-lingual SapBERT model, as introduced by Liu et al. (9). Such an approach was utilized by the top three teams in the MedProcNER challenge (https://temu.bsc.es/medprocner/) (10, 11, 8), as well as by top-ranked teams in the SympTEMIST challenge, augmented with a reranking model based on Bidirectional Encoder Representations from Transformers (BERT), achieving an accuracy of 0.60.

The most prominent multilingual end-to-end solution for NER and entity linking is mReFinED (12). The proposed two-state solution is based on a semisupervised method for annotating entity mentions within Wikipedia articles, subsequently leveraging this annotated corpus to train a multifaceted BERT-oriented model performing NER, entity disambiguation, and entity linking.

Some multilingual general domain approaches based on sequence-to-sequence models successfully utilize entity linking, as exemplified by the mGENRE system (13). The latter one is designed to predict the normalized entity name for each mention in a specified language using an auto-regressive model (13) and demonstrated state-of-the-art results for the benchmark dataset Mewsli-9 (14).

In the biomedical and clinical domain, Zhu et al. propose a system that maps language-specific mentions to Unified Medical Language System (UMLS) using a controllable contrastive generation framework using a template-based UMLS concept summary to guide the decoder to generate the correct entity (15).

Data

SympTEMIST dataset

The SympTEMIST corpus (2) contains 1000 clinical case reports in Spanish with labeled symptom spans and their corresponding concept code from SNOMED CT. The corpus contains labeled train and test sets and a gazetteer of SNOMED CT codes and different aliases in Spanish. The train set has 750 documents, 11,899 sentences, and 343,243 tokens. The test set has 250 documents, 3,986 sentences, and 114,536 tokens. The train set contains 3,484 annotated entities with 1534 unique entity codes. Fifty-nine mentions have no SNOMED CT code assigned, and the rest have a single corresponding code. An additional set of composite mentions was released after the end of the challenge. There is one nested mention. The Spanish SympTEMIST gazetteer contains a total of 164,817 aliases for terms in multiple categories, including findings, disorders, morphologic abnormalities, and others.

As part of the experimental multilingual subtask, the organizers automatically translated 350 Spanish clinical case reports into five languages and then transferred the labeled entities from Spanish into the target language using lexical annotations transfer techniques (2). The goal is to predict the correct SNOMED CT code based on the identified entity mentioned in the text. Figure 1 shows the number of entities in the train and test sets in different languages in the SympTEMIST dataset. Due to differences in the automatic translation quality, the number of labeled entities is slightly different in the different languages (2). Figure 2 shows the distribution of SNOMED CT concept types in the SympTEMIST gazetteer. The majority of concepts are disorders, followed by findings and morphologic abnormalities.

An external file that holds a picture, illustration, etc.
Object name is baae090f1.jpg

The number of entities in the train and test sets in different languages in the SympTEMIST dataset.

An external file that holds a picture, illustration, etc.
Object name is baae090f2.jpg

The number of concepts per type in the gazetteer in the SympTEMIST dataset.

Language pretraining dataset

To evaluate the effect of further pretraining of the language model, we compiled an additional dataset consisting of Spanish UMLS synonyms of all terms included in the SympTEMIST gazetteer, for a total of 337,039 aliases.

UMLS KB

Using the UMLS Spanish SNOMED CT (3), a KB of all aliases of SympTEMIST gazetteer concepts was created. In addition, data from the gazetteer and the train set for Subtask 2 were added. The KB consists of 289,734 aliases of symptoms. Figure 3 shows the frequency distribution of concept aliases in the Spanish KB—the majority of concepts have up to two aliases.

An external file that holds a picture, illustration, etc.
Object name is baae090f3.jpg

The concept alias frequency in the Spanish KB.

For the multilingual subtask, we compiled all UMLS SNOMED CT aliases for the different languages and combined them with the train set entities. Figure 4 shows the number of aliases in the KB for each language.

An external file that holds a picture, illustration, etc.
Object name is baae090f4.jpg

The number of aliases in the KB for the different languages.

Methods

Subtask 1—NER

The clinical report texts are first split into sentences using the SPACCC Sentence Splitter (https://github.com/PlanTL-GOB-ES/SPACCC_Sentence-Splitter) as about 33% of the case reports exceed the 512 input token limit of the employed models. Following the method we used in the original competition (16), we approach the NER subtask as a token classification task and label the train set using the IOB2 scheme (17). We train a transformer-based model with an additional two-layer BiLSTM, followed by a linear layer and a conditional random field (CRF) on top of the token classification task using the negative conditional log likelihood of a sequence of labels as a loss function. Figure 5 shows the NER pipeline architecture. We trained the model for 20 epochs using the following hyperparameters: learning rate, 5e-5 (recommended by the original BERT paper (18)); Adam beta1, 0.9; Adam beta2, 0.999 [Adam default configuration of huggingface library (https://huggingface.co/docs/transformers/en/main_classes/optimizer_schedules)]; weight decay, 0.1 (recommended by the original RoBERTa paper (19)); batch size, 8; and gradient accumulation steps, 2.

An external file that holds a picture, illustration, etc.
Object name is baae090f5.jpg

The architecture of the NER model.

After splitting the training dataset into sentences, we augment it by randomly replacing some of the annotated mentions with a synonym from the Spanish UMLS. This results in 1672 additional example sentences for a total of 13,571.

Classification model selection

For the token classifier base model, we performed experiments with different Spanish BERT-based models—CLIN-X-ES (20) and PlanTL-GOB-ES/roberta-base-biomedical-clinical-es (6) (Spanish RoBERTa). We chose these models as they had performed well in previous challenges on NER in Spanish clinical texts showing competitive results in MedProcNER (21) and DisTEMIST (22).

CLIN-X-ES is based on XLM-RoBERTa Large, a cross-lingual language model, and is additionally trained on Spanish clinical corpus using the masked language modeling objective. The corpus combines the MeSpEN (23) dataset and Scielo archive (https://scielo.org/). We use this model as-is without further language pretraining.

Spanish RoBERTa (PlanTL-GOB-ES/roberta-base-biomedical-clinical-es) is a monolingual Spanish model, based on RoBERTa and trained on a large Spanish biomedical–clinical corpus of more than 1B tokens. Systems using this model have achieved very good results on previous Spanish biomedical–clinical tasks. We further pretrain the Spanish RoBERTa model on the language pretraining dataset to improve the language model’s training on symptoms.

Language model pretraining

We evaluate the effect of further pretraining of the base transformer model using the language pretraining dataset and the Spanish RoBERTa model with the masked language modeling objective for four epochs. Hyperparameter values are the same as those used for pretraining RoBERTa (base) in the original RoBERTa paper(19).

Subtask 2—entity linking

The entity linking task aims to predict the correct SNOMED CT code, using the gold entities provided in the SympTEMIST dataset. The dataset originally provided by the organizers had all composite mentions, which correspond to more than one code, removed from the train and test sets, and therefore our system targets outputting a single code. Some entity mentions have no corresponding code in SNOMED CT and are marked as NO_CODE (59 mentions in the train set).

We use the model developed for the competition (16) which did not have any composite mentions in its training data and therefore does not address the composite code problem. Entity linking is performed in two steps—first, we try to match the mention to an alias in the KB by using an exact match string search of the lowercase symptom name. Second, for entities that did not match a KB alias, we performed a cosine similarity search on the cross-lingual SapBERT (9) embeddings, retrieving the closest alias from the KB. The architecture of the entity linking model is shown in Figure 6. We gather symptom synonyms from different data sources including the SympTEMIST gazetteer, train data, and UMLS symptoms (3). We also augment the rare symptom concepts with less than five aliases by adding/removing random characters to generate five new aliases for each concept.

An external file that holds a picture, illustration, etc.
Object name is baae090f6.jpg

The architecture of the Enity Linking model.

Subtask 3—multilingual entity linking

For the multilingual entity linking task, the organizers provided automatically translated symptom mentions in English, French, Italian, Dutch, and Portuguese with the corresponding SNOMED CT codes, that are derived from the original gold Spanish annotations, used in the second task. The pipeline used to assign a SNOMED CT code to the automatically translated mentions is shown in Figure 7. First, each mention is transformed to lowercase. Then, a lowercase exact match string search is performed against a language-specific dictionary of all UMLS SNOMED CT aliases, combined with entities from the corresponding train set. If a mention is found in the dictionary, it is assigned the corresponding code. Mentions not found in the dictionary are further processed by a cross-lingual SapBERT to find the five most similar entities (in terms of cosine similarity) from the UMLS KB. Finally, each mention, its five candidate entities, and the text of the Spanish patient case report from which the mention is derived are used to construct a one-shot prompt for GPT-3.5, which selects the best candidate. In case GPT-3.5 returns an entity that is not included in the five candidates provided, NO_CODE is assigned, which addresses a limitation of the basic cross-lingual SapBERT approach.

An external file that holds a picture, illustration, etc.
Object name is baae090f7.jpg

A pipeline for assigning SNOMED CT codes to automatically translated symptom mentions (Subtask 3), consisting of four main steps: first, symptom mentions are preprocessed and a dictionary lookup is performed, second if a mention is found in the dictionary, it is assigned the corresponding code, third, any mentions not found in the dictionary are processed by the cross-lingual SapBERT to find the five most similar candidates in a KB, and finally, for each mention, GPT-3.5 is prompted to determine the best candidate from the five, using the mention and the original Spanish document from which the mention is derived.

Experiments and results

Train and validation datasets

We compile the train and validation sets by first splitting the original texts into sentences and then creating three bins of sentences based on the longest mention contained in the sentence—short (less than 38 symbols), medium (between 38 and 90 symbols), and long (over 90 symbols). Finally, 80% of the sentences in each bin are used as train examples, and the rest—for validation. Bin margins are determined based on the mention length distribution in the original train set. The validation set is used for model comparison in different experiments, while the full train set is used to train the final models for test set evaluation.

Micro-averaged precision, recall, and F1-score are used for the NER subtask. Accuracy is used for entity linking.

Subtask 1—NER

We split the train set into sentences and use 80% for training and 20% for validation. NER performance is measured using micro-averaged precision, recall, and F1-score.

Table 1 presents the results of the models and fine-tuning combinations that were submitted in the original challenge. Models based on the Spanish RoBERTa (PlanTL-GOB-ES/roberta-base-biomedical-clinical-es) perform best, likely because it is specialized for the Spanish biomedical–clinical domain. The addition of a two-layer BiLSTM increases both recall and precision, perhaps due to its ability to consider long-term dependencies(7). Table 2 shows the strict vs overlapping precision, recall, and F1 scores on the test set. The overlapping scores are significantly higher than the strict by 13%–15%, which could be explained by differences in the tokenization rules used by the annotators and our models. The augmented Spanish RoBERTa + BiLSTM + CRF model shows an overlapping F1 score of 0.8683 ranked second in the competition.

Table 1.

Subtask 1 results on the validation and test sets of the models and fine-tuning combinations that were submitted in the original challenge (best scores are shown in bold) (Subtask 1—NER section)

ModelVal PVal RVal F1Test PTest RTest F1
Spanish RoBERTa + CRF0.730 0.746 0.738---
Augmented Spanish RoBERTa + CRF 0.773 0.729 0.750 0.7320.7180.725
Augmented Spanish RoBERTa + BiLSTM + CRF0.744 0.732 0.738 0.739 0.725 0.732
Pre-trained augmented Spanish RoBERTa + CRF0.7490.7210.7350.7150.7200.718
CLIN-X-ES + CRF0.7570.7170.7370.7180.7030.710
Augmented CLIN-X-ES + CRF0.7220.7040.7130.7240.6990.712

Models based on the Spanish RoBERTa perform best. The addition of a two-layer BiLSTM increases both recall and precision. The best score on the test set is achieved by the Augmented Spanish RoBERTa with BiLSTM and CRF layers—0.732 F1.

Table 2.

Subtask 1 results on the test sets of the models submitted in the challenge—strict and overlapping precision, recall, and F1 scores (best scores are shown in bold) (Subtask 1—NER section).

ModelTest PTest RTest F1Test overlap PTest overlap RTest overlap F1
Augmented Spanish RoBERTa + CRF0.73240.71780.7250.87020.85280.8614
Augmented Spanish RoBERTa + BiLSTM + CRF 0.7393 0.7255 0.7324 0.8766 0.8602 0.8683
Pretrained augmented Spanish RoBERTa + CRF0.71490.72070.71780.8603 0.8673 0.8638
CLIN-X-ES + CRF0.72450.69910.71160.87480.84410.8592
augmented CLIN-X-ES + CRF0.71770.70260.71010.86510.8470.8559

The overlapping scores are significantly higher than the strict by 13%–15% which could be explained by differences in the tokenization rules used by the annotators and our models. The Augmented Spanish RoBERTa + BiLSTM + CRF model shows an overlapping F1 score of 0.8683 which ranked second in the competition.

Effect of data augmentation

We compare the models trained on the original and the augmented dataset which contains 15% more examples generated by replacing symptom mentions with their UMLS synonyms. The model results on the augmented dataset are shown in Table 1. Using the augmented train set significantly improves the precision and F1 of the Spanish RoBERTa model after fine-tuning. A small performance drop is observed in the augmented CLIN-X-ES model compared to its nonaugmented version on the validation set. However, on the test set, the two models show very close F1 scores, with the augmented CLIN-X-ES having a small precision lead.

Effect of language pretraining

Further pretraining of the Spanish RoBERTa degrades model performance when fine tuned on the augmented training set. Using only UMLS term synonyms out of context appears to be insufficient for conditioning the model for the NER task.

Subtask 2—entity linking

For the Spanish entity linking task, we experiment with various KBs by combining the different resources—gazetteer, train set, and the UMLS synonym dataset. In addition, we augment the knowledge by adding/removing random characters in the aliases of rare concepts that have fewer than five records in the KB. For each rare concept, five new records are added to the KB. For validation purposes, we use the full train set and exclude examples that exactly match aliases in the KB. Accuracy is used to measure the entity linking performance.

We noticed that the straight-forward cosine similarity search using cross-lingual SapBERT embeddings struggles with longer mentions, so we performed an experiment tackling this challenge using a sliding window approach. For each alias in the KB, we determine the final similarity score as a linear combination of cosine similarities of three parts—the full mention, the first 75% of tokens in the mention, and the last 75% of tokens in the mention. Again, we select the KB alias with the highest combined cosine similarity score. Using grid search over the train set, we identify the optimal coefficients for the three parts in the linear combination to be 0.75, 0.17, and 0.08, respectively. This approach achieves 2% higher accuracy than the basic cross-lingual SapBERT model on the same KB, suggesting that the information needed to find the correct code is more focused in the first part of the mention.

The results for the entity linking subtask submitted in the challenge are presented in Table 3. The best model we submitted for the challenge achieved an accuracy of 0.589 on the test set and has the richest KB with additional data from UMLS. The majority of the models show close accuracy results in the range of 0.56–0.58. We also measured the effect of performing a dictionary match before the cosine similarity search. Excluding the dictionary matching resulted in lower scores of 1.4% on the validation set and 0.3% on the test set. There was a bug in the code for generating the predictions of this model, and it scored 0.01 in the official evaluation. After the challenge ended, we fixed the bug and reran the test, resulting in an accuracy of 0.586 on the test set.

Table 3.

Subtask 2 results on the validation and test sets of the models submitted in the challenge show that the best model achieved an accuracy of 0.589 on the test set, using a UMLS-enriched KB (best scores are shown in bold) (Subtask 2—entity linking section).

ModelKBValTest
AccAcc
Cross-lingual SapBERTGazetteer + Train0.5140.588
Cross-lingual SapBERTGazetteer + Train + Aug.0.5330.565
Cross-lingual SapBERTGazetteer + Train + UMLS0.524 0.589
Cross-lingual SapBERT + sliding windowGazetteer + Train + Aug. 0.536 0.587
Cross-lingual SapBERT without dictionary matchGazetteer + Train + UMLS0.5100.017/0.586*

The majority of the models show close accuracy results in the range of 0.56–0.58. Excluding the dictionary match from the pipeline resulted in a lower score on both validation and test sets.

equation ILM0001-Latex The first result is the original submission which had a bug in the code, the second—after the bugfix post-competition.

Subtask 3—multilingual entity linking

As a baseline for the multilingual entity linking task, we perform a lowercase exact match string search within the KB for each language. Table 4 includes the baseline results and the best results across languages, including Spanish (part of Subtask 2). All experiments for this subtask were performed after the challenge ended.

Table 4.

Subtask 3 results on the test set for different languages after the challenge ended (best scores are shown in bold) (Subtask 3—multilingual entity linking section).

ModelPreprocessingLanguageTest set accuracy
DictionaryLowercase mentionsFrench0.527
Dictionary + Cross-lingual SapBERT-French0.708
Dictionary + Cross-lingual SapBERT + GPT 3.5 Reranking-French0.7143
Dictionary + Cross-lingual SapBERTGPT-3.5 Spanish translationFrench0.708
Dictionary + Cross-lingual SapBERT + GPT 3.5 RerankingGPT-3.5 Spanish translationFrench 0.7284
DictionaryLowercase mentionsEnglish0.596
Dictionary + Cross-lingual SapBERT-English0.7362
Dictionary + Cross-lingual SapBERT + GPT 3.5 Reranking-English0.7437
Dictionary + Cross-lingual SapBERTGPT-3.5 Spanish translationEnglish0.7462
Dictionary + Cross-lingual SapBERT + GPT 3.5 RerankingGPT-3.5 Spanish translationEnglish 0.7525
DictionaryLowercase mentionsDutch0.549
Dictionary + Cross-lingual SapBERT-Dutch0.7261
Dictionary + Cross-lingual SapBERT + GPT 3.5 Reranking-Dutch0.7309
Dictionary + Cross-lingual SapBERTGPT-3.5 Spanish translationDutch 0.7365
Dictionary + Cross-lingual SapBERT + GPT 3.5 RerankingGPT-3.5 Spanish translationDutch0.7301
DictionaryLowercase mentionsItalian0.523
Dictionary + Cross-lingual SapBERT-Italian0.6943
Dictionary + Cross-lingual SapBERT + GPT 3.5 Reranking-Italian0.7156
Dictionary + Cross-lingual SapBERTGPT-3.5 Spanish translationItalian0.713
Dictionary + Cross-lingual SapBERT + GPT 3.5 RerankingGPT-3.5 Spanish translationItalian 0.726
DictionaryLowercase mentionsPortuguese0.503
Dictionary + Cross-lingual SapBERT-Portuguese0.7047
Dictionary + Cross-lingual SapBERT + GPT 3.5 Reranking-Portuguese 0.7271
Dictionary + Cross-lingual SapBERTGPT-3.5 Spanish translationPortuguese0.7094
Dictionary + Cross-lingual SapBERT + GPT 3.5 RerankingGPT-3.5 Spanish translationPortuguese0.7251
Dictionary + Cross-lingual SapBERT + GPT 3.5 Reranking-Spanish 0.6123

The baseline dictionary approach shows the lowest score of 0.5396 on average across all languages. Enhancing the dictionary with a cosine similarity search using Cross-lingual SapBERT embeddings significantly improves the accuracy 14%–20% depending on the language. Using GPT-3.5 as a reranker of the final results makes additional improvements of 1%–2%, and finally preprocessing the original text and translating it into Spanish shows the best results for French, English, Italian, and Spanish.

For all languages, we use a cross-lingual SapBERT with the Spanish UMLS KB to assign a SNOMED CT code to symptom mentions that are not present in the dictionary. The model produces an embedding vector for the query symptom mention, which is then used to find the most similar (in terms of cosine similarity) entities from the KB. The Spanish UMLS knowledge was chosen, as it achieved our highest result in the second subtask while offering a good balance between the code variety and the number of aliases. For French, using the augmented KB instead leads to a lower accuracy score (0.682 vs 0.708 for the UMLS KB). Unsurprisingly, naively translating the KB into the target language (e.g. French) also causes performance degradation.

Figure 8 shows an example of the instructions we provide to GPT-3.5-turbo (gpt-3.5-turbo-0125) to rerank the candidates generated by the cross-lingual SapBERT. The prompt is constructed from a symptom mention, candidates from SapBERT, and the text of the patient report from which the symptom mention is derived. The example in the prompt varies between languages and is from the corresponding train set. The number of candidate mentions in the prompt has a significant impact on the reranking performance. Having fewer candidates is better (0.728 accuracy with five candidates for French), even though the right candidate could be found more often in the top 10 (0.712 accuracy with 10 candidates for French). Reranking performance is significantly affected by the presence of the correct entity among the top five candidate entities found by SapBERT. For English, the correct entity is in the top five candidates for about 0.52 of the entities that are processed by SapBERT and 0.76 of those are reranked correctly. Last but not least, the Spanish entity linking (Subtask 2) also benefits from the reranking step, achieving a new best accuracy score of 0.612 (achieved after the challenge deadline).

An external file that holds a picture, illustration, etc.
Object name is baae090f8.jpg

Example prompt for reranking candidate entities.

Translating the symptom mention from the original language into Spanish brings a stable performance improvement across languages. Since the KB constitutes UMLS terms in Spanish, it could explain why translating the search term in Spanish improves the system’s performance. Figure 10 shows the prompt used to translate a symptom mention into Spanish via GPT-3.5-turbo. Table 4 shows the impact of symptom mention translation and candidate reranking.

An external file that holds a picture, illustration, etc.
Object name is baae090f10.jpg

The prompt used for translating a symptom mention into Spanish.

An external file that holds a picture, illustration, etc.
Object name is baae090f9.jpg

Example reranking result.

Error analysis

For the multilingual entity linking task, across all languages, about 2% of the test set symptom mentions match a term from the corresponding dictionary, but their code in the test set differs from the one in the dictionary. For example, in the Portuguese dictionary, the term macrocitose has a code of 397073000 (finding), while in the test set, it is assigned 72826005 (morphologic abnormality), both being valid candidate codes, depending on the context.

Conclusion

We explore transformer-based approaches to solving the SympTEMIST NER and linking tasks. For NER, systems based on a specialized monolingual model achieve the best results. The addition of a BiLSTM layer after the last transformer layer, and train data augmentation, improves performance on the test set. Further pretraining on a UMLS synonyms dataset did not prove beneficial. For the entity linking task, we employ a hybrid approach, including UMLS dictionary matching, generating candidates using a cross-lingual SapBERT, and reranking the top five candidates using GPT-3.5. The choice of a KB has the highest impact on system performance—our highest accuracy model combines the SympTEMIST gazetteer, UMLS synonyms, and train set annotations. We performed experiments for entity linking with Spanish and five other European languages which showed a stable accuracy of 0.73 on average. The approach also improved the entity linking for Spanish and achieved an accuracy of 0.6123 on the test set, thus improving the current top score from the SympTEMIST challenge of 0.607.

All the challenge participants in the NER subtask use BERT-based models in some fashion, predominantly for token classification, and the best results are shown by ensemble models (5, 2). We could improve our system by training multiple BERT-based models in an ensemble. For entity linking, a lot of solutions use cross-lingual SapBERT but combine it in different ways. The best score on the entity linking challenge combines TF-IDF and SapBERT for candidate generation and a trained BERT-based cross-encoder for reranking (24). Our proposed method improves on their score but requires more compute resources and uses a proprietary model (GPT-3.5). As future work, we could explore fine-tuning an open-source model with less parameters like Llama3-8b (25) to rerank the suggested candidates which will allow the usage of the model in a lower-resource environment.

Limitations

The entity linking experiments covered six different European languages which are covered by UMLS. Applying the same approach to languages from different language families or languages that are not represented in UMLS will probably result in lower performance. Since the dataset was automatically translated from Spanish and the annotations were automatically applied, the mentions that were successfully transferred may represent concepts that are easier to link since they are lexically similar. Therefore, the approach may not perform so well on a manually labeled dataset that includes “harder” mentions to link.

Contributor Information

Sylvia Vassileva, Faculty of Mathematics and Informatics, Sofia University St. Kliment Ohridski, Blvd “James Bourchier” 5, Sofia 1164, Bulgaria.

Georgi Grazhdanski, Faculty of Mathematics and Informatics, Sofia University St. Kliment Ohridski, Blvd “James Bourchier” 5, Sofia 1164, Bulgaria.

Ivan Koychev, Faculty of Mathematics and Informatics, Sofia University St. Kliment Ohridski, Blvd “James Bourchier” 5, Sofia 1164, Bulgaria.

Svetla Boytcheva, Faculty of Mathematics and Informatics, Sofia University St. Kliment Ohridski, Blvd “James Bourchier” 5, Sofia 1164, Bulgaria. Ontotext, Ontotext, ul. “Nikola Gabrovski” 79, Sofia 1700, Bulgaria.

Funding

This work was supported by the European Union-NextGenerationEU, through the National Recovery and Resilience Plan of the Republic of Bulgaria (Grant Project No. BG-RRP-2.004-0008 - Project “Sofia University - Marking Momentum for Innovation and Technological Transfer”) and was partially funded by Horizon Europe research and innovation programme project RES-Q plus (Grant Agreement No. 101 057 603) and funded by the European Union.

Conflict of interest

Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the Health and Digital Executive Agency. Neither the European Union nor the granting authority can be held responsible for them.

References

1. Islamaj R, Arighi C, Campbell I, et al. . Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models. In: Proceedings of the Biocreative VIII Challenge and Workshop: Curation and Evaluation in the Era of Generative Models. AMIA 2023 Annual Symposium. Zenodo, New Orleans, LA, USA, November 2023, 2023. 10.5281/zenodo.10103191 [Google Scholar]
2. Lima-López S, Farré-Maduell E, Gasco-Sánchez L. et al. Overview of the SympTEMIST shared task at BioCreative VIII: detection and normalization of symptoms, signs and findings. In: Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models. New Orleans, USA. Zenodo, Vol. VIII, USA, November 2023, Workshop, 2023. [Google Scholar]
3. Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 2004;32:D267–70, https://api.semanticscholar.org/CorpusID:205228801 [Europe PMC free article] [Abstract] [Google Scholar]
4. López S, Sánchez L, Farré E. et al.. SympTEMIST corpus: gold standard annotations for clinical symptoms, signs and findings information extraction. 2023. https://zenodo.org/records/10635215 (1 March 2024, date last accessed).
5. Gallego F, Veredas F. ICB-UMA at BioCreative VIII @ AMIA 2023 Task 2 SYMPTEMIST (Symptom TExt Mining Shared Task). In: Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the Era of Generative Models. Zenodo, USA, November 2023, 2023. 10.5281/zenodo.10104058 [Google Scholar]
6. Carrino C, Armengol-Estape J, Gutierrez-Fandiño A. et al.. Biomedical and clinical language models for Spanish: on the benefits of domain-specific pretraining in a mid-resource scenario. 2021. https://arxiv.org/abs/2109.03570 (1 March 2024, date last accessed).
7. Almeida T, Jonker R, Poudel R., et al.Discovering medical procedures in Spanish using transformer models with MCRF and augmentation. In: Working Notes Of CLEF 2023—Conference and Labs of the Evaluation Forum. Ceur-WS, Vol. 3497, Greece, September 18-21, 2023, pp.60–72, 2023. [Google Scholar]
8. Chizhikova M, Collado-Montañez J, Diaz-Galiano M., et al.Coming a long way with pre-trained transformers and string matching techniques: clinical procedure mention recognition and normalization. In: Working Notes Of CLEF 2023—Conference And Labs Of The Evaluation Forum. Ceur-WS, Vol. 3497, Greece, September 19-21, 2023, pp.91–101, 2023. [Google Scholar]
9. Liu F, Vulić I, Korhonen A. et al. Learning domain-specialised representations for cross-lingual biomedical entity linking. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, Vol. 2, Online, August 2021. pp.565–74, 2021. 10.18653/v1/2021.acl-short.72 [CrossRef] [Google Scholar]
10. Zotova E, García-Pablos A, Cuadros M., et al.VICOMTECH at MedProcNER 2023: transformers-based sequence-labelling and cross-encoding for entity detection and normalisation in Spanish clinical texts. In: Working Notes of CLEF 2023—Conference and Labs of the Evaluation Forum. Ceur-WS, Vol. 3497, Greece, September 19-21, 2023, pp.206–18, 2023. [Google Scholar]
11. Vassileva S, Grazhdanski G, Boytcheva S., et al.Fusion @ BioASQ MedProcNER: transformer-based approach for procedure recognition and linking in Spanish clinical text. In: Working Notes of CLEF 2023—Conference and Labs of the Evaluation Forum. Ceur-WS, Vol. 3497, Greece, September 19-21, 2023, pp.190–205, 2023. [Google Scholar]
12. Limkonchotiwat P, Cheng W, Christodoulopoulos C. et al.. mReFinED: an efficient end-to-end multilingual entity linking system. In: Findings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, Vol. 1, Singapore, December 2023, pp.15080–89. 2023. 10.18653/v1/2023.findings-emnlp.1007. [CrossRef] [Google Scholar]
13. De Cao N, Wu L, Popat K., et al.Multilingual autoregressive entity linking. Transactions of the Association for Computational Linguistics 2022;10:274–90. https://aclanthology.org/2022.tacl-1.16 [Google Scholar]
14. Botha J, Shan Z, Gillick D. Entity linking in 100 languages. In: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics. Online, November 2020, pp.7833–7845, 2020. 10.18653/v1/2020.emnlp-main.630 [CrossRef] [Google Scholar]
15. Zhu T, Qin Y, Chen Q., et al.Controllable contrastive generation for multilingual biomedical entity linking. In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Singapore, December 2023, pp.5742–53., 2023. 10.18653/v1/2023.emnlp-main.350 [CrossRef] [Google Scholar]
16. Grazhdanski G, Vassileva S, Koychev I. et al.. Team Fusion@SU @ BC8 SympTEMIST track: transformer-based approach for symptom recognition and linking. In: Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the era of Generative Models. New Orleans, USA. Zenodo, USA, November 2023, 2023. 10.5281/zenodo.10103750 [Google Scholar]
17. Krishnan V, Ganapathy V.. Named Entity Recognition. 2005. https://cs229.stanford.edu/proj2005/KrishnanGanapathy-NamedEntityRecognition.pdf (1 March 2024, date last accessed).
18. Devlin Jacob, Chang Ming-Wei, Lee Kenton. et al.. BERT: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Computational Linguistics, USA, June 2019, pp.4171–86, 2019. 10.18653/v1/N19-1423 [CrossRef] [Google Scholar]
19. Liu Y, Ott M, Goyal N. et al.. RoBERTa: a robustly optimized BERT pretraining approach. 2019. https://arxiv.org/abs/1907.11692 (1 March 2024, date last accessed).
20. Lange L, Adel H, Strötgen J., et al.CLIN-X/i: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain. Bioinform 2022;38:3267–74, 10.1093/bioinformatics/btac297 [Abstract] [CrossRef] [Google Scholar]
21. Lima-López S, Farré-Maduell E, Gasco L, et al. . Overview of MedProcNER task on medical procedure detection and entity linking at BioASQ 2023. In: Working Notes of the Conference and Labs of the Evaluation Forum (CLEF 2023) Ceur-WS. Vol. 3497, Greece, September 19-21, 2023, pp.1–18, 2023. [Google Scholar]
22. Miranda-Escalada A, Gasco L, Lima-López S, et al. . Overview of DisTEMIST at BioASQ: automatic detection and normalization of diseases from clinical texts: results, methods, evaluation and multilingual resources, In: Working Notes of Conference and Labs of the Evaluation (CLEF) Forum. CEUR Workshop Proceedings. Ceur-WS, Vol. 3180, Italy, September 2022, pp.179–203, 2022. [Google Scholar]
23. Villegas M, Intxaurrondo A, Gonzalez-Agirre A. et al.. The MeSpEN resource for English-Spanish Medical Machine Translation and Terminologies: census of parallel corpora, glossaries and term translations. In: Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018). European Language Resources Association (ELRA), Japan, May 7-12, 2018. pp.32–39, 2018. [Google Scholar]
24. Borchert F, Llorca I., Schapranow M. HPI-DHC @ BC8 SympTEMIST track: detection and normalization of symptom mentions with SpanMarker and xMEN. In: Proceedings of the BioCreative VIII Challenge and Workshop: Curation and Evaluation in the Era of Generative Models. Zenodo, USA, November 2023, 2023. [Google Scholar]
25. AI@Meta. Llama 3 Model Card. 2024. https://github.com/meta-llama/llama3/blob/main/MODEL_CARD.md (1 March 2024, date last accessed).

Articles from Database: The Journal of Biological Databases and Curation are provided here courtesy of Oxford University Press

Similar Articles 


To arrive at the top five similar articles we use a word-weighted algorithm to compare words from the Title and Abstract of each citation.