Hybrid natural language processing tool for semantic annotation of medical texts in Spanish

Campillos-Llanos, Leonardo; Valverde-Mateos, Ana; Capllonch-Carrión, Adrián

doi:10.1186/s12859-024-05949-6

Software
Open access
Published: 08 January 2025

Hybrid natural language processing tool for semantic annotation of medical texts in Spanish

Leonardo Campillos-Llanos¹,
Ana Valverde-Mateos² &
Adrián Capllonch-Carrión³

BMC Bioinformatics volume 26, Article number: 7 (2025) Cite this article

2352 Accesses
13 Altmetric
Metrics details

Abstract

Background

Natural language processing (NLP) enables the extraction of information embedded within unstructured texts, such as clinical case reports and trial eligibility criteria. By identifying relevant medical concepts, NLP facilitates the generation of structured and actionable data, supporting complex tasks like cohort identification and the analysis of clinical records. To accomplish those tasks, we introduce a deep learning-based and lexicon-based named entity recognition (NER) tool for texts in Spanish. It performs medical NER and normalization, medication information extraction and detection of temporal entities, negation and speculation, and temporality or experiencer attributes (Age, Contraindicated, Negated, Speculated, Hypothetical, Future, Family_member, Patient and Other). We built the tool with a dedicated lexicon and rules adapted from NegEx and HeidelTime. Using these resources, we annotated a corpus of 1200 texts, with high inter-annotator agreement (average F1 = 0.841% ± 0.045 for entities, and average F1 = 0.881% ± 0.032 for attributes). We used this corpus to train Transformer-based models (RoBERTa-based models, mBERT and mDeBERTa). We integrated them with the dictionary-based system in a hybrid tool, and distribute the models via the Hugging Face hub. For an internal validation, we used a held-out test set and conducted an error analysis. For an external validation, eight medical professionals evaluated the system by revising the annotation of 200 new texts not used in development.

Results

In the internal validation, the models yielded F1 values up to 0.915. In the external validation with 100 clinical trials, the tool achieved an average F1 score of 0.858 (± 0.032); and in 100 anonymized clinical cases, it achieved an average F1 score of 0.910 (± 0.019).

Conclusions

The tool is available at https://claramed.csic.es/medspaner . We also release the code (https://github.com/lcampillos/medspaner) and the annotated corpus to train the models.

Peer Review reports

Introduction

The substantial volume of medical data contained within electronic health records (EHRs), articles or clinical trials represents a potential source of evidence and knowledge discovery [1, 2]. However, information is predominantly stored in a unstructured format, which poses challenges for effective extraction and analysis. The application of natural language processing (NLP) techniques has facilitated more efficient text mining in the medical domain [3].

Comprehensive NLP tools can extract relevant information and alleviate the manual curation of data by healthcare professionals, which is time-consuming and error-prone [4]. Such type of system can automate cohort definition tasks for clinical trials by extracting medical concepts from eligibility criteria [5], and can enhance the discovery of disease-drug pairs for drug-repurposing [6, 7]. Likewise, it might be used for information extraction from clinical cases, with the goal of enhancing data analysis in large volumes of data [8] and identifying phenotype variance [9]—a step towards precision medicine [10]. Potential applications range from automatically extracting co-occurrence of drug-disease pairs in real data [11, 12], detecting comorbidities, disease recurrence and risk factors (e.g., in dementia [13], cardiovascular [14, 15] or mental health [13, 16, 17])—or EHR-based pharmacovigilance [18]. The reuse of clinical data can also be employed to identify potential participants who meet the eligibility criteria of clinical trials [19]. In these tasks, the detection of negation and speculation is essential for accurately excluding pathological states and comorbidities that are not associated with specific conditions [20]. Additionally, the processing of temporal information and the history of medical conditions is critical for predicting clinical events and managing chronic diseases [21].

A challenge in implementing this type of NLP system is the considerable effort required to tailor it specifically to each use case [22]. Ideally, an open-source medical NLP tool should be flexible enough to allow other teams to customize it for their particular needs. Additionally, such a system should be capable of processing a wide range of information, encompassing disorders, treatments or age-groups, negated and speculated events, drug contraindications and history of conditions or procedures. Although there has been a rise in the number of teams conducting research in Spanish medical natural language processing (NLP), integrated systems remain limited. A notable exception is the framework developed by [23], which facilitates the expansion of medical terminologies, enabling the reuse of information available in EHRs for advanced data analysis.

In this context, we introduce MEDSPANER, a Medical Semantic Python-Assisted Named Entity Recognizer for the Spanish language. The tool performs medical named entity recognition (NER), medication information extraction, temporal entity annotation and detection of negation and speculation. The system also normalizes concepts according to Concept Unique Identifiers (CUIs) from the Unified Medical Language System (herein, UMLS) [24] or SNOMED CT codes [25], using a dedicated lexicon [26].

We developed the tool to automate the processing and analysis of clinical trials by medical professionals. While it was initially designed for this specific text genre, we are releasing it with the intention that it can also be applied to other medical sub-genres or adapted for additional tasks in the future. A demo video is provided as supplementary material. Our contributions are:

An integrated tool that can be used with a UNIX-based terminal, or with a graphical user interface (GUI; Fig. 1). Annotations can be normalized to UMLS CUIs or SNOMED CT codes. The output can be json (which enables exporting data to a database such as Mongo DB) or the ann format (which enables loading the annotated files in the BRAT tool [27]).
A set of fine-tuned Transformer-based models already available for the above-mentioned sub-tasks.
An error analysis of the system’s output, and a human evaluation by medical professionals, to show its strengths and limitations.
An enriched annotation of the CT-EBM-SP corpus [28], used to develop the tool and to train the deep learning models. We also share the 200 annotated texts used in the human evaluation.

A supplementary graphical abstract summarizes the contents of this work [see Additional-File-1].

Background

NLP libraries in the Python language—e.g., Stanza [29] or spaCy [30]—currently perform fast automatic tokenization, sentence splitting and linguistic analysis (part-of-speech, syntax or semantic tagging). In addition, advances in neural-network-based frameworks [31, 32] provide modules to integrate the output of those linguistic analyses with contextual information available in pre-trained language models [33] or contextual embeddings [34]. In summary, the current NLP ecosystem enables the construction of complex pipelines for the processing of medical language. Some examples of Python libraries are those developed to lookup the UMLS Metathesaurus [35], for processing EHRs or other clinical texts [36,37,38,39,40,41].

Multipurpose medical annotation tools feature medical entity annotation and classification—e.g., MedLEE [42], MetaMap [43] or MetaMap Lite [44]—and also negation and assertion detection—e.g., cTAKES [4]; or CLAMP [22]. For medical entity recognition, systems can use dictionaries [43]. These have the advantage of identifying entities in unsupervised contexts, and cope with infrequent term mentions, e.g., microbial species [45, 46]. Lexicon methods can be combined with rules and machine learning [4]. Recently, general models have been adapted or pre-trained on large scale biomedical and clinical data [47,48,49,50]. Recent experiments have tested dictionary-based approaches to feed a lexicon output into the neural model [51, 52].

With regard to temporal information [53], rules such as HeidelTime [54] or machine-learning classifiers have been applied, which yielded good results with clinical texts [55, 56]. Similar methods were used to process temporal entities and relations in clinical trials [57, 58]. For the detection of negation, uncertainty, temporality and experiencer attributes, methods have evolved from rules—e.g., NegEx [59] and ConText [60]—to supervised neural-network-models [61,62,63,64,65]. For the extraction of medication information, a similar transition has occurred from rule-based systems to neural-based or hybrid approaches [66, 67].

All the same, current advances of large language models (LLMs) and GPT-based methods are been applied to parse trial eligibility criteria and match patients to EHR data [68, 69]. Nonetheless, privacy issues to keep patients’ protected information in each health institution poses an major limitation to commercial systems such as ChatGPT. In this context, Transformer-based models and open-source projects are a feasible alternative, given that they can be executed locally, and their results are competitive—sometimes with higher performance, according to some studies [70]. Combining such type of framework and a physician-in-the-loop to check the system output [71] seems a suitable approach.

Most tools are available for the English language and few tools are aimed at eligibility criteria of clinical trials [5] (no system exists for the Spanish language, to the best of our knowledge). Numerous teams have contributed to the task of medical entity recognition in Spanish; herein, we review only the works most related to ours (Table D.9 in Appendix D gathers more details). Initiatives include detecting drug effects in social media [72], drugs in clinical cases [73], or adverse drug reactions in EHRs by applying random forests (RFs) [74]. ICD-10 coding of disorders in EHRs was explored using RFs [75] or BERT models on radiology reports [76]. Also in radiology reports, a wide range of entities such as findings or body parts were annotated [77]. Detecting disabilities in Orphanet data was tested by means of SVM classifiers or Bi-LSTM networks [78] and BERT models [79]. In clinical referrals, entities were extracted by using a Bi-LSTM-CRF framework [80]; the same method combined with rules was applied to health reports of disease outbreaks [81]. Temporal entity recognition was performed by means of Bi-LSTM-CRF architectures in clinical narratives [82] or RoBERTa models in clinical cases [83]. As far as we know, detecting medication information (e.g. dosage) in Spanish was only explored on summaries of product characteristics by employing dictionary-based methods [84]. Negation detection was performed using syntactic methods on radiology reports [85], and negation/uncertainty was detected by applying Bi-LSTM-CRF or BERT-based models on product reviews, journal articles and clinical notes [62, 64, 65].

Overall, most methods have relied on Bi-LSTM networks or BERT models in the latest years. However, no comprehensive system exists for fast and actionable processing of that type of relevant information, and most works have focused on specific semantic aspects. No open-source tool has been developed that integrates annotation models for medical entities, negation and uncertainty, temporal information, and experiencer within a unified framework for Spanish medical texts. In addition, medical attributes of event temporality (e.g. History_of) or experiencer (e.g. Patient or Family_member) have not been considered in other NLP projects for the clinical literature in Spanish.

This work contributes with an integrated tool for medical language processing of texts in Spanish language. It extracts UMLS entities, medication-related data, temporal information, negation or speculation, experiencer or event temporality attributes, and miscellaneous entities such as observations, results or qualifiers. We combined a dedicated lexicon, the knowledge available in rule-based methods [54, 59] adapted to medical Spanish, and state-of-the-art Transformer-based models. Originally developed for clinical trials, the tool is now being released to be used with other medical sub-genres.

Implementation

The implementation involved several stages. The first steps were creating resources and rules for named entity recognition of the entities of interest. These resources enabled the pre-annotation of a task-specific corpus, which was revised by experts and then used to train deep learning (DL) models in a supervised setting. The MedLexSp lexicon of Spanish medical terms [26] with lemmas and form variants is used for domain entity recognition by employing a dictionary-based approach. The Unified Medical Language System (UMLS) [24] is the main source of domain terminology in MedLexSp. The tool normalizes concepts using SNOMED CT codes [25] or the UMLS concept unique identifiers (CUIs) recorded in MedLexSp.

For temporal entities, we developed rules by adapting HeidelTime [54]. For negation and speculation, we created rules by translating NegEx and ConText [59, 60] and reusing former work for Spanish [86]. Rules for medication-related information were manually developed and refined iteratively in corpus annotation. We also used the lists of administration routes and dosage forms from the Spanish Medicines Agency Nomenclátor [87]. We created rules for the following entity types: Dose, Concentration or Strength; Route or Mode of administration; and Dosage form.

The development of rules and the pre-annotation of domain texts were iterative processes: rules were redefined after analyzing the errors of the pre-annotated texts and after being revised by experts, then the improved rules were applied again to new batches of texts to revise, and so on. Figure 2 explains the workflow.

To develop the tools, we employed the Clinical Trials for Evidence-Based Medicine in Spanish (CT-EBM-SP) corpus [28]. This is a collection of 1200 texts about clinical trials (292173 tokens). A subset of 500 abstracts come from journals available in PubMed or the Scientific Electronic Library Online (SciELO). Another subset of 700 clinical trial announcements were published in the European Clinical Trials Register. The quality and consistency of the annotations were assessed by computing the inter-annotator agreement (IAA) between annotator pairs, which showed very good agreement (§ Annotation and IAA). The corpus and guidelines are available at the companion repository. Figure 3 is a sample of an annotated text.

We conducted an internal and external validation of the tool, as recommended [88]. The internal evaluation used a held-out test set, i.e. not used in model training. For the external validation, eight medical professionals revised the system annotations on 200 new texts (not used in system development): 100 clinical trial announcements, and also 100 anonymized clinical cases (to assess the performance of other medical sub-genres). The next sections provide more details of each component.

UMLS entities

In the first version of the corpus [28], only four UMLS semantic groups were annotated: anatomic entities (ANAT; e.g., arm), pharmacological and chemical substances (CHEM; e.g., aspirin), pathological conditions (DISO; e.g., diabetes) and laboratory, diagnostic and therapeutic procedures (PROC; e.g., radiotherapy). In the second version (distributed here), we annotated these entity types: medical devices (DEVI; e.g., probe), genes and genetic material (GENE; e.g., BRAF), physiological processes (PHYS; e.g., breathing) and living beings (LIVB, e.g., patient; note that this category also includes virus). Nevertheless, we did not use the GENE category to train the models, given the scarce number of instances and the low IAA scores achieved.

Temporal entities

We followed the TimeML annotation scheme [89] and annotated the following entity types: Date (e.g., 2022), Duration (e.g., dos horas, ‘two hours’), Frequency (called ‘set’ in TimeML; e.g., semanalmente, ‘weekly’) and Time (e.g., noche, ‘night’). We did not merge entities beginning with pre-/post- prefixes, as in other annotation projects [90], which mark them as PREPOSTEX. PREPOSTEX generally gathers Date or Time entities that are ambiguous or unspecified (i.e. when there is not enough context to decide if postoperative refers to an amount of time expressed in days or hours). However, we kept the distinction between Date and Time in these expressions since it is more valuable for the clinical practice. Finally, we added the entity type Age (e.g., 18 years old), given that age is an important factor for inclusion or exclusion in clinical trials. In addition, living being entities (LIVB) that also convey age information were annotated with the Age attribute (e.g., adolescent, adults).

Medication-related information

We considered the guidelines of previous corpora [66, 91, 92] and marked these types: Dosage form (Form, e.g., píldora, ‘pill’), Route or Mode of administration (Route, e.g., intravenoso, ‘intravenous’) and Dose, Concentration or Strength (Dose, e.g., 125 mg). We merged the categories of Dose and Concentration or Strength. This decision might provide coarse-grained results, but seemed more suitable for a general usage of the tool, which can be adapted if needed. Lastly, we annotated Contraindicated in some chemical and pharmacological entities (CHEM) or procedures (PROC). This is typically found in the exclusion criteria of trial announcements (e.g., Pacientes con contraindicación a corticoesteroides, ‘Patients with a contraindication to corticosteroids’; the entity corticosteroids is Contraindicated).

Negation and speculation

We marked negation or speculation only on concepts or events. This choice was applied in other biomedical corpora [74, 93,94,95,96] and some annotation tools [4, 37]. We did not mark the full negated or speculated scope, as in other works [61, 63, 78, 97]. This criterion can be illustrated with the following sentence: Los pacientes no habían recibido tratamiento antibiótico (‘The patients had not receive antibiotic treatment’). The scope of the negation would imply annotating ‘had not receive antibiotic treatment’; however, we only annotated the concept ‘antibiotic treatment’ as Negated. This difference has impacted the performance of the NER models (see § Results). We marked negation and speculation cues (Neg_cue, Spec_cue), and events or entities within the scope of the negation or speculation were marked with Negated or Speculated, respectively. These were mostly marked on UMLS entities (ANAT, CHEM, DEVI, DISO, GENE, LIVB, PHYS and PROC).

Event temporality and Experiencer

Event temporality attributes are marked on entities referring to procedures and pathological or physiological conditions. These attributes specify whether the event occurred in the past (History_of or Family_History_of), if it will take place in the future (Future) or if it is Hypothetical. Experiencer attributes are only marked on LIVB entities referring to human beings. These attributes indicate if the entity is the experiencer of the condition or procedure (Patient), if it is a patient’s relative (Family_member) or if he/she has another role (Other). We adapted the scheme from previous works [4, 60, 98] to the Spanish language. Importantly, we adopted entity and attribute types used in former annotation projects focused on EHR or clinical case data. This was motivated by the development of a comprehensive annotation framework aimed at improving the mapping of clinical patient data to relevant clinical trials. Nonetheless, we did not use the Hypothetical or Family_History_of attributes to train the models, due to the few instances in the corpus.

Miscellaneous entities

Lastly, we also annotated a set of clinical entities deemed necessary for the task: CONC (concepts), Food/Drink (soy), Observation/Finding (e.g. relapse), Quantifier/Qualifier (e.g. at least 4, severe) and Result/Value (e.g. < 3 UNL).

Annotation and IAA

We applied the dictionary-based and rule-based tool to clinical trials texts to pre-annotate the data and then revise the output. Three experts annotated the data: a practicing medical doctor with 20 years of medical practice, a medical lexicographer with 18 years of experience in data curation and corpus annotation, and a computational linguist with 18 years of experience in data annotation and natural language processing. To define and learn the annotation guidelines, the experts annotated 12 texts, in several consensus rounds. We used the BRAT annotation tool [27]. After adequate agreement scores were reached, 112 texts were doubly annotated, again with meetings to fix disagreements and refine the guidelines iteratively. In the final step, the leading researcher annotated the remaining texts. Approximately 10% of the corpus (124 texts) was annotated by two or three experts.

We used the F-measure to assess the inter-annotator agreement (IAA) between pairs of doubly annotated sets. The F-measure is considered adequate for contexts where entities can have disparate spans [99]. The IAA values of four UMLS semantic groups (ANAT, CHEM, DISO and PROC) were reported in [28]. In the first version of the corpus, the IAA scores for these four entity types had an average F1 score of 0.856 (±0.048) in strict match (i.e. the full span and semantic class of the entity must match) after consensus annotations.

For the new annotated entities, the average IAA was an F1 score of 0.841% (±0.045) with consensus annotations (strict match). For attributes (Age, Negated, Speculated, Contraindicated, Hypothetical, Future, Family_member, Patient and Other), the IAA had an average F1 score of 0.881% (±0.032) with consensus (strict). Note that no IAA was computed for Family_History_of, because no instances appeared in the documents used for double annotation. IAA values were very good, given the large number of different entity types. However, IAA was lower for categories such as Contraindicated, Form, Spec_cue or Speculated, and these had an impact on the lower F1 results obtained with the tested models.

We annotated 86 389 entities (average of 71.99 per text) and 16 590 attributes (average of 13.82 per text). Figure 4 reports the number of annotations per entity type and the IAA scores. For named entity recognition, the annotations were converted to the standard BIO format (B stands for ‘beginning’ of entity; I, ‘Inside’, and O, ‘Out’). The entity attributes were also processed to this format.

Transformer architecture in BERT models

We used the neural-network based Transformer architecture, as implemented in the Bidirectional Encoder Representations from Transformers (BERT) model [33]. The Transformer architecture replaces the recurrent neural network (e.g., Long Short-Term Memory models) with attention layers [100], and contain an encoder and a decoder (Fig. 5, a). Typically, 6 to 12 encoder layers and 6 to 12 decoder layers are stacked. The encoder converts the input sequence ($x_1, \ldots ,x_n$) into a sequence of continuous representations ($z=(z_1, \ldots ,z_n)$). Encoding layers generate token embeddings and positional embeddings, which bear the information about the location of each word and extract dependencies between them. Note that BERT and RoBERTa also learn representations of segment embeddings. Each layer has a self-attention mechanism that considers the entire context of the sequence, and weighs the importance of each word in relation to others. The multi-head self-attention component employs a set of attention heads, which are run in parallel to calculate Query (Q), Key (K) and Value (V) matrices to compute attention scores. The final attention output is passed through a feed-forward neural network with two fully connected layers and ReLU activation in between. A layer normalization is applied, and residual connections are added to the self-attention and the feed-forward components, which help stabilize training. With these z representations, the decoder outputs a sequence of symbols ($y_1, \ldots , y_n$). The decoder consists of similar components to those in the encoder; since BERT models do not rely on the decoder for inference, due to space limits, we omit an exhaustive explanation and refer to [100] for more details. Once the sequence passes through all decoder layers, the output is fed into a linear layer followed by a SoftMax activation, which produces probability distributions over the vocabulary to predict the next token.

BERT, and its variants such as RoBERTa [102] or DeBERTa [101], are the most widely-used models with the attention mechanism for sequence labeling tasks. These models’ architecture is composed of a stacked multi-layer bidirectional Transformer encoder, and separate the training in two steps. The first is pre-training a large task-independent language model, with a masked language model objective (i.e. the model has to predict a subset of hidden words). This pre-trained model is, secondly, fine-tuned with data for each task (e.g., named entity recognition). BERT models have an input length limit of 512 tokens at a time (vectors are truncated or padded to that fixed length) and may have 12 or 24 layers.

Figure 6 depicts the Transformer-based processing of a sentence, as implemented in BERT [33] or RoBERTa [102]. First, a [CLS] symbol is added at the beginning of every input sentence (for classification purposes) and a [SEP] (separator) token is put to separate each segment from the next one. The input text is split into word units (typically, at punctuation signs and white space characters) and sub-word units, using the WordPiece tokenizer [103]. Sub-words are generally used if the original words were not seen in the pre-training corpus, as it occurs in the medical domain: e.g., homocisteína (‘homocysteine’) is split in homo, ##ci, ##ste, ##ína. Then, a Transformer block processes each sentence element and represents it with token, segment and position embeddings (Fig. 7 depicts this process for BERT and RoBERTa models). The segment embedding is a value corresponding to the sentence or phrase where each token occurs, separated by the [SEP] symbol. In Fig. 7, two segments appear: niveles de homocisteína en pacientes en hemodiálisis (‘homocysteine levels in hemodialysis patients’, segment A) and Ensayo clínico (‘Clinical trial’, segment B). Every token in the same segment has the same segment embedding vector. The input representation is combined by summing the token, segment and position embeddings to obtain a hidden representation with contextual information; note that DeBERTA models encode the input with two vectors (for token and relative position), as explained in the next section. Lastly, a token classification head for named entity recognition—made up of a dense layer and a SoftMax activation—outputs the semantic class and span of the named entities.

Training of models

The annotated corpus was used in a supervised setting to train Transformer-based DL models. We fine-tuned six model variants for: (1) UMLS semantic groups; (2) medication information; (3) temporal entities; (4) negation and speculation; (5) miscellaneous medical entities; and (6) experiencer and temporality attributes.

First, we tested domain-specific RoBERTa-based [102] pre-trained models available for medical Spanish. RoBERTa (Robustly Optimized BERT Pre-training Approach) is a model variant that was pre-trained for longer (125000 additional steps), with a larger batch size (8000 samples per batch), using dynamic masking, and on a larger dataset (160 gigabytes) than the collection used to train BERT. RoBERTa also removes the Next Sentence Prediction task used for pre-training BERT, which makes it a more optimized model. We tested the bsc-bio-ehr-es model, which further pre-trained RoBERTa with medical texts in Spanish from SciELO, Wikipedia, EMEA or PubMed, and also with clinical data from EHRs [49]. This model uses the Byte-Pair Encoding (BPE) tokenizer, which is more efficient for subword tokenization. To analyze the influence of bilingual, domain pre-training data, we also tested EriBERTa [104] (base version), a bilingual RoBERTa-based bilingual model. EriBERTa base was pre-trained on an English and Spanish corpus. The English medical pre-training data include texts from EMEA, PubMed and ClinicalTrials.gov; and subsets of the Spanish collection come from SciELO, EMEA, PubMed, SNOMED CT and a Spanish clinical cases corpus. Lastly, we compared both models to the RoBERTa XLM Spanish Clinical model (hereafter, CLIN-X-ES) [50] large version. CLIN-X-ES is derived from the XML RoBERTA multilingual model (originally pre-trained on 2.5 terabytes of the CommonCrawl corpus for 100 languages), by continuous pre-training on a corpus of medical texts from SciELO, MedlinePlus, EMEA or PubMed. This model implements the XLM-R tokenizer and computes the input on subword level (not on word level); also, the cross-sentence context is incorporated in the input. Lastly, a Conditional Random Field (CRF) layer computes the output.

Second, we assessed multilingual, general-domain models, to examine further whether the difference in performance across models might be related to domain-specific or multilingual pre-training data. We evaluated the multilingual BERT (from here on, mBERT) base model, which was pre-trained on a corpus derived from Wikipedia in 104 languages. We also tested the Decoding-enhanced BERT with Disentangled Attention (DeBERTa) model, which presents two innovations in the BERT architecture. First, DeBERTa features a disentangled attention mechanism: each word is represented by separate vectors (not only by one sum vector, as in the original BERT) to encode the content and relative position separately (Fig. 5 b). With this approach, the self-attention layer models the dependency between near tokens in a better way. Second, DeBERTa features an enhanced mask decoder for pre-training (Fig. 5 c; Q stands for ‘Query’; K, for ‘Key’; V, for ‘Value’; and I, for ‘Input’; and H, for ‘Hidden state’). The enhanced mask encoder employs the information about the content and the position for masked language modeling. Information about absolute positions is introduced in the decoding layer (before the SoftMax layer) to predict masked tokens during pre-training. By means of this technique, the model gives more importance to the absolute position of words. In our experiments, we specifically used the multilingual DeBERTa (hereafter, mDeBERTa) model vs 3 [101] base version, which is a multilingual, general-domain model pre-trained on the CC-100 corpus (CommonCrawl for over 100 languages).

Table 1 summarizes the characteristics of the Transformer-based models we tested and the pre-training details. To fine-tune and release the models, we used Transformers Hugging Face [32], which facilitates the reuse and adaptation of available models, and the replication of experiments. We used the AutoModelForTokenClassification class for NER. All the tested models are shared at the Hugging Face hub.

Table 1 Characteristics of the Transformer-based models and pre-training details (medical-domain models are italized in rows 2-4)

Full size table

As baselines for comparison, we applied the rules we developed for temporal annotation and medication information extraction. We also tested the annotated corpus on a Bidirectional Long Short-Term Memory (Bi-LSTM) architecture with Conditional Random Fields (CRF) for sequence labeling. We used the implementation in the FLAIR framework, which models the linguistic context through contextual embeddings [34], but does not include any local nor global attention mechanism [105, 106]. The FLAIR implementation has one bidirectional LSTM layer with 1024 hidden states each (totaling 2048, 1024 for each direction).

To select the best configurations, we used the train (60%) and development splits (20%) from [28] (respectively, 720 and 240 texts). Once the best models were obtained, we tested then on a held-out set (20%, 240 texts). Because preliminary experiments for negation and speculation yielded poor results, we also used the NUBEs corpus [61] in our data. We used train/dev/test splits and added them to the corresponding splits in our corpus. For the negation/speculation model, we manually converted the format and labels from the NUBEs corpus in order to fit our criteria of labeling negation and speculation (e.g., label names were changed).

Table 2 describes the data splits used to train, validate and test the models (in the internal evaluation and in the human evaluation, HE); we include the count of tokens and texts in the training data enriched with the NUBEs corpus (+518068 tokens). We also count the subsets of augmented data for applying the RoBERTa EHR and the CLIN-X-ES models to clinical cases (§Extending the models to clinical cases). These augmented data contain 100 clinical cases (CC, +29808 tokens) and summaries of product characteristics (SPCs, +51057 tokens).

Table 2 Data splits to train and test each model (listed in the header)

Full size table

Table 3 Fine-tuning hyperparameters of the tested models

Full size table

We used a Zotac Geforce RTX 3090 GPU of 24 GB RAM for the experiments. The models were trained with early stop with a patience of 5 (i.e. the training stopped if the F1 score did not improve after 5 epochs). Table 3 shows the hyperparameters.

Annotation pipeline

Given an unstructured medical text (e.g., a clinical trial announcement or a clinical case), stage 1 involves pre-processing (e.g., changing characters that cause problems to the Transformer-based models), sentence splitting, tokenization and part-of-speech tagging by means of spaCy or Stanza (we used spaCy in this work). Then, the UMLS medical entities are detected with the dictionary (MedLexSp) [26] or with a BERT-based model. A list of exceptions can be applied to the output. For example, a specific semantic group can be excluded if it is unnecessary for a task: e.g. ACTI (activity entities), which are recognized only with the dictionary. Stage 2 involves named entity recognition of temporal entities, medication information, miscellaneous clinical entities, and attributes expressing negation/speculation or temporality/experiencer. This can be customized according to the user’s needs (the output of stage 1 can be passed to stage 3 straightaway) and is processed sequentially with dedicated models. Lastly, stage 3 converts the output to JSON or BRAT format. Figure C.17 (Appendix) outlines the steps.

Extending the models to clinical cases

To test the generalizability of the trained models to other medical text types, we used 200 clinical cases already anonymized and prepared by medical residents or under a Creative Commons License.^{Footnote 1} We utilized 100 texts (comprising 29808 tokens) to further fine-tune the RoBERTa EHR and CLIN-X-ES models and adapt them to the new subdomain; and 100 texts (27332 tokens) for a human evaluation by medical professionals.

For the medication information model applied to clinical cases, we also included in the fine-tuning data 265 text samples (51057 tokens) of summaries of product characteristics (SPCs) from the Spanish Drug Information Center (CIMA). We selected SPCs from a subset of drugs included in the WHO Model List of Essential Medicines [107]. We used excerpts from sections corresponding to the Medication name, Dosage form, Administration mode and Contraindications. We also applied a data-augmentation technique for synonym replacement [108] using UMLS Concept Unique Identifiers available in MedLexSp [26]. For example, full forms (intravenoso, ‘intravenous’) were replaced with abbreviations/acronyms (IV). Some contexts required revision to fix gender agreement and replacement errors. In summary, we trained the models for medication information entities with the CT-EBM-SP corpus, 100 clinical cases and 265 text samples from SPCs (Table 2).

After inspecting the output of the models, we found that the RoBERTa EHR models were the most suitable for integration into the annotation pipeline of the medical entity recognizer (see §Results); accordingly, medical professionals evaluated only these models. We did not fine-tune the EriBERTa, mBERT or mDeBERTa models with the 100 clinical cases nor with the SPCs for the current version of the tool.

Evaluation

Internal validation

We computed precision (P, or positive predictive value), recall (R, or sensitivity) and F1 with the seqeval library [109]:

$$\begin{aligned} P = \frac{TP}{TP + FP} \qquad R = \frac{TP}{TP + FN} \qquad F1 = \frac{ 2 P R }{ P + R } \end{aligned}$$

(TP: true positives; FP: false positives; FN: false negatives). We report micro-average scores because of the classification imbalance: e.g., anatomic entities (ANAT) or devices (DEVI) are scarce compared to chemical instances (CHEM). Results are reported on exact match at named-entity level (i.e. a true positive is counted if the models’ prediction and reference match in scope and class). We provide the average and standard deviation of five experimental rounds with different initialization seeds. Carbon emissions of the models were estimated with a calculator online [110].

Table 4 Comparison models (average ± standard deviation); medical-domain models are italized;general models are underlined; best results in bold (P: ‘precision’; R: ‘recall’)

Full size table

External validation (human evaluation)

As a use case in a real setting, eight medical professionals revised the annotations obtained with the tool (using the RoBERTa EHR models) on 200 new texts not employed for system development. First, 100 texts (29851 tokens) announcing trials published in 2022-23 were downloaded from EudraCT after developing the system. Second, to test the generalizability to other medical sub-genres, the evaluators checked the annotation of 100 clinical cases with a Creative Commons license (27332 tokens). Three medical doctors, four medical interns and one nurse participated; their average age was 33.0 years and the average years of medical practice was 6.88 years. Evaluators were instructed about the entity types and scope. The evaluation texts are available at the companion repository. Participation was remunerated afterwards.

Results

Results of the internal validation

Table 4 shows the results of the compared methods. Rules for medication information extraction and temporal entity recognition performed below the neural-network-based models, although the recall for the medication information entities (R = 0.786) was higher than that of the Bi-LSTM-CRF model (R = 0.740). For their part, Transformer-based models showed better performance compared to the Bi-LSTM-CRF model across all subtasks. The RoBERTa EHR and EriBERTa models performed similarly well, the latter tending to achieve slightly higher F1 scores. When comparing the performance of multilingual general-domain models, mBERT tended to yield lower scores than the RoBERTa-based models, but the mDeBERTa vs 3 model outperformed the other models for all tasks. However, the CLIN-X-ES models exhibited comparable performance, and outperformed the rest of RoBERTa-based models, excepting for the recognition of temporal entities. In Appendix B, Figure B.8 plots the F1 measure per model (averaged over 5 experimental runs); and Table B.8 presents the results per entity type of the RoBERTa EHR models, which were used in the human evaluation.

Regarding the training time of the clinical trial texts, the RoBERTA EHR model needed fewer training epochs; the estimated carbon footprint of each experimental round for the RoBERTa EHR models was of 63 g (eq. CO$_{2}$) per model (this value and the following are the average of 5 rounds). The EriBERTa models required a similar number of training epochs, and given their shorter training times, they had an estimated carbon footprint values of 33 g (eq. CO$_{2}$) per model. The CLIN-X-ES models also needed similar training epochs, but each one had longer training times; their estimated carbon footprint was of 209 g (eq. CO$_{2}$) per model. The mBERT and mDeBERTa models showed a similar trend, with fewer training epochs (although each epoch in mDeBERTa was longer). The estimated carbon footprint of each mBERT model was of 62 g (eq. CO$_{2}$), and the estimated carbon footprint of each mDeBERTa model was of 115 g (eq. CO$_{2}$). Lastly, the Bi-LSTM-CRF models needed more training epochs and longer training times; their estimated carbon footprint was of 263 g (eq. CO$_{2}$) per model. Table A.6 in Appendix A includes all the details.

With regard to the annotation times, we compared both RoBERTa-based models using the 100 new EudraCT texts and the 100 clinical cases for the human evaluation. Speed of processing was measured in a laptop with a CPU Intel Core i7 processor (2.8 GHz), 16 GB of RAM and Mac OS. Table A.7 in Appendix A shows that the RoBERTa EHR and EriBERTa models were faster compared to and mDeBERTa and CLIN-X-ES; we did not compare them with the Bi-LSTM or the mBERT models, given the worse performance of the latter models.

We inspected the errors made by these models on a subset of the test split (§Discussion). The CLIN-X-ES models made critical errors in medication and temporal information, and the mDeBERTa models did not label several medical abbreviations and acronyms. Moreover, these models had longer annotation times and a higher environmental impact of fine-tuning. We thus selected the RoBERTa EHR models to be evaluated by medical professionals (note that the EriBERTa model was publicly released after the human evaluation has conducted). In the current version of the medical entity recognition tool, we integrated the RoBERTa EHR models; however, we release the rest of fine-tuned models in the Hugging Face hub.

Results of the human evaluation

Table 5 shows the comparison between the evaluators’ revisions and the system output of 100 new texts from EudraCT and 100 clinical cases. The average F1 score (strict match) of the tool was of 0.858 (± 0.032) on the 100 trial announcements, and of 0.910 (± 0.019) on the 100 clinical cases. Interestingly, performance on the 100 cases was higher than on the 100 new EudraCT texts. The fact that the RoBERTa EHR model was trained on clinical data may account for this outcome.

Table 5 Comparison of system predictions and human evaluation of 100 new trials and 100 clinical cases

Full size table

System predictions regarding the seven UMLS entity types and the temporal annotations were in line with the results obtained in our held-out test set. In the 100 new EudraCT trials, results for medication information and negation or speculation were moderately below our scores in the test set; but note that their standard deviation scores were higher. The system performed very well in some texts, but not in other ones. A similar trend is observed in models for miscellaneous medical entities and experiencer and temporality attributes. The models for medication information, negation and speculation, miscellaneous medical entities, and experiencer and temporality attributes seem to vary widely in a real-world setting.

Upon analyzing the discrepancies between the system’s predictions and the human evaluations, we observed that the tool failed to annotate certain Route entities that were missing in the training data (e.g., systemic, which caused inconsistencies across annotators). Negation and speculation caused several misinterpretations regarding the scope of Negated or Speculated. In particular, contexts involving laboratory tests, or where the scope is unclear, were often ambiguous for human evaluators. Several mismatches affected the PHYS and Observation categories, especially in entities expressing normal findings (e.g., normoperfundido, ‘normal perfusion’), and generally in the 100 clinical cases. The PROC semantic group also caused ambiguities in diagnostic tools or scales (e.g., ECOG) or observable entities that can be interpreted as an observation (e.g., left ventricular ejection fraction).

The medical professionals also gave us some feedback. Some aspects revolved around the scope of the annotation, which needed discontinuous entity marking (e.g., pregnant or lactating patients). However, discontinuous annotations are currently not supported in this version of the tool.

Discussion

Overall, Transformer-based models achieved the highest scores compared to our manually-crafted rules or the Bi-LSTM-CRF framework. In the comparison between the RoBERTa-based models, we observed that a bilingual, domain-specific model (EriBERTa) slightly surpassed the monolingual medical model RoBERTa EHR (except for the recognition of medication information). This suggests that cross-lingual transfer learning has a positive impact. Indeed, CLIN-X-ES outperformed both models, and a plausible explanation might be that it was pre-trained on the multilingual XLM RoBERTa model and further pre-trained on Spanish medical corpora. Interestingly, we identified contradictory findings when comparing multilingual general models to monolingual domain models. Although mBERT tended to show a lower performance, the mDeBERTa model generally outperformed RoBERTa ERH and CLIN-X-ES, which were both trained with medical data. One explanation could be the specific architecture of the mDeBERTa model, namely the disentangled attention mechanism and the use of absolute position information, which enables the model to learn better dependencies between words. To confirm whether this performance is due to the multilingual pre-training data or to the model architecture, a medical mDeBERTa model should also be tested in our data, which is, however, currently unavailable. Nonetheless, the mDeBERTa and CLIN-X-ES models did not extract key medical entities, abbreviations and acronyms for our task; therefore, we used the RoBERTa ERH models in the external evaluation by medical professionals.

We analyzed the errors in the output of the RoBERTa EHR model for the human evaluation. Figure B.9 in Appendix B includes a detailed error analysis based on an standard taxonomy [111]; Figures B.10-B.15 include examples; and Figure B.16 shows the confusion matrices of predicted and gold standard labels per category. Many errors involved the scope of DISO, PROC, Duration, Observation, Negated or Speculated entities. These occurred when the model labeled modifiers as part of the entity (e.g., small children), or vice versa. False negatives occurred in classes with scarce instances in the corpus (ANAT, DEVI, PHYS, Contraindicated, Form, Food or Future, and lexical negation/speculation cues). Proper names and acronyms caused both false positives (e.g., dialyzer AN69ST$^{{\underline{{\circledR }}}}$) and false negatives (GE, ‘experimental group’). The semantic class was often wrong between CHEM and PROC, Date and Time, DISO and Observation, or Negated and Speculated.

This analysis revealed that a contributing factor to the observed errors is the insufficient number of annotated samples for specific categories within the training dataset. Another source of errors occurred under the zero-shot setting, when unseen instances in the training data (e.g., acronyms or brand names) were unannotated or misclassified. These would be the major weaknesses of the current version of the tool, especially if false negative errors cause valuable information loss. Errors affecting the scope of entities seem to arise owing to ambiguous contexts, in particular of negated and speculated phrases. Nonetheless, human evaluators did not find severe errors in this respect; and in the case of multi-word entities, missing tokens might be considered a minor error if the entity head is labeled (e.g. gastric bleeding vs. upper gastric bleeding). The error analysis also provided us with insights into future improvements. To alleviate the false negative and zero-shot issues, we would need to conduct more annotations—specifically for the infrequent classes—and train the models with enriched annotated data. To improve the detection of negation and speculation, we could try methods based on syntax [112] or data augmentation [113].

We next present a comparative analysis of our results with those reported in other studies for similar tasks. We refer to recent works for a comparison of systems on EHRs [114] and an analysis of encoder-based models on clinical datasets [115]. Table D.9 in Appendix D includes specific figures reported by the authors, with an exact match criterion, excepting the work by [67], who reported relaxed-match results. Replicating other teams’ results was out of the scope of this work.

In general, direct comparison of our results is not feasible due to the substantial variability among annotation schemes. We used a subset of UMLS semantic groups; however, the UMLS subset of cTAKES differs from our scheme (e.g., we did not separate signs/symptoms and disorders into different categories). For English clinical trial data, the UMLS was not used [116,117,118]; and Criteria2Query [5] was evaluated on the OMOP scheme [119].

Regarding temporal entities, our results with a Transformer model are slightly higher than those reported by other teams. However, we used the Age class and the TimeML scheme. Other groups restricted to the four categories of TimeML and achieved competitive results using rule-based methods (generally based on HeidelTime [120, 121]), machine-learning (as implemented in the cTAKES’ temporal expression extractor [55]) and neural models [82, 83].

Our annotation scheme also differs from other studies regarding drug-related entities. We merged Dose and Strength, and did not include classes such as Condition or Reason, considered in other works [66, 92]. Results cannot be compared to those by other teams who applied rule-based methods [122], deep learning-based algorithms [66] or ensemble methods [67]. Nonetheless, to the best of our knowledge, no similar work has been done on medication data in Spanish texts using deep learning.

Regarding negation and speculation, we annotated this type of information exclusively for concepts or events, which is similar to the cTAKES approach [4]. Although we did not annotate the scope of negation or speculation, most state-of-the-art works achieved similar outcomes. Recognizing cues generally achieves higher scores compared to scopes, and this is consistent across medical reports [61, 64], medical literature [62] or clinical trials [5]. This trend is also observed with neural-network-based approaches in Spanish, French or Brazilian Portuguese [63]. We also obtained higher scores recognizing negation rather than speculation.

As for event temporality attributes (History_of, Family_History_of and Future), comparison with other studies is challenging, as these classes are frequently merged into more general categories. Some examples are status attributes in [4] and historical attributes in ConText [60]. Few projects considered experiencer attributes (Patient, Other and Family_member), but consistent with our findings, these generally demonstrated high scores.

Miscellaneous medical entities show a broader variability in results. Observations or Findings did not achieve the highest scores, which is consistent with the results reported by other teams [77, 95, 123]. This class can be confused with disorders or results, and models tend to perform poorly. The same trend occurs in quantifier/qualifier entities, which may include a wide range of adjectives or expressions that other schemes represent with more specific classes: e.g. Multiplier [117] or Modifier [123]. In contrast, entities expressing results or values have high scores.

Limitations

Among our limitations, we only used texts about clinical trials to develop the tool. We trained new models with 100 anonymized clinical cases and tested them on 100 different cases, but we need to confirm the performance for other text genres, especially real EHRs. Moreover, our annotation scheme might be coarse for some tasks. The DISO category merged signs, symptoms and disorders; and Dose includes both dosage and concentration or strength. We largely depend on the UMLS categories, which might be inadequate for other tasks, and the tool is not compatible with other standards, e.g., OHDSI OMOP [119]. However, the tool can be used to pre-annotate texts using labels that can be adapted to specific tasks. Further improvement is required in the recognition of drug-related information; achieving this will necessitate additional annotated data. Discontinuous entities (i.e. separated by non-annotated tokens) were discarded to train the models: e.g., vacuna anual contra la gripe (‘influenza virus [..] vaccine’). Furthermore, new architectures and language models need to be tested. Lastly, although the tool supports concept normalization to UMLS CUIs or SNOMED CT codes, it does not provide sense disambiguation. In the future, we plan to evaluate the normalization performance and will address relation extraction tasks by annotating the corpus with relations.

Usage

First, download the models from the Hugging Face hub. Then, import the AutoModelForTokenClassification class and load the NER model; for example, to annotate texts with the RoBERTa bsc-bio-ehr-es model for UMLS semantic groups, use: model = “roberta-es-clinical-trials-umls-7sgs-ner” (Figure C.18). Along with the graphical user interface (Fig. 1), the code can be executed via an UNIX-based terminal (Figure C.19) either for a single document or a batch of files included in a folder. A configuration file (Figure C.20) facilitates the specification of annotation options: e.g., use of the lexicon, temporal entities, negation or speculation, or output format of the annotations (JSON or BRAT ann). The code is available at: https://github.com/lcampillos/medspaner.

Conclusions

We have introduced a comprehensive NLP tool designed to automate the processing of clinical trials in Spanish and demonstrated its efficacy in extracting medical information from clinical cases. The Medical Semantic Python-Assisted Named Entity Recognizer (MEDSPANER) is open source, supports both lexicon- and Transformer-based annotation of medical entities, and also normalizes entities to UMLS Concept Unique Identifiers or SNOMED CT codes. The tool integrates Transformer-based NER models that can be adapted to other sub-genres (e.g., clinical reports or medical journal articles). In addition, the models for negation or speculation and temporal entity recognition can be re-adapted for non-medical domains. We conducted an internal validation, with F1 scores up to 0.915 (strict match). We also performed an external validation in which eight medical professionals evaluated the system annotations on 100 new clinical trials texts (average F1 = 0.858) and 100 anonymized clinical cases (with an average F1 = 0.910). To the best of our knowledge, this is one of the few tools for comprehensive processing of Spanish medical texts, including the processing of medication information, experiencer and event temporality. The tool can be adapted to other languages for which similar text data and resources exist. We make the tool available online and distribute the code. Because new language models will be released, which are expected to surpass the results presented herein, we created a space at the Hugging Face Hub to share these and future models. We also release in a companion repository the annotated corpus for training the models.

Data Availibility Statement

All the resources supporting this article are available at the companion repository: https://github.com/lcampillos/medspaner. A demonstration system of the web interface is available at: https://claramed.csic.es/medspaner . The annotated corpus to train the models can be downloaded at: https://zenodo.org/records/13880599. The fine-tuned models are available at: https://huggingface.co/medspaner. .

Notes

The Andalusian Society of Family and Medicine (SAMFyC) authorized us to use the cases they published. We also used clinical cases aimed at educational purposes from https://hipocampo.org/ and https://urgenciasbidasoa.wordpress.com/casos-clinicos-3/ [Accessed: 2024-23-02].

Abbreviations

BERT:: Bidirectional Encoder Representations from Transformers
Bi-LSTM:: BidirectionalLong-Short Term Memory
BIO:: Begin, Inside, Out
CRF:: Conditional Random Fields
CT:: Clinical trials
CUI:: Concept Unique Identifier
EBM:: Evidence-Based Medicine
EHR:: Electronic health record
EudraCT:: European Clinical Trials Register
FN:: False negative
FP:: False positive
GPU:: Graphical processing unit
H:: Hidden state
I:: Input
IAA:: Inter-annotator agreement
K:: Key
LLM:: Large language model
M:: Mean
NER:: Named entity recognition
NLP:: Natural language processing
OOV:: Out-of-vocabulary
P:: Precision
Q:: Query
R:: Recall
REEC:: Repositorio Español de Estudios Clínicos
SciELO:: Scientific Electronic Library Online
SD:: Standard deviation
SG:: Semantic group
SNOMED CT:: Systematized Nomenclature of Medicine - Clinical Terms
TP:: True positive
V:: Value

References

Data MC. Secondary Analysis of Electronic Health Records. Springer Nature; 2016.
Chen X, Xie H, Cheng G, Poon LK, Leng M, Wang FL. Trends and features of the applications of natural language processing techniques for clinical trials text analysis. Appl Sci. 2020;10(6):2157.
Article CAS Google Scholar
Xu H, Demner-Fushman D. Natural Language Processing in Biomedicine. Cham: Springer; 2024.
Book Google Scholar
Savova GK, Masanz JJ, Ogren PV, Zheng J, Sohn S, Kipper-Schuler KC, et al. Mayo clinical Text Analysis and Knowledge Extraction System (cTAKES): architecture, component evaluation and applications. JAMIA. 2010;17(5):507–13.
PubMed PubMed Central Google Scholar
Yuan C, Ryan PB, Ta C, Guo Y, Li Z, Hardin J, et al. Criteria2Query: a natural language interface to clinical databases for cohort definition. JAMIA. 2019;26(4):294–305.
PubMed PubMed Central Google Scholar
Rastegar-Mojarad M, Ye Z, Kolesar JM, Hebbring SJ, Lin SM. Opportunities for drug repositioning from phenome-wide association studies. Nat Biotechnol. 2015;33(4):342–5.
Article CAS PubMed Google Scholar
Álvarez-Pérez A, Prieto-Santamaría L, Ugarte-Carro E, Otero-Carrasco B, Ayuso-Muñoz A, Rodríguez-González A. Exploring disease-drug pairs in Clinical Trials information for personalized drug repurposing. Proc of 26th IEEE Computer-based Med Syst Conf. 2023;p. 2023–05.
Hossain E, Rana R, Higgins N, Soar J, Barua PD, Pisani AR, et al. Natural language processing in electronic health records in relation to healthcare decision-making: a systematic review. Comput Biol Med. 2023;155: 106649.
Article PubMed Google Scholar
Robinson PN, Mungall CJ, Haendel M. Capturing phenotypes for precision medicine. Mol Case Stud. 2015;1(1): a000372.
Article Google Scholar
Gupta NS, Kumar P. Perspective of artificial intelligence in healthcare data management: A journey towards precision medicine. Comput Biol Med. 2023;162: 107051.
Article PubMed Google Scholar
Finlayson SG, LePendu P, Shah NH. Building the graph of medicine from millions of clinical narratives. Sci Data. 2014;1(1):1–9.
Article Google Scholar
Zhou M, Zheng C, Xu R. Combining phenome-driven drug-target interaction prediction with patientsâ€™ electronic health records-based clinical corroboration toward drug discovery. Bioinform. 2020;36:i436–44.
Article CAS Google Scholar
Noori A, Magdamo C, Liu X, Tyagi T, Li Z, Kondepudi A, et al. Development and evaluation of a natural language processing annotation tool to facilitate phenotyping of cognitive status in electronic health records: diagnostic study. JMIR Med Inform. 2022;24(8): e40384.
Google Scholar
Ashburner JM, Chang Y, Wang X, Khurshid S, Anderson CD, Dahal K, et al. Natural language processing to improve prediction of incident atrial fibrillation using electronic health records. J Am Heart Assoc. 2022;11(15): e026014.
Article PubMed PubMed Central Google Scholar
Zheng C, Ms Lee, Bansal N, Go AS, Chen C, Harrison TN, et al. Identification of recurrent atrial fibrillation using natural language processing applied to electronic health records. Eur Heart J. 2023;10(1):77–88.
Google Scholar
Irving J, Patel R, Oliver D, Colling C, Pritchard M, Broadbent M, et al. Using natural language processing on electronic health records to enhance detection and prediction of psychosis risk. Schizophr Bull. 2021;47(2):405–14.
Article PubMed Google Scholar
Fu S, Lopes GS, Pagali SR, Thorsteinsdottir B, LeBrasseur NK, Wen A, et al. Ascertainment of delirium status using natural language processing from electronic health records. J Gerontol. 2022;77(3):524–30.
Article Google Scholar
Luo Y, Thompson WK, Herr TM, Zeng Z, Berendsen MA, Jonnalagadda SR, et al. Natural language processing for EHR-based pharmacovigilance: a structured review. Drug Saf. 2017;40:1075–89.
Article PubMed Google Scholar
Ni Y, Wright J, Perentesis J, Lingren T, Deleger L, Kaiser M, et al. Increasing the efficiency of trial-patient matching: automated clinical trial eligibility pre-screening for pediatric oncology patients. BMC Med Inform Decis Mak. 2015;15:1–10.
Article Google Scholar
Wu S, Miller T, Masanz J, Coarr M, Halgrim S, Carrell D, et al. Negation's not solved: generalizability versus optimizability in clinical natural language processing. PLoS ONE. 2014;9(11): e112774.
Article PubMed PubMed Central Google Scholar
Xie F, Yuan H, Ning Y, Ong MEH, Feng M, Hsu W, et al. Deep learning for temporal data representation in electronic health records: A systematic review of challenges and methodologies. J Biomed Inform. 2022;126: 103980.
Article PubMed Google Scholar
Soysal E, Wang J, Jiang M, Wu Y, Pakhomov S, Liu H, et al. CLAMP-A toolkit for efficiently building customized clinical natural language processing pipelines. JAMIA. 2018;25(3):331–6.
PubMed Google Scholar
Espinosa L, Tello J, Pardo A, Hernández-Medrano I, Ureña A, Salcedo I, et al. SAVANA: a global information extraction and terminology expansion framework in the medical domain. Proces Leng Nat. 2016;57:23–30.
Google Scholar
Bodenreider O. The unified medical language system (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32:D267–70.
Article CAS PubMed PubMed Central Google Scholar
Donnelly K. SNOMED-CT: the advanced terminology and coding system for eHealth. Stud Health Technol Inform. 2006;121:279.
PubMed Google Scholar
Campillos-Llanos L. MedLexSp - A medical lexicon for Spanish medical natural language processing. J Biomed Semant. 2023;14(1):2.
Article Google Scholar
Stenetorp P, Pyysalo S, Topić G, Ohta T, Ananiadou S, Tsujii J. BRAT: a web-based tool for NLP-assisted text annotation. In: Proc. of 13th EACL Conf. - Demonstrations; 2012. p. 102–107.
Campillos-Llanos L, Valverde-Mateos A, Capllonch-Carrión A, Moreno-Sandoval A. A clinical trials corpus annotated with UMLS entities to enhance the access to evidence-based medicine. BMC Med Inform Decis Mak. 2021;21(1):1–19.
Google Scholar
Qi P, Zhang Y, Zhang Y, Bolton J, Manning CD. Stanza: A Python natural language processing toolkit for many human languages. Proc of 58th ACL. 2020;p. 101–108.
Honnibal M, Montani I. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing; 2017. To appear. Available from: https://spacy.io/.
Gardner M, Grus J, Neumann M, Tafjord O, Dasigi P, Liu NF, et al. AllenNLP: A Deep Semantic Natural Language Processing Platform. In: Proc. of Workshop for NLP Open Source Software. Melbourne, Australia; 2018. p. 1–6.
Wolf T, Debut L, Sanh V, Chaumond J, Delangue C, Moi A, et al. Transformers: State-of-the-art natural language processing. In: Proc. of 2020 EMNLP; 2020. p. 38–45.
Devlin J, Chang M, Lee K, Toutanova K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In: Proc. of 2019 NAACL Conf., vol. 1. Minneapolis, MN, USA, 2-7 June; 2019. p. 4171–86.
Akbik A, Blythe D, Vollgraf R. Contextual string embeddings for sequence labeling. Proc of 27th Int Conf Comp Ling Santa Fe, NM, USA, 20-26 Aug. 2018;p. 1638–49.
Soldaini L, Goharian N. Quickumls: a fast, unsupervised approach for medical concept extraction. In: MedIR workshop, SIGIR; 2016. p. 1–4.
Eyre H, Chapman AB, Peterson KS, Shi J, Alba PR, Jones MM, et al. Launching into clinical space with medspaCy: a new clinical text processing toolkit in Python. In: Proc. AMIA Symp.. vol. 2021. American Medical Informatics Association; 2021. p. 438.
Kraljevic Z, Searle T, Shek A, Roguski L, Noor K, Bean D, et al. Multi-domain clinical natural language processing with MedCAT: the medical concept annotation toolkit. Artif Intell Med. 2021;117: 102083.
Article PubMed Google Scholar
Zhang Y, Zhang Y, Qi P, Manning CD, Langlotz CP. Biomedical and clinical English model packages for the Stanza Python NLP library. JAMIA. 2021;28(9):1892–9.
PubMed PubMed Central Google Scholar
Li I, You K, Tang X, Qiao Y, Huang L, Hsieh CC, et al. EHRKit: A Python Natural Language Processing Toolkit for Electronic Health Record Texts. arXiv preprint arXiv:2204.06604. 2022;.
Krishnamoorthy S, Jiang Y, Buchanan W, Singh A, Ortega J. CLPT: A Universal annotation scheme and toolkit for clinical language processing. In: Proc. 4th Clinical NLP Workshop; 2022. p. 1–9.
Neuraz A, Vaillant G, Arias C, Birot O, Huynh KT, Fabacher T, et al. Facilitating phenotyping from clinical texts: the medkit library. arXiv preprint arXiv:2409.00164. 2024;.
Friedman C. A broad-coverage natural language processing system. In: Proc. AMIA Symp. American Medical Informatics Association; 2000. p. 270.
Aronson AR, Lang FM. An overview of MetaMap: historical perspective and recent advances. JAMIA. 2010;17(3):229–36.
PubMed PubMed Central Google Scholar
Demner-Fushman D, Rogers WJ, Aronson AR. MetaMap Lite: an evaluation of a new Java implementation of MetaMap. JAMIA. 2017;24(4):841–4.
PubMed PubMed Central Google Scholar
Cook HV, Pafilis E, Jensen LJ. A dictionary-and rule-based system for identification of bacteria and habitats in text. In: Proc. of 15th BioNLP; 2016. p. 50–55.
Cook HV, Bērziņš R, Leal C, Cejuela JM, Jensen LJ. Creation and evaluation of a dictionary-based tagger for virus species and proteins. In: Proc. of. 16th BioNLP; 2017. p. 91–98.
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinform. 2020;36(4):1234–40.
Article CAS Google Scholar
Rojas M, Dunstan J, Villena F. Clinical Flair: A Pre-Trained Language Model for Spanish Clinical Natural Language Processing. In: Proc. of 4th Clinical NLP Workshop; 2022. p. 87–92.
Carriño CP, Llop J, Pàmies M, Gutiérrez A, Armengol J, Silveira J, et al. Pretrained Biomedical Language Models for Clinical NLP in Spanish. In: Proc. of 21st BioNLP; 2022. p. 193–199.
Lange L, Adel H, Strötgen J, Klakow D. CLIN-X: pre-trained language models and a study on cross-task transfer for concept extraction in the clinical domain. Bioinform. 2022;38(12):3267–74.
Article CAS Google Scholar
Chen Y, Zhou C, Li T, Wu H, Zhao X, Ye K, et al. Named entity recognition from Chinese adverse drug event reports with lexical feature based BiLSTM-CRF and tri-training. J Biomed Inform. 2019;96: 103252.
Article PubMed Google Scholar
Lerner I, Paris N, Tannier X. Terminologies augmented recurrent neural network model for clinical named entity recognition. J Biomed Inform. 2020;102: 103356.
Article PubMed Google Scholar
Madkour M, Benhaddou D, Tao C. Temporal data representation, normalization, extraction, and reasoning: a review from clinical domain. Comput Methods Prog Biomed. 2016;128:52–68.
Article Google Scholar
Strötgen J, Gertz M. HeidelTime: High quality rule-based extraction and normalization of temporal expressions. In: Proc. of 5th Int. Workshop Semant. Eval.; 2010. p. 321–324.
Miller T, Bethard S, Dligach D, Lin C, Savova G. Extracting time expressions from clinical text. In: Proc. of 15th BioNLP; 2015. p. 81–91.
Najafabadipour M, Zanin M, Rodríguez A, Torrente M, Nuñez B, Bermudez JL, et al. Reconstructing the patient's natural history from electronic health records. Artif Intell Med. 2020;105: 101860.
Article PubMed Google Scholar
Luo Z, Johnson SB, Lai AM, Weng C. Extracting temporal constraints from clinical research eligibility criteria using conditional random fields. In: Proc. AMIA Symp.. vol. 2011. American Medical Informatics Association; 2011. p. 843.
Boland MR, Tu SW, Carini S, Sim I, Weng C. EliXR-TIME: a temporal knowledge representation for clinical research eligibility criteria. Proc AMIA summits translat sci. 2012;2012:71.
Google Scholar
Chapman W, Bridewell W, Hanbury P, Cooper GF, Buchanan BG. A simple algorithm for identifying negated findings and diseases in discharge summaries. J Biomed Inform. 2001;34(5):301–10.
Article CAS PubMed Google Scholar
Harkema H, Dowling JN, Thornblade T, Chapman WW. ConText: an algorithm for determining negation, experiencer, and temporal status from clinical reports. J Biomedical Inform. 2009;42(5):839–51.
Article Google Scholar
Lima S, Perez N, Cuadros M, Rigau G. NUBes: A Corpus of Negation and Uncertainty in Spanish Clinical Texts. In: Proc. 12th LREC. Marseille, France; 2020. p. 5772–5781.
Rivera R, Martinez P. The impact of pretrained language models on negation and speculation detection in cross-lingual medical text: comparative study. JMIR Med Inform. 2020;8(12): e18953.
Article Google Scholar
Dalloux C, Claveau V, Grabar N, Silva LE, Cabral C, Bonescki Y, et al. Supervised learning for the detection of negation and of its scope in French and Brazilian Portuguese biomedical corpora. Nat Lang Eng. 2021;27(2):181–201.
Article Google Scholar
Solarte-Pabón O, Montenegro O, Torrente M, González AR, Provencio M, Menasalvas E. Negation and uncertainty detection in clinical texts written in Spanish: a deep learning-based approach. PeerJ Comp Sci. 2022;8: e913.
Article Google Scholar
Fabregat H, Duque A, Martinez-Romo J, Araujo L. Negation-based transfer learning for improving biomedical named entity recognition and relation extraction. J Biomed Inform. 2023;138: 104279.
Article PubMed Google Scholar
Jouffroy J, Feldman SF, Lerner I, Rance B, Burgun A, Neuraz A, et al. Hybrid deep learning for medication-related information extraction from clinical texts in French: MedExt algorithm development study. JMIR Med Inform. 2021;9(3): e17934.
Article PubMed PubMed Central Google Scholar
Alfattni G, Belousov M, Peek N, Nenadic G, et al. Extracting drug names and associated attributes from discharge summaries: Text mining study. JMIR Med Inform. 2021;9(5): e24678.
Article PubMed PubMed Central Google Scholar
Park J, Fang Y, Ta C, Zhang G, Idnay B, Chen F, et al. Criteria2query 3.0: leveraging generative large language models for clinical trial eligibility query generation. Available at SSRN 4637800;.
Peikos G, Symeonidis S, Kasela P, Pasi G. Utilizing ChatGPT to Enhance Clinical Trial Enrollment. arXiv preprint arXiv:2306.02077. 2023;.
Hu Y, Ameer I, Zuo X, Peng X, Zhou Y, Li Z, et al. Zero-shot clinical entity recognition using ChatGPT. arXiv preprint arXiv:2303.16416. 2023;.
Hamer DMd, Schoor P, Polak TB, Kapitan D. Improving Patient Pre-screening for Clinical Trials: Assisting Physicians with Large Language Models. arXiv preprint arXiv:2304.07396. 2023.
Segura-Bedmar I, Martínez P, Revert R, Moreno-Schneider J. Exploring Spanish health social media for detecting drug effects. In: BMC Med. Inform. Decis. Mak.. vol. 15. BioMed Central; 2015. p. 1–9.
Gonzalez-Agirre A, Marimon M, Intxaurrondo A, Rabal O, Villegas M, Krallinger M. PharmaCoNER: Pharmacological Substances, Compounds and Proteins Named Entity Recognition Track. In: Proc. of 5th Workshop on BioNLP Open Shared Tasks; 2019. p. 1–10.
Oronoz M, Gojenola K, Pérez A, de Ilarraza AD, Casillas A. On the creation of a clinical gold standard corpus in Spanish: mining adverse drug reactions. J Biomed Inform. 2015;56:318–32.
Article PubMed Google Scholar
Atutxa A, Pérez A, Casillas A. Machine learning approaches on diagnostic term encoding with the ICD for clinical documentation. IEEE J Biomed & Health Inform. 2017;22(4):1323–9.
Article Google Scholar
Chizhikova M, López-Úbeda P, Collado-Montañez J, Martín-Noguerol T, Díaz-Galiano MC, Luna A, et al. CARES: a Corpus for classification of Spanish Radiological reports. Comput Biol Med. 2023;154: 106581.
Article PubMed Google Scholar
Cotik V, Filippo D, Roller R, Uszkoreit H, Xu F. Annotation of entities and relations in Spanish radiology reports. In: Proc. of RANLP; 2017. p. 177–184.
Fabregat H, Araujo L, Martinez J. Deep neural models for extracting entities and relationships in the new RDD corpus relating disabilities and rare diseases. Comput Methods Prog Biomed. 2018;164:121–9.
Article Google Scholar
Goenaga I, Andres E, Gojenola K, Atutxa A. Advances in monolingual and crosslingual automatic disability annotation in Spanish. BMC Bioinform. 2023;24(1):1–18.
Article Google Scholar
Báez P, Bravo-Marquez F, Dunstan J, Rojas M, Villena F. Automatic extraction of nested entities in clinical referrals in Spanish. ACM Trans Comput Healthc. 2022;3(3):1–22.
Article Google Scholar
Dellanzo A, Cotik V, Lozano Barriga DY, Mollapaza Apaza JJ, Palomino D, Schiaffino F, et al. Digital surveillance in Latin American diseases outbreaks: information extraction from a novel Spanish corpus. BMC Bioinform. 2022;23(1):1–22.
Article Google Scholar
Ruiz A, López JL, González I, Ruiz B. Sequence Time Expression Recognition in the Spanish Clinical Narrative. In: 2019 IEEE 32nd Int. Symp. Computer-Based Med. Syst. (CBMS). IEEE; 2019. p. 331–336.
Sánchez-de Castro A, Araujo L, Martínez J. RoBERTime: a novel model for the detection of temporal expressions in Spanish. Proces Leng Nat. 2023;70(1):39–51.
Google Scholar
Moreno I, Boldrini E, Moreda P, Romá-Ferri MT. Drugsemantics: a corpus for named entity recognition in Spanish summaries of product characteristics. J Biomed Inform. 2017;72:8–22.
Article PubMed Google Scholar
Cotik V, Stricker V, Vivaldi J, Rodríguez Hontoria H. Syntactic methods for negation detection in radiology reports in Spanish. In: Proc. of 15th BioNLP. Association for Computational Linguistics; 2016. p. 156–165.
Campillos-Llanos L, Martınez P, Segura-Bedmar I. A preliminary analysis of negation in a Spanish clinical records dataset. In: Proc. 1st Workshop Spanish Negation (NEGES); 2017. .
AEMPS. Nomenclátor de Prescripción. [Retrieved on 2024-02-23]. 2023;Available from: www.aemps.gob.es.
Manlhiot C, van den Eynde J, Kutty S, Ross HJ. A primer on the present state and future prospects for machine learning and artificial intelligence applications in cardiology. Can J Cardiol. 2022;38(2):169–84.
Article PubMed Google Scholar
Pustejovsky J, Castano JM, Ingria R, Sauri R, Gaizauskas RJ, Setzer A, et al. TimeML: Robust specification of event and temporal expressions in text. New Dir Quest Answ. 2003;3:28–34.
Google Scholar
Styler WF, Bethard S, Finan S, Palmer M, Pradhan S, De Groen PC, et al. Temporal annotation in the clinical domain. TACL. 2014;2:143–54.
Article PubMed Google Scholar
Uzuner Ö, Solti I, Cadag E. Extracting medication information from clinical text. JAMIA. 2010;17(5):514–8.
PubMed PubMed Central Google Scholar
Henry S, Buchan K, Filannino M, Stubbs A, Uzuner O. 2018 n2c2 shared task on adverse drug events and medication extraction in electronic health records. JAMIA. 2020;27(1):3–12.
PubMed Google Scholar
Uzuner Ö, South BR, Shen S, DuVall SL. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. JAMIA. 2011;18(5):552–6.
PubMed PubMed Central Google Scholar
Thompson P, Daikou S, Ueno K, Batista-Navarro R, Tsujii J, Ananiadou S. Annotation and detection of drug effects in text for pharmacovigilance. J Cheminform. 2018;10(1):1–33.
Article Google Scholar
Báez P, Villena F, Rojas M, Durán M, Dunstan J. The Chilean Waiting List Corpus: a new resource for clinical named entity recognition in Spanish. In: Proc. of 3rd Clinical NLP workshop; 2020. p. 291–300.
Kittner M, Lamping M, Rieke DT, Götze J, Bajwa B, Jelas I, et al. Annotation and initial evaluation of a large annotated German oncological corpus. JAMIA open. 2021;4(2):ooab025.
Article PubMed PubMed Central Google Scholar
Vincze V, Szarvas G, Farkas R, Móra G, Csirik J. The BioScope corpus: biomedical texts annotated for uncertainty, negation and their scopes. BMC Bioinform. 2008;9(11):1–9.
Google Scholar
Campillos L, Deléger L, Grouin C, Hamon T, Ligozat AL, Névéol A. A French clinical corpus with comprehensive semantic annotations: development of the Medical Entity and Relation LIMSI annOtated Text corpus (MERLOT). Lang Resour Eval. 2018;52:571–601.
Article Google Scholar
Hripcsak G, Rothschild AS. Agreement, the f-measure, and reliability in information retrieval. JAMIA. 2005;12(3):296–8.
PubMed PubMed Central Google Scholar
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. Proc of Advanc Neural Inf Process Syst. 2017;p. 5998–6008.
He P, Gao J, Chen W. DeBERTaV3: Improving DeBERTa using ELECTRA-Style Pre-Training with Gradient-Disentangled Embedding Sharing. In: Proc. of ICLR; 2023. .
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, et al. RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692. 2019;.
Schuster M, Nakajima K. Wordpiece Tokenization. In: IEEE Int. Conf. Acoustics, Speech & Signal Process.. vol. 1; 2012. p. 5149–5152.
de la Iglesia I, Atutxa A, Gojenola K, Barrena A. EriBERTa: A Bilingual Pre-Trained Language Model for Clinical Natural Language Processing. arXiv preprint arXiv:2306.07373. 2023.
Bahdanau D. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473. 2014.
Luong T, Pham H, Manning CD. Effective Approaches to Attention-based Neural Machine Translation. In: Màrquez L, Callison-Burch C, Su J, editors. Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon, Portugal: Association for Computational Linguistics; 2015. p. 1412–1421.
WHO. WHO model list of essential medicines: 18th list, April 2013. World Health Organization; 2021.
Phan U, Nguyen N. Simple Semantic-based Data Augmentation for Named Entity Recognition in Biomedical Texts. In: Proc. of 21st BioNLP; 2022. p. 123–129.
Nakayama H. seqeval: A Python framework for sequence labeling evaluation; 2018.
Lannelongue L, Grealey J, Inouye M. Green algorithms: quantifying the carbon footprint of computation. Adv Sci. 2021;8(12):2100707.
Article Google Scholar
Nejadgholi I, Fraser KC, de Bruijn B. Extensive error analysis and a learning-based evaluation of medical entity recognition systems to approximate user experience. In: Proc. of 19th BioNLP. Online; 2020. p. 177–186.
Mehrabi S, Krishnan A, Sohn S, Roch AM, Schmidt H, Kesterson J, et al. DEEPEN: a negation detection system for clinical text incorporating dependency relation into NegEx. J Biomed Inform. 2015;54:213–9.
Article PubMed PubMed Central Google Scholar
Hosseini A, Reddy S, Bahdanau D, Hjelm RD, Sordoni A, Courville A. Understanding by Understanding Not: Modeling Negation in Language Models. In: Proc. of 2021 Conf. NAACL; 2021. p. 1301–1312.
Lossio-Ventura JA, Sun R, Boussard S, Hernandez-Boussard T. Clinical concept recognition: evaluation of existing systems on EHRs. Front Artif Intell. 2022;5:1051724.
Article PubMed Google Scholar
García Subies G, Barbero Jiménez Á, Martínez Fernández P. A comparative analysis of Spanish Clinical encoder-based models on NER and classification tasks. J Am Med Inf Assoc. 2024. https://doiorg.publicaciones.saludcastillayleon.es/10.1093/jamia/ocae054.
Article Google Scholar
Tian S, Erdengasileng A, Yang X, Guo Y, Wu Y, Zhang J, et al. Transformer-based named entity recognition for parsing clinical trial eligibility criteria. In: Proc. 12th ACM Conf. Bioinform., Comp. Biol. & Health Inform.; 2021. p. 1–6.
Kury F, Butler A, Yuan C, Fu Lh, Sun Y, Liu H, et al. Chia, a large annotated corpus of clinical trial eligibility criteria. Sci data. 2020;7(1):1–11.
Article Google Scholar
Tseo Y, Salkola M, Mohamed A, Kumar A, Abnousi F. Information extraction of clinical trial eligibility criteria. arXiv preprint arXiv:2006.07296. 2020;.
Hripcsak G, Duke JD, Shah NH, Reich CG, Huser V, Schuemie MJ, et al. Observational Health Data Sciences and Informatics (OHDSI): opportunities for observational researchers. Stud Health Technol Inform. 2015;216:574.
PubMed PubMed Central Google Scholar
Grouin C, Grabar N, Hamon T, Rosset S, Tannier X, Zweigenbaum P. Eventual situations for timeline extraction from clinical reports. JAMIA. 2013;20(5):820–7.
PubMed PubMed Central Google Scholar
Navas M, Rodríguez V. Annotador: a temporal tagger for Spanish. J Intell Fuzzy Syst. 2020;39(2):1979–91.
Article Google Scholar
Xu H, Stenner SP, Doan S, Johnson KB, Waitman LR, Denny JC. MedEx: a medication information extraction system for clinical narratives. JAMIA. 2010;17(1):19–24.
CAS PubMed PubMed Central Google Scholar
Dobbins NJ, Mullen T, Uzuner Ö, Yetisgen M. The Leaf Clinical Trials Corpus: a new resource for query generation from clinical trial eligibility criteria. Sci Data. 2022;9(1):490.
Article PubMed PubMed Central Google Scholar
Bhatia P, Celikkaya B, Khalilia M, Senthivel S, Comprehend medical: a named entity recognition and relationship extraction web service. In,. 18th IEEE Int. Conf. Mach. Learn. & Applications (ICMLA). IEEE. 2019;2019:1844–51.
Zanoli R, Lavelli A, do Amarante DV, Toti D. Assessment of the E3C corpus for the recognition of disorders in clinical texts. Nat Lang Eng. 2023;30(4):851–69.
Solarte-Pabón O, Montenegro O, García-Barragán A, Torrente M, Provencio M, Menasalvas E, et al. Transformers for extracting breast cancer information from Spanish clinical narratives. Artif Intell Med. 2023;143: 102625.
Article PubMed Google Scholar

Download references

Acknowledgements

We are really thankful to Dr. Antonio Moreno-Sandoval for giving us access to the computer facilities of the Computational Linguistics Laboratory (Autonomous University of Madrid) and Dr. Jordi Porta for his help with the GPU. We also thank the Spanish Medicines Agency (AEMPS) and Dr. Jesús Santamaría for their support with the first stage of the human evaluation; Dr. Jonathan Heras for his valuable comments; and SAMFyC, who allowed us using their published clinical cases. We greatly thank the medical professionals who evaluated the texts and provided us with feedback to improve the system: Beatriz Arribas, Jorge Collado, Roberto Crespo, Daniel Duménigo, Hermenegildo Martínez, Rosa López, Noelia Martín and Javier Pumares. We also gratefully thank the creators of the NUBEs corpus and the pre-trained models we integrated in the tool. Lastly, we deeply thank the anonymous reviewers for their valuable comments to improve this work and the final version of the manuscript.

Availability and requirements

Project name: MEDSPANER. Code page: https://github.com/lcampillos/medspaner. Demo page: https://claramed.csic.es/medspaner . Trained models available at: https://huggingface.co/medspaner . Operating system(s): UNIX-based (including Mac OS). Programming language: Python vs 3.7 or higher. Other requirements: spaCy (vs. 3.3.1), Transformers (vs. 4.16.2); the full list of requirements is provided in the companion repository. License: General Public License 3 (AGPL$-$3.0). Any restrictions to use: MEDSPANER uses the MedLexSp lexicon, which is distributed upon signing license available at: https://digital.csic.es/handle/10261/270429.

Funding

Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. The first version of MEDSPANER was developed under the Marie Curie COFUND grant agreement no. 713366 (InterTalentum UAM). The training of the Transformer-based models and their integration into the new version was conducted under the CLARA-MeD project (PID2020-116001RA-C33), funded by MICIU/AEI/10.13039/501100011033/, in project call: “Proyectos I+D+i Retos de investigación". The funding bodies did not take part in the design of the study and collection, analysis, and interpretation of data and writing the manuscript.

Author information

Authors and Affiliations

ILLA - CSIC (Spanish National Research Council), C/Albasanz 26-28, 28037, Madrid, Spain
Leonardo Campillos-Llanos
Medical Terminology Unit, Spanish Royal Academy of Medicine, C/Arrieta 12, 28013, Madrid, Spain
Ana Valverde-Mateos
Centro de Salud Retiro, Hospital Universitario Gregorio Marañon, C/Lope de Rueda, 43, 28009, Madrid, Spain
Adrián Capllonch-Carrión

Authors

Leonardo Campillos-Llanos
View author publications
You can also search for this author inPubMed Google Scholar
Ana Valverde-Mateos
View author publications
You can also search for this author inPubMed Google Scholar
Adrián Capllonch-Carrión
View author publications
You can also search for this author inPubMed Google Scholar

Contributions

LCL conceptualized the annotation scheme, collected the texts, annotated data, analyzed the results, conducted the experiments, programmed the software and prepared the manuscript. AVM contributed to the creation of the annotation scheme, annotation criteria and guidelines, doubly annotated some sets and reviewed the manuscript. ACC helped to create the annotation scheme and guidelines according to his medical knowledge, helped to engage the health professionals who participated in the external evaluation, doubly annotated some sets and reviewed the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Leonardo Campillos-Llanos.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no Conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Supplementary Material 1.

Appendix A. Training and annotation details

Table 6 Training epochs per model (average of five rounds and standard deviation)

Full size table

Table 7 Annotation times (seconds per text) on 100 new trial texts and 100 clinical cases

Full size table