Hybrid natural language processing tool for semantic annotation of medical texts in Spanish

BMC Bioinformatics

Table 1 Characteristics of the Transformer-based models and pre-training details (medical-domain models are italized in rows 2-4)

Model	PT corpus size	#A	#H	#L	#P	#V
RoBERTa EHR (bsc-bio-ehr-es)	>1B tok	12	768	12	125M	52K
EriBERTa (EriBERTa-base)	900M tok	12	768	12	125M	50K
CLIN-X-ES (xlm-roberta-large-spanish-clinical)	790MB	16	1024	24	550M	250K
mBERT (bert-base-multilingual-cased)	2.5T	12	768	12	110M	110K
mDeBERTa (mdeberta-v3-base)	2.5T	12	768	12	190M	250K

A: attention heads; B: billion; H: hidden size; K: thousand; L: number of layers; M: million;
MB: megabytes; P: parameters; PT: pre-training; T: terabytes; Tok: tokens; V: vocabulary size

ISSN: 1471-2105