Can large language models understand molecules?

BMC Bioinformatics

Table 3 Results on classification tasks

Dataset	BBBP		BACE		HIV
# Compounds	2039		1513		41127
Negative:Positive	\(\approx\)1:3		\(\approx\)1:1		\(\approx\)28:1

Models	F1-Score	AUROC	F1-Score	AUROC	F1-Score	AUROC
Morgan FP	0.921 ± 0.003	0.896 ± 0.014	0.778 ± 0.027	0.880 ± 0.020	0.373 ± 0.028	0.797 ± 0.019
BERT	0.935 ± 0.005	0.947 ± 0.007	0.744 ± 0.023	0.845 ± 0.016	0.182 ± 0.032	0.780 ± 0.011
ChemBERTa	0.926 ± 0.011	0.944 ± 0.012	0.767 ± 0.020	0.862 ± 0.011	0.294 ± 0.033	0.767 ± 0.019
MolFormer-XL	0.927 ± 0.006	0.934 ± 0.007	0.762 ± 0.012	0.860 ± 0.010	0.317 ± 0.032	0.804 ± 0.010
GPT	0.908 ± 0.007	0.921 ± 0.015	0.648 ± 0.025	0.743 ± 0.030	0.039 ± 0.010	0.746 ± 0.009
LLaMA	0.933 ± 0.006	0.953 ± 0.009	0.766 ± 0.024	0.859 ± 0.017	0.391 ± 0.013	0.802 ± 0.010
LLaMA2	0.930 ± 0.006	0.945 ± 0.004	0.772 ± 0.023	0.863 ± 0.018	0.378 ± 0.017	0.799 ± 0.008

The reported performance metrics are the mean and standard deviation of the F1-score and AUROC, calculated across the five-folds. The Best Performance is Highlighted in Bold

ISSN: 1471-2105