Skip to main content

Table 3 Results on classification tasks

From: Can large language models understand molecules?

Dataset

BBBP

BACE

HIV

# Compounds

2039

1513

41127

Negative:Positive

\(\approx\)1:3

\(\approx\)1:1

\(\approx\)28:1

Models

F1-Score

AUROC

F1-Score

AUROC

F1-Score

AUROC

Morgan FP

0.921 ± 0.003

0.896 ± 0.014

0.778 ± 0.027

0.880 ± 0.020

0.373 ± 0.028

0.797 ± 0.019

BERT

0.935 ± 0.005

0.947 ± 0.007

0.744 ± 0.023

0.845 ± 0.016

0.182 ± 0.032

0.780 ± 0.011

ChemBERTa

0.926 ± 0.011

0.944 ± 0.012

0.767 ± 0.020

0.862 ± 0.011

0.294 ± 0.033

0.767 ± 0.019

MolFormer-XL

0.927 ± 0.006

0.934 ± 0.007

0.762 ± 0.012

0.860 ± 0.010

0.317 ± 0.032

0.804 ± 0.010

GPT

0.908 ± 0.007

0.921 ± 0.015

0.648 ± 0.025

0.743 ± 0.030

0.039 ± 0.010

0.746 ± 0.009

LLaMA

0.933 ± 0.006

0.953 ± 0.009

0.766 ± 0.024

0.859 ± 0.017

0.391 ± 0.013

0.802 ± 0.010

LLaMA2

0.930 ± 0.006

0.945 ± 0.004

0.772 ± 0.023

0.863 ± 0.018

0.378 ± 0.017

0.799 ± 0.008

  1. The reported performance metrics are the mean and standard deviation of the F1-score and AUROC, calculated across the five-folds. The Best Performance is Highlighted in Bold