Skip to main content

Table 4 Results on multi-task classification tasks

From: Can large language models understand molecules?

Dataset

ClinTox

SIDER

Tox21

# Compounds

1478

1427

7831

# Tasks

2

27

12

Models

F1-Score

AUROC

F1-Score

AUROC

F1-Score

AUROC

Morgan FP

0.647 ± 0.065

0.799 ± 0.063

0.634 ± 0.008

0.629 ± 0.01

0.314 ± 0.019

0.761 ± 0.010

BERT

0.919 ± 0.035

0.983 ± 0.017

0.617 ± 0.008

0.625 ± 0.014

0.192 ± 0.019

0.786 ± 0.011

ChemBERTa

0.896 ± 0.019

0.965 ± 0.01

0.628 ± 0.014

0.628 ± 0.012

0.236 ± 0.013

0.781 ± 0.008

MolFormer-XL

0.929 ± 0.038

0.982 ± 0.013

0.624 ± 0.012

0.605 ± 0.009

0.315 ± 0.008

0.775 ± 0.012

GPT

0.520 ± 0.035

0.963 ± 0.019

0.601 ± 0.005

0.612 ± 0.013

0.032 ± 0.008

0.757 ± 0.015

LLaMA

0.881 ± 0.053

0.980 ± 0.008

0.627 ± 0.007

0.605 ± 0.008

0.339 ± 0.015

0.774 ± 0.010

LLaMA2

0.905 ± 0.036

0.978 ± 0.014

0.627 ± 0.004

0.599 ± 0.009

0.332 ± 0.012

0.773 ± 0.009

  1. The reported performance metrics are the mean and standard deviation of the F1-score and AUROC, calculated across the five-folds. The Best Performance is Highlighted in Bold