ChemBERTa Statistics 2026

ChemBERTa-77M-MLM recorded 49,475 monthly downloads on HuggingFace as of December 2025, ranking it among the most downloaded transformer models for chemistry. ChemBERTa-3, released in July 2025, scaled pretraining to 1.4 billion compounds from the ZINC20 dataset. This article covers download volumes, benchmark scores, model variants, and the broader AI drug discovery market driving adoption.

Key ChemBERTa Statistics 2026

ChemBERTa-77M-MLM recorded 49,475 monthly HuggingFace downloads in December 2025.
ChemBERTa-3 pretrained on 1.4 billion compounds from ZINC20 in July 2025.
ChemBERTa-2 outperformed the D-MPNN baseline on 6 of 8 MoleculeNet tasks.
The MLM variant pulled over 10 times more downloads than the 10M-MTR variant.
The AI drug discovery market is projected to reach $8.6 billion in 2026.

How Many People Download ChemBERTa?

ChemBERTa-77M-MLM is the most downloaded variant in the family. The model recorded 49,475 monthly downloads on HuggingFace as of December 2025. The MTR variant trails far behind, with the MLM version pulling more than 10x the volume of the 10M-MTR model.

The DeepChem organization, which maintains ChemBERTa, has 91 followers on HuggingFace. Seven derived fine-tuned models and three active HuggingFace Spaces use ChemBERTa as their foundation.

Variant	Pretraining Data	Monthly Downloads
ChemBERTa-77M-MLM	77M PubChem	49,475
ChemBERTa-77M-MTR	77M PubChem	~12,000
ChemBERTa-10M-MLM	10M PubChem	~8,500
ChemBERTa-10M-MTR	10M PubChem	~4,800

Source: HuggingFace Model Hub, December 2025

ChemBERTa Statistics 2026 Across Model Generations

The ChemBERTa family scaled rapidly across three generations. The original release used 10 million SMILES from PubChem. ChemBERTa-2 expanded to 77 million compounds. ChemBERTa-3 jumped to 1.4 billion compounds from ZINC20 in July 2025, with infrastructure designed to handle billion-molecule datasets on AWS Ray and on-premise HPC clusters.

Version	Release	Dataset	Compounds
ChemBERTa	2020	PubChem	10 million
ChemBERTa-2	2022	PubChem	77 million
ChemBERTa-3	July 2025	ZINC20	1.4 billion

Source: arXiv, ChemRxiv (DOI 10.26434/chemrxiv-2025-4glrl-v2)

How Does ChemBERTa Perform on MoleculeNet?

ChemBERTa models are evaluated on the MoleculeNet benchmark suite. ChemBERTa-2 beat the D-MPNN (chemprop) baseline on 6 out of 8 MoleculeNet tasks. Pretraining loss improved by 25 to 35% when the dataset scaled from 5 million to 77 million compounds.

The MLM-100M variant outperformed the much larger MoLFormer 1.1B model on blood-brain barrier penetration and clinical toxicity classification. Scaling from 100K to 10M compounds delivered an average ROC-AUC gain of +0.110 across BBBP, ClinTox, and Tox21.

Benchmark Task	ChemBERTa ROC-AUC	D-MPNN ROC-AUC
BBBP	0.733	0.708
ClinTox	0.952	0.905
Tox21 (SR-p53)	0.728	0.688
HIV (MLM)	0.793	0.771
BACE	0.799	0.812

Source: ChemBERTa-2 paper (arXiv:2209.01712)

ChemBERTa Architecture Specifications

ChemBERTa builds on the RoBERTa transformer, adapted for SMILES strings. The architecture uses 12 attention heads across 6 transformer layers, producing 72 distinct attention mechanisms. The model accepts a maximum sequence length of 512 tokens.

Pretraining Objectives Compared

ChemBERTa supports two pretraining tasks: Masked Language Modeling (MLM) and Multi-Task Regression (MTR). MLM masks 15% of SMILES tokens and trains the model to predict them. MTR predicts 200 RDKit-calculated physicochemical properties.

Attribute	MLM	MTR
Training Speed	Faster	Slower
Output Targets	Masked tokens	200 RDKit properties
Best HIV Score (AUROC)	0.793	0.733
Monthly Downloads (77M)	49,475	~12,000

Source: ChemBERTa-2 paper, HuggingFace Model Hub

How ChemBERTa Compares to Other Chemical Foundation Models

ChemBERTa-2 trained on 77 million compounds. MolBERT achieves comparable performance with just 1.6 million compounds through chemistry-aware tokenization. MoLFormer scaled to 1.1 billion molecules but did not produce proportional gains over smaller models.

Model	Parameters	Training Data	Architecture
ChemBERTa-2	~83M	77M (PubChem)	RoBERTa
ChemBERTa-3	Variable	1.4B (ZINC20)	RoBERTa / MoLFormer
MolBERT	85M	1.6M	BERT
MoLFormer	1.1B	1.1B	Linear attention
D-MPNN	~1M	Task-specific	Graph NN

Source: Digital Discovery (RSC), DOI 10.1039/D5DD00348B

ChemBERTa Statistics 2026 in the AI Drug Discovery Market

The AI drug discovery market hit $6.0 billion in 2025 and is projected to reach $8.6 billion in 2026, growing at a 12.6% CAGR through 2035. North America accounts for more than 50% of the market. Lead optimization holds the largest share at roughly 50% of total spend.

Year	Market Size (USD)
2024	$5.4 billion
2025	$6.0 billion
2026	$8.6 billion
2030	$13.5 billion
2035	$25.0 billion

Source: Roots Analysis, AI in Drug Discovery Market Report 2026-2035

What Are ChemBERTa’s Main Use Cases?

Pharmaceutical and academic teams use ChemBERTa for molecular property prediction, toxicity screening, pharmacokinetic prediction, and drug-drug interaction classification. The model also supports fine-tuning for ADMET prediction and binding affinity tasks. A January 2026 ScienceDirect paper showed a hybrid ChemBERTa plus graph neural network architecture outperformed standalone networks on solubility, toxicity, and binding affinity.

Where ChemBERTa Falls Short

ChemBERTa-10M scored 0.207 PRC-AUC on Tox21, well below D-MPNN at 0.429. The gap shows that current transformer iterations lack the inductive biases of graph algorithms and struggle with severe class imbalances in chemical datasets. Training large foundation models is also expensive, with billion-compound runs requiring AWS spot infrastructure or HPC clusters.

Adoption Across Research Communities

ChemBERTa is integrated into the DeepChem library, making it accessible through pip install. The model appears in over 200 academic papers indexed on Google Scholar. Researchers running statistical analysis on molecular datasets often pair ChemBERTa embeddings with downstream classifiers like random forests or logistic regression.

The seven derived fine-tuned models on HuggingFace cover specialized tasks like reaction prediction, retrosynthesis, and protein-ligand binding. Three HuggingFace Spaces let users test the model in a browser without local setup.

Resource	Count
Primary model variants	4
Derived fine-tuned models	7
Active HuggingFace Spaces	3
DeepChem org followers	91

Source: HuggingFace Model Hub, December 2025

FAQs

What is ChemBERTa used for?

ChemBERTa predicts molecular properties from SMILES strings. Researchers use it for toxicity screening, pharmacokinetics prediction, drug-drug interaction classification, and ADMET property forecasting in early-stage drug discovery pipelines.

How many compounds did ChemBERTa-3 train on?

ChemBERTa-3, released in July 2025, pretrained on 1.4 billion compounds from the ZINC20 dataset. Training used both AWS Ray spot instances and on-premise high-performance computing clusters to handle the scale.

Is ChemBERTa free to use?

Yes. ChemBERTa is open-source and available through DeepChem and HuggingFace. Pre-trained weights, training configs, and deployment workflows are released under permissive licenses for academic and commercial fine-tuning.

How does ChemBERTa compare to graph neural networks?

ChemBERTa-2 beat the D-MPNN baseline on 6 of 8 MoleculeNet tasks. On Tox21, however, D-MPNN still leads on PRC-AUC (0.429 vs 0.207), showing graph models retain advantages on class-imbalanced datasets.

Which ChemBERTa variant should I use?

For most tasks, ChemBERTa-77M-MLM is the most downloaded and best-supported variant. The MTR variant performs better on regression tasks but trains slower due to its 200-element label vector.

Citations

https://arxiv.org/abs/2209.01712
https://chemrxiv.org/engage/chemrxiv/article-details/6871305143bc52e4ec7bf0db
https://pubs.rsc.org/en/content/articlehtml/2026/dd/d5dd00348b
https://www.rootsanalysis.com/reports/ai-based-drug-discovery-market.html

ChemBERTa Statistics 2026

Key ChemBERTa Statistics 2026

How Many People Download ChemBERTa?

ChemBERTa Statistics 2026 Across Model Generations

How Does ChemBERTa Perform on MoleculeNet?

ChemBERTa Architecture Specifications

Pretraining Objectives Compared

How ChemBERTa Compares to Other Chemical Foundation Models

ChemBERTa Statistics 2026 in the AI Drug Discovery Market

What Are ChemBERTa’s Main Use Cases?

Where ChemBERTa Falls Short

Adoption Across Research Communities

FAQs

Citations

Enterprise AI Spending Statistics 2026: Budgets, ROI, and Industry Data

Grok AI Statistics 2026: Users, Revenue, Market Share, and Growth Data

AI in Healthcare Adoption Statistics 2026

Deepfake Incident Statistics 2026

Best AI Music and Vocal Tools for Chromebook Users in 2026

Chromebook Plus Adoption Statistics 2026

Chromebook vs MacBook Sales Statistics 2026: Shipments, Revenue, and Market Share Data

Chromebook Chipset Market Share (Intel, AMD, MediaTek, Qualcomm) 2026 Statistics

Firefox Usage Statistics 2026

Chromebook Global Shipment Statistics By Quarter 2026