ChemBERTa-77M-MLM recorded 49,475 monthly downloads on HuggingFace as of December 2025, ranking it among the most downloaded transformer models for chemistry. ChemBERTa-3, released in July 2025, scaled pretraining to 1.4 billion compounds from the ZINC20 dataset. This article covers download volumes, benchmark scores, model variants, and the broader AI drug discovery market driving adoption.
Key ChemBERTa Statistics 2026
- ChemBERTa-77M-MLM recorded 49,475 monthly HuggingFace downloads in December 2025.
- ChemBERTa-3 pretrained on 1.4 billion compounds from ZINC20 in July 2025.
- ChemBERTa-2 outperformed the D-MPNN baseline on 6 of 8 MoleculeNet tasks.
- The MLM variant pulled over 10 times more downloads than the 10M-MTR variant.
- The AI drug discovery market is projected to reach $8.6 billion in 2026.
How Many People Download ChemBERTa?
ChemBERTa-77M-MLM is the most downloaded variant in the family. The model recorded 49,475 monthly downloads on HuggingFace as of December 2025. The MTR variant trails far behind, with the MLM version pulling more than 10x the volume of the 10M-MTR model.
The DeepChem organization, which maintains ChemBERTa, has 91 followers on HuggingFace. Seven derived fine-tuned models and three active HuggingFace Spaces use ChemBERTa as their foundation.
| Variant | Pretraining Data | Monthly Downloads |
|---|---|---|
| ChemBERTa-77M-MLM | 77M PubChem | 49,475 |
| ChemBERTa-77M-MTR | 77M PubChem | ~12,000 |
| ChemBERTa-10M-MLM | 10M PubChem | ~8,500 |
| ChemBERTa-10M-MTR | 10M PubChem | ~4,800 |
Source: HuggingFace Model Hub, December 2025
ChemBERTa Statistics 2026 Across Model Generations
The ChemBERTa family scaled rapidly across three generations. The original release used 10 million SMILES from PubChem. ChemBERTa-2 expanded to 77 million compounds. ChemBERTa-3 jumped to 1.4 billion compounds from ZINC20 in July 2025, with infrastructure designed to handle billion-molecule datasets on AWS Ray and on-premise HPC clusters.
| Version | Release | Dataset | Compounds |
|---|---|---|---|
| ChemBERTa | 2020 | PubChem | 10 million |
| ChemBERTa-2 | 2022 | PubChem | 77 million |
| ChemBERTa-3 | July 2025 | ZINC20 | 1.4 billion |
Source: arXiv, ChemRxiv (DOI 10.26434/chemrxiv-2025-4glrl-v2)
How Does ChemBERTa Perform on MoleculeNet?
ChemBERTa models are evaluated on the MoleculeNet benchmark suite. ChemBERTa-2 beat the D-MPNN (chemprop) baseline on 6 out of 8 MoleculeNet tasks. Pretraining loss improved by 25 to 35% when the dataset scaled from 5 million to 77 million compounds.
The MLM-100M variant outperformed the much larger MoLFormer 1.1B model on blood-brain barrier penetration and clinical toxicity classification. Scaling from 100K to 10M compounds delivered an average ROC-AUC gain of +0.110 across BBBP, ClinTox, and Tox21.
| Benchmark Task | ChemBERTa ROC-AUC | D-MPNN ROC-AUC |
|---|---|---|
| BBBP | 0.733 | 0.708 |
| ClinTox | 0.952 | 0.905 |
| Tox21 (SR-p53) | 0.728 | 0.688 |
| HIV (MLM) | 0.793 | 0.771 |
| BACE | 0.799 | 0.812 |
Source: ChemBERTa-2 paper (arXiv:2209.01712)
ChemBERTa Architecture Specifications
ChemBERTa builds on the RoBERTa transformer, adapted for SMILES strings. The architecture uses 12 attention heads across 6 transformer layers, producing 72 distinct attention mechanisms. The model accepts a maximum sequence length of 512 tokens.
Pretraining Objectives Compared
ChemBERTa supports two pretraining tasks: Masked Language Modeling (MLM) and Multi-Task Regression (MTR). MLM masks 15% of SMILES tokens and trains the model to predict them. MTR predicts 200 RDKit-calculated physicochemical properties.
| Attribute | MLM | MTR |
|---|---|---|
| Training Speed | Faster | Slower |
| Output Targets | Masked tokens | 200 RDKit properties |
| Best HIV Score (AUROC) | 0.793 | 0.733 |
| Monthly Downloads (77M) | 49,475 | ~12,000 |
Source: ChemBERTa-2 paper, HuggingFace Model Hub
How ChemBERTa Compares to Other Chemical Foundation Models
ChemBERTa-2 trained on 77 million compounds. MolBERT achieves comparable performance with just 1.6 million compounds through chemistry-aware tokenization. MoLFormer scaled to 1.1 billion molecules but did not produce proportional gains over smaller models.
| Model | Parameters | Training Data | Architecture |
|---|---|---|---|
| ChemBERTa-2 | ~83M | 77M (PubChem) | RoBERTa |
| ChemBERTa-3 | Variable | 1.4B (ZINC20) | RoBERTa / MoLFormer |
| MolBERT | 85M | 1.6M | BERT |
| MoLFormer | 1.1B | 1.1B | Linear attention |
| D-MPNN | ~1M | Task-specific | Graph NN |
Source: Digital Discovery (RSC), DOI 10.1039/D5DD00348B
ChemBERTa Statistics 2026 in the AI Drug Discovery Market
The AI drug discovery market hit $6.0 billion in 2025 and is projected to reach $8.6 billion in 2026, growing at a 12.6% CAGR through 2035. North America accounts for more than 50% of the market. Lead optimization holds the largest share at roughly 50% of total spend.
| Year | Market Size (USD) |
|---|---|
| 2024 | $5.4 billion |
| 2025 | $6.0 billion |
| 2026 | $8.6 billion |
| 2030 | $13.5 billion |
| 2035 | $25.0 billion |
Source: Roots Analysis, AI in Drug Discovery Market Report 2026-2035
What Are ChemBERTa’s Main Use Cases?
Pharmaceutical and academic teams use ChemBERTa for molecular property prediction, toxicity screening, pharmacokinetic prediction, and drug-drug interaction classification. The model also supports fine-tuning for ADMET prediction and binding affinity tasks. A January 2026 ScienceDirect paper showed a hybrid ChemBERTa plus graph neural network architecture outperformed standalone networks on solubility, toxicity, and binding affinity.
Where ChemBERTa Falls Short
ChemBERTa-10M scored 0.207 PRC-AUC on Tox21, well below D-MPNN at 0.429. The gap shows that current transformer iterations lack the inductive biases of graph algorithms and struggle with severe class imbalances in chemical datasets. Training large foundation models is also expensive, with billion-compound runs requiring AWS spot infrastructure or HPC clusters.
Adoption Across Research Communities
ChemBERTa is integrated into the DeepChem library, making it accessible through pip install. The model appears in over 200 academic papers indexed on Google Scholar. Researchers running statistical analysis on molecular datasets often pair ChemBERTa embeddings with downstream classifiers like random forests or logistic regression.
The seven derived fine-tuned models on HuggingFace cover specialized tasks like reaction prediction, retrosynthesis, and protein-ligand binding. Three HuggingFace Spaces let users test the model in a browser without local setup.
| Resource | Count |
|---|---|
| Primary model variants | 4 |
| Derived fine-tuned models | 7 |
| Active HuggingFace Spaces | 3 |
| DeepChem org followers | 91 |
Source: HuggingFace Model Hub, December 2025
FAQs
What is ChemBERTa used for?
ChemBERTa predicts molecular properties from SMILES strings. Researchers use it for toxicity screening, pharmacokinetics prediction, drug-drug interaction classification, and ADMET property forecasting in early-stage drug discovery pipelines.
How many compounds did ChemBERTa-3 train on?
ChemBERTa-3, released in July 2025, pretrained on 1.4 billion compounds from the ZINC20 dataset. Training used both AWS Ray spot instances and on-premise high-performance computing clusters to handle the scale.
Is ChemBERTa free to use?
Yes. ChemBERTa is open-source and available through DeepChem and HuggingFace. Pre-trained weights, training configs, and deployment workflows are released under permissive licenses for academic and commercial fine-tuning.
How does ChemBERTa compare to graph neural networks?
ChemBERTa-2 beat the D-MPNN baseline on 6 of 8 MoleculeNet tasks. On Tox21, however, D-MPNN still leads on PRC-AUC (0.429 vs 0.207), showing graph models retain advantages on class-imbalanced datasets.
Which ChemBERTa variant should I use?
For most tasks, ChemBERTa-77M-MLM is the most downloaded and best-supported variant. The MTR variant performs better on regression tasks but trains slower due to its 200-element label vector.
Citations
https://arxiv.org/abs/2209.01712
https://chemrxiv.org/engage/chemrxiv/article-details/6871305143bc52e4ec7bf0db
https://pubs.rsc.org/en/content/articlehtml/2026/dd/d5dd00348b
https://www.rootsanalysis.com/reports/ai-based-drug-discovery-market.html
