MolBERT packs 85 million parameters into a model pre-trained on just 1.6 million SMILES molecular strings, yet matches or beats competitors trained on 77 million to 1.1 billion compounds. Developed by BenevolentAI and first presented at the NeurIPS 2020 Machine Learning for Molecules Workshop, MolBERT applies the BERT architecture to chemical language. This page covers MolBERT statistics for 2026, including model specifications, benchmark comparisons, derivative models, and the AI adoption trends driving molecular transformer research.
MolBERT Statistics 2026 — TL;DR
MolBERT uses 12 transformer layers, 12 attention heads, and 768-dimensional hidden representations to encode molecular structures, according to a 2023 Nature Communications study.
The model pre-trains on 1.6 million SMILES strings from the GuacaMol benchmark [derived from ChEMBL], predicting 200 physicochemical properties during training.
MolBERT achieves comparable prediction accuracy to ChemBERTa-2, which requires 77 million compounds — a roughly 48× difference in training data efficiency.
VitroBERT, a 2025 derivative built on MolBERT’s architecture, posted a 29% improvement in biochemistry-related prediction tasks and a 9% overall mean gain, based on research published in the Journal of Cheminformatics.
The AI drug discovery market reached an estimated $3.1 billion in 2025 and is projected at $4 billion in 2026, per Global Market Insights.
How Many Parameters Does MolBERT Have?
MolBERT contains approximately 85 million parameters. It follows the BERT-Base configuration: 12 encoder layers, 12 attention heads, and a hidden size of 768 dimensions with a feedforward size of 3,072.
The model operates on a fixed vocabulary of 42 tokens built specifically for SMILES notation. Maximum input sequence length is 128 characters, with relative positional encoding to handle longer SMILES strings at inference time. These architecture details come from the original BenevolentAI paper and were confirmed in a 2023 Nature Communications analysis of molecular property prediction methods.
| Specification | Value |
|---|---|
| Total Parameters | ~85 million |
| Transformer Layers | 12 |
| Attention Heads | 12 |
| Hidden Dimension | 768 |
| Feedforward Size | 3,072 |
| Vocabulary Size | 42 tokens |
| Max Sequence Length | 128 characters |
| Training Compounds | ~1.6 million SMILES |
Source: Nature Communications (2023); BenevolentAI / NeurIPS 2020
MolBERT Pre-Training Tasks
MolBERT uses three self-supervised objectives during pre-training. Masked Language Modeling (MLM) works the same as in standard BERT — random tokens are masked and the model predicts them from context. PhysChemPred predicts 200 physicochemical properties calculated via RDKit, including molecular weight, hydrogen bond donors, and aromatic ring counts. SMILES-Eq teaches the model to recognize whether two different SMILES strings encode the same molecule.
This triple-objective approach is what separates MolBERT from models like MolFormer, which relies on MLM alone. A March 2025 study from arXiv found that models incorporating physicochemical objectives consistently outperform structure-only approaches on ADME prediction datasets. The researchers noted that MolBERT and similar chemistry-informed models beat MolFormer on permeability prediction with statistical significance (p-value = 0.048).
MolBERT Statistics: Training Data Efficiency Compared
MolBERT trains on 1.6 million compounds from the GuacaMol benchmark dataset. ChemBERTa-2 uses up to 77 million compounds. MolFormer uses 100 million molecules from ZINC and PubChem. Despite the gap in training set size, MolBERT matches or exceeds both on most molecular property prediction benchmarks.
A 2025 arXiv study confirmed that a domain-adapted transformer pre-trained on just 400,000 molecules performed comparably to MolBERT across seven ADME datasets. MolFormer, trained on 100 million, was outperformed on six of those seven tasks. The takeaway: how you train matters more than how much data you train on.
| Model | Training Compounds | Parameters | Pre-Training Objectives |
|---|---|---|---|
| MolBERT | 1.6 million | ~85 million | MLM + PhysChem + SMILES-Eq |
| ChemBERTa-2 | 77 million | ~3.4 million | MLM or MTR |
| MolFormer | 100 million | ~45.5 million | MLM only |
| Chemformer | 100 million | <20 million | MLM (BART-based) |
| DeLiCaTe | 1.6 million | ~8.5 million | Distilled from MolBERT |
Source: arXiv (2025); Nature Communications (2023); Journal of Cheminformatics
MolBERT Derivative Models And Performance
Several models build directly on MolBERT’s architecture or weights. VitroBERT, published in August 2025 in the Journal of Cheminformatics by researchers at Johnson & Johnson, extends MolBERT by adding pre-training on large-scale in vitro biological assay profiles. Compared to standard MolBERT embeddings, VitroBERT delivered a 29% gain on preclinical biochemistry tasks, a 16% gain on histopathology endpoints, and a 9% mean improvement across all evaluated tasks.
DeLiCaTe, a compressed version of MolBERT, uses a combination of cross-layer parameter sharing and knowledge distillation to reduce the model to roughly 8.5 million parameters — about 10× smaller. It achieves more than 96% of MolBERT’s performance on QSAR and virtual screening tasks while running 4× faster in both training and inference.
MolRoPE-BERT replaced the original positional encoding with Rotary Position Embeddings to better handle variable-length molecular inputs. These derivatives reflect how generative AI adoption patterns are filtering into specialized scientific domains.
| Derivative Model | Change vs. MolBERT | Key Feature | Year Published |
|---|---|---|---|
| VitroBERT | +29% (biochem), +16% (pathology) | Biological assay pre-training | 2025 |
| DeLiCaTe | 96%+ retained, 4× faster | Knowledge distillation + CLPS | 2022 |
| MolRoPE-BERT | Improved on variable-length SMILES | Rotary position embedding | 2022 |
Source: Journal of Cheminformatics (2025); arXiv
AI Drug Discovery Market Size And MolBERT’s Context
MolBERT operates in a fast-growing market. Global Market Insights estimated the AI drug discovery market at $3.1 billion in 2025, projecting it to reach $4 billion in 2026 and $43.9 billion by 2035 at a 30.5% CAGR. Fortune Business Insights placed the 2025 figure slightly higher at $4.46 billion, with a 2026 estimate of $5 billion.
Over 3,000 drug candidates in the global pharmaceutical pipeline were developed or repurposed with AI assistance as of 2024, based on GlobalData’s Drugs Database. The World Economic Forum estimated that 30% of all new drug discoveries would incorporate AI by 2025. According to Science Direct, AI-discovered molecules have posted 80–90% success rates in Phase I clinical trials, well above the traditional average.
| Year | Estimated Market Size (USD) |
|---|---|
| 2022 | $0.9 billion |
| 2023 | $1.2 billion |
| 2024 | $1.7–1.9 billion |
| 2025 | $3.1 billion |
| 2026 (projected) | $4.0 billion |
| 2030 (projected) | $8.5–12.6 billion |
Source: Global Market Insights (2026); Fortune Business Insights; Arizton
Who Uses Molecular Transformer Models Like MolBERT?
Pharmaceutical and biotechnology companies accounted for 59.19% of the AI drug discovery market’s revenue in 2025, according to Grand View Research. North America held the largest regional share at approximately 52–66%, depending on the analyst, with the United States leading in both venture funding and published research. Oncology applications made up about 22% of market revenue in 2023, followed by infectious disease segments.
The top five companies in the AI drug discovery space — Isomorphic Labs (Alphabet), Insitro, Insilico Medicine, Recursion Pharma, and Schrödinger — collectively held 11.8% market share in 2025, per Global Market Insights. In August 2024, Recursion and Exscientia merged in a deal valued at $850 million. Xaira Therapeutics raised $1 billion in April 2024. Insilico Medicine closed a $110 million round in March 2025. These deals track with AI-related job creation trends across the technology sector.
BenevolentAI, the company that developed MolBERT, operates within this ecosystem. Its open-source MolBERT repository on GitHub remains an active reference for researchers working on biomedical language models and SMILES-based molecular featurization.
MolBERT Statistics: Benchmark Performance
MolBERT has been tested across the MoleculeNet benchmark suite, which covers classification and regression tasks in physical chemistry, biophysics, and physiology. On the standard six-dataset evaluation — BACE, BBBP, HIV (classification) and ESOL, FreeSolv, Lipophilicity (regression) — MolBERT either matches or outperforms comparable models.
The 2025 arXiv study by Sultan et al. found MolBERT outperformed MolFormer on all but one dataset (rPPB) and tied with their smaller domain-adapted model on five of seven ADME benchmarks. One consistent finding across studies: models that integrate physicochemical features during pre-training — like MolBERT — outperform those relying purely on structural representations. This aligns with broader trends in how AI tools are applied in specialized professional workflows.
MolBERT Statistics: Key Industry Investments (2024–2025)
Funding in AI drug discovery has accelerated rapidly. The Recursion-Exscientia merger closed at $850 million in August 2024, creating one of the largest AI-native pharma companies. Xaira Therapeutics secured $1 billion in a single funding round in April 2024 — one of the biggest raises in the sector’s history. Insilico Medicine raised $110 million in March 2025.
In September 2025, Eli Lilly launched an AI platform giving biotech firms access to its proprietary drug discovery models. In early April 2026, Anthropic acquired Coefficient Bio for $400 million, signaling that frontier AI labs are now making direct bets on drug discovery infrastructure.
| Deal / Investment | Value | Date |
|---|---|---|
| Xaira Therapeutics (funding) | $1 billion | April 2024 |
| Recursion-Exscientia merger | $850 million | August 2024 |
| Anthropic → Coefficient Bio | $400 million | April 2026 |
| Insilico Medicine (funding) | $110 million | March 2025 |
Source: GlobalData; Bessemer Venture Partners; Company Announcements
MolBERT vs. Other Molecular Language Models
The molecular transformer field has split into two camps. One approach — represented by MolBERT — injects chemistry-specific knowledge during pre-training through physicochemical property prediction and SMILES equivalence tasks. The other approach, used by ChemBERTa and MolFormer, relies on scale and generic masked language modeling.
A 2025 review published in Digital Discovery by the Royal Society of Chemistry classified MolBERT as a sequence-based model that tokenizes SMILES strings and applies masked language modeling to learn chemical embeddings. The review noted that graph-based transformers like Graphormer (AUC: 80.51% on MolHIV) slightly outperform sequence models like GROVER-LARGE (AUC: 80.32%) on certain tasks, but with more parameters. MolBERT’s strength lies in its data efficiency — getting competitive results from a fraction of the training data.
FAQ
What is MolBERT used for?
MolBERT generates molecular embeddings from SMILES strings for property prediction, virtual screening, and QSAR modeling in drug discovery workflows.
How many parameters does MolBERT have?
MolBERT has approximately 85 million parameters across 12 transformer layers, 12 attention heads, and 768-dimensional hidden representations.
How does MolBERT compare to ChemBERTa-2?
MolBERT trains on 1.6 million compounds versus ChemBERTa-2’s 77 million, achieving comparable accuracy with roughly 48× less training data.
What is VitroBERT?
VitroBERT is a 2025 derivative of MolBERT pre-trained on in vitro biological assays. It improved biochemistry task performance by 29% over standard MolBERT.
How large is the AI drug discovery market in 2026?
Global Market Insights projects the AI drug discovery market at $4 billion in 2026, up from an estimated $3.1 billion in 2025.
Sources:
https://www.nature.com/articles/s41467-023-41948-6
https://jcheminf.biomedcentral.com/articles/10.1186/s13321-025-01048-7
https://www.gminsights.com/industry-analysis/ai-in-drug-discovery-market
