CodeT5-base recorded 22,172 monthly downloads on Hugging Face as of 2026, with 761+ related models deployed across the platform. Developed by Salesforce Research and released under the Apache 2.0 license, the CodeT5 family spans eight model variants ranging from 60 million to 16 billion parameters. This article covers the latest CodeT5 statistics for 2026, including code generation accuracy on benchmarks, adoption metrics, training data composition, and how the model family compares to other open-source code LLMs.
CodeT5 Statistics in 2026 — TL;DR
CodeT5-base pulls 22,172 monthly downloads on Hugging Face and supports 761+ models on the platform as of 2026.
InstructCodeT5+ 16B scored 35.0% pass@1 on HumanEval in zero-shot settings, beating OpenAI’s code-cushman-001 at 33.5%.
CodeT5+ trained on 51.5 billion tokens — a 50x increase over the original CodeSearchNet corpus of 8.35 million instances.
The CodeT5+ 770M model matched performance of models 8 to 80 times its size, reaching 15.5% pass@1 on HumanEval.
Salesforce archived the official GitHub repository in May 2025. Model weights remain on Hugging Face and community forks stay active.
How Many Parameters Does Each CodeT5 Model Have?
The CodeT5 family includes eight variants. Salesforce released the first three in 2021 as part of the original CodeT5 line. Five more arrived with CodeT5+ in 2023, introducing flexible encoder-decoder architectures and shallow-encoder/deep-decoder configurations for the larger models.
| Model Variant | Parameters | Architecture | Year |
|---|---|---|---|
| CodeT5-small | 60M | Encoder-Decoder | 2021 |
| CodeT5-base | 220M | Encoder-Decoder | 2021 |
| CodeT5-large | 770M | Encoder-Decoder | 2021 |
| CodeT5+ 220M | 220M | Flexible Encoder-Decoder | 2023 |
| CodeT5+ 770M | 770M | Flexible Encoder-Decoder | 2023 |
| CodeT5+ 2B | 2B | Shallow Encoder, Deep Decoder | 2023 |
| CodeT5+ 6B | 6B | Shallow Encoder, Deep Decoder | 2023 |
| CodeT5+ 16B | 16B | Shallow Encoder, Deep Decoder | 2023 |
Source: Salesforce Research / arXiv
CodeT5+ can operate in three modes — encoder-only, decoder-only, or full encoder-decoder — allowing teams to pick the right setup for tasks like code completion, embedding, or translation without maintaining separate model instances.
CodeT5 Code Generation Accuracy on HumanEval
HumanEval is a benchmark of 164 hand-authored Python programming problems. Each problem requires generating a function body that passes hidden unit tests. InstructCodeT5+ 16B hit 35.0% pass@1 and 54.5% pass@10 in zero-shot evaluation, according to the original CodeT5+ paper. When paired with CodeT test generation, accuracy climbed to 42.9% pass@1 and 67.8% pass@10.
| Model | Pass@1 | Pass@10 | Setting |
|---|---|---|---|
| InstructCodeT5+ 16B | 35.0% | 54.5% | Zero-shot |
| InstructCodeT5+ 16B + CodeT | 42.9% | 67.8% | Zero-shot + Test Gen |
| CodeT5+ 770M-py | 15.5% | — | Zero-shot |
| OpenAI code-cushman-001 | 33.5% | — | Zero-shot |
| GPT-NeoX 20B | 15.4% | — | Zero-shot |
| PaLM 62B | 15.9% | — | Zero-shot |
Source: Wang et al., arXiv 2305.07922
The 770M variant’s 15.5% pass@1 came close to PaLM 62B (15.9%) despite having roughly 80 times fewer parameters. That gap between model size and output quality is one of the more notable findings from the CodeT5+ evaluation.
CodeT5 Training Data and Programming Language Support
CodeT5+ trained on 51.5 billion tokens from permissively licensed GitHub repositories. That’s a 50x jump from the original CodeSearchNet corpus, which contained 8.35 million function-level instances. The training data only includes code under MIT, Apache-2.0, BSD-3-Clause, BSD-2-Clause, CC0-1.0, Unlicense, and ISC licenses, keeping commercial deployment clear of legal friction.
The model supports nine programming languages. The original CodeT5 covered eight. CodeT5+ added C++ to the roster.
| Language | CodeT5 | CodeT5+ |
|---|---|---|
| Python | Yes | Yes |
| Java | Yes | Yes |
| JavaScript | Yes | Yes |
| Go | Yes | Yes |
| Ruby | Yes | Yes |
| PHP | Yes | Yes |
| C | Yes | Yes |
| C# | Yes | Yes |
| C++ | No | Yes |
Source: Hugging Face / Salesforce Research
CodeT5 Hugging Face Adoption Metrics in 2026
CodeT5-base generated 22,172 monthly downloads on Hugging Face as of 2026. The model accumulated 132 community likes and powers 36 dependent spaces on the platform. Developers built 86 finetuned derivative models and 17 adapter models from the base checkpoint. Across all variants and finetuned offshoots, the CodeT5 family totals 761+ models on Hugging Face Hub.
| Metric | Value |
|---|---|
| Monthly Downloads (CodeT5-base) | 22,172 |
| Community Likes | 132 |
| Dependent Spaces | 36 |
| Finetuned Derivative Models | 86 |
| Adapter Models | 17 |
| Total Models on Hub | 761+ |
Source: Hugging Face
The GitHub repository at salesforce/CodeT5 has 3,100+ stars and 486+ forks. Salesforce archived it in May 2025, but model weights stay available through Hugging Face and community-maintained forks continue development under the Apache 2.0 license.
CodeT5 Downstream Task Performance
CodeT5+ was evaluated on over 20 code-related benchmarks across zero-shot, finetuning, and instruction-tuning settings. The strongest gains showed up in retrieval-augmented code generation, where scores jumped by an average of +5.8 BLEU-4 points over prior baselines. Text-to-code retrieval improved by +3.2 average MRR across eight tasks. Line-level code completion added +2.1 average exact match across two benchmarks.
| Task Category | Benchmarks | Improvement |
|---|---|---|
| Text-to-Code Retrieval | 8 tasks | +3.2 avg. MRR |
| Line-Level Code Completion | 2 tasks | +2.1 avg. Exact Match |
| Retrieval-Augmented Code Gen | 2 tasks | +5.8 avg. BLEU-4 |
Source: Wang et al., arXiv 2305.07922
On mathematical programming, CodeT5+ 770M reached 87.4% pass@80 on MathQA-Python and 73.8% pass@100 on GSM8K-Python after finetuning — results that outperformed models with up to 137 billion parameters.
How Does CodeT5 Compare to Other Open Code LLMs?
When CodeT5+ 16B launched, its 35.0% pass@1 on HumanEval was the top score among open-source code models. Since then, newer models like StarCoder2 (15B parameters, trained on 1 trillion tokens) and Code Llama (up to 70B parameters) have arrived. StarCoder2-15B matched Code Llama 33B on several code completion tasks at roughly twice the processing speed, according to the BigCode project’s evaluation. CodeT5 retains an edge in encoder-decoder tasks like code summarization and code search, where separate research found it outperformed PLBART across all generation benchmarks.
The broader AI code assistant market reached an estimated $8.5 billion by 2026, with 84% of developers reporting they use or plan to use AI tools. Within that space, CodeT5 serves as a popular base model for teams building custom code intelligence pipelines rather than competing directly with end-user products like GitHub Copilot or Cursor.
CodeT5 Licensing and Open-Source Status
All CodeT5 and CodeT5+ models are released under the Apache 2.0 license. Organizations can modify, distribute, and deploy without paying fees or requesting additional permissions. The training code uses a BSD-3-Clause license. Salesforce archived the official GitHub repo in May 2025, but the archive status doesn’t affect the license terms or access to model weights on Hugging Face.
This open licensing approach has contributed to CodeT5’s adoption in both academic and enterprise settings. Researchers at multiple universities have finetuned CodeT5 for tasks including automated program repair, code review automation, and vulnerability detection.
CodeT5 Environmental Footprint
Training CodeT5-base produced 49.25 kg of CO2 emissions on Google Cloud Platform, according to Salesforce’s own reporting. Google Cloud offset those emissions through its carbon credit programs. By releasing pretrained weights openly, the model eliminates the need for other teams to repeat the pretraining step from scratch, reducing cumulative compute and energy costs across the community.
FAQ
How many monthly downloads does CodeT5-base get on Hugging Face?
CodeT5-base recorded 22,172 monthly downloads on Hugging Face as of 2026, with 761+ total models in the CodeT5 family deployed on the platform.
What accuracy did CodeT5+ achieve on HumanEval?
InstructCodeT5+ 16B scored 35.0% pass@1 on HumanEval in zero-shot settings. With CodeT test generation, that number rose to 42.9% pass@1.
How many programming languages does CodeT5 support?
CodeT5+ supports nine languages: Python, Java, JavaScript, Go, Ruby, PHP, C, C#, and C++. The original CodeT5 covered eight (no C++).
Is CodeT5 free for commercial use?
Yes. All CodeT5 models use the Apache 2.0 license, allowing unrestricted commercial deployment, modification, and distribution without fees.
Is the CodeT5 GitHub repository still active?
Salesforce archived the official repository in May 2025. Model weights remain on Hugging Face, and community forks continue active development.
Sources:
https://arxiv.org/pdf/2305.07922
https://huggingface.co/Salesforce/codet5-base
https://github.com/salesforce/CodeT5
https://www.salesforce.com/blog/codet5-open-code-large-language-models/
