CodeT5 Statistics 2026: Code Generation Accuracy, Adoption and AI Capabilities

CodeT5-base recorded 22,172 monthly downloads on Hugging Face as of 2026, with 761+ related models deployed across the platform. Developed by Salesforce Research and released under the Apache 2.0 license, the CodeT5 family spans eight model variants ranging from 60 million to 16 billion parameters. This article covers the latest CodeT5 statistics for 2026, including code generation accuracy on benchmarks, adoption metrics, training data composition, and how the model family compares to other open-source code LLMs.

CodeT5 Statistics in 2026 — TL;DR

CodeT5-base pulls 22,172 monthly downloads on Hugging Face and supports 761+ models on the platform as of 2026.

InstructCodeT5+ 16B scored 35.0% pass@1 on HumanEval in zero-shot settings, beating OpenAI’s code-cushman-001 at 33.5%.

CodeT5+ trained on 51.5 billion tokens — a 50x increase over the original CodeSearchNet corpus of 8.35 million instances.

The CodeT5+ 770M model matched performance of models 8 to 80 times its size, reaching 15.5% pass@1 on HumanEval.

Salesforce archived the official GitHub repository in May 2025. Model weights remain on Hugging Face and community forks stay active.

How Many Parameters Does Each CodeT5 Model Have?

The CodeT5 family includes eight variants. Salesforce released the first three in 2021 as part of the original CodeT5 line. Five more arrived with CodeT5+ in 2023, introducing flexible encoder-decoder architectures and shallow-encoder/deep-decoder configurations for the larger models.

Model Variant	Parameters	Architecture	Year
CodeT5-small	60M	Encoder-Decoder	2021
CodeT5-base	220M	Encoder-Decoder	2021
CodeT5-large	770M	Encoder-Decoder	2021
CodeT5+ 220M	220M	Flexible Encoder-Decoder	2023
CodeT5+ 770M	770M	Flexible Encoder-Decoder	2023
CodeT5+ 2B	2B	Shallow Encoder, Deep Decoder	2023
CodeT5+ 6B	6B	Shallow Encoder, Deep Decoder	2023
CodeT5+ 16B	16B	Shallow Encoder, Deep Decoder	2023

Source: Salesforce Research / arXiv

CodeT5+ can operate in three modes — encoder-only, decoder-only, or full encoder-decoder — allowing teams to pick the right setup for tasks like code completion, embedding, or translation without maintaining separate model instances.

CodeT5 Code Generation Accuracy on HumanEval

HumanEval is a benchmark of 164 hand-authored Python programming problems. Each problem requires generating a function body that passes hidden unit tests. InstructCodeT5+ 16B hit 35.0% pass@1 and 54.5% pass@10 in zero-shot evaluation, according to the original CodeT5+ paper. When paired with CodeT test generation, accuracy climbed to 42.9% pass@1 and 67.8% pass@10.

Model	Pass@1	Pass@10	Setting
InstructCodeT5+ 16B	35.0%	54.5%	Zero-shot
InstructCodeT5+ 16B + CodeT	42.9%	67.8%	Zero-shot + Test Gen
CodeT5+ 770M-py	15.5%	—	Zero-shot
OpenAI code-cushman-001	33.5%	—	Zero-shot
GPT-NeoX 20B	15.4%	—	Zero-shot
PaLM 62B	15.9%	—	Zero-shot

Source: Wang et al., arXiv 2305.07922

The 770M variant’s 15.5% pass@1 came close to PaLM 62B (15.9%) despite having roughly 80 times fewer parameters. That gap between model size and output quality is one of the more notable findings from the CodeT5+ evaluation.

CodeT5 Training Data and Programming Language Support

CodeT5+ trained on 51.5 billion tokens from permissively licensed GitHub repositories. That’s a 50x jump from the original CodeSearchNet corpus, which contained 8.35 million function-level instances. The training data only includes code under MIT, Apache-2.0, BSD-3-Clause, BSD-2-Clause, CC0-1.0, Unlicense, and ISC licenses, keeping commercial deployment clear of legal friction.

The model supports nine programming languages. The original CodeT5 covered eight. CodeT5+ added C++ to the roster.

Language	CodeT5	CodeT5+
Python	Yes	Yes
Java	Yes	Yes
JavaScript	Yes	Yes
Go	Yes	Yes
Ruby	Yes	Yes
PHP	Yes	Yes
C	Yes	Yes
C#	Yes	Yes
C++	No	Yes

Source: Hugging Face / Salesforce Research

CodeT5 Hugging Face Adoption Metrics in 2026

CodeT5-base generated 22,172 monthly downloads on Hugging Face as of 2026. The model accumulated 132 community likes and powers 36 dependent spaces on the platform. Developers built 86 finetuned derivative models and 17 adapter models from the base checkpoint. Across all variants and finetuned offshoots, the CodeT5 family totals 761+ models on Hugging Face Hub.

Metric	Value
Monthly Downloads (CodeT5-base)	22,172
Community Likes	132
Dependent Spaces	36
Finetuned Derivative Models	86
Adapter Models	17
Total Models on Hub	761+

Source: Hugging Face

The GitHub repository at salesforce/CodeT5 has 3,100+ stars and 486+ forks. Salesforce archived it in May 2025, but model weights stay available through Hugging Face and community-maintained forks continue development under the Apache 2.0 license.

CodeT5 Downstream Task Performance

CodeT5+ was evaluated on over 20 code-related benchmarks across zero-shot, finetuning, and instruction-tuning settings. The strongest gains showed up in retrieval-augmented code generation, where scores jumped by an average of +5.8 BLEU-4 points over prior baselines. Text-to-code retrieval improved by +3.2 average MRR across eight tasks. Line-level code completion added +2.1 average exact match across two benchmarks.

Task Category	Benchmarks	Improvement
Text-to-Code Retrieval	8 tasks	+3.2 avg. MRR
Line-Level Code Completion	2 tasks	+2.1 avg. Exact Match
Retrieval-Augmented Code Gen	2 tasks	+5.8 avg. BLEU-4

Source: Wang et al., arXiv 2305.07922

On mathematical programming, CodeT5+ 770M reached 87.4% pass@80 on MathQA-Python and 73.8% pass@100 on GSM8K-Python after finetuning — results that outperformed models with up to 137 billion parameters.

How Does CodeT5 Compare to Other Open Code LLMs?

When CodeT5+ 16B launched, its 35.0% pass@1 on HumanEval was the top score among open-source code models. Since then, newer models like StarCoder2 (15B parameters, trained on 1 trillion tokens) and Code Llama (up to 70B parameters) have arrived. StarCoder2-15B matched Code Llama 33B on several code completion tasks at roughly twice the processing speed, according to the BigCode project’s evaluation. CodeT5 retains an edge in encoder-decoder tasks like code summarization and code search, where separate research found it outperformed PLBART across all generation benchmarks.

The broader AI code assistant market reached an estimated $8.5 billion by 2026, with 84% of developers reporting they use or plan to use AI tools. Within that space, CodeT5 serves as a popular base model for teams building custom code intelligence pipelines rather than competing directly with end-user products like GitHub Copilot or Cursor.

CodeT5 Licensing and Open-Source Status

All CodeT5 and CodeT5+ models are released under the Apache 2.0 license. Organizations can modify, distribute, and deploy without paying fees or requesting additional permissions. The training code uses a BSD-3-Clause license. Salesforce archived the official GitHub repo in May 2025, but the archive status doesn’t affect the license terms or access to model weights on Hugging Face.

This open licensing approach has contributed to CodeT5’s adoption in both academic and enterprise settings. Researchers at multiple universities have finetuned CodeT5 for tasks including automated program repair, code review automation, and vulnerability detection.

CodeT5 Environmental Footprint

Training CodeT5-base produced 49.25 kg of CO2 emissions on Google Cloud Platform, according to Salesforce’s own reporting. Google Cloud offset those emissions through its carbon credit programs. By releasing pretrained weights openly, the model eliminates the need for other teams to repeat the pretraining step from scratch, reducing cumulative compute and energy costs across the community.

FAQ

How many monthly downloads does CodeT5-base get on Hugging Face?

CodeT5-base recorded 22,172 monthly downloads on Hugging Face as of 2026, with 761+ total models in the CodeT5 family deployed on the platform.

What accuracy did CodeT5+ achieve on HumanEval?

InstructCodeT5+ 16B scored 35.0% pass@1 on HumanEval in zero-shot settings. With CodeT test generation, that number rose to 42.9% pass@1.

How many programming languages does CodeT5 support?

CodeT5+ supports nine languages: Python, Java, JavaScript, Go, Ruby, PHP, C, C#, and C++. The original CodeT5 covered eight (no C++).

Is CodeT5 free for commercial use?

Yes. All CodeT5 models use the Apache 2.0 license, allowing unrestricted commercial deployment, modification, and distribution without fees.

Is the CodeT5 GitHub repository still active?

Salesforce archived the official repository in May 2025. Model weights remain on Hugging Face, and community forks continue active development.

Sources:

https://arxiv.org/pdf/2305.07922

https://huggingface.co/Salesforce/codet5-base

https://github.com/salesforce/CodeT5

https://www.salesforce.com/blog/codet5-open-code-large-language-models/

CodeT5 Statistics 2026: Code Generation Accuracy, Adoption and AI Capabilities

CodeT5 Statistics in 2026 — TL;DR

How Many Parameters Does Each CodeT5 Model Have?

CodeT5 Code Generation Accuracy on HumanEval

CodeT5 Training Data and Programming Language Support

CodeT5 Hugging Face Adoption Metrics in 2026

CodeT5 Downstream Task Performance

How Does CodeT5 Compare to Other Open Code LLMs?

CodeT5 Licensing and Open-Source Status

CodeT5 Environmental Footprint

FAQ

Enterprise AI Spending Statistics 2026: Budgets, ROI, and Industry Data

Grok AI Statistics 2026: Users, Revenue, Market Share, and Growth Data

AI in Healthcare Adoption Statistics 2026

Deepfake Incident Statistics 2026

Best AI Music and Vocal Tools for Chromebook Users in 2026

Chromebook Plus Adoption Statistics 2026

Chromebook vs MacBook Sales Statistics 2026: Shipments, Revenue, and Market Share Data

Chromebook Chipset Market Share (Intel, AMD, MediaTek, Qualcomm) 2026 Statistics

Firefox Usage Statistics 2026

Chromebook Global Shipment Statistics By Quarter 2026