Perceiver IO, developed by Google DeepMind, recorded 708 academic citations on Semantic Scholar by May 2026 and matched BERT’s 81.1 average GLUE score while processing raw bytes instead of tokenized input. This article covers the latest Perceiver IO statistics for 2026, including model capabilities, benchmark performance across language and vision tasks, real-world use cases in medical imaging and game AI, and the architecture’s growing research footprint.
Perceiver IO Statistics 2026 – TL;DR
Perceiver IO uses 201 million parameters in its language variant and processes inputs with linear O(MN) computational scaling instead of the quadratic O(M²) required by standard Transformers.
The model achieved 1.81 end-point error on Sintel Clean optical flow benchmarks, outperforming specialized architectures like RAFT and PWCNet without any flow-specific design features.
When integrated into DeepMind’s AlphaStar system, Perceiver IO reduced floating-point operations by 3.5x while maintaining an 87% win rate in StarCraft II.
A 2025 hybrid framework pairing Vision Transformers with Perceiver IO recorded 99% accuracy in neurological disorder classification and 98% accuracy in lung disease detection.
Hugging Face Transformers hosts 36 Perceiver IO-related models across five pre-trained variants covering language modeling, image classification, optical flow, and multimodal autoencoding.
How Does Perceiver IO Compare to Standard Transformers?
Standard Transformer architectures like BERT cap input sequences at 512 tokens because self-attention memory and compute scale quadratically with sequence length. Perceiver IO sidesteps this by running self-attention on a fixed set of latent variables (256 or 512) and using cross-attention to incorporate inputs. The result is linear scaling with input size, regardless of whether the input is text, pixels, audio, or video.
On the GLUE language benchmark, Perceiver IO trained on SentencePiece tokens scored 81.2 average, slightly above BERT Base’s 81.1. The byte-level variant, which skips tokenization entirely and operates on raw UTF-8 bytes with a vocabulary of just 262, scored 81.0.
| Model | Input Type | Vocab Size | GLUE Avg. Score |
|---|---|---|---|
| BERT Base | SentencePiece tokens | 30,522 | 81.1 |
| Perceiver IO (tokens) | SentencePiece tokens | 30,522 | 81.2 |
| Perceiver IO (bytes) | Raw UTF-8 bytes | 262 | 81.0 |
Source: Perceiver IO paper (ICLR 2022)
Perceiver IO Statistics for Optical Flow Performance
Optical flow estimation measures how pixels shift between two frames of the same scene. Perceiver IO achieved a 1.81 average end-point error (EPE) on Sintel Clean and 2.42 on Sintel Final, matching or beating specialized flow architectures. On the KITTI benchmark, it scored 4.98 EPE. These results came without cost volumes, explicit warping, or hierarchical processing — features that flow-specific models depend on.
| Model | Sintel Clean (EPE) | Sintel Final (EPE) | KITTI (EPE) |
|---|---|---|---|
| PWCNet | 2.08 | 3.55 | 6.92 |
| RAFT | 1.94 | 2.78 | 5.04 |
| Perceiver IO | 1.81 | 2.42 | 4.98 |
Source: Perceiver IO paper (ICLR 2022), AutoFlow training protocol
Perceiver IO ImageNet Classification Results
On ImageNet image classification, Perceiver IO offers multiple preprocessing strategies. The convolutional preprocessing variant reached 84.5% top-1 accuracy when pretrained on the JFT dataset and 82.1% when pretrained on ImageNet alone. A more minimal approach using only learned 1D position embeddings — with zero 2D structural information about the image — achieved 72.7% top-1 accuracy. All variants operate at 224×224 pixel resolution.
| Preprocessing Method | Pretraining Data | Top-1 Accuracy |
|---|---|---|
| 2D Conv + MaxPool | JFT | 84.5% |
| 2D Conv + MaxPool | ImageNet | 82.1% |
| 2D Fourier Features | ImageNet | 79.0% |
| Learned 1D Embeddings | ImageNet | 72.7% |
Source: Perceiver IO paper (ICLR 2022)
How Was Perceiver IO Used in AlphaStar?
DeepMind tested Perceiver IO as a drop-in replacement for the Transformer entity encoder inside AlphaStar, their StarCraft II AI agent. The swap reduced floating-point operations by approximately 3.5x while keeping the win rate at 87% and the parameter count roughly the same. This result came after only three experimental runs, suggesting the architecture can replace task-specific Transformers in complex decision-making systems with minimal tuning.
Perceiver IO Statistics in Medical Imaging (2025)
A 2025 study published in Computational Biology and Chemistry tested a hybrid framework combining Vision Transformers with Perceiver IO for multi-disease medical image classification. The system was evaluated across three medical domains: neurology, dermatology, and pulmonology. This was the first reported application of a ViT + Perceiver IO architecture for these disease categories.
| Medical Domain | Accuracy | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Neurological Disorders | 0.99 | 0.99 | 1.00 | 0.99 |
| Lung Diseases | 0.98 | 0.97 | 1.00 | 0.98 |
| Skin Diseases | 0.95 | 0.93 | 0.97 | 0.95 |
Source: Khaliq et al., Computational Biology and Chemistry, Vol. 119, December 2025
What Is Graph Perceiver IO?
Graph Perceiver IO (GPIO), published in Pattern Recognition (Volume 169, January 2026), extended the architecture to graph-structured datasets. Standard Perceiver IO handled images and text well but lacked support for topological data like social networks or molecular structures. GPIO added graph positional encoding and output query smoothing to address this gap.
The model was validated on RTX A6000, RTX 3090, and A100 GPUs using PyTorch 1.11.0 and PyTorch-Geometric 2.0.1. GPIO showed competitive performance against state-of-the-art graph neural networks on link prediction tasks while maintaining lower space complexity since its computations do not depend on adjacency matrix operations. An extended version, GPIO+, uses two separate decoders to process images and graphs simultaneously for few-shot classification.
Source: Bae et al., Pattern Recognition, Volume 169, January 2026
Perceiver IO Multimodal Autoencoding Statistics
The Kinetics-700-2020 multimodal autoencoding task demonstrated Perceiver IO’s ability to reconstruct video, audio, and class labels at the same time. The model processed 16 video frames at 224×224 resolution alongside 30,720 raw audio samples and a 700-dimensional one-hot class label. Total input came to 1,920 16-dimensional vectors from video plus the label, all serialized into one 2D input array.
The architecture achieved an 88x compression ratio in its latent bottleneck during this task. Modality-specific Fourier position embeddings and modality embeddings were used for decoding. When the class label was masked during evaluation, the autoencoder doubled as a video classifier.
How Many Pre-Trained Perceiver IO Models Are Available?
Hugging Face Transformers added Perceiver IO support on December 8, 2021, roughly four months after the model’s initial arXiv release on July 30, 2021. The library currently lists 36 related models. Five official pre-trained variants from DeepMind cover the main use cases.
| Variant | Task | Training Data |
|---|---|---|
| language-perceiver | Masked Language Modeling | Wikipedia + C4 |
| vision-perceiver-conv | Image Classification | ImageNet (14M images) |
| vision-perceiver-fourier | Image Classification | ImageNet (14M images) |
| vision-perceiver-learned | Image Classification | ImageNet (14M images) |
| optical-flow-perceiver | Optical Flow Estimation | AutoFlow (400K pairs) |
Source: Hugging Face Transformers documentation
Perceiver IO Statistics: Architecture Specifications
The core architecture processes inputs through cross-attention with a fixed latent array, then applies repeated self-attention blocks within that latent space. Outputs are decoded through a second cross-attention step using task-specific query arrays. The language model variant has 201 million parameters and 26 processing layers, compared to BERT Base’s 12 layers, while still fitting within a similar compute budget thanks to the smaller latent size of 256.
| Specification | Perceiver IO (Language) | BERT Base |
|---|---|---|
| Parameters | 201M | 110M |
| Processing Layers | 26 | 12 |
| Latent Size | 256 | N/A (512 tokens) |
| Vocabulary Size | 262 (byte IDs) | 30,522 |
| Max Input Length | 2,048 bytes | 512 tokens |
| Scaling | Linear O(MN) | Quadratic O(M²) |
Source: Perceiver IO paper (ICLR 2022)
Perceiver IO Research Impact and Citations
The original Perceiver IO paper accumulated 708 citations on Semantic Scholar by May 2026. Of those, 60 are classified as highly influential, 219 as background citations, and 208 as methods citations. The paper was published at ICLR 2022 after its initial arXiv release in July 2021.
Research extending the Perceiver IO framework has appeared in journals including Pattern Recognition (Graph Perceiver IO, January 2026), Computational Biology and Chemistry (medical imaging, December 2025), and Acta Astronautica (2025). The architecture has also been cited in systematic reviews of foundation models in mobile service robotics spanning over 7,500 papers.
FAQ
What is Perceiver IO?
Perceiver IO is a general-purpose architecture from Google DeepMind that processes text, images, audio, and video through a unified model. It scales linearly with input and output size using cross-attention with fixed latent variables.
How many parameters does Perceiver IO have?
The language modeling variant has 201 million parameters with 26 processing layers. Image classification and optical flow variants differ in size based on their preprocessing configurations.
What accuracy does Perceiver IO achieve on ImageNet?
With convolutional preprocessing and JFT pretraining, Perceiver IO reached 84.5% top-1 accuracy on ImageNet. Without any 2D structural assumptions, it scored 72.7% using only learned embeddings.
Can Perceiver IO process multiple data types at once?
Yes. The Kinetics-700 multimodal configuration processes 16 video frames, 30,720 audio samples, and classification labels simultaneously with an 88x compression ratio in its latent space.
Where can I access pre-trained Perceiver IO models?
Hugging Face Transformers provides five official pre-trained variants from DeepMind, plus 36 total related models. The library has supported Perceiver IO since December 8, 2021.
Sources:
https://arxiv.org/abs/2107.14795
https://huggingface.co/docs/transformers/model_doc/perceiver
https://www.sciencedirect.com/science/article/pii/S1476927125002476
https://www.sciencedirect.com/science/article/abs/pii/S0031320325005497
