Close Menu
    Facebook X (Twitter) Instagram
    • About
    • Privacy Policy
    • Write For Us
    • Newsletter
    • Contact
    Instagram
    About ChromebooksAbout Chromebooks
    • Linux
    • News
      • Stats
      • Reviews
    • AI
    • How to
      • DevOps
      • IP Address
    • Apps
    • Business
    • Q&A
      • Opinion
    • Gaming
      • Google Games
    • Blog
    • Podcast
    • Contact
    About ChromebooksAbout Chromebooks
    AI

    SciBERT Statistics 2026: Academic Usage Statistics And NLP Benchmarks

    Dominic ReignsBy Dominic ReignsDecember 15, 2025Updated:June 23, 2026No Comments8 Mins Read

    SciBERT, the domain-adapted BERT model for scientific text, records over 219,000 monthly downloads on Hugging Face and has crossed 3,400 academic citations since its 2019 release. Built by the Allen Institute for AI and trained on 1.14 million Semantic Scholar papers, SciBERT remains a standard baseline in scientific NLP even as larger models crowd the space. This post breaks down its current usage numbers, benchmark performance across NLP tasks, and how it compares to competing domain-specific models in 2026.

    SciBERT Statistics 2026 – TL;DR

    SciBERT has 219,161 monthly downloads on Hugging Face as of May 2026, based on the allenai/scibert_scivocab_uncased model page.

    The original SciBERT paper has accumulated 3,474 citations on Semantic Scholar, with 575 of those classified as highly influential.

    SciBERT was pretrained on 3.1 billion tokens from 1.14 million scientific papers, roughly 80% from biomedical domains and 20% from computer science.

    On NLP benchmarks, SciBERT outperforms BERT-base by an average of +2.11 F1 across scientific tasks when fine-tuned.

    The GitHub repository has 1,700 stars and 231 forks, with 94 community fine-tuned models listed on Hugging Face.

    How Many People Use SciBERT?

    The primary SciBERT model (scibert_scivocab_uncased) logged 219,161 downloads in the last month on Hugging Face. A cased variant is also available but used less frequently. The model has 170 community likes and 57 active Spaces on Hugging Face that depend on it.

    On GitHub, the allenai/scibert repository has attracted 1,700 stars and 231 forks since its release. The repo contains evaluation code for NER, text classification, relation extraction, dependency parsing, and PICO extraction tasks.

    MetricValue
    Monthly Hugging Face Downloads219,161
    Semantic Scholar Citations3,474
    Highly Influential Citations575
    GitHub Stars1,700
    GitHub Forks231
    Hugging Face Fine-tuned Models94
    Hugging Face Spaces57
    Hugging Face Community Likes170

    Source: Hugging Face, Semantic Scholar, GitHub (May 2026)

    SciBERT Usage Overview

    SciBERT Model Architecture and Training Data

    SciBERT uses the same 12-layer Transformer encoder as BERT-base, with 768-dimensional hidden states, 12 attention heads, and a feed-forward size of 3,072. Total parameter count is 110 million. The key difference from BERT-base is the vocabulary: SciBERT uses a custom SciVocab built from scientific text, which cuts out-of-vocabulary rates for technical terms.

    The training corpus was drawn from Semantic Scholar and consists of 1.14 million papers with full text, totaling 3.1 billion tokens. About 82% of the papers come from biomedical fields, and 18% from computer science. The model was released in both cased and uncased variants, with uncased performing better on most tasks.

    SpecificationDetail
    ArchitectureBERT-base (12-layer Transformer)
    Parameters110 million
    Hidden Dimensions768
    Attention Heads12
    Training Corpus Size1.14M papers / 3.1B tokens
    Corpus SourceSemantic Scholar
    VocabularySciVocab (31K tokens)
    Biomedical Papers Share~82%
    Computer Science Papers Share~18%

    Source: Beltagy et al. (2019), Allen Institute for AI

    SciBERT NLP Benchmark Performance

    SciBERT outperforms BERT-base across biomedical and computer science NLP tasks. The largest gains appear on domain-heavy datasets like ACL-ARC (citation intent classification), where SciBERT scores 70.98 F1 versus 63.91 for BERT-base, a gap of +7.07 points. On ChemProt relation extraction, it scores 83.64 versus 79.14 (+4.50).

    In biomedical tasks specifically, the fine-tuned improvement averages +1.92 F1. For computer science tasks, that average is +3.55 F1. With frozen embeddings (no fine-tuning), the gaps widen further: +3.59 F1 on biomedical tasks and +1.13 on CS tasks, according to the original evaluation.

    Task / DatasetSciBERT F1BERT-base F1Difference
    ACL-ARC (Citation Intent)70.9863.91+7.07
    ChemProt (Relation Extraction)83.6479.14+4.50
    SciERC (NER)67.5765.24+2.33
    BC5CDR (NER)90.0188.85+1.16
    SciCite (Classification)84.0084.31-0.31
    JNLPBA (NER)77.2876.99+0.29
    NCBI-disease (NER)88.5786.72+1.85

    Source: Beltagy et al. (2019), emergentmind.com

    SciBERT vs BERT-base F1 Scores by Task

    SciBERT Statistics by Domain Improvement

    The F1 gains break down differently by domain. Biomedical tasks see a +1.92 average improvement with fine-tuning, while computer science tasks get +3.55 on average. When embeddings are frozen and not fine-tuned, SciBERT gains +3.59 F1 on biomedical tasks, suggesting the pretrained representations carry strong domain signal even without task-specific updates.

    DomainFine-tuned F1 GainFrozen Embeddings F1 Gain
    Biomedical+1.92+3.59
    Computer Science+3.55+1.13
    Multi-domain+0.72+2.47
    Overall Average+2.11+2.43

    Source: Beltagy et al. (2019)

    F1 Score Improvement Over BERT-base by Domain

    How Does SciBERT Compare to Other Scientific NLP Models?

    PubMedBERT, developed by Microsoft Research and trained exclusively on PubMed abstracts, generally outperforms SciBERT on biomedical benchmarks. PubMedBERT achieved an 82.91 BLURB benchmark score and records roughly 522,000 monthly Hugging Face downloads as of late 2024, more than double SciBERT’s current rate. BioBERT, which continues pretraining from BERT-base on PubMed text, falls between the two on most biomedical tasks.

    SciBERT’s advantage over PubMedBERT is its dual-domain coverage. Because the training data includes both biomedical and computer science papers, SciBERT performs better on CS-specific benchmarks like ACL-ARC and SciERC. For teams working across scientific disciplines rather than strictly in biomedicine, SciBERT remains a practical choice, especially given its lighter compute requirements for fine-tuning.

    ModelTraining CorpusParametersMonthly Downloads (HF)Citations
    SciBERT1.14M Semantic Scholar papers110M~219K3,474
    PubMedBERT14M PubMed abstracts110M~522K1,000+
    BioBERTPubMed + PMC articles110M~300K5,000+
    MatSciBERTMaterials science literature110MModerate200+

    Source: Hugging Face, Semantic Scholar, respective model papers

    Monthly Hugging Face Downloads: Domain-Specific BERT Models

    SciBERT Academic Citations Over Time

    The original SciBERT paper was published at EMNLP 2019. It crossed 1,000 citations by mid-2021 and 2,000 by early 2023. As of May 2026, the count on Semantic Scholar stands at 3,474, with 575 classified as highly influential and 818 as methods citations, meaning those papers used SciBERT as part of their methodology.

    Citation growth has slowed compared to the 2021-2023 period, which is typical for a model that has become a standard baseline. Newer models trained on larger or more focused corpora now attract incremental citations, but SciBERT continues to accumulate roughly 400-500 new citations per year.

    YearEstimated Cumulative Citations
    2019~50
    2020~450
    2021~1,200
    2022~2,000
    2023~2,700
    2024~3,100
    2025~3,350
    2026 (May)3,474

    Source: Semantic Scholar

    SciBERT Cumulative Academic Citations (2019-2026)

    SciBERT Derivative and Fine-tuned Models

    The SciBERT architecture has spawned a range of domain-specific derivatives. MatSciBERT was trained on materials science literature and outperforms SciBERT on materials NER and classification tasks. COVID-SciBERT was fine-tuned during the pandemic for biomedical text related to SARS-CoV-2. SsciBERT targets social science text, and NukeBERT covers the nuclear domain.

    On Hugging Face alone, 94 community fine-tuned models list SciBERT as their base. These cover tasks like drug adverse effect extraction, citation intent classification, scientific claim identification, and research paper topic tagging. An additional 3 adapter models and 3 quantized versions are available.

    Derivative ModelDomainBase
    MatSciBERTMaterials ScienceSciBERT
    COVID-SciBERTCOVID-19 / BiomedicalSciBERT
    SsciBERTSocial SciencesSciBERT
    NukeBERT / NukeLMNuclear DomainSciBERT / RoBERTa
    SciEdBERTScience EducationSciBERT
    BatteryBERTBattery / EnergySciBERT

    Source: Hugging Face, arxiv.org

    SciBERT Statistics in Clinical and Biomedical NLP

    A 2025 study by Rubio-Martín et al. applied SciBERT to clinical note classification using hospital electronic health records. The model reached 0.96 accuracy and 0.97 F1 on that task, outperforming most traditional methods after hyperparameter tuning. On biomedical NER specifically, SciBERT with a CRF head scored 0.82 F1 on the NCBI Disease Corpus after grid search optimization, per a 2025 DOAJ study.

    For medical abbreviation disambiguation on the MeDAL dataset, SciBERT hit 77.3% macro-F1 and 90.5% weighted F1. In citation intent classification within ensemble setups, SciBERT-based systems achieved macro-F1 above 89%. These numbers position SciBERT as a strong general-purpose scientific encoder, though PubMedBERT remains the top pick for purely biomedical pipelines where accuracy on medical terminology is the priority.

    Scientific NLP Market Context for SciBERT

    The global NLP market is projected at $70.11 billion in 2026, growing at 29% annually to reach $249.97 billion by 2031, according to MarketsandMarkets. Within that, the biomedical NLP segment reached $8.97 billion in 2025 and is expected to grow at a 34.74% CAGR through 2034.

    The domain-specific AI language models market, which covers models like SciBERT and its peers, hit an estimated $6.62 billion in 2026, according to MarketIntelo. That figure is expected to reach $66.2 billion by 2034 at a 38% CAGR. Demand is driven by regulatory requirements in healthcare, finance, and legal sectors that favor auditable domain-constrained models over opaque general-purpose systems.

    Domain-Specific AI Language Models Market Size ($B)

    SciBERT Key Use Cases in 2026

    SciBERT is used across a range of academic and industrial NLP tasks. Named entity recognition for genes, chemicals, and diseases is the most common application. Relation extraction, especially for drug-protein interactions (ChemProt), is another frequent use case. Citation intent classification, where the model identifies whether a paper cites another for background, method, or result, is a strong fit given SciBERT’s training on full paper text.

    Other uses include automated systematic review filtering in evidence-based medicine, bioassay semantic annotation, scientific document classification for library systems, and clinical note processing. Research teams working on knowledge graphs from academic literature also rely on SciBERT embeddings for entity linking and document clustering.

    FAQ

    What is SciBERT used for?

    SciBERT is used for scientific text processing tasks including named entity recognition, relation extraction, citation classification, and clinical note analysis in academic and biomedical research.

    How many downloads does SciBERT have?

    SciBERT’s primary model (scibert_scivocab_uncased) records 219,161 monthly downloads on Hugging Face as of May 2026.

    Is SciBERT better than BERT for scientific text?

    Yes. SciBERT outperforms BERT-base by an average of +2.11 F1 on scientific NLP benchmarks, with gains up to +7.07 F1 on citation intent tasks.

    How does SciBERT compare to PubMedBERT?

    PubMedBERT outperforms SciBERT on biomedical-only benchmarks. SciBERT covers both biomedical and computer science domains, making it more versatile across scientific disciplines.

    How many citations does the SciBERT paper have?

    The SciBERT paper has 3,474 citations on Semantic Scholar as of May 2026, with 575 classified as highly influential.

    Sources:

    https://huggingface.co/allenai/scibert_scivocab_uncased

    https://www.semanticscholar.org/paper/SciBERT:-A-Pretrained-Language-Model-for-Scientific-Beltagy-Lo/156d217b0a911af97fa1b5a71dc909ccef7a8028

    https://www.emergentmind.com/topics/scibert

    https://www.marketsandmarkets.com/Market-Reports/natural-language-processing-nlp-825.html

    Dominic Reigns
    • Website
    • Instagram

    As a senior analyst, I benchmark and review gadgets and PC components, including desktop processors, GPUs, monitors, and storage solutions on Aboutchromebooks.com. Outside of work, I enjoy skating and putting my culinary training to use by cooking for friends.

    Best of AI

    What Does Adobe Firefly AI Do?

    June 16, 2026

    Is Joyland AI Safe For Kids?

    June 12, 2026

    LMArena AI: Chatbot Ranking Platform 2026

    May 27, 2026

    Will AI Take Over the World

    May 25, 2026

    AI21 Jurassic Statistics 2026: Model Size, Usage and AI Performance Trends

    May 19, 2026
    Trending Stats

    Chromebook Repairability Scores Statistics 2026: Hardware Serviceability Reports

    June 22, 2026

    ChromeOS Feature Adoption Rates Statistics 2026: User Adoption Metrics And Data

    June 20, 2026

    Chromebook Storage Usage Statistics 2026: Device Storage Consumption Trends

    June 18, 2026

    Conversational AI Market Statistics 2026: Chatbot Usage And Enterprise Deployment

    June 17, 2026

    Chromebook Gaming Usage Statistics 2026: Player Engagement And Usage Reports

    June 17, 2026
    • About
    • Tech Guest Post
    • Contact
    • Privacy Policy
    • Sitemap
    © 2026 About Chrome Books. All rights reserved.

    Type above and press Enter to search. Press Esc to cancel.