Getting the Most Out of scGPT for Single-Cell Annotation

Getting the Most Out of scGPT for Single-Cell Annotation

Introduction

In an earlier article, scGPT: The First AI Large Language Model in Single-Cell RNA Sequencing, we introduced scGPT and its transformer-based approach to labeling cells with remarkable accuracy across complex datasets like pancreas, multiple sclerosis, and tumor-infiltrating myeloid cells.

That post sparked a lot of interest (and traffic) which we interpreted to mean that the community is eager for practical guidance on how to actually apply these models in the lab.

This follow-up dives deeper into how to get the most out of scGPT for single-cell annotation, focusing on real-world scenarios where bioinformaticians and R&D teams can benefit most. We explore:

  • When to use zero-shot vs fine-tuned scGPT, and how to choose based on your dataset.
  • How LLM alternatives like GPT-4, scBERT, and CellTypist stack up in real workflows.
  • Why something as simple as choosing the “top 10 genes” over “top 20” can make or break your results.

Whether you’re sketching a quick atlas or designing a clinical-grade diagnostic panel, this guide offers practical tips for integrating scGPT and other LLM tools into your single-cell workflows.

A Two-Track Model: Zero-Shot Vs Task-Specific Training

To apply scGPT effectively, the first key decision is how to run it: do you use the model as-is, or tailor it to your specific data? That choice -zero-shot versus fine-tuned – shapes the rest of your workflow.
[1] 

scGPT offers two distinct operating modes:

Zero-shot (pre-trained only): You keep the foundation model exactly as it was trained on 33 million cells and apply it directly to your data. Think of it as asking a well-read expert for a first opinion without giving them any new background.

Task-specific fine-tuning: You start with the same foundation but let the model “study” a small, labelled subset of your own cells for a few epochs (an epoch is one full pass through your training set—every labelled cell is seen exactly once before the next epoch begins). This is like giving that expert a focused crash course on your exact tissue or disease context before they render a verdict.

With that framing in mind, here’s how the two tracks compare:

ModeWhat you doProsCons
Zero-shot (pre-trained only)Feed your gene counts → get embeddings or labelsInstant; no GPU; reusable across projectsCan miss rare/novel states; lower macro-F1 on out-of-distribution data
Fine-tunedStart from the 33-million-cell backbone , run 5-10 epochs (≈20 min on 1 × A100)+10-25 pp accuracy jump on MS and TIM datasets; better subtype resolutionNeeds GPUs; risk of overfitting small cohorts; extra MLOps complexity

TL;DR:

  • For quick exploration (or when no reference labels exist) run scGPT in zero-shot mode and cluster the embeddings.
  • When you need publication-quality or clinical-grade labels, spend the extra hour to fine-tune the model on a few thousand well-annotated cells.

LLM-powered shortcuts: GPT-4, GPTCelltype & more


While scGPT is the centerpiece, it’s not the only tool in the modern single-cell toolbox. Here’s how alternative LLMs and rule-based methods can complement or accelerate your workflow.[2] 

  • GPT-4 marker-prompting. Hou & Ji showed that GPT-4, queried with the top 10 differential genes per cluster, matched manual annotations across >400 cell types with median concordance >0.85 .
    • Strengths: No training, zero code (just an API call), human-readable rationales.
    • Limitations: Requires good DE genes; struggles when marker lists are noisy.

  • scBERT / Geneformer. Text-style transformers trained on gene-token sentences. Comparable accuracy to scGPT when data are balanced, but fall off on extremely rare populations .

  • CellTypist / Azimuth. Classical reference-mapping; blazing fast but limited by reference quality.

Wondering when to mix & match? We’d suggest using GPT-4 to sanity-check scGPT predictions or to label clusters that scGPT flags as “unknown”. Ensemble voting often lifts borderline clusters by 3-5 percentage points.

Why “Top 10” Beats “Top 20” (most of the time)

No matter which model you choose – scGPT, GPT-4, or CellTypist – your input gene set can make or break your results. One subtle yet crucial design choice is how many DE genes to use when prompting. [3] Hou & Ji systematically varied the number of DE genes fed to GPT-4. Accuracy peaked at 10 genes and declined as the list grew to 20 or 50 . The intuition: a concise marker panel focuses the LLM on signature genes instead of drowning it in noise.

For scGPT fine-tuning, the story differs slightly:

  • Top HVGs (≈2 k genes) are still extracted to build the token sequence, but
  • The CLS classifier inside scGPT implicitly down-weights low-information genes during training .

Practical tip: When prompting an LLM, start with the top 10 DEGs. For transformer-based models, keep using your usual HVG pipeline and let the model learn the weights.

Choosing the Right Workflow

Depending on your timeline and goals, there are three common levels of investment for applying scGPT.[4] 

Rapid exploration
Time-scale: a few hoursIf you only need a quick sense of what cell populations are present, start with the off-the-shelf, zero-shot version of scGPT. Generate embeddings for all cells, run a lightweight clustering algorithm such as Leiden or Louvain, and visualise the result in UMAP. At this stage you have an unlabeled “atlas” that already captures transcriptional neighbourhoods. To add biological meaning without slowing yourself down, pass the top ten marker genes for each cluster into GPT-4 or a rules-based tool like CellTypist. The combination gives you provisional labels that are “good enough” for brainstorming, figure drafts, or deciding whether a new dataset is worth deeper investment.

Detailed atlas construction or clinical assay development
Time-scale: a few days
When accuracy requirements rise—because you are building a tissue atlas for publication, validating a diagnostic panel, or integrating multiple donors—you will benefit from task-specific fine-tuning. Select a representative subset of cells with reliable labels (even a few thousand is enough), fine-tune scGPT for five to ten epochs on a single A100 or comparable GPU, and evaluate on held-out donors to confirm that the model generalises. The fine-tuned classifier markedly improves recall of rare or subtle subtypes (e.g., exhausted T cells or disease-specific macrophages) while preserving the speed of the backbone. For additional quality assurance, run GPT-4 on any clusters that remain ambiguous; discrepancies often highlight mislabeled training examples or batch-specific artefacts.

High-stakes discovery projects
Time-scale: several weeks
Projects that aim to uncover novel biology—such as identifying drug-responsive cell states, linking single-cell profiles to clinical endpoints, or integrating multi-omics layers—demand a more robust, ensemble-driven pipeline. Begin by fine-tuning scGPT as above, but also train or evaluate complementary models like scBERT or scVI to capture different feature representations. Harmonise embeddings across donors and modalities with tools such as Harmony or Scanorama, then iteratively refine cluster definitions through expert review. Incorporate orthogonal data types (spatial transcriptomics, TCR/BCR sequences, chromatin accessibility) and rerun the models to test stability. Throughout the cycle, schedule regular cross-validation checkpoints and document each decision so the workflow remains auditable—essential for clinical or regulated environments. The result is a consensus annotation that balances algorithmic power, technical reproducibility, and domain expertise, giving you confidence to push findings toward translational or therapeutic applications.

Conclusion

Whichever path you choose – quick prototype or rigorous clinical pipeline – the key is balancing model power with biological context.[5]  That’s where expert support can make all the difference. Foundation models like scGPT are redefining what “baseline” means in single-cell. Yet, as the 2025 zero-shot study reminds us, bigger isn’t always better. Careful benchmarking still beats blind trust. Pairing a well-chosen gene list with the right level of model adaptation is the fastest path to reliable annotations and, ultimately, biological insight.

At Bridge Informatics, we live at the intersection of bioinformatics expertise, clinical data fluency, and cutting-edge AI engineering. That makes us uniquely equipped to help R&D teams in pharma and biotech unlock the full potential of models like scGPT in their own pipelines.

Want some help applying scGPT to your dataset? Let’s build models that actually model biology. Click here to schedule a free introductory call with a member of our team. We understand that deploying a foundation model isn’t just about running code. It’s about fitting the tool to the biological question, the data characteristics, and the regulatory context. Whether you’re working with rare disease cohorts, spatial transcriptomics, or high-throughput drug screens, our team knows how to adapt these models to maximize biological insight while minimizing operational friction.

Specifically, we bring:

  • Deep domain knowledge in immunology, oncology, and neuroscience, which lets us guide model fine-tuning and gene set design with biological intuition—not just metrics.
  • End-to-end pipeline engineering, from preprocessing FASTQs to GPU-accelerated model training and scalable deployment in cloud environments.
  • Experience harmonizing data from multiple modalities and sources, ensuring that embeddings from scGPT and similar models are robust across donors, sites, and assays.
  • Regulatory-aware workflows, with documented QC, audit trails, and interpretability baked in from the start—especially important for clinical assay development.

In short, we help you move from “this looks promising” to “this changes our decision-making.” If you’re considering applying scGPT or other LLMs to your single-cell datasets and want to accelerate the journey, schedule a free consultation with one of our experts.

References

  1. Cui H. et al. “scGPT: toward building a foundation model for single-cell multi-omics using generative AI.” Nat. Methods (2024).
  2. scGPT documentation – “Fine-tuning on Pre-trained Model for Cell-type Annotation” tutorial (accessed May 2025).
  3. Hou W. & Ji Z. “Assessing GPT-4 for cell type annotation in single-cell RNA-seq.” Nat. Methods (2024).
  4. Kedzierska K.Z. et al. “Zero-shot evaluation reveals limitations of single-cell foundation models.” Genome Biology (2025).

Originally published by Bridge Informatics. Reuse with attribution only.

Share this article with a friend