Part 2: Modern Strategies for scRNA-seq Integration

Part 2: Modern Strategies for scRNA-seq Integration

Part 2: Modern Strategies for scRNA-seq Integration – Plug-and-Map and the Future of Scalable Reference Mapping

This article is the second in our series on the future of single-cell integration. In Part 1, we explored the current toolkit and how to choose between classical integration methods. Here, we look ahead, examining plug-and-map workflows that allow teams to build stable references once and map new data into them continuously.

Introduction

When you start working with single-cell data, the question of integration comes up fast. In your first integration project, you may run Seurat’s anchor workflow or Harmony, align multiple public datasets, and celebrate your clean UMAP.

But as the datasets pile up, internal projects, public atlases, collaborators, a new problem emerges: every time you add data, you have to redo everything. But there is an easier way!  In this article you will learn about reference mapping, or “plug-and-map,” a new paradigm for scalable single-cell integration.

Why reference mapping matters

Traditional integration is a full re-optimization problem. You take all datasets, compute anchors or mutual nearest neighbors, and learn a shared embedding. That’s powerful, but computationally expensive and brittle to change. Add one new dataset and the alignment shifts.

Reference mapping flips that logic. Instead of re-integrating, you:

  1. Build a reference atlas once, harmonizing all your trusted datasets.
  2. Compress or model that reference embedding.
  3. Map new query datasets into that fixed space, fast, reproducibly, and without touching the reference.

This simple idea unlocks scalability, standardization, and reproducibility for scRNA-seq at scale. It’s why modern atlasing efforts (Human Cell Atlas, Tabula Sapiens, etc.) are converging on reference-based approaches.

The leading players

1. Symphony (linear mixture-model mapping)

Developed by the Satija Lab, Symphony takes a Harmony-integrated reference, learns mixture model parameters for each cluster, and stores a compressed reference object.

When you map a new query:

  • The query is projected into the reference PCA space.
  • Each query cell’s batch effect is estimated relative to stored reference cluster centroids.
  • The cell is then adjusted and placed into the shared embedding, without re-integrating the reference.

Symphony is fast (can map 10^5–10^6 cells in minutes) and accurate for well-matched data. For PBMCs, the authors reported >97% label transfer accuracy across 10x chemistries.

Symphony performs best for standardized datasets (e.g. 10x PBMCs) where reference and query share cell types and technologies, but the linear model may underperform on nonlinear batch effects or very divergent disease states.

2. scArches (deep generative model mapping)

scArches generalizes mapping through transfer learning. It extends scVI, a deep variational autoencoder trained on the reference, and fine-tunes model layers to adapt to each query dataset.

Key innovations:

  • Reference structure is preserved (frozen layers).
  • Query data are adapted via a few epochs of fine-tuning.
  • Supports multimodal extensions (totalVI, multiVI).
  • Allows “out-of-reference” detection, cells that don’t match known types.

The method scales to millions of cells and naturally handles nonlinear batch effects.

Best for: large or heterogeneous datasets (multi-platform, multi-omic, disease vs. healthy).
Limitations: more complex setup; requires GPUs for training; hyperparameter tuning matters.

3. Azimuth and other hosted references

10x Genomics’ Azimuth platform is a practical application of reference mapping.
It packages pre-built Symphony references for common tissues (PBMCs, lung, skin, brain) and exposes a web interface. Upload a Seurat object, get standardized annotations, reference-aligned embeddings, and metadata harmonization.

Azimuth is effectively Symphony-as-a-service, ideal for common tissues or quick baselines. You can also download the same reference models and use them locally via Seurat.

Choosing between them

GoalRecommended approachNotes
Fast, reproducible mapping of common tissuesSymphony or AzimuthBest when your queries are similar to the reference (e.g. PBMCs, healthy tissues).
Cross-platform or disease-state mappingscArches (scVI-based)Better at handling nonlinear effects and detecting new states.
Continually updating referencetreeArchesAllows dynamic extension of the reference hierarchy.
Cross-species mappingscVI/scArches or scGPT-based methodsHandles large expression-space shifts.

PBMCs: the canonical case study

PBMCs are the perfect sandbox for plug-and-map. Because their cell types are stable and abundant in public data, almost every reference mapping paper tests on them. Several ready-made PBMC references exist:

  • Azimuth PBMC 20k Reference – built from 10x 3′ v1–v3 datasets; downloadable from satijalab.org/azimuth.
  • Symphony PBMC Reference – same base data; R package on CRAN.
  • scArches tutorials – include PBMC mapping examples using scVI latent spaces.

If your goal is to integrate internal PBMC samples with public data (e.g. healthy vs. autoimmune), these references let you skip re-integration and directly compare across cohorts.

When mapping fails, and what to watch

  • Coverage gaps: reference must contain your cell types. If not, new cells will be forced into the nearest known cluster.
  • Batch complexity: cross-species or platform gaps may break linear mappers; use model-based approaches.
  • Annotation bias: label transfer can propagate errors from the reference; inspect marker genes in mapped queries.
  • Version drift: always freeze your reference build and record version metadata to keep analyses reproducible.

Looking ahead

Reference mapping is evolving toward foundation models, pre-trained universal embeddings (e.g. scGPT, scFoundation, SCimilarity) that encode billions of cells across tissues and species. These aim to replace custom references with general models where “mapping” becomes a simple embedding query.

But for now, Symphony and scArches remain the most mature, open, and practical plug-and-map frameworks.  If you’re integrating PBMCs or similar tissues, start there, they’ll likely become the foundation of how we build and reuse single-cell atlases in the next decade.

Want help determining the optimal integration strategy for your dataset? Click here to schedule a free introductory call with a member of our team.

Originally published by Bridge Informatics. Reuse with attribution only.

Share this article with a friend