Part 1: Modern Strategies for scRNA-seq Integration

Part 1: Modern Strategies for scRNA-seq Integration

Part 1: Modern Strategies for scRNA-seq Integration – Foundations and Decision Framework

This article is the first in a two-part series exploring the evolution of single-cell integration. Here, we lay the groundwork, reviewing the major integration strategies, when to use them, and how they fit into today’s experimental workflows. In Part 2, we build on this foundation to discuss the emerging plug-and-map paradigm that is reshaping scalable atlas construction.

Introduction

Public single-cell RNA-seq data are now abundant, and the question isn’t whether to integrate them, but how. As reference atlases like the Human Cell Atlas and Tabula Sapiens expand, integrating external data with in-house experiments has become central to reproducible biology. Yet, choosing the right integration method depends on your goal, data scale, and how often you expect to add new samples.

This blog article summarizes the current toolkit for scRNA-seq data integration and practical decision points for researchers handling multi-source datasets.

Why Integration Matters

When datasets originate from different labs, instruments, or chemistries, they carry batch effects that obscure true biological signals. Integration aims to align cells into a shared representation while preserving meaningful biological variation. Too little correction and you see batch-driven clusters; too much, and you risk erasing cell-type differences.

The challenge is achieving the right balance, and doing so at scale.

Core Integration Strategies

1. Anchor-based alignment (Seurat v4/v5)

Seurat’s canonical correlation analysis (CCA) and reciprocal PCA (RPCA) workflows find anchors, pairs of cells across datasets that share similar expression patterns. The latest versions improve speed and scalability but remain sensitive to dataset imbalance and require substantial memory for >500k cells.
When to use: You want interpretability, integration within the Seurat ecosystem, and downstream label transfer.

2. Latent-space correction (Harmony)

Harmony integrates at the level of PCA embeddings. It iteratively adjusts embeddings to remove batch variation while retaining biological structure. It’s computationally light and works well even when datasets have distinct compositions.
When to use: You have moderate-to-large datasets (10k–1M cells) and need robust batch correction without deep learning overhead.

3. Mutual Nearest Neighbors (fastMNN, BBKNN)

fastMNN (R) and BBKNN (Python) identify nearest neighbors across datasets in a lower-dimensional space and correct local batch bias. These methods are fast, simple, and effective, though they struggle when there’s minimal cell-type overlap.
When to use: You want transparent, graph-based correction integrated into Scanpy or Bioconductor workflows.

4. Matrix factorization (LIGER)

LIGER decomposes datasets into shared and dataset-specific factors using integrative non-negative matrix factorization (iNMF). It’s powerful when you expect both common and unique signals, for instance, integrating healthy and diseased tissues.
When to use: You need interpretable latent factors capturing both shared and condition-specific variation.

5. Deep generative modeling (scVI, scANVI, scArches)

Variational autoencoders (VAEs) model count data directly, learning a low-dimensional latent representation that separates technical and biological variation.

  • scVI integrates unlabeled data.
  • scANVI adds label supervision to refine embeddings and enable label transfer.
  • scArches extends these models for incremental integration, mapping new datasets into an existing latent space without retraining from scratch.

When to use: You handle large or heterogeneous public datasets, or expect to integrate continuously (e.g., new patient cohorts, new public atlases).

6. Reference mapping (Symphony, Azimuth)

Instead of re-integrating everything, these methods (Symphony, Azimuth) map new cells onto a pre-built reference atlas. It’s fast, consistent, and ideal when a suitable reference (like a PBMC atlas) exists.
When to use: You need to standardize annotations across experiments or labs.

Decision Framework

Large-scale benchmarking efforts, such as the Open Problems Batch Integration challenge, have shown that no single integration method is best for every scenario. Performance depends heavily on data type, batch structure, and downstream goals. Methods like Seurat RPCA, Harmony, and scVI consistently perform well for PBMC-like datasets, but differ in scalability, runtime, and preservation of biological variance. Here, we provide some guidelines for a decision framework:

1. Integration goal

One-off co-embedding: use Seurat RPCA or Harmony.
Continuous addition of new data: choose scArches or Symphony.

2. Data complexity

Similar chemistries, balanced composition: Seurat or Harmony suffice.
Heterogeneous, multi-lab data: scVI/scANVI perform best.

3. Computational scale

For millions of cells, prefer Harmony or scVI (GPU-accelerated).

4.Biological heterogeneity

Preserve condition-specific effects using LIGER or by tuning Seurat’s anchor selection.

Conclusion

Integration methods are converging toward model-based approaches that treat reference atlases as reusable latent spaces. Tools like scvi-hub, Azimuth, and scArches now support “plug-and-map” workflows, integrating data as easily as adding new samples to a sequencing run. For more information, see our blog post on plug-and-map workflows[1]  covered in part two of this series.

Looking Ahead

In 2025, the question is less “Which algorithm removes batch effects best?” and more “How do we maintain a consistent reference framework as data keeps growing?” Want help determining the optimal integration strategy for your dataset? Click here to schedule a free introductory call with a member of our team.

Originally published by Bridge Informatics. Reuse with attribution only.

Share this article with a friend