New Generative AI Model Improves Sample & Cell Level Representations for scRNA-seq Analysis

New Generative AI Model Improves Sample & Cell Level Representations for scRNA-seq Analysis

Table of Contents

by Josh Stolz and Haider Hassan


This article explores the capabilities of scPoli, a semi-supervised conditional generative model, in addressing challenges in single-cell data integration. From robust annotation propagation to multi-scale classification of cells and samples, read on to discover how scPoli reshapes the landscape of single-cell research, demonstrating its prowess in handling diverse datasets and supporting experimental design in integration workflows.

Challenges associated with the Integration of single cell RNA seq datasets

Single-cell RNA sequencing (scRNA-seq) is a technique that allows researchers to examine transcriptional changes of individual cells within a population. In contrast to bulk RNA sequencing, scRNA-seq offers insights into cellular heterogeneity, identification of rare cell types, and an understanding of the dynamics of gene expression at the single-cell level.

The applications of scRNA-seq are broad, spanning various fields such as developmental biology, immunology, neurobiology, and cancer research. The technology has significantly advanced our understanding of cellular heterogeneity in diseases such as cancers, enabling the discovery of intricate gene expression patterns that would otherwise be masked in traditional bulk RNA sequencing analyses. A major drawback of scRNA-seq, however, is that studies often utilize a relatively limited sample size and a small number of cells in the experiments. These complications are associated with the high cost of sample preparation and sequencing, and can be compounded by the loss of cells during isolation and barcoding. This often leads to the temptation  to merge studies to increase data resolution and statistical power. Unfortunately, batch effects associated with technical variation are often prohibitive when merging the datasets. In addition, the annotation or labeling of cell types is often specific to each experiment and/ or dataset and cell label transfer during data integration can be challenging. 

In a recent study published in Nature methods, Donno et al. (2023) describe a new technique called single-cell population level integration (scPoli).  scPoli utilizes a generative artificial intelligence model to learn sample and cell level representations during scRNA-seq data integration, adjusting for batch effects and making cell annotation transfer possible.

Single cell population level integration (scPoli) is a one step solution for population level integration of scRNA-seq datasets

Batch effects in scRNA-seq datasets refers to variations that arise from technical or experimental factors, such as sample handling and lot -to-lot variations in reagents, rather than the true biological signal. During scRNA-seq data integration, scPoli utilizes non-linear models to remove batch effects while using conditional variables to preserve true biological signals. scPoli achieves this in two steps. In the first step, scPoli transforms conditional variables from one-hot encodings to continuous variables with limited dimensionality, thereby allowing for higher resolution. Next, scPoli uses its generative model to create cell prototypes that can be later used for label transfer. The implementation of this model achieves a difficult task of integrating datasets while maintaining biological signals necessary for research.

Several benchmarking analyses were conducted with scPoli against other single-cell integration methods, highlighting its superiority or comparability to standards in the field. When assessing successful data integration scPoli outperformed its leading competitor (SCANVI) by 5.06% on standard metrics. On a metric called macro average F1-score, which evaluates the precision and recall of a classificatin model,  scPoli performed similarly to Seurat (v3) and again outperformed SCANVI.

Label Transfer and Mapping:
scPoli’s Adaptive Approach to Query-to-Reference Integration

For annotation transfer during scRNA-seq data integration, scPoli uses cell prototypes. A cell prototype is a condensed representation summarizing the key features of a specific cell type, and is computed by averaging the characteristics of all data points associated with the respective cell type. scPoli uses these prototypes to transfer labels and enhance the integration of data. In order to preserve important biological information, scPoli’s implements a process called “prototype loss”, which encourages the model to make the representation of a cell closer to its prototype. To classify cells without annotations, scPoli compares their characteristics to the cell prototypes and assigns the label of the closest prototype. Importantly, scPoli’s use of prototypes allows it to expand an initial reference atlas with new cell types from a labeled query without having to retrain the reference model, which sets it apart from existing methods.


ScPoli has been used for the integration of atlas-level single cell atlases consisting of millions of cells across thousands of samples. In the process of integration, scPoli accurately accounts for technical variation to capture true biological signals and cell type annotations. Importantly, using cell type prototypes scPoli accurately assigns labels to cells of unknown annotation, thereby increasing the quantity of annotated cell types.

The accurate integration of single cell datasets using scPoli promotes biomarker discovery by increasing the robustness and statistical prowess of gene expression analysis, as well as a comprehensive understanding of cellular heterogeneity and subtle variations associated with specific conditions and treatments. By leveraging the power of scPoli and scRNA-seq, researchers can uncover novel biomarkers that advance our understanding of diseases and inform the next stage of personalized medicine strategies.

Outsourcing Bioinformatics Analysis: How Bridge Informatics Can Help

BI’s data scientists prioritize studying, understanding, and reporting on the latest developments so we can advise our clients confidently. The generation, storage and analysis of biological data is faster and more accessible than ever before. From pipeline development and software engineering to deploying your existing bioinformatic tools, Bridge Informatics can help you on every step of your research journey.

As experts across data types from leading sequencing platforms, we can help you tackle the challenging computational tasks of storing, analyzing and interpreting genomic and transcriptomic data. Bridge Informatics’ bioinformaticians are trained bench biologists, so they understand the biological questions driving your computational analysis. Click here to schedule a free introductory call with a member of our team.

Josh Stolz, Data Scientist

Josh, Data Scientist & Content Writer for Bridge Informatics, most enjoys work that sits at the intersection of complex molecular biology and genomics. He is an expert at identifying biomarkers, developing statistical methods, and next generation sequencing.

Before joining BI, Josh worked at the Johns Hopkins Lieber Institute of Brain Development where he used RNA-seq data to illuminate the underlying causes of schizophrenia. He also worked at Abbvie where he used genomic technologies to bolster clinical trial portfolios surrounding eye related treatments.

Josh received a BS in Biology from Indiana State University and an M.Sc in Bioinformatics. If he’s away from his desk, you will likely find Josh running along the Baltimore harbor.

Haider M. Hassan, Data Scientist, Bridge Informatics

Haider is one of our premier data scientists. He provides bioinformatic services to clients, including high throughput sequencing, data pre-processing, analysis, and custom pipeline development. Drawing on his rich experience with a variety of high-throughput sequencing technologies, Haider analyzes transcriptional (spatial and single-cell), epigenetic, and genetic landscapes.

Before joining Bridge Informatics, Haider was a Postdoctoral Associate at the London Regional Cancer Centre in Ontario, Canada. During his postdoc, he investigated the epigenetics of late-onset liver cancer using murine and human models. Haider holds a Ph.D. in biochemistry from Western University, where he studied the molecular mechanisms behind oncogenesis. Haider still lives in Ontario and enjoys spending his spare time visiting local parks. If you’re interested in reaching out, please email [email protected] or [email protected]

Share this article with a friend

Create an account to access this functionality.
Discover the advantages