Introduction to Differential Expression Analysis
Understanding how gene expression changes between different states, such as disease versus healthy conditions, is a fundamental aspect of biomedical research. Traditional targeted methods, like real-time PCR or ELISA, focus on specific genes or proteins and often rely on straightforward statistical tests to measure fold changes or calculate p-values. However, these approaches only provide a narrow view of the molecular landscape.
Over the past decade, high-throughput technologies like RNA sequencing (RNA-seq) have revolutionized gene expression analysis, enabling an unbiased examination of the entire transcriptome. Unlike targeted methods focusing on a subset of genes, RNA-seq quantifies mRNA levels across all genes, offering a comprehensive view of gene expression changes in response to various conditions, such as diseases or environmental factors. The statistical methods for RNA-seq require complex models to handle multiple tests and variability across thousands of genes and biological replicates.
With the advent of single-cell technologies, such as single-cell RNA sequencing (scRNA-seq), the analysis becomes even more complex. These technologies provide whole transcriptome data at a single-cell resolution, revealing cellular heterogeneity and enabling a deeper understanding of how different cell types respond to specific conditions. However, this complexity also introduces challenges in selecting the most appropriate methods for differential expression (DE) analysis.
This blog will explore differential gene expression analysis methods for bulk RNA sequencing and explore various scRNA-seq approaches. By the end, you’ll gain insights to help you choose the most suitable method for your future experiments.
Navigating Differential Expression in Bulk RNA-Seq: Insights on DESeq2 and edgeR
Bulk RNA sequencing (RNA-seq) measures the combined RNA from an entire tissue or cell population, providing an average gene expression profile. Differential gene expression (DGE) analysis in bulk RNA-seq identifies genes with significant expression differences between conditions, such as treated versus untreated or disease versus healthy states.The typical workflow for bulk RNA-seq analysis involves aligning reads to a reference genome and quantifying the reads mapped to each gene, which reflects its expression level. Normalization is applied to correct for differences in sequencing depth and RNA composition across samples, ensuring that observed changes are biologically relevant rather than technical artifacts. Statistical models, often based on a negative binomial distribution, are used to test for differential expression, with biological replicates enhancing the reliability of results.
DESeq2 and edgeR are two of the most widely used tools for DGE analysis in bulk RNA-seq. Both utilize negative binomial models but differ in their approaches to normalization and handling variability. EdgeR uses the Trimmed Mean of M-values (TMM) for normalization and empirical Bayes methods to stabilize variability estimates, particularly for low-count genes. DESeq2 performs normalization using size factors and employs local regression to provide smoother variability estimates. For detecting differentially expressed genes, DESeq2 uses Wald tests or likelihood ratio tests, depending on the study design. DESeq2 is often preferred for its more robust handling of low-count genes and improved variability estimates, making it more reliable in many scenarios. However, while bulk RNA-seq is powerful for identifying general trends in gene expression, it lacks the resolution to detect cell-type-specific variations within multicellular tissues, as it averages signals from all cells.
Differential Expression in Single-Cell RNA-Seq: Methods, Challenges
Single-cell RNA sequencing (scRNA-seq) represents a significant advancement by measuring RNA abundance at the individual cell level. In scRNA-seq differential expression analysis, the focus is on identifying gene expression changes within individual cell types, thereby offering a nuanced view of biological processes and capturing the heterogeneity within tissues. There may be a heterogeneity in response to the same perturbation, thereby making this detailed approach essential. Different cell types may respond distinctly to the same perturbation, making this detailed approach essential. The computational workflow for scRNA-seq begins with preprocessing raw sequencing data, aligning reads to a reference genome or transcriptome, and quantifying gene expression levels. Quality control filters out low-quality cells based on criteria such as gene count, mitochondrial gene content, and read counts. Normalization is performed to adjust for differences in sequencing depth across cells, often using log-normalization or SCTransform. Linear and non-linear Dimensionality reduction techniques like PCA, t-SNE, or UMAP visualize high-dimensional data in lower dimensions, while clustering algorithms like Louvain or Leiden group cells based on their gene expression profiles to identify distinct cell populations (interested in automatic cell type annotation read our previous blog)
Differential expression (DE) analysis in scRNA-seq is crucial for revealing unique characteristics of cell types, or marker genes, and understanding the effects of different conditions at a single-cell level.
DE analysis in scRNA-seq involves two main approaches: identifying cell type markers and comparing conditions with single-cell resolution. After clustering cells based on gene expression, tools such as Seurat’s FindMarkers function detect differentially expressed genes between clusters, thereby identifying marker genes that define specific cell types. These markers help distinguish different cell types and subtypes, providing insights into cellular heterogeneity and tissue composition.This function compares the expression levels of genes in one cluster against all others, highlighting genes that are specifically expressed in that cluster. The method often assumes that gene expression levels follow a specific statistical distribution, such as the negative binomial, which accounts for the variability and overdispersion typically observed in RNA-seq data. It also assumes that observations (cells) are independent (we will refer to this assumption as single-cell methods) and that clusters have sufficiently distinct gene expression patterns to identify unique markers therefore in this method the differential expression at the single cell level assumes individual cells as a replicate. These differentially expressed genes serve as unique markers for distinguishing between different cell types and subtypes. By identifying these marker genes, researchers can annotate clusters with likely cell types based on known gene functions from reference datasets or literature (read our previous blog on methods to annotate cells). This process not only defines cell identities but also reveals insights into cellular heterogeneity, tissue composition, and lineage relationships within complex tissues, providing a deeper understanding of their biological roles
Beyond identifying cell type markers, DE analysis in scRNA-seq also compares gene expression between conditions at the single-cell level. By comparing treated and control cells, researchers can detect subtle but biologically significant changes in gene expression, particularly in rare cell populations or specific cellular states.
DE analysis in single-cell transcriptomics can be approached using three main methods: single-cell, pseudobulk, and linear mixed models, each with unique advantages and limitations (see table 1). Overall, Single-cell methods analyze gene expression at the individual cell level, capturing cell-to-cell variability but are sensitive to technical noise. Pseudobulk methods aggregate gene expression across cells within each biological replicate, reducing noise and improving statistical power but may miss subtle subpopulation differences. Linear mixed models balance both fixed and random effects, offering nuanced analysis but can be computationally intensive. With these advancements, choosing a methodology that leads to reproducible results remains challenging. In the next section, we will compare DE methods for single cell RNAsequencing using a benchmarking experiment by Butler et al. (2021), published in Nature Biotechnology.
Single-Cell RNA-Seq: Assessing Differential Expression Methods with Ground Truth Data
The goal was to evaluate statistical methods for DE analysis based on their ability to produce biologically accurate results. Butler et al. (2021) argued that using real datasets with known experimental ground truths provides a more accurate assessment of method performance than simulations. They determined that the best ground truth comes from datasets where both bulk RNA-seq and scRNA-seq were conducted on the same purified cell populations, under identical conditions, and sequenced in the same laboratories. Through a thorough literature review, they identified eighteen “gold standard” datasets that met these criteria, thereby enabling a large-scale benchmarking of DE methods in controlled experimental settings.
The team then selected fourteen commonly used differential expression analysis (DE) methods in single-cell transcriptomics for comparison, covering nearly 90% of recent studies. These methods included seven statistical tests analyzing gene expression at the individual cell level (“single-cell methods”), six tests that aggregated cells within biological replicates to create pseudobulks before analysis (“pseudobulk methods”), and one method employing a linear mixed model.
Minimizing Bias in Gene Expression: Pseudobulk Methods Outperform single cell methods in scRNA-Seq
The analysis showed that methods aggregating cells within biological replicates to form “pseudobulks” before applying statistical tests consistently outperformed those comparing individual cells (single-cell methods). Pseudobulk methods demonstrated higher concordance with bulk RNA-seq results, better predicted changes in protein abundance, and more accurately reflected biological pathways in functional enrichment analyses. The study also found that single-cell DE methods tend to be biased toward highly expressed genes, whereas pseudobulk methods reduce data sparsity by aggregating cells, enhancing accuracy, particularly for lowly expressed genes.
To investigate the bias in scRNA-seq towards highly expressed genes further, the researchers categorized genes into three groups based on expression levels: low, moderate, and high. They found the most significant differences between pseudobulk and single-cell methods were among highly expressed genes, contrary to the initial hypothesis that pseudobulk methods would mainly excel with lowly expressed genes. Additional analysis revealed that single-cell DE methods were more prone to false positives with highly expressed genes and false negatives with lowly expressed ones, indicating a systematic bias toward identifying highly expressed genes as differentially expressed, even when their expression did not change. Experimental validation using synthetic mRNA spike-ins confirmed that single-cell methods incorrectly flagged many abundant mRNAs as differentially expressed, a bias not seen with pseudobulk methods. This bias was consistent across 46 diverse scRNA-seq datasets, suggesting that the inferior performance of single-cell methods is largely due to their preference for highly expressed genes.
The study further demonstrated that pseudobulk methods outperform single-cell methods in DE analysis by effectively accounting for biological replicates, avoiding biases toward highly expressed genes. The researchers showed that when pseudobulk methods (such as edgeR, DESeq2, and limma) were applied without cell aggregation, their performance advantage vanished, and a bias towards highly expressed genes appeared. Randomly grouping cells into pseudo-replicates led to similar performance declines and biases, underscoring that aggregating gene expression across biological replicates is crucial for the superior performance of pseudobulk methods. Ignoring biological replicates reduces observed variance in gene expression, leading to incorrect identification of differentially expressed genes due to technical noise rather than true biological differences. This loss of variance, particularly affecting highly expressed genes, was consistently observed across multiple datasets, emphasizing the importance of considering biological replicates in DE analysis to avoid biases and accurately detect genuine gene expression changes. The outperformance of pseudobulk methods compared to single-cell methods is concerning, as most of the literature relies on single-cell methods, which often result in inflated p-values and the identification of false positive differentially expressed genes.
False Positives in scRNA-Seq: The Need for Robust Methods Accounting for Biological Variability
The study by Butler et al. (2021) highlighted a critical issue in differential expression (DE) analysis of single-cell RNA sequencing (scRNA-seq) data: the risk of false discoveries when variability between biological replicates is not properly accounted for. Through simulations, the researchers demonstrated that single-cell DE methods could incorrectly identify hundreds of DE genes even when no real biological differences were present between groups, especially for genes with high variability across replicates. In contrast, pseudobulk methods, which aggregate gene expression data across replicates, effectively eliminated these false positives. Additional experiments showed that increasing the number of biological replicates reduced false discoveries, whereas simply increasing the number of cells exacerbated the problem, emphasizing the importance of considering replicate variability. Real data analysis of human peripheral blood mononuclear cells (PBMCs) confirmed this issue; randomly splitting unperturbed control samples into synthetic groups resulted in numerous false DE genes, a pattern observed across multiple datasets. The problem also extended to spatial transcriptomics data, where neglecting biological replicate variability led to thousands of spurious DE genes. These findings highlight the need for DE analysis methods that carefully account for replicate variability, particularly in human studies, where such variability is harder to control due to genetic and environmental differences.
The study investigated gene expression changes in the lumbar spinal cord of mice following spinal cord injury using single-nucleus RNA sequencing (snRNA-seq). DE analysis revealed significant transcriptional changes in endothelial cells, but results differed greatly between the Wilcoxon rank-sum test (a single-cell method) and edgeR-LRT (a pseudobulk method), with the latter proving more reliable upon validation. These findings highlight that pseudobulk methods, which account for biological replicate variability, offer more accurate results compared to single-cell methods, which may produce false positives.
The study found that single-cell differential expression (DE) methods are most effective when accounting for variability between biological replicates, but surprisingly, linear mixed models (GLMMs), which are designed to handle this variability, performed poorly compared to pseudobulk methods. While GLMMs showed accuracy in small datasets under specific conditions, their performance was similar to pseudobulk methods for larger datasets and required significantly more computational time. These results suggest that although GLMMs have theoretical advantages, pseudobulk methods offer a more practical combination of speed and accuracy for single-cell DE analysis in real-world applications.
Summary
Accurate differential expression (DE) analysis in single-cell transcriptomics is essential for understanding transcriptional responses to different conditions. Choosing methods that properly account for variability between biological replicates is crucial, as failing to do so can lead to false discoveries and misleading interpretations. Single-cell DE methods that analyze individual cells often mistake biological variability for true effects, resulting in spurious results, especially in human studies with high variability. In contrast, pseudobulk methods, which aggregate data across replicates, provide more reliable outcomes. Selecting the right methodology is vital to ensure the accuracy and reproducibility of findings, guiding better scientific conclusions and effective use of resources.
Table 1- Comparison of different DE analysis in sc-RNAsequencing
Feature | Pseudobulk Methods | Single-Cell Methods | Non-Linear Mixed Models |
Definition | Aggregates gene expression data across cells within biological replicates to form “pseudobulks” before applying statistical tests. | Performs DE analysis directly on individual cells without aggregation. | Uses Poisson or negative binomial generalized linear mixed models (GLMMs) to model gene expression variability within and between biological replicates. |
Statistical Basis | Negative binomial models (e.g., edgeR, DESeq2) and linear models (e.g., limma) | Non-parametric (e.g., Wilcoxon rank-sum test) or parametric tests designed for single-cell data. | Poisson or negative binomial GLMMs that model multiple layers of variability. |
Handling of Biological Replicates | Explicitly accounts for variability between biological replicates by aggregating data. | Typically does not account for biological replicate variability, leading to potential biases. | Models both within- and between-replicate variability, but performance can vary. |
Performance with Low Cell Counts | Good performance; aggregation stabilizes variability even with fewer cells. | Performance can be variable; more prone to false positives with low cell counts. | GLMMs can perform well under specific conditions but are computationally expensive. |
Performance with Large Datasets | Maintains high performance and scalability; quick computation time. | Can become computationally intensive and prone to false discoveries as the dataset size increases. | Similar performance to pseudobulk methods, but computational requirements increase significantly. |
Computational Efficiency | High efficiency; methods run in minutes even on large datasets. | Computationally demanding with large datasets; can take significantly longer than pseudobulk methods. | Very high computational cost; fitting models can take hours even on downsampled datasets. |
Bias Toward Highly Expressed Genes | Minimizes bias due to aggregation of cells, reducing zero inflation and data sparsity. | Prone to bias toward highly expressed genes due to data sparsity and dropout events. | Can reduce bias with correct parameter settings, but computational costs may outweigh benefits. |
Are you interested in reading more about single-cell studies? Check out our other single cell-related articles:
● Targeting Senescent Cells Using CAR T Cells: A New Approach to Combat Aging
● scFoundation: A Powerful AI Large-Scale Foundation Model for Single-Cell Research
● Breakthrough High-Resolution Spatial Multi-Omics: Slide-Tags Unlock Single Cell Analysis
● Discovering the Unseen in Less Time: Light-Seq’s Role in Identifying Rare Retinal Biomarkers
● scGPT: The First AI Large Language Model in Single-Cell RNA Sequencing
● How GPT-4 Provides High Accuracy Cell-Type Annotations in Single Cell RNA Sequencing Experiments
● sc-SHC: A Framework for Statistical Testing During Single Cell Clustering
● Discovering the Unseen in Less Time: Light-Seq’s Role in Identifying Rare Retinal Biomarkers
Outsourcing Bioinformatics Analysis: How Bridge Informatics (BI) Can Help
We are passionate about empowering life science companies with cutting-edge technologies. BI’s data scientists prioritize studying, understanding, and reporting on the latest developments so we can advise our clients confidently. Our bioinformaticians are trained bench biologists, so they understand the biological questions driving your computational analysis.
From pipeline development and software engineering to deploying your existing bioinformatic tools, BI can help you on every step of your research journey. As experts across data types from leading sequencing platforms, we can help you tackle the challenging computational tasks of storing, analyzing and interpreting genomic and transcriptomic data. Click here to schedule a free introductory call with a member of our team.
Shahrzad Ghazisaeidi, PhD, Data Scientist, Bridge Informatics
Shahrzad specializes in high-throughput sequencing, data pre-processing, and downstream analysis of transcriptomic and epigenetic landscapes. She is particularly passionate about developing innovative tools for drug repurposing.
Prior to joining Bridge Informatics, Shahrzad served as a Postdoctoral Associate at the Hospital for Sick Children in Toronto, Canada where she researched the epigenetics of peripheral nerve injury models.
Shahrzad earned her Ph.D. in Physiology and Neuroscience from the University of Toronto. Her studies focused on the sex-dependent and independent gene regulation of peripheral nerve injury. Currently based in Toronto, Shahrzad balances her professional pursuits with personal interests by making time for yoga, martial arts, and poetry.
Haider M. Hassan, Data Scientist, Bridge Informatics
Haider is one of our premier data scientists. He provides bioinformatic services to clients, including high throughput sequencing, data pre-processing, analysis, and custom pipeline development. Drawing on his rich experience with a variety of high-throughput sequencing technologies, Haider analyzes transcriptional (spatial and single-cell), epigenetic, and genetic landscapes.
Before joining Bridge Informatics, Haider was a Postdoctoral Associate at the London Regional Cancer Centre in Ontario, Canada. During his postdoc, he investigated the epigenetics of late-onset liver cancer using murine and human models. Haider holds a Ph.D. in biochemistry from Western University, where he studied the molecular mechanisms behind oncogenesis. Haider still lives in Ontario and enjoys spending his spare time visiting local parks. If you’re interested in reaching out, please email [email protected] or [email protected]