Summary
This blog discusses the challenges associated with read alignment in single cell RNA seq (scRNA-seq) data analysis that lead to missed gene expression and cell type misidentification. We outline the publication by Pool et al. (2023), which demonstrates that optimizing the reference transcriptome significantly improves gene detection, read alignment, and uncovers previously missed cell types in scRNA-seq datasets .
Introduction
Single-cell RNA sequencing (scRNA-seq) is an advanced molecular biology technique for the comprehensive analysis of gene expression at the single-cell level within a heterogeneous population of cells. In contrast to traditional bulk RNA sequencing techniques that provide aggregated gene expression data, scRNA-seq offers a granular analysis of cellular diversity and heterogeneity. The key step in scRNA-seq analysis workflows involves dimensionality reduction, which transforms high-dimensional gene expression data into lower-dimensional representations, enabling subsequent unsupervised clustering analyses. K-means clustering algorithms, in combination with community detection methods such as the Louvain or Leiden algorithms, permits the identification of distinct cell populations or clusters. The clusters may represent a unique population of cells, and therefore are indicative of cellular heterogeneity, illuminating both established cell types and potentially novel subpopulations for further investigation. Determining the gene expression signatures of these clusters of cells provides insights into cellular functionality, disease mechanisms, and developmental processes.
In scRNA-seq analysis workflows, a frequently overlooked process is alignment and quantification, which determines the inclusion of reads in the dataset. This involves mapping reads to the reference transcriptome, associating them with specific genes, and removing duplicates. Interestingly, a significant portion of sequencing reads often gets excluded due to various factors, such as failed mapping to the transcriptome, duplication, multi-mapping to more then one genomic sites, associating with multiple genes (multi-gene reads), or mapping to intronic or intergenic regions. Some of these discarded reads often represent genuine gene expression, thereby leading to missing data for expressed genes, mechanisms, and even potentially novel cell types.
In a recent paper published in Nature methods, Pool et al. (2023) demonstrate that the issue of discarding useful reads, often representing true gene expression, during the alignment stage does not arise due to low sensitivity but rather from shortcomings in the existing transcriptomic references. This finding holds true even for extensively annotated genomes like those of mice and humans. Importantly, optimizing the reference transcriptome prior to alignment enables capturing of missed gene expression and the identification of potentially mis-identified cell types.
Optimizing reference transcriptome for the detection of novel cell types
In scRNAseq workflow, Computational tools are utilized to identify gene expression patterns in individual cells and detect distinct cell types. However, a notable issue with scRNA-seq is the exclusion of certain RNA sequencing data during the alignment step, which often capture true gene expression data. Inclusion of these reads is essential for differential gene expression analysis and for the identification of distinct cell types. For instance, in a scenario where two cells exhibit gene expression profiles consisting of genes A, B, C, and D, with only one of the two cells expressing gene E. Failure to detect the expression of gene E, due to the use of non-optimized reference transcriptome,would inaccurately imply identical cell types for both cells, despite their underlying differences.
Pool et al. (2023) demonstrate that read loss in single-cell RNA sequencing (scRNA-seq) datasets occurs from three sources: poor annotation of 3′ gene ends, gene overlap stemming from the annotation of rare read-through or prematurely starting transcripts, and the exclusion of intronic read. In order to address these challenges, Pool et al. (2023) proposed a structured three-step strategy aimed at optimizing the integration of informative reads. This approach includes incorporating reads mapping to unannotated 3′ untranslated regions (UTRs), refining the integration of intronic reads to capture gene expression profiles, and resolving gene overlaps by excluding rare transcript isoforms. Using this approach, Pool et al. (2023) produced enhanced genome annotations for both mouse and human transcriptomes and developed an R package called ReferenceEnhancer. This package automates critical reference optimization processes, thereby facilitating the creation of optimized genome annotations. These annotations can then be utilized to generate the corresponding transcriptomic reference using preferred read-aligning software such as Cell Ranger or STARSolo. In essence, this approach offers a universal and scalable method for optimizing genome annotations, particularly beneficial for efficient 3′ single-cell RNA sequencing (scRNA-seq) analysis.
Implementing the reference optimization strategy was able to successfully recover obscured gene expression data for a diverse array of genes, thereby overcoming the limitations posed by conventional methods that rely on exonic or combined exonic-intronic reads. Pool et al. (2023) assessed the performance of optimized reference transcriptomes across various tissues and species and quantified enhancements in gene detection and read alignment. Compared to exonic references, thousands of new genes and several hundred new genes were detected using optimized reference transcriptomes compared to regular intronic read-based references. Additionally, thousands of genes exhibited a 50% or higher increase in read detection compared to regular intronic read-based references. Detailed analysis of mouse brain and human PBMC datasets revealed significant improvements in gene detection and read alignment using the optimized mouse transcriptome, including over 3,000 newly detected genes and 14.8% more reads for downstream analysis. This optimized reference also led to a substantial increase in cellular profiling resolution, with nearly 600 additional genes detected per median preoptic nucleus neuron, representing a more than 20% increase in genes detected per neuron. Furthermore, this improved resolution resulted in the identification of additional neuron cell types.
Conclusion
The study by Pool et al. (2023) demonstrates that optimizing the reference transcriptome for scRNA-seq analysis has the potential to recover lost gene expression data and unveil novel cell types. The implementation of optimized references holds promise for personalized medicine and immunotherapy by enhancing our understanding of cellular diversity and functionality at the single-cell level, thereby facilitating the development of tailored and effective treatments based on an individual’s genetic makeup. Moreover, optimized reference mapping may aid in the development of more potent immunotherapies by identifying previously unrecognized cell types, shedding light on mechanisms by which cancer cells evade immune surveillance and offering avenues for intervention. Future research endeavors could focus on further refining these optimization strategies and investigating their implications across various biological contexts, including personalized medicine and immunotherapy. Such efforts could yield more comprehensive and precise insights into cellular behaviors, thereby propelling advancements in our understanding of biological systems at the single-cell level and revolutionizing clinical practices for improved patient outcomes.
Advancing Research with Bridge Informatics
By harnessing the power of bioinformatics and drawing on the latest scientific advancements such as this study, Bridge Informatics empowers its clients to navigate the complexities of sex-specific biology and translate research into life-saving advancements in healthcare.
BI’s data scientists prioritize studying, understanding, and reporting on the latest developments so we can advise our clients confidently. The generation, storage and analysis of biological data is faster and more accessible than ever before. From pipeline development and software engineering to deploying your existing bioinformatic tools, Bridge Informatics can help you on every step of your research journey.
As experts across data types from leading sequencing platforms, we can help you tackle the challenging computational tasks of storing, analyzing and interpreting genomic and transcriptomic data. Bridge Informatics’ bioinformaticians are trained bench biologists, so they understand the biological questions driving your computational analysis. Click here to schedule a free introductory call with a member of our team.