Summary
Cell clustering is a critical component in single-cell analysis workflows, and is essential for identifying both known and novel cell types. A common issue with current single-cell RNA sequencing (scRNA-seq) workflows is that current clustering methods, that often rely on a tuning parameter called resolution, do not account for statistical measures during cell clustering. This problem, coined as over-clustering, can lead to spurious sub-clustering of cell populations and impact the discovery of novel cell types. The current blog outlines an algorithm called single cell self-contained hierarchical clustering or sc-SHC, which utilizes a model-based hypothesis testing approach for statistical evaluation during cell clustering in single cell analysis workflows.
Challenges of Heuristic Clustering
Single-cell RNA sequencing (scRNA-seq) is a technique in molecular biology that allows the analysis of gene expression profiles of single cells in a population. Therefore, unlike traditional bulk RNA sequencing, which provides gene expression analysis of a population of cells, scRNA-seq permits an analysis of cellular heterogeneity. An important component in single cell analysis workflows is the dimensional reduction of expression profiles to perform unsupervised cellular clustering, in order to identify distinct cell populations, or clusters, that can either be characterized as known cell types or as novel entities. A common issue associated with the current clustering algorithms is over-clustering.
Over-clustering refers to the partitioning of a known cell population into many smaller sub-populations, or smaller clusters, due to bias or random variation in the data and not true differential expression. In single cell analysis workflows, over-clustering can confound confidence in the discovery of novel cell types since sub-populations can arise from random variation alone. Additionally, since the predominant clustering algorithms are heuristic and lack reliance on underlying generative models, they are not inherently designed to incorporate statistical inference. Lastly, the data snooping bias, also referred to as double-dipping, can lead to cells being inaccurately clustered into two groups, resulting in genes appearing differentially expressed with artificially deflated p-values.
In a publication by Grabski et al. (2023), the authors introduce an algorithm termed single cell self-contained hierarchical clustering (sc-SHC). sc-SHC integrates significance analysis into the clustering process, thereby enabling the statistical assessment of identified clusters as distinct cell populations.
Model-Based Hypothesis Testing with SHC
Statistical frameworks for clustering often assume Gaussian, or a normal, distribution. This limits the applicability of statistical frameworks in scRNA-seq analysis since they permit an analysis of one versus two clusters, rather than assessing any number of clusters, and therefore cell clustering cannot be performed in a hierarchical manner. Significance of Hierarchical Clustering (SHC) addresses this issue by incorporating hypothesis testing into the hierarchical procedure. Yet, SHC is not suitable for scRNA-seq due to its reliance on Gaussian distribution which is incompatible with sparse count data.
To account for over-clustering, sc-SHC builds on SHC by incorporating a model-based hypothesis testing to hierarchical clustering in scRNA-seq workflows. This method combines conventional hierarchical clustering in scRNA-seq workflows with built-in hypothesis testing to impart statistical inference
sc-SHC was built to have two important functions. First, to enable hierarchical clustering with built-in hypothesis testing, and to allow for significance analysis on pre-clustered datasets. Secondly, sc-SHC addresses multiple, sequential hypothesis testing and controls the family-wise error rate (FWER), which is a statistical measure that controls for false discoveries (Or type I errors) in a set of simultaneous tests. Therefore, sc-SHC can be used to enable cell clustering within a statistical framework, and to re-analyze existing datasets and single cell atlases with pre-clustered cell population to attach statistical significance to cell populations.
Benchmark Performance
In order to evaluate sc-SHC performance in cell clustering, the authors conducted a benchmarking analysis of publicly available and simulated data. Using simulated data the authors demonstrated that the traditional Seurat method results in over-clustering even when only one cell-type is present. An initial simple application of the model was used in a case with two known groups to illustrate the utility of the statistical analysis. The authors use parametric bootstrap to create a null-distribution of each cluster and then by comparing this to the observed distribution, they compute a p-value.
sc-SHC was applied to the Human Lung Cell Atlas in a data set of roughly 29k cells. The original study found 39 unique clusters some of which had been classified as novel cell-types. Applying the sc-SHC method reduced this number to 26, 17 of which were analogous to clusters in the original results. The remaining 9 clusters were the results of merging two to four clusters from the original paper. For example using the statistical methods made available by the pipeline the clusters Capillary and Capillary Intermediate 1 were not found to be significantly different and as a result were merged.
In conclusion, the creation of sc-SHC is an important method of development in the field of scRNA-seq and will give statistical merit to claims made about clustering and the discovery of novel cell types. The difficulty of reproducing clusters can be resolved now by applying rigorous and benchmarked methods. Additionally, the availability of an R package to apply this pipeline adds utility to the genomics and bioinformatics community.
Outsourcing Bioinformatics Analysis: How Bridge Informatics Can Help
BI’s data scientists prioritize studying, understanding, and reporting on the latest developments so we can advise our clients confidently. The generation, storage and analysis of biological data is faster and more accessible than ever before. From pipeline development and software engineering to deploying your existing bioinformatic tools, Bridge Informatics can help you on every step of your research journey.
As experts across data types from leading sequencing platforms, we can help you tackle the challenging computational tasks of storing, analyzing and interpreting genomic and transcriptomic data. Bridge Informatics’ bioinformaticians are trained bench biologists, so they understand the biological questions driving your computational analysis. Click here to schedule a free introductory call with a member of our team.
Josh Stolz, Data Scientist
Josh, Data Scientist & Content Writer for Bridge Informatics, most enjoys work that sits at the intersection of complex molecular biology and genomics. He is an expert at identifying biomarkers, developing statistical methods, and next generation sequencing.
Before joining BI, Josh worked at the Johns Hopkins Lieber Institute of Brain Development where he used RNA-seq data to illuminate the underlying causes of schizophrenia. He also worked at Abbvie where he used genomic technologies to bolster clinical trial portfolios surrounding eye related treatments.
Josh received a BS in Biology from Indiana State University and an M.Sc in Bioinformatics. If he’s away from his desk, you will likely find Josh running along the Baltimore harbor.
Haider M. Hassan, Data Scientist, Bridge Informatics
Haider is one of our premier data scientists. He provides bioinformatic services to clients, including high throughput sequencing, data pre-processing, analysis, and custom pipeline development. Drawing on his rich experience with a variety of high-throughput sequencing technologies, Haider analyzes transcriptional (spatial and single-cell), epigenetic, and genetic landscapes.
Before joining Bridge Informatics, Haider was a Postdoctoral Associate at the London Regional Cancer Centre in Ontario, Canada. During his postdoc, he investigated the epigenetics of late-onset liver cancer using murine and human models. Haider holds a Ph.D. in biochemistry from Western University, where he studied the molecular mechanisms behind oncogenesis. Haider still lives in Ontario and enjoys spending his spare time visiting local parks. If you’re interested in reaching out, please email [email protected] or [email protected]