scPerturb: A Breakthrough Resource for Single cell Multi-Omic Perturbation Data Analysis and Integration

scPerturb: A Breakthrough Resource for Single cell Multi-Omic Perturbation Data Analysis and Integration

Table of Contents

Introduction:

The use of single cell genomics has significantly improved our ability to dissect cellular heterogeneity and enabled a better understanding of mechanisms behind drug resistance, clonal structure of tissues, and the evolutionary trajectory of normal cells to a diseased state. However, disease pathology involves alterations at the genomic, transcriptional, and proteomic levels, which necessitates the utilization of multiple high throughput technologies, or single cell “multi-omics”. Multi-omics or “panomics” is a novel approach in molecular biology that combines two or more sequencing datasets to address this challenge. Additionally, in the field of system biology, single-cell perturbation studies are revolutionizing our understanding of cellular behavior. By examining how individual cells respond to various stimuli, researchers can gain deeper insights into complex biological processes.

Single-cell perturbation studies utilize targeted manipulations to interrogate cellular response mechanisms. CRISPR-Cas9 and CRISPR-i induce disruptions at distinct tiers of the protein synthesis hierarchy, while CRISPR-Cas13 triggers RNA degradation downstream. Small molecule libraries probe the functional landscape by directly modulating protein activity (enzymes, receptors) via agonistic or antagonistic effects. These studies integrate multi-omic readouts encompassing chromatin accessibility, protein abundance, transcriptome profiling, genotype, and phenotypic characterization. Unique barcodes identify each perturbed cell, allowing co-analysis with Single Cell RNA sequencing (scRNA-seq), Cellular Indexing of Transcriptomes and Epitopes sequencing  (CITE-seq), or Single Cell Assay for Transposase Accessible Chromatin sequencing (scATAC-seq) data to map the perturbation state. Precise disruption is achieved through individual CRISPR guide delivery to single cells. Recent advancements incorporate multi-omic readouts, with CITE-seq enabling quantification of surface protein expression.

In a recent study published in Nature methods, Peidli et al. (2024) introduce scPerturb, a valuable resource for researchers working with single-cell perturbation data. scPerturb offers a collection of 44 publicly available datasets encompassing transcriptomic, proteomic, and epigenomic readouts from single-cell perturbation experiments. These datasets have been meticulously processed using consistent quality control procedures and standardized annotations to ensure seamless comparison and integration.

scPerturb: A Resource to Streamline Single-Cell Perturbation Data Analysis

Researchers in system biology require standardized datasets to develop and compare computational methods. scPerturb, which is a resource containing 44 datasets from various studies, addresses this challenge. The datasets in scPerturb describe targeted perturbations with single-cell readouts, allowing researchers to compare experiment-specific variables and quantify perturbation strength. In addition, scPerturb goes beyond data integration and collection by introducing Energy distance, or E-distance, and E-test tools for statistically comparing single cell sets.  The E-distance metric serves as a measure of inter-cellular spacing and reflects the signal-to-noise ratio within a dataset. A significant E-distance indicates distinct cellular distributions and a potent perturbation effect. Conversely, a low E-distance suggests minimal changes in expression patterns, potentially due to technical artifacts, inefficient perturbation, or cellular resistance.

Analyzing large single-cell perturbation datasets presents challenges due to high dimensionality, cell-to-cell variation, and data sparsity. These factors hinder traditional methods of measuring distances between perturbations. Currently, there’s no single standard for statistical comparison in perturbation studies. Some existing methods might lose information by grouping cells into pseudo-bulk, while studies involving diverse cell types require complex techniques for measuring similarity. Ideally, a distance measure should identify similar perturbations and classify their strength based on how groups of treated cells differ. This information can shed light on the underlying mechanisms or targets of perturbations. The single-cell research community has explored various scRNA-seq distance measures, with E-distance emerging as a promising tool. This method offers statistical reliability and can guide experiment design, data selection for model training, and diagnostics of information content within a specific perturbation.

Components of scPerturb

In scPerturb, Peidli et al. (2024) curated 44 publicly available single-cell perturbation response datasets encompassing multi-omic readouts (transcriptome, proteome, and epigenome). This collection includes data from 32 CRISPR perturbations and 9 drug perturbations. While the number of pharmacological datasets is limited due to the inherent challenges of applying diverse perturbations to cells, scPerturb offers a scalable resource due to the inclusion of a heterogeneous set of single guides for CRISPR perturbations. Each dataset incorporates standardized quality control (QC) measures, gene counts, count matrices, and mitochondrial read percentages. Additionally, scPerturb.org provides access to QC plots for further data exploration. Notably, three CITE-seq datasets offer separate downloads for protein and RNA counts, enabling in-depth analysis of co-expressed molecules.

In scPerturb, each dataset undergoes quantification of genes per cell and Unique Molecular Identifier (UMI) counts. Sequencing depth directly influences the detection of lowly expressed genes. Deeper sequencing increases UMI counts for these genes, thereby mitigating the technical dropout events reflected by zero counts. Consequently, variations in sequencing depth can affect the ability to distinguish between perturbations and may introduce bias in downstream analyses. This necessitates a trade-off between the number of perturbations and the average number of cells analyzed per perturbation within a dataset for robust data analysis.

This study investigated the concordance between RNA and protein expression profiles in cell-type association using E-distance values on a human PBMC CITE-seq dataset. Researchers calculated all pairwise E-distances between cell types based on principal component analysis (PCA) and compared them to perturbation E-distance values. Subsequently, established cell-type relationships were used to compute cell-type hierarchies based on the resulting pairwise distance matrices. The analysis revealed a distinct clustering of B cell subtypes, with platelets forming a separate group. E-distance-based hierarchies segregated lymphoid and myeloid cells into distinct lineages. Interestingly, NK cell clusters and innate lymphoid cells (ILCs) formed a unique group due to their functional similarity to cytotoxic T cells. These findings suggest that protein markers offer superior resolution for classifying immune cell types. The study concluded that protein representations more faithfully capture cell-type specificities, while RNA data primarily reflects functional programs such as cytotoxicity or proliferation. Notably, E-distance effectively captured the known characteristics associated with each measurement modality in this study.

scPerturb: A major step forward in personalized medicine

The scPerturb dataset resource offers a comprehensive approach to analyze and explore single-cell perturbation datasets. Consistent annotations within the resource facilitate data integration, benchmarking, and identification of common perturbations across various studies. E-distance, a powerful metric included in the resource, enables quantitative comparison of perturbations within each dataset, revealing functionally similar and dissimilar ones.

The scPerturb resource and the E-statistics analytical framework hold immense potential for personalized medicine endeavors. Machine learning can leverage the perturbation significance testing and standardized annotations within scPerturb to train robust models. This paves the way for the development of next-generation computational tools for quantitatively predicting cellular processes. Armed with these models, researchers can design targeted therapies specifically tailored to individual patients, thereby facilitating a new era of personalized medicine.

Outsourcing Bioinformatics Analysis: How Bridge Informatics (BI) Can Help

We are passionate about empowering life science companies with cutting-edge technologies. BI’s data scientists prioritize studying, understanding, and reporting on the latest developments so we can advise our clients confidently. Our bioinformaticians are trained bench biologists, so they understand the biological questions driving your computational analysis.

From pipeline development and software engineering to deploying your existing bioinformatic tools, BI can help you on every step of your research journey. As experts across data types from leading sequencing platforms, we can help you tackle the challenging computational tasks of storing, analyzing and interpreting genomic and transcriptomic data. Click here to schedule a free introductory call with a member of our team.

Share this article with a friend