Introduction
Advances in single-cell genomics have enabled the creation of cell atlases, which serve as comprehensive maps for biological research. These atlases allow new datasets to be quickly integrated through “reference mapping,” which helps annotate data, impute missing information, and discover new cell types. Reference mapping involves aligning new data to these atlases to extract biological meaning and build on existing knowledge. However, existing reference mapping techniques often lack interpretability, limiting the ability to understand biological mechanisms.
Traditional methods, like differential gene expression, are commonly used to identify biological differences between new data and existing atlases. These methods, however, struggle with complex datasets that contain various biological and technical variations, often resulting in less accurate results. To improve this, it’s recommended that differential expression analysis should be performed on unaltered data, rather than data that has been corrected for differences.
The “expiMap” (explainable programmable mapper) proposed by Lotfollahi, M. et. al addresses these issues by providing a more interpretable way to integrate new data into reference atlases. It uses machine learning to ensure that each part of its model represents meaningful biological information, like known gene programs (GP-groups of genes that are co-regulated to drive specific cellular processes), rather than just mathematically aligning the datasets. This allows for clearer insights into biological changes, such as understanding which gene e are affected in diseases compared to healthy references.
To reduce redundancy and ensure that only the most relevant biological insights are highlighted, expiMap employs an attention-like mechanism. For instance, by eliminating overlapping or redundant gene programs, the model can focus on the core biological processes that truly matter, which strengthens interpretability. Additionally, expiMap incorporates an attention mechanism to prioritize key gene sets and discover new ones, even if they weren’t initially part of existing data. This makes expiMap particularly useful for identifying and exploring both well-characterized and novel biological processes, ultimately offering a more robust way to understand changes in healthy and disease conditions.
Leveraging Gene Programs for Meaningful Single-Cell Data Integration
The “expiMap” approach provides a new, interpretable method for integrating single-cell data into reference atlases by incorporating biological knowledge into the analysis. Unlike traditional methods, which either lack interpretability or fail to capture complex relationships, expiMap uses a balanced model that ensures flexibility while maintaining biological context.
expiMap integrates multiple single-cell datasets along with gene programs (GPs) to create a meaningful representation. These GPs can be sourced from databases, literature, or custom lists. The model is structured to ensure that each latent variable or internal feature relates to a particular GP, making the results biologically interpretable. When existing GPs are incomplete, expiMap can add new genes to refine these programs and enhance biological accuracy.
To focus on relevant biological insights, expiMap employs an attention-like mechanism that selects only the most informative GPs, avoiding redundancy. Additionally, the model supports transfer learning, enabling the integration of new datasets by adding new latent nodes while keeping the core structure fixed. This allows the model to learn new, meaningful GPs specific to new data, including entirely novel gene patterns. Using a Bayesian framework, expiMap enables hypothesis testing to identify gene programs that differ between groups, such as in disease versus healthy conditions. This capability helps uncover important biological insights and is quantified using the Bayes factor.
Understanding Disease and Treatment Effects Using expiMap
One key goal of building large single-cell atlases is to understand how diseases or treatments affect cells compared to healthy states. To demonstrate this, expiMap was used to analyze the effects of interferon (IFN)-β on immune cells by mapping a dataset of treated and untreated cells from patients with systemic lupus erythematosus to a reference atlas of healthy immune cells. The untreated cells were successfully aligned with their corresponding healthy reference, while the treated cells formed distinct clusters, capturing the effects of IFN-β.
Differential analysis showed that gene programs (GPs) related to interferon pathways were consistently activated by IFN-β treatment, which separated the treated cells from untreated ones. Further analysis revealed cell-type-specific GPs, highlighting differences between cell populations and the specific effects of IFN-β on individual cell types, such as distinct metabolic activities in myeloid cells and monocytes.
In comparison, traditional methods like gene set enrichment analysis (GSEA) often produce broad, non-specific results, such as simply identifying “immune response” as a key function. expiMap, however, provided more precise insights, such as pinpointing B-cell receptor signaling in B cells. This precision is important for understanding the unique responses of different cell types.
The model also introduced a gene importance score to quantify the role of individual genes within each GP, allowing a deeper analysis of which genes are driving specific biological responses. Overall, expiMap effectively mapped the transcriptional response of immune cells to IFN-β, accounting for technical variations and capturing biologically relevant patterns that are difficult to detect with other methods.
Adapting to Novel Data: expiMap’s Flexible Approach to Gene Programs
Incorporating domain knowledge is essential for the effective analysis of new datasets, but existing knowledge often lacks information about newly emerging phenomena. To address this, expiMap is designed to learn novel gene programs (GPs) from query data that aren’t present in existing references, as well as de novo GPs that have never been defined before.
To test this capability, the model was trained without information about interferon (IFN) signaling and B cells in the reference. During query training, new nodes were introduced to learn these missing programs, and the model successfully identified specific variations in B cells and IFN signaling that were not part of the reference. For example, the model correctly identified B-cell receptor signaling, even though related GPs were removed during training. Additionally, expiMap added extra B-cell markers that weren’t initially included, enriching the existing knowledge.
The model also learned new GPs linked to myeloid cells and IFN-β treatment effects, effectively capturing specific variations that were not part of the original reference atlas. These new nodes differentiated between cell types and treatment conditions, showcasing expiMap’s ability to learn new patterns and variations in the data. This ability to discover entirely new gene programs is especially important when exploring novel treatments or diseases where biology is not yet fully understood.
Another key feature of expiMap is its ability to deactivate redundant nodes, which keeps the focus on biologically relevant information. During training, some nodes that initially seemed less important began capturing signals from dendritic cells, demonstrating how expiMap can adapt and refine its analysis. The model’s robustness was further validated through testing under different data conditions and model parameters, showing that it can consistently learn both predefined and new GPs across varied scenarios.
Overall, expiMap not only incorporates pre-existing gene programs but also enriches them and learns new programs independently, making it a powerful tool for understanding novel cellular responses.
Understanding Disease Complexity: expiMap’s Role in COVID-19 and Diabetes
The expiMap model was used to study immune cell responses in patients with COVID-19, focusing on changes at different stages of infection and during treatment with tocilizumab, an immunosuppressive drug. By mapping datasets from two patients, including samples from severe COVID-19 and remission stages, expiMap successfully identified immune cell states affected by the disease. Key cell types, like CD8+ T cells and CD14+ monocytes, exhibited distinct responses across stages. These cell types showed differences in antiviral gene programs, such as RIG-I/MDA5 and GPCR signaling pathways, which are crucial for initiating immune responses and coordinating inflammation during viral infections.
Further analysis revealed shifts in cellular communication pathways involving annexins, structural proteins that regulate inflammation. Using CellChat, it was found that annexin-related pathways differed between severe and remission stages for the two patients. One patient showed interactions primarily involving CD14+ monocytes, while the other exhibited shifting signaling networks that included CD16+ monocytes, NK cells, and T cells. These findings suggest that differences in annexin-driven signaling circuits may be linked to individual patient responses to tocilizumab, potentially influencing treatment outcomes.
In the context of pancreatic diseases, expiMap was applied to integrate datasets containing both healthy and type 2 diabetes (T2D) cells. The model helped differentiate cell types and resolve heterogeneity among pancreatic cells, accounting for differences like age, sex, and stress levels. Notably, expiMap assisted in accurately annotating complex cell populations, such as immune–endocrine doublets and rare cell types like acinar cells. By using cell type scores based on PanglaoDB markers and Reactome pathways, expiMap provided insights that were more reliable than traditional annotation methods, especially for challenging populations that were either misclassified or left ambiguous.
The analysis also highlighted key dysfunctions in T2D beta cells, such as the unfolded protein response (UPR), which results from an imbalance between protein synthesis and processing capacity. This stress can lead to beta cell dysfunction and death, a hallmark of diabetes progression. Cross-study comparisons further revealed a strong correlation between UPR and N-linked glycosylation, supporting its potential role in T2D pathology. By identifying these key pathways, expiMap allows researchers to better understand the intricate mechanisms that contribute to disease onset and progression, facilitating the development of targeted therapeutic interventions.
Conclusion
The implications of expiMap extend beyond just improving data integration. By offering a more biologically meaningful mapping, expiMap provides researchers with deeper insights into disease mechanisms, which is crucial for advancing personalized medicine. For example, understanding cell-type-specific responses, as demonstrated in the IFN-β treatment of lupus patients, could guide the development of more targeted therapies, reducing side effects and enhancing efficacy. Furthermore, its ability to learn novel gene programs enables the model to adapt to emerging biological challenges, making it a valuable tool for studying rapidly evolving fields such as immunotherapy or pandemics. By addressing interpretability and incorporating domain knowledge, expiMap ensures that the resulting insights are not only accurate but also actionable, paving the way for more precise interventions in complex diseases.
Outsourcing Bioinformatics Analysis: How Bridge Informatics (BI) Can Help
The expiMap model represents a significant advancement in single-cell data analysis, offering a more interpretable and insightful approach to understanding complex biological processes. Its ability to integrate domain knowledge, learn novel gene programs, and adapt to diverse datasets makes it a powerful tool for researchers seeking to unravel the intricacies of disease mechanisms and develop more targeted therapies.
At Bridge Informatics (BI), we are committed to empowering life science companies with cutting-edge bioinformatics tools and expertise. Our team of experienced bioinformaticians can help you leverage advanced methods like expiMap to analyze your single-cell data, extract meaningful insights, and accelerate your research. We understand the importance of clear, biologically relevant results, and we work closely with our clients to ensure their research goals are met.
From pipeline development and data integration to advanced analysis and interpretation, BI can assist you at every stage of your research journey. Click here to schedule a free introductory call and discover how we can help you unlock the full potential of your single-cell data.
Are you interested in reading more about single-cell studies?
Check out our other single cell-related articles:
- Decoding Differential Expression: Are Your Findings Really True?
- CellMarkerPipe: A Unified Platform for Accurate Cell Type Annotation in Single Cell RNA Sequencing Datasets
- Targeting Senescent Cells Using CAR T Cells: A New Approach to Combat Aging
- Single-Cell and Spatial Transcriptomics Analysis on Non-Small Cell Lung Cancer (NSCLC) Reveals A Population of Tumor Macrophage Hybrid Cell
- scFoundation: A Powerful AI Large-Scale Foundation Model for Single-Cell Research
- scPerturb: A Breakthrough Resource for Single Cell Multi-Omic Perturbation Data Analysis and Integration
- Breakthrough High-Resolution Spatial Multi-Omics: Slide-Tags Unlock Single Cell Analysis
- Discovering the Unseen in Less Time: Light-Seq’s Role in Identifying Rare Retinal Biomarkers
- scGPT: The First AI Large Language Model in Single-Cell RNA Sequencing
- How GPT-4 Provides High Accuracy Cell-Type Annotations in Single Cell RNA Sequencing Experiments
- Optimizing Single Cell Reference Transcriptomes: Improved Illumina Sequencing Analysis Sheds Light on Previously Undetected Cell Types and Gene Expression
- sc-SHC: A Framework for Statistical Testing During Single Cell Clustering
- Discovering the Unseen in Less Time: Light-Seq’s Role in Identifying Rare Retinal Biomarkers
- Revolutionizing Cancer Treatment with AI: How PERCEPTION Uses Single-Cell Sequencing Data to Predict Patient Outcomes
Shahrzad Ghazisaeidi, PhD, Data Scientist, Bridge Informatics
Shahrzad specializes in high-throughput sequencing, data pre-processing, and downstream analysis of transcriptomic and epigenetic landscapes. She is particularly passionate about developing innovative tools for drug repurposing.
Prior to joining Bridge Informatics, Shahrzad served as a Postdoctoral Associate at the Hospital for Sick Children in Toronto, Canada where she researched the epigenetics of peripheral nerve injury models.
Shahrzad earned her Ph.D. in Physiology and Neuroscience from the University of Toronto. Her studies focused on the sex-dependent and independent gene regulation of peripheral nerve injury. Currently based in Toronto, Shahrzad balances her professional pursuits with personal interests by making time for yoga, martial arts, and poetry.