scFoundation: A Powerful AI Large-Scale Foundation Model for Single-Cell Research

scFoundation: A Powerful AI Large-Scale Foundation Model for Single-Cell Research

Table of Contents

Introducing scFoundation

Large-scale foundation models are revolutionizing natural language processing (NLP) and advancing artificial intelligence (AI) by identifying patterns and relationships within vast datasets. In a similar way, cells in life sciences can be compared to sentences composed of DNA, RNA, proteins, and gene expression values. Single-cell RNA sequencing (scRNA-seq) generates extensive cellular transcriptomics data, which is ideal for developing these models. Gene expression profiles reveal intricate gene interactions within cells, similar in complexity to the texts used for training language models.

In recent years, the rapid advancement of scRNA-seq data has met the data demands necessary to pretrain these models. However, pretraining large-scale models on single-cell data presents significant challenges. These include the need for comprehensive, organized gene expression data across various cell types and states, managing the large volume of gene expression data, and addressing discrepancies in sequencing read depth from various techniques, which complicate uniform model training.

The new large pre-trained model, called scFoundation, with 100 million parameters, was developed to address these challenges using data from over 50 million gene expression profiles. Innovations in scFoundation include an asymmetric architecture for efficient training and scalability, and a read-depth-aware pretraining task designed to model gene co-expression and link cells with different read depths.

The scFoundation AI Pre-Training Framework

The scFoundation pre-training framework was developed to model 19,264 genes using approximately 100 million parameters, pretrained on over 50 million scRNA-seq datasets, establishing it as a substantial model in the single-cell field. Key components of the framework include a scalable transformer-based design with an asymmetric encoder-decoder architecture and an embedding module to retain raw gene expression values. The Read Depth-Aware (RDA) modeling task extends masked language modeling to predict masked gene expressions based on cell context, accommodating varying read depths. Data were sourced from numerous public single-cell repositories, resulting in a comprehensive dataset spanning over 100 tissue types and various states. The scFoundation model supports various downstream tasks such as cell clustering, drug response prediction, and gene module inference.

One of the standout features of scFoundation is its scalable read-depth enhancement without the need for fine-tuning. Compared to other imputation methods like MAGIC and SAVER, scFoundation excelled in clustering accuracy and alignment with reference data. It was also able to enhance cell embeddings in challenging datasets, demonstrating superior separation of cell types. Additionally, scFoundation showcased its capability to facilitate read-depth enhanced clustering across different batches.

Advancing Cancer Drug Response Predictions

Cancer drug responses (CDRs) analyze how tumor cells react to drug interventions. Computational prediction of CDRs is crucial for guiding anticancer drug design and enhancing our understanding of cancer biology. To improve cancer drug response prediction, scFoundation was combined with the DeepCDR method to predict the half-maximal inhibitory concentration (IC50) values of drugs across various cell line data. This validated scFoundation’s ability to provide informative embeddings for bulk-level gene expression data, despite being trained on single cells. scFoundation achieved higher predictive accuracy for most drugs and cancer types, consistently outperforming the original DeepCDR model in drug-blind tests. Notably, drugs related to chemotherapy showed higher prediction accuracy than targeted therapy drugs. Further, gene set enrichment analysis (GSEA) validated the predictions, linking sensitive cell lines to specific signaling pathways. These findings demonstrate scFoundation’s potential to enhance the understanding of drug responses in cancer biology and guide the design of effective anticancer treatments.

Inferring drug sensitivities at the single-cell level can help identify specific cell subtypes with different drug resistance characteristics, providing valuable insights into underlying mechanisms and potential new therapies. However, the majority of drug-response data doesn’t have single-cell resolution. scFoundation was applied to transfer bulk-sequencing response data into single-cell data. The scFoundation model outperformed the existing SCAD model, which took all genes’ expression values as input. Moreover, scFoundation was able to better group cells or bulk cell lines with the same drug response. The implications of scFoundation’s superior performance suggest a significant advancement in precision medicine and targeted cancer therapy.

Advancing Biomedical Applications

Understanding cellular responses to perturbations is critical for biomedical applications and drug design, as it identifies gene interactions and potential drug targets. Using Perturb-seq data, scFoundation was combined with the advanced GEARS model to train models for predicting cellular responses. This integration creates cell-specific gene co-expression graphs, improving prediction accuracy. The model, trained on three datasets, outperformed the original GEARS model, particularly in challenging two-gene perturbations. It achieved lower error values and better-predicted gene expression distributions. Additionally, it excelled in classifying genetic interaction types, identifying more true synergy and suppressor interactions. These findings underscore the value of cell-specific gene context embeddings from scFoundation for precise perturbation prediction, enhancing the ability to model and understand complex cellular behaviors.

These results highlight that scFoundation can significantly improve drug target identification and therapeutic intervention strategies by providing a more accurate understanding of gene interactions and cellular responses.

High-Accuracy Cell Type Annotation

Cell type annotation is crucial in single-cell studies, and scFoundation has proven to be highly effective in this task. By fine-tuning a single layer of its encoder and adding a prediction layer, scFoundation was tested against several existing methods using challenging datasets. It consistently achieved the highest accuracy, particularly excelling in identifying rare cell types such as CD4+ T helper 2 and CD34+ cells. scFoundation could clearly separate different cell types, highlighting its ability to use the entire gene set for more precise annotations compared to methods that rely on a smaller subset of gene data. These results demonstrate the superior performance and robustness of scFoundation in cell type annotation tasks, making it a valuable tool in single-cell research.

One advantage of scFoundation is its ability to extend gene expression values into context embeddings, unlike other architectures such as the vanilla MLP model. These embeddings facilitate graph-based downstream methods like GEARS and can infer gene-gene networks. Using gene embeddings from three immune cell types—monocytes, cytotoxic CD8+ T cells, and B cells—researchers validated this usage by clustering genes into modules based on embedding similarity. The results showed that scFoundation effectively identified differentially expressed gene modules for each cell type. Gene enrichment analysis confirmed that these modules were enriched in their respective cell types, indicating that the gene embeddings captured functional relationships among genes. These findings highlight the potential of scFoundation gene embeddings for inferring gene regulatory networks (GRNs) and understanding gene regulation.

Conclusion

The scFoundation model represents a significant advancement in applying large-scale foundation models to single-cell biology. It excels in tasks such as read-depth enhancement, drug response prediction, single-cell drug sensitivity prediction, perturbation predictions, and cell type annotation without needing further fine-tuning. However, scFoundation has limitations, including potential gaps in capturing the full complexity of human biology in context of disease, substantial computational demands, and an exclusive focus on transcriptomic data without incorporating genomic or epigenomic information. Future improvements could integrate multi-omic to better link molecular features with phenotypes and explore single-cell multiomics data for more comprehensive modeling. The versatility of scFoundation across various tasks highlights its capability in learning gene expression relationships in different cell types, paving the way for new methods to decode complex molecular systems and supporting a wide range of downstream research.

Are you interested in the application of AI in sc-RNAseq? Read our other related articles:

Outsourcing Bioinformatics Analysis: How Bridge Informatics (BI) Can Help

At Bridge Informatics, we are passionate about empowering life science companies with the latest and most advanced technologies, including large language models (LLM) inspired tools, such as GPTs, to ensure they stay at the forefront of their fields. BI’s data scientists prioritize studying, understanding, and reporting on the latest developments so we can advise our clients confidently. Our bioinformaticians are trained bench biologists, so they understand the biological questions driving your computational analysis.

From pipeline development and software engineering to deploying your existing bioinformatic tools, BI can help you on every step of your research journey. As experts across data types from leading sequencing platforms, we can help you tackle the challenging computational tasks of storing, analyzing and interpreting genomic and transcriptomic data. Click here to schedule a free introductory call with a member of our team.



Shahrzad Ghazisaeidi, PhD, Data Scientist, Bridge Informatics

Shahrzad specializes in high-throughput sequencing, data pre-processing, and downstream analysis of transcriptomic and epigenetic landscapes. She is particularly passionate about developing innovative tools for drug repurposing.

Prior to joining Bridge Informatics, Shahrzad served as a Postdoctoral Associate at the Hospital for Sick Children in Toronto, Canada where she researched the epigenetics of peripheral nerve injury models.

Shahrzad earned her Ph.D. in Physiology and Neuroscience from the University of Toronto. Her studies focused on the sex-dependent and independent gene regulation of peripheral nerve injury. Currently based in Toronto, Shahrzad balances her professional pursuits with personal interests by making time for yoga, martial arts, and poetry.

Share this article with a friend