scGPT: The First AI Large Language Model in Single-Cell RNA Sequencing

scGPT: The First AI Large Language Model in Single-Cell RNA Sequencing

Table of Contents

Introduction

Single-cell sequencing tools are revolutionizing our understanding of cell types, as well as cellular heterogeneity that drive disease-specific mechanisms in biological systems, thereby enabling detailed analysis and accelerating personalized therapy development. Recent breakthroughs involve the creation of new reference atlases and the development of technologies to integrate data from diverse biological modalities, including epigenomics, transcriptomics, and proteomics. However, these advancements also introduce new challenges, such as the need for innovative tools to manage, integrate and interpret large multi-modal datasets. One promising solution to this challenge is generative pretraining Transformer models (GPTs). A notable example is the GPT-4, which excels in text creation. The model predicts the next word in a sequence based on the preceding words, allowing it to generate coherent and contextually relevant responses across various topics.

Similar to how texts are made up of words, cells are identified by the genes and protein products they express. This concept inspired Cui et al. (2024) to introduce the single-cell GPT (scGPT) foundation model adapted from natural language generation (NLG). This model, pre-trained on 33 million human cells from 51 organs and 441 studies obtained from the CELLxGENE collection, provides deep biological insights and outperforms traditional models in downstream tasks like cell type annotation and multi-omics integration.

scGPT: AI Large Language Model

The training of scGPT consists of two main phases: pretraining and fine-tuning. scGPT’s pre-#GPT4 training strategy involves a self-supervised approach tailored to non-sequential gene expression data. In the initial pre-training phase, the model learns how to generate gene expression of cells based on cell states and gene expression cues. This trains the model with general knowledge about cellular gene expression patterns. Notably, scGPT addresses the non-sequential nature of gene expression data through a specially designed attention mask and a generative training pipeline. This technique allows scGPT to adapt the framework of sequential prediction commonly used in Natural Language Generation (NLG) tasks to the unique structure of gene expression data. During pre-training, scGPT essentially learns to predict the complete gene expression profile of a cell based on either its cellular state or existing gene expression cues.

In the fine-tuning stage, the pre-trained model is adapted to new datasets on smaller, task-specific datasets, thereby enhancing its adaptability to new datasets for applications such as batch correction, cell type annotation, multi-omic integration, perturbation prediction, and gene regulatory network inference.

scGPT: Mastering Single Cell Annotation

The scGPT model enhances the precision of cell type annotation by leveraging a neural network classifier that transforms the output embeddings from the scGPT transformer into categorical cell type predictions.

In practice, the scGPT model has demonstrated impressive performance across various datasets. It was applied to a human pancreas dataset and a multiple sclerosis dataset, where it achieved high precision for most cell types and showed high accuracy. Additionally, in more complex designs, when evaluated on a tumor-infiltrating myeloid dataset across different cancer types, scGPT displayed strong precision in distinguishing immune cell subtypes. These outcomes underscore the model’s capability to handle diverse biological datasets and its potential to advance single-cell research.

scGPT Integrates Datasets and Reveals Gene Regulatory Networks

Beyond assisting with cell type annotation, scGPT can dive deeper by predicting how genes will respond to unseen perturbations, based on its understanding of gene interactions learned from existing experiments. The performance of scGPT was assessed using three Perturb-seq datasets from a leukemia cell line. scGPT excelled in predicting post-perturbation changes, consistently outperforming other models. Additionally, scGPT accurately predicted the trend of expression changes for all differentially expressed genes. scGPT could also predict CRISPR target genes that influence cells to recover from a cell state.

scGPT has been rigorously tested and validated across diverse and complex datasets, including Multi-batch scRNA-seq integration and multimodal datasets where it effectively integrates transcriptomics with epigenomics data. Additionally, scGPT has also been shown to accurately capture gene-gene interactions at the single-cell level, providing insights into context-specific gene regulatory interactions within individual cells.

This robust performance highlights scGPT’s ability to provide deeper insights into the intricate biology of cells, aiding in advanced scientific discoveries and applications. Despite its successes, scGPT faces challenges such as batch effects affecting performance and integrating with spatial omics data.

Outsourcing Bioinformatics Analysis: How Bridge Informatics (BI) Can Help

At Bridge Informatics, we are passionate about empowering life science companies with the latest and most advanced technologies, including large language models (LLM) inspired tools, such as GPTs, to ensure they stay at the forefront of their fields. BI’s data scientists prioritize studying, understanding, and reporting on the latest developments so we can advise our clients confidently. Our bioinformaticians are trained bench biologists, so they understand the biological questions driving your computational analysis.

From pipeline development and software engineering to deploying your existing bioinformatic tools, BI can help you on every step of your research journey. As experts across data types from leading sequencing platforms, we can help you tackle the challenging computational tasks of storing, analyzing and interpreting genomic and transcriptomic data. Click here to schedule a free introductory call with a member of our team.



Shahrzad Ghazisaeidi, PhD, Data Scientist, Bridge Informatics

Shahrzad specializes in high-throughput sequencing, data pre-processing, and downstream analysis of transcriptomic and epigenetic landscapes. She is particularly passionate about developing innovative tools for drug repurposing.

Prior to joining Bridge Informatics, Shahrzad served as a Postdoctoral Associate at the Hospital for Sick Children in Toronto, Canada where she researched the epigenetics of peripheral nerve injury models.

Shahrzad earned her Ph.D. in Physiology and Neuroscience from the University of Toronto. Her studies focused on the sex-dependent and independent gene regulation of peripheral nerve injury. Currently based in Toronto, Shahrzad balances her professional pursuits with personal interests by making time for yoga, martial arts, and poetry.


Haider M. Hassan, Data Scientist, Bridge Informatics

Haider is one of our premier data scientists. He provides bioinformatic services to clients, including high throughput sequencing, data pre-processing, analysis, and custom pipeline development. Drawing on his rich experience with a variety of high-throughput sequencing technologies, Haider analyzes transcriptional (spatial and single-cell), epigenetic, and genetic landscapes.

Before joining Bridge Informatics, Haider was a Postdoctoral Associate at the London Regional Cancer Centre in Ontario, Canada. During his postdoc, he investigated the epigenetics of late-onset liver cancer using murine and human models. Haider holds a Ph.D. in biochemistry from Western University, where he studied the molecular mechanisms behind oncogenesis. Haider still lives in Ontario and enjoys spending his spare time visiting local parks. If you’re interested in reaching out, please email [email protected] or [email protected]

Share this article with a friend