scGPT: The First AI Large Language Model in Single-Cell RNA Sequencing

scGPT: The First AI Large Language Model in Single-Cell RNA Sequencing

Table of Contents

Introduction

Single-cell sequencing tools are revolutionizing our understanding of cell types, as well as cellular heterogeneity that drive disease-specific mechanisms in biological systems, thereby enabling detailed analysis and accelerating personalized therapy development. Recent breakthroughs involve the creation of new reference atlases and the development of technologies to integrate data from diverse biological modalities, including epigenomics, transcriptomics, and proteomics. However, these advancements also introduce new challenges, such as the need for innovative tools to manage, integrate and interpret large multi-modal datasets. One promising solution to this challenge is generative pretraining Transformer models (GPTs). A notable example is the GPT-4, which excels in text creation. The model predicts the next word in a sequence based on the preceding words, allowing it to generate coherent and contextually relevant responses across various topics.

Similar to how texts are made up of words, cells are identified by the genes and protein products they express. This concept inspired Cui et al. (2024) to introduce the single-cell GPT (scGPT) foundation model adapted from natural language generation (NLG). This model, pre-trained on 33 million human cells from 51 organs and 441 studies obtained from the CELLxGENE collection, provides deep biological insights and outperforms traditional models in downstream tasks like cell type annotation and multi-omics integration.

scGPT: AI Large Language Model

The training of scGPT consists of two main phases: pretraining and fine-tuning. scGPT’s pre-#GPT4 training strategy involves a self-supervised approach tailored to non-sequential gene expression data. In the initial pre-training phase, the model learns how to generate gene expression of cells based on cell states and gene expression cues. This trains the model with general knowledge about cellular gene expression patterns. Notably, scGPT addresses the non-sequential nature of gene expression data through a specially designed attention mask and a generative training pipeline. This technique allows scGPT to adapt the framework of sequential prediction commonly used in Natural Language Generation (NLG) tasks to the unique structure of gene expression data. During pre-training, scGPT essentially learns to predict the complete gene expression profile of a cell based on either its cellular state or existing gene expression cues.

In the fine-tuning stage, the pre-trained model is adapted to new datasets on smaller, task-specific datasets, thereby enhancing its adaptability to new datasets for applications such as batch correction, cell type annotation, multi-omic integration, perturbation prediction, and gene regulatory network inference.

scGPT: Mastering Single Cell Annotation

The scGPT model enhances the precision of cell type annotation by leveraging a neural network classifier that transforms the output embeddings from the scGPT transformer into categorical cell type predictions.

In practice, the scGPT model has demonstrated impressive performance across various datasets. It was applied to a human pancreas dataset and a multiple sclerosis dataset, where it achieved high precision for most cell types and showed high accuracy. Additionally, in more complex designs, when evaluated on a tumor-infiltrating myeloid dataset across different cancer types, scGPT displayed strong precision in distinguishing immune cell subtypes. These outcomes underscore the model’s capability to handle diverse biological datasets and its potential to advance single-cell research.

scGPT Integrates Datasets and Reveals Gene Regulatory Networks

Beyond assisting with cell type annotation, scGPT can dive deeper by predicting how genes will respond to unseen perturbations, based on its understanding of gene interactions learned from existing experiments. The performance of scGPT was assessed using three Perturb-seq datasets from a leukemia cell line. scGPT excelled in predicting post-perturbation changes, consistently outperforming other models. Additionally, scGPT accurately predicted the trend of expression changes for all differentially expressed genes. scGPT could also predict CRISPR target genes that influence cells to recover from a cell state.

scGPT has been rigorously tested and validated across diverse and complex datasets, including Multi-batch scRNA-seq integration and multimodal datasets where it effectively integrates transcriptomics with epigenomics data. Additionally, scGPT has also been shown to accurately capture gene-gene interactions at the single-cell level, providing insights into context-specific gene regulatory interactions within individual cells.

This robust performance highlights scGPT’s ability to provide deeper insights into the intricate biology of cells, aiding in advanced scientific discoveries and applications. Despite its successes, scGPT faces challenges such as batch effects affecting performance and integrating with spatial omics data.

Outsourcing Bioinformatics Analysis: How Bridge Informatics (BI) Can Help

At Bridge Informatics, we are passionate about empowering life science companies with the latest and most advanced technologies, including large language models (LLM) inspired tools, such as GPTs, to ensure they stay at the forefront of their fields. BI’s data scientists prioritize studying, understanding, and reporting on the latest developments so we can advise our clients confidently. Our bioinformaticians are trained bench biologists, so they understand the biological questions driving your computational analysis.

From pipeline development and software engineering to deploying your existing bioinformatic tools, BI can help you on every step of your research journey. As experts across data types from leading sequencing platforms, we can help you tackle the challenging computational tasks of storing, analyzing and interpreting genomic and transcriptomic data. Click here to schedule a free introductory call with a member of our team.

Share this article with a friend