The ‘Meat and Potatoes’ of Bulk RNA-seq Data Analysis
For more than a decade, RNA extraction followed by massively parallel sequencing (aka RNA-seq) has provided unbiased snapshots of cellular gene expression to scientists worldwide. With this information, researchers have identified key genes responsible for a number of conditions, from cancer to autoimmune diseases.
In this blog post, I’ll cover the basics of how RNA-seq data are analyzed so that you can better understand and interpret the results in the context of bench-based biological experiments.
Understanding Sequencing Data
Before starting any sort of data analysis, it’s important to first understand what sequencing data is and how it works. Sequencing is the process of reading and reporting the sequence of nucleotides composing the samples analyzed (DNA or RNA).
When studying gene expression, RNA first has to be “transformed” into complementary DNA (cDNA), because of its higher stability and convenience, and amplified. Then, the quality of the libraries is checked and adapters are ligated to the ends of the fragments. These adapters are used to “label” the samples and to allow for the actual enzymatic sequencing reactions to occur.
Aligning Your Data To a Reference Genome
Once you have an understanding of where raw sequencing data comes from, you can move on to processing the output of the sequencing reactions. The sequencer will produce (very) large files containing the raw sequence reads, essentially a series of carbon copies of the content of the cDNA libraries you prepared.
The first step is to align your reads to a reference genome or transcriptome. This step transforms disorganized data into whole sequences by finding matches (or alignments) of these shorter reads on the whole genome sequence so that you can find where these genes originated. After alignment, you can then assign gene names and calculate their expression values.
Using Statistical Methods
Once all reads are aligned and annotated, it’s time to use statistical methods for further analysis. These methods allow researchers to compare different samples or datasets and draw conclusions about differences between them. The core of comparing different samples resides in differential gene expression analysis (DEA). Multiple DEA methods exist, leveraging complex statistics to define what gene transcripts are more abundant across samples.
In order to better understand the differences among samples, other analytic approaches are used, including principal component analysis (PCA) which captures the variance across samples, and multiple flavors of hierarchical clustering to visually highlight the similarities of genes and samples. Using these methods, scientists are able to gain an even deeper understanding of the sequence data and identify key genes associated with specific traits or conditions they are researching.
RNA-seq: A Powerful Bioinformatic Tool
RNA-seq data analysis is essential to gain detailed insights into the levels at which genes are expressed within a cell. The tightly regulated differential expression of genes is intimately related to cellular health and survival. When this process goes smoothly, cells function correctly and survive, while disruptions in the process can cause cells to become defective and pathogenic.
With all of these components working in unison, researchers can better understand the underlying molecular mechanisms behind biological processes and develop studies and treatments that could improve treatment outcomes and patient health.
Outsourcing Bioinformatics Analysis: How We Can Help
The applications of RNA-seq technologies are innumerable, and our clients are at the forefront of tackling these research questions with sophisticated bioinformatics approaches. However, transforming raw sequence data of any kind into actionable biological insights is no small feat.
As experts across data types from cutting-edge sequencing platforms, we can help you tackle the challenging computational tasks of storing, analyzing, and interpreting genomic and transcriptomic data. Bridge Informatics’ bioinformaticians are trained bench biologists, so they understand the biological questions driving your computational analysis. Click here to schedule a free introductory call with a member of our team.
Dan Ryder, MPH, PhD
Dan is the founder and CEO of Bridge Informatics, a professional services firm helping pharmaceutical companies translate genomic data into medicine. Unlike any other data analytics firm, Bridge forges sustainable communication change between their client’s biological and computational scientists. Dan is particularly passionate about improving communication between people of different scientific backgrounds, enabling bioinformaticians and software engineers to collectively succeed.
Prior to forming Bridge Informatics, Dan served in a variety of roles helping pharmaceutical clients solve early-phase drug discovery and development challenges.
Dan received both a Ph.D. in Biochemistry and Molecular Biology and an MPH in Disease Control from the University of Texas Health Science Center at Houston (UTHealth Houston). He completed his postdoctoral studies in Molecular Pathways of Energy Metabolism at the University of Florida College of Medicine. Dan received his undergraduate degree in Microbiology from the University of Texas at Austin.