February 7, 2022
The cost of sequencing an entire human genome has gone down exponentially in the last two decades. Although the low cost of the technology facilitates much more sequencing than was previously possible, it has led to such high volumes of data that organizing and analyzing genomic data has now become the rate limiting step in genomics.
Limitations of Current Genomic Data Warehousing
A lot of the time, the energy and cost of making sense of genomic data come from the limitations of existing tools for genomics analysis. Many existing tools are single nodes, which makes them very challenging to scale up, and many works as command-line tools, making it difficult to link them together into a complex workflow.
Fortunately, some companies are creating data warehouse architecture platforms to address these challenges. Interestingly, they’ve taken existing tools in big data and applied them to the life sciences. Some companies have created what they call a “Unified Analytics Platform for Genomics” using Apache Spark, a unified analytics engine that is part of the largest open-source project in data processing.
Open-Source Code for Maximum Speed, Ease and Support
Apache Spark is a general-purpose, open-source, multi-language big data engine that can process up to petabytes of information on clusters of thousands of nodes. Apache Spark can also be leveraged for machine learning. Apache Spark is extremely fast and has many existing APIs and standard libraries that provide a lot of ease and support for its users.
Companies like Snowflake and Databricks have taken the capacity, scalability, and speed of the Apache Spark platform and used it to create optimized genomic analysis workflows. A great example is the Databricks variant called DNASeq pipeline, which can run a whole genome sample with 30X coverage in under 30 minutes. This is an essential step in transforming raw sequence data into a usable format by assembling the genome and identifying the variants present, and Databricks leverages Apache Spark to match best practices from single-node pipelines across the whole cluster.
What’s Unique About How Databricks is Using Apache Spark?
Following data ingestion, further analyses require additional data processing steps. Databricks has created a proprietary extension for overlap joins in the sequence data that has 100X more capacity than the best existing method, and their team is developing many more tools on the Apache Spark platform to make genomic analysis faster, among other things. Other projects include a slightly different version of their DNASeq pipeline with a different variant calling pipeline specific to analyzing sequences of cancer DNA and creating an optimized pipeline for RNASeq data as well.
Save Time, Money, and Blaze New Trails in Bioinformatics
Leveraging open-source tools and cloud computing to create better tools for genomics is essential for realizing the promise that big data holds in the life sciences. It saves time and money via fewer expenditures on cloud computing and storage costs. The speed of these tools will be vital going forward: clinicians and researchers will need raw data to go from the sequencer to actionable information as quickly as possible as genomics is brought into clinical settings. Additionally, these general-purpose big data platforms are not too specialized to one data subtype so they can be used to integrate different data types, like single-cell data, gene expression data, and even genomic information with clinical observations (aka genotype to phenotype connections).
Accessibility is another important aspect of using mostly open-source tools. These platforms have notebooks and examples for users to work through and learn how to use the platform or apply it immediately to their own datasets. Projects like the Unified Analytics Platform for Genomics and Apache Spark will continue to help researchers realize the promise of genomic data for advancing medicine and biology.
Special Note: this article discusses the application of Databricks, one of several wonderful data warehouse architecture platforms. Each one helps scientists to quickly manage, process and analyze genomic data to save time and cloud costs. It’s important to note that almost all data warehouse architecture platforms use the open-source code Apache Spark. This was the focus of this article. Point being , Bridge Informatics is vendor agnostic and will discuss similar data warehouse architecture platforms like Snowflake, Cloudera, or AWS Redshift, in later blogs.
Dan Ryder, CEO & Managing Director, Bridge Informatics
Dan is the founder of Bridge Informatics, a greater Boston-based consulting firm that focuses on bioinformatics and software development. Bridge Informatics builds tools for life science with a concentration on data mining, machine learning, and various bioinformatic techniques to discover biomarkers and companion diagnostics. If you’re interested in reaching out, he can be contacted at [email protected].
Jane Cook, Journalist & Content Writer, Bridge Informatics
Jane is a Content Writer at Bridge Informatics, a professional services firm that helps biotech customers implement advanced techniques in the management and analysis of genomic data. Bridge Informatics focuses on data mining, machine learning, and various bioinformatic techniques to discover biomarkers and companion diagnostics. If you’re interested in reaching out, please email [email protected] or [email protected].