What is metagenomics?
Metagenomics is a high-throughput sequencing and analysis technique popular for complex microbial DNA samples derived from environmental or clinical sources. The growing threat posed by multi-drug resistant strains of bacteria and the increased understanding of the role of microbes in human health has driven the expansion of metagenomic research. This is further fueled by the recent understanding of microbes in complex human diseases, such as cancer and diabetes, and responses to immunotherapies.
An important aspect of metagenomic research involves the taxonomic classification and abundance estimation of bacterial genomes, or metagenomes, at genus and species levels. Additionally, the stratification of bacteria down to a strain level is often necessary to understand bacterial population dynamics, epidemiology, and the development of therapeutic options.
What are the current approaches to the taxonomic classification of bacteria?
Currently, there are two approaches for bacterial taxonomic classification: alignment and alignment-free. Both approaches have their own inherent benefits and drawbacks. The alignment-based tools, such as GATK PathSeq, Blastn, MetaPhlAn, MEGAN, and Pathscope, offer the benefit of increased sensitivity and accuracy at the cost of being computationally intense and time-consuming. Among the alignment-based tools, MetaPhlAn is an exception since it uses a compressed reference database of bacterial species/ or strain-specific marker genes for classification. Although it is more efficient, MetaPhlAn suffers from lower sensitivity and accuracy since only a subset of total metagenomic reads is utilized for taxonomic quantification.
On the other hand, alignment-free tools, such as Kraken2, Braken, CLARK, and Centrifuge require relatively less computational power at the cost of lower sensitivity and an increased rate of false positives. Additionally, alignment-free methods are often insufficient for strain-level bacterial classification.
CAMMiQ for Strain Classification of Bacteria
In a recent paper published in Nature Communications, Zhu et al. introduce CAMMiQ (Combinatorial Algorithms for metagenomic Microbial Quantification) for bacterial taxonomic strain-level stratification with high accuracy and sensitivity.
CAMMiQ provides a two-step solution to bacterial classification and quantitation from metagenomic sequences. In the first step, it generates a database of indexed genomes from a list of reference genomes or assemblies. Then, it uses doubly unique and variable-length metagenomic sequences to query the reference database for taxonomic classification and abundance measurement. In short, CAMMiQ uses a data structure comprised of the shortest possible metagenomic sequences, or strings, that align to two bacterial genomes in the indexed database.
Combined with the use of variable-length strings/sequences for alignment, as opposed to fixed-length k-mers that are typically used by most alignment tools, CAMMiQ samples a larger selection of metagenomic sequences for better coverage. This ensures efficiency during the mapping of metagenomic sequences and higher accuracy than its competitors, such as PathSeq and blastn. In their comparative benchmarking study, the authors show that CAMMiQ reduces the rate of false positives, and correctly classifies more bacterial genomes than its competitors. In addition, CAMMiQ was further applied to a single-cell dataset for strain-level detection and was comparable in accuracy and precision to GATK PathSeq and blastn.
Overall, CAMMiQ paves the way for more accurate, efficient, and sensitive strain-level classification and quantitation of bacteria in metagenomic research. The application of CAMMiQ in health research, as well as single-cell studies, paves a path to a better understanding of alterations in bacterial population dynamics in human health and diseases.
Outsourcing Bioinformatics Analytics: How We Can Help
Our clients are using cutting-edge bioinformatics tools including metagenomics and RNA-Seq to answer pressing biological research questions. From pipeline development and software engineering to deploying existing bioinformatics tools, Bridge Informatics can help you on every step of your research journey.
As experts across data types from cutting-edge sequencing platforms, we can help you tackle the challenging computational tasks of storing, analyzing, and interpreting genomic and transcriptomic data. Bridge Informatics’ bioinformaticians are trained bench biologists, so they understand the biological questions driving your computational analysis. Click here to schedule a free introductory call with a member of our team.
Haider M. Hassan, Data Scientist, Bridge Informatics
Haider is one of our premier data scientists. He provides bioinformatic services to clients, including high throughput sequencing, data pre-processing, analysis, and custom pipeline development. Drawing on his rich experience with a variety of high-throughput sequencing technologies, Haider analyzes transcriptional (spatial and single-cell), epigenetic, and genetic landscapes.
Before joining Bridge Informatics, Haider was a Postdoctoral Associate at the London Regional Cancer Centre in Ontario, Canada. During his postdoc, he investigated the epigenetics of late-onset liver cancer using murine and human models. Haider holds a Ph.D. in biochemistry from Western University, where he studied the molecular mechanisms behind oncogenesis. Haider still lives in Ontario and enjoys spending his spare time visiting local parks. If you’re interested in reaching out, please email [email protected] or [email protected]