Advances in Microbial Genome Quality Prediction through Machine Learning

Advances in Microbial Genome Quality Prediction through Machine Learning

Table of Contents


How do you determine which microbial species are present in an environmental sample? Several tools have been developed to help extract high-quality microbial genomes from metagenomic data, with varying strengths and weaknesses. Researchers have developed a new machine-learning-based tool, CheckM2, that significantly improves the accuracy and speed of predicting the quality of microbial genomes extracted from metagenomic data.

The Challenge of Assessing Microbial Genome Quality

The rapidly advancing field of metagenomics allows researchers to sequence all of the genomes found in a complex environmental sample. This is particularly useful for examining the composition of microbial communities, exploring the diversity and relative abundance of microbiota and their functions. However, precisely identifying individual species’ genomes from mixed metagenomic samples presents its own challenges.

In addition, accurately evaluating the quality of these genomes can be difficult due to the lack of comprehensive reference databases for many lineages, or inadequate marker genes for quality assessment of novel lineages. Extracting high-quality genomes from complex microbiome samples is critical for conducting reliable downstream analyses and improving the number of available reference genomes for microbial species.

Introducing CheckM2: A Machine Learning Solution

In a report published in Nature Methods, researchers introduced CheckM2, a machine-learning-based approach that overcomes the limitations of previous quality assessment tools when applied to novel genomes. CheckM2 does not rely on lineage-specific marker sets, enabling the analysis of genomes from underrepresented or novel lineages. CheckM2 also surpasses existing methods by incorporating a broader range of genomic inputs, such as multi-copy genes, metabolic pathways, amino acid counts, and other features.

The research team rigorously trained the model on simulated genomes with known levels of incompleteness and contamination. When applied to novel metagenomic data and benchmarked against existing tools, CheckM2 demonstrated remarkable accuracy, particularly for medium and low-quality genomes, as well as for lineages with limited genomic representation.

Going Beyond Genome Prediction

The significance of CheckM2 goes beyond genome quality prediction. It enhances existing databases, refines biological interpretations, and has the potential to identify genomes previously overlooked or misjudged by other tools. Additionally, CheckM2’s increased speed relative to existing tools positions it as an ideal candidate for analyzing large metagenomic datasets, further facilitating research in the field of microbial genomics.

The research team’s future plans involve refining CheckM2 by incorporating an increasing number of high-quality genomes, exploring alternative annotation systems, and enhancing the detection of contamination from diverse taxonomic sources. Ultimately, they aim for CheckM2 to become a versatile and powerful tool for predicting genome quality across bacterial and archaeal genomes, amplifying our understanding of microbial diversity and function in various environments.

Outsourcing Bioinformatics Analysis: How Bridge Informatics Can Help

The generation, storage and analysis of biological data is faster and more accessible than ever before. From pipeline development and software engineering to deploying your existing bioinformatic tools, Bridge Informatics can help you on every step of your research journey.

As experts across data types from leading sequencing platforms, we can help you tackle the challenging computational tasks of storing, analyzing and interpreting genomic and transcriptomic data. Bridge Informatics’ bioinformaticians are trained bench biologists, so they understand the biological questions driving your computational analysis. Click here to schedule a free introductory call with a member of our team.

Lauren Dembeck, Ph.D., Geneticist & Science Writer, Bridge Informatics

Lauren Dembeck, Ph.D., is an experienced science and medical writer. During her doctoral research at North Carolina State University, she conducted genome-wide association studies to identify genetic variants contributing to natural variation in complex traits and used a combination of classical and molecular genetics approaches in validation studies. Lauren was a postdoctoral fellow at the Okinawa Institute of Science and Technology in Japan. During her postdoc, she used fluorescence-activated cell sorting paired with high-throughput sequencing approaches to study the formation and regulation of neuronal circuits. 

She is part of our team of expert content writers at Bridge Informatics, bringing our readers and customers everything they need to know at the cutting edge of bioinformatics research. If you’re interested in reaching out, please email [email protected] or [email protected].

Share this article with a friend

Create an account to access this functionality.
Discover the advantages