January 6, 2022
Machine Learning for Modern Biology
Given the opposite natures of computer science and in vivo biological systems, it’s not surprising that challenges arise when applying machine learning to complex biological problems.
However, machine learning is becoming an essential tool in modern biology, particularly for analyzing genomic and proteomic data and making predictions, and thus any associated challenges must be addressed.
A recent article in Nature Reviews Genetics by Whalen, Schreiber, Nobel, and Pollard highlights the main pitfalls in applying machine learning to genomics, and what causes them.
Challenge 1: Varying Distributions
Most genomic datasets contain some distributional differences. This can be due to anything from sampling different cell types to the ascertainment bias in genomic data that comes from only looking at one type of ancestry, all the way to variation in how the samples were collected.
Unfortunately, in machine learning models that assume identical distributions across examples, varying distributions, especially when the training and test sets have different distributions from the prediction set(s), severely affect the model’s efficacy.
It is thus extremely important to identify any distributional differences in data before training a machine learning model and accounting for those differences, which is still an area of ongoing research.
Challenge 2: Dependent Examples
Common machine learning models assume in their underlying mathematics that the values used are completely independent of one another, meaning they are not affected by or dependent on another value.
Dependence is central to biology, however, from gene and regulatory element interactions to protein-protein interactions, and identifying that dependence is not as straightforward as it seems.
To address this pitfall, the review authors suggest using a tool to visualize the interactions within your data to identify any dependencies and to check whether the math underlying your machine-learning model assumes independence or not.
Challenge 3: Confounding Variables
When an unknown confounding variable is present in a dataset, it can create artificial associations with an outcome or mask associations that are really there. Since the ‘confounder’ is not known or measured, it can lead to poor performance of the model on real-world data that lacks that confounding variable, leading to its identification.
Solutions to this challenge include randomizing experimental batches used and including the identified confounder in the machine learning model to reduce its effect.
Challenge 4: Information Leakage
In genomics, a pervasive problem is the inadvertent leakage of information from the test set into the training set during data pre-processing, which limits the usefulness of the test set.
Many times, transforming genomic data requires some sort of standardization process, where the training set may be standardized against the test set, introducing dependence between the samples.
Fortunately, there are machine learning models in place to transform the datasets independently, by learning the parameters on the training set and applying them to the test and training sets independently.
Challenge 5: Imbalance
A machine learning task is considered unbalanced if the examples are not evenly distributed across values of the outcome, and many genomic problems have an extreme imbalance towards what classes of examples are available.
For example, any rare type of outcome, like the identification of a rare disease variant or a new enhancer, is unlikely to occur, so maybe predicted to belong to a more majority class of outcome, like a transcription factor.
To fix this, the reviewers highlight strategies to emphasize the importance of this ‘minority class’ to the model, whatever it may be, to help the model learn to identify it even if it is much less common as part of an extremely unbalanced dataset.
Outsourcing Bioinformatic Tasks
Working with service providers like Bridge Informatics for your data storage, analysis, and pipeline development needs helps eliminate these kinds of common challenges by having experts help you at every step of your analysis.
Jane Cook, Journalist & Content Writer, Bridge Informatics
Jane is a Content Writer at Bridge Informatics, a professional services firm that helps biotech customers implement advanced techniques in the management and analysis of genomic data. Bridge Informatics focuses on data mining, machine learning, and various bioinformatic techniques to discover biomarkers and companion diagnostics. If you’re interested in reaching out, please email [email protected] or [email protected].