Introduction
Machine learning is a branch of artificial intelligence that involves developing algorithms and statistical models that enable computer training to make predictions or decisions based on input data. Machine learning, particularly neural networks, encompasses various techniques, such as supervised learning, unsupervised learning, and reinforcement learning that allow the network to capture quantitative relationships between input and output data. There are primarily two different types of neural networks that are currently utilized: convolutional neural networks (CNN) and artificial neural networks (ANN). CNNs are utilized mainly for vision AI, such as the analysis of histological data. ANNs, however, are employed in natural language processing tasks, such as sentiment analysis and language translation, to analyze and understand textual data with categorical values. ANNs mimic the behavior of neurons to process information and can be trained to make precise predictions across various applications.
ANNs have gained prominence in machine learning due to their ability to learn from data and make predictions or decisions without being explicitly programmed. The architecture of an ANN typically includes three layers: an input layer, which receives data inputs such as genetic markers or sequences; hidden layers, where data undergoes transformations through weighted connections between neurons; and an output layer, which produces predictions or classifications based on the processed data. Each neuron in the hidden layers applies activation function to the weighted sum of its inputs, enabling complex, and more importantly, non-linear mappings between the input and output data. During the training process, ANNs optimize internal parameters, such as weights and biases in the nodes, to minimize the difference between predicted outputs and ground truth labels. This optimization is achieved through backpropagation via stochastic gradient descent (SGD). SGD involves the computation of gradients of the loss function on model parameters, followed by iterative updates to these parameters to reduce the loss. ANNs offer a powerful framework for life science and other machine learning applications by leveraging their ability to learn complex patterns from data to improve accuracy and efficiency in decision making.
In recent years, neural networks have been utilized towards an understanding of the regulatory mechanisms governing RNA splicing. RNA splicing is a fundamental process in gene expression that involves the removal of introns and alternative junctures of exons to produce mature messenger RNA (mRNA). Although our current understanding of RNA splicing has elucidated certain sequence motifs that are crucial for splicing, the precise role of exon sequence in dictating inclusion or skipping remains elusive. Furthermore, despite advancements in the use of neural networks to predict RNA splicing outcomes, the intricate sensitivity of splicing logic to single-nucleotide changes within exons remains unclear.
An additional challenge to the use of neural networks for predicting RNA splicing logic is interpretability. Therefore, to facilitate scientific advancements, machine learning models must not only provide accurate predictions but also offer insights into their decision-making processes. In a recent publication in PNAS, Susan et al. (2023) showcases an “interpretable-by-design” machine learning model that maintains predictive accuracy while preserving interpretability. This model not only sheds light on the underlying decision-making logic governing splicing outcomes but also uncovers previously unrecognized splicing features, thereby advancing our understanding of RNA splicing regulation in mammalian systems.
Construction of an “Interpretable-by-design” Machine Learning Model to Accurately Predict RNA Splicing
A key aspect of training ANN is the preparation of input data. To train their model, Susan et al. (2023) generated a large-scale synthetic splicing dataset consisting of hundreds of thousands of “input-output” data points derived from massively parallel reporter assays using the Hela cell line. Each reporter in the assay comprised a three-exon design, with differences in the middle exon which consisted of a random 70-nucleotide sequence that is subjected to alternative splicing. Post high throughput sequencing, each data point in the synthetic training dataset consisted of a random exon sequence, paired with a measured percent spliced in (PSI) output. PSI values for each reporter were computed by dividing the number of inclusions reads by total reads encompassing inclusion and skipping events.
Synthetic datasets circumvent the limitations posed by genomic data by offering significant increases in data points, simplification of interpretability by fixing variable regions, and the elimination of overlapping RNA codes. Interestingly, an analysis of off-the-shelf machine learning algorithms to predict alternative splicing using the synthetic dataset indicated that complex neural networks capture features affecting splicing outcomes better than k-mer scoring algorithms. However, due to the “black box” nature of these ANNs it was impossible to identify features that could be fine-tuned to improve predictability.
Using the “Interpretable-by-design” ANN, the model was able to predict the outcomes from the synthetic dataset, as well as other diverse splicing datasets suggesting that the model can capture critical aspects of splicing regulatory logic. For instance, the ANN incorporates short six-nucleotide sequence filters and a one-dimensional convolutional filter to capture known motifs and RNA secondary structure in predicting splicing outcomes, respectively. Importantly, the interpretable-by-design model successfully identified previously uncharacterized splicing features, notably two long skipping filters with substantial influence on splicing predictions. These filters were robustly identified across various initialization seeds and training/testing splits, indicating their significance. One filter was found to identify stem loop structures with short, GC-rich, double-stranded regions, contributing to exon skipping and was experimentally validated through mutational analysis. Conversely, the other filter exhibited a preference for long guanine-depleted sequences, leading to increased exon inclusion upon mutational disruption. This discovery of previously unrecognized splicing features highlights the model’s capacity to uncover novel biological insights beyond known splicing motifs.
Use of “Interpretable-by-design” Machine Learning Model in Personalized Medicine
The interpretable-by-design machine learning model represents a significant advancement in accurately predicting alternative RNA splicing events, thereby offering promising avenues for drug target identification, therapeutic development, and personalized medicine. Through the utilization of extensive genomic datasets and advanced computational techniques, the interpretable-by-design model identified novel splicing signatures with the potential to serve as viable targets for drug development and therapeutic interventions. This insight holds the promise of reshaping the pharmaceutical landscape, enabling the creation of precision medications customized to the distinct splicing profiles of patients towards the treatment of complex diseases, such as cancer.
Outsourcing Bioinformatics Analysis: How Bridge Informatics Can Help
BI’s data scientists prioritize studying, understanding, and reporting on the latest pipeline developments. We do this so we can develop tools for our clients and advise them confidently. The generation, storage and analysis of biological data is faster and more accessible than ever before. From pipeline development and software engineering to deploying your existing bioinformatic tools, Bridge Informatics can help you on every step of your research journey.
As experts across data types from leading sequencing platforms, we can help you circumvent the challenging computational tasks of storing, analyzing and interpreting genomic and transcriptomic data. Bridge Informatics’ bioinformaticians are trained bench biologists, so they understand the biological questions driving your computational analysis. Click here to schedule a free introductory call with a member of our team.
Dan Ryder, MPH, PhD
Dan is the founder and CEO of Bridge Informatics, a professional services firm helping pharmaceutical companies translate genomic data into medicine. Unlike any other data analytics firm, Bridge forges sustainable communication change between their client’s biological and computational scientists. Dan is particularly passionate about improving communication between people of different scientific backgrounds, enabling bioinformaticians and software engineers to collectively succeed.
Prior to forming Bridge Informatics, Dan served in a variety of roles helping pharmaceutical clients solve early-phase drug discovery and development challenges.
Dan received both a Ph.D. in Biochemistry and Molecular Biology and an MPH in Disease Control from the University of Texas Health Science Center at Houston (UTHealth Houston). He completed his postdoctoral studies in Molecular Pathways of Energy Metabolism at the University of Florida College of Medicine. Dan received his undergraduate degree in Microbiology from the University of Texas at Austin.