Building a Database and AI/ML Pipeline for Publicly Available Multimodal Data

Building a Database and AI/ML Pipeline for Publicly Available Multimodal Data

Table of Contents

Situation

A large, global pharmaceutical company (“Client”) focused on oncology drug development recognized the vast potential of leveraging several publicly available data repositories to accelerate  their R&D efforts.

Client sought to integrate, and harmonize these data sets in order to analyze the diverse data types which usually is quite challenging. The data types  included unstructured text from scientific publications, structured information from several genomic databases, and  high-content imaging data from cell lines, model organisms, and even patient samples (where available).

Bridge Informatics (BI) was brought in to acquire, aggregate, and curate the data from these  public repositories. As a second phase, BI then designed and implemented a system to harness this multimodal data, with a particular focus on analyzing the genotype-backed image data.

Strategy

BI, after understanding Client’s objectives and goals, drafted and executed on a statement of work that included the following:

  • Data Acquisition and Curation: We identified and accessed relevant publicly available data sources, including access to the Genotype-Tissue Expression (GTEx) repository, TCGA (the cancer genome atlas ) repository ,the GDC (genomic data commons), NCBI databases like Gene Expression Omnibus (GEO) and GenBank for genomic data, and image repositories such as the Broad Bioimage Benchmark Collection. We then curated this data, extracting key information from publications using manual methods plus natural language processing (NLP) methods, while standardizing genomic data formats, and organizing image data with appropriate metadata.
  • Database Design and Development: We designed a robust and scalable database architecture to accommodate the diverse data types and volumes. This involved selecting appropriate database technologies (e.g., relational databases for structured genomic data, NoSQL databases for unstructured text data, and specialized image repositories) and integrating them seamlessly.
  • AI/ML Pipeline Implementation: We developed a custom AI/ML pipeline tailored to the pharmaceutical company’s drug development needs. This included NLP techniques to extract insights from scientific literature on drug mechanisms and clinical trials, analysis of genomic data to identify potential drug targets and biomarkers, and application of computer vision techniques to analyze image data for phenotypic characteristics related to drug response.
  • Model Training and Evaluation: We trained and evaluated various machine learning models, including deep learning models for image analysis, to predict drug efficacy and toxicity, identify patient subgroups most likely to benefit from specific therapies, and discover new drug targets and therapeutic strategies.
  • User Interface Development: We created a user-friendly interface for researchers and scientists to access, query, and visualize the integrated data, facilitating data exploration, analysis, and decision-making in the drug development process.

Results

By partnering with BI, Client successfully leveraged the power of publicly available data to drive innovation in oncology drug development.

  • Client gained a custom built centralized and organized repository for diverse publicly available data, enabling efficient access and analysis.
  • The integrated database and AI/ML pipeline significantly accelerated research and development efforts by providing powerful exploratory tools to analyze and interpret complex multimodal data, leading to faster identification of promising drug candidates and more efficient clinical trial design.
  • The AI/ML models generated valuable insights into disease mechanisms, potential drug targets, and patient stratification strategies, leading to the discovery of novel therapeutic approaches.
  • The user-friendly interface facilitated collaboration among researchers, scientists, and clinicians, promoting a data-driven approach to drug development and ultimately leading to more effective and targeted therapies.

This case study demonstrates the value of integrating multimodal data and applying AI/ML techniques to accelerate the development of new and improved cancer treatments.

Interested in partnering with Bridge Informatics? Contact us to learn more about our team of bioinformaticians with experience at the bench whose core specialty is understanding and analyzing biological data.


Jessica Corrado, Head of Business Development & Commercial Operations, Bridge Informatics

As the Head of Business Development & Commercial Operations, Jessica is responsible for driving strategic growth initiatives and overseeing the company’s commercial activities. She has both a keen understanding of the life sciences industry and a strong track record in building successful partnerships.

Prior to joining Bridge, Jessica held a number of leadership roles across sales, marketing, and communications. Outside of work, Jessica is responsible for the majority of marketing and event planning for Shore Saves, a non-profit animal rescue. She enjoys reading and is often reading at least two books of various genres at a time. If you’re interested in reaching out, please email [email protected].

Share this article with a friend