Unveiling the Molecular Code: Antibody Sequence Mining and Target Affinity Analysis

Overview

The challenge in AI Drug Discovery is often not the algorithm itself, but the quality and structure of the training data. This case study details how we partnered with a leading APAC-based AI/ML client to provide comprehensive Data Curation Services, meticulously extracting clone, sequence, and target affinity information from patents. This process, centered on Antibody Sequence Mining, delivered a structured dataset focused on Therapeutic Monoclonal Antibodies (mAbs), which served as the bedrock for training advanced AI models. This enabled the client to accelerate their efforts in precision medicine and target identification, particularly in immune oncology. This successful engagement highlights the power of strategic data leveraging to unlock the full potential of antibodies.

Our client

Our client

The client is a cutting-edge AI Drug Discovery company, committed to revolutionizing the drug discovery journey. Leveraging their proprietary workflow AI platform, they generate critical insights from customized target identification to lead generation, enabling the development of commercially valuable drugs from in-house and partnership projects. They are based in the APAC region and operate within the AI/ML Industry.

Client’s challenge

Client’s challenge

The customer faced a critical need for a comprehensive, high-quality training set specifically focused on Therapeutic Monoclonal Antibodies (mAbs) and their associated Structure-Activity Relationship (SAR) data. This dataset was essential for enhancing their AI/ML algorithm to identify new targets in the highly competitive field of immune oncology. The current unavailability of such a specialized and tailored dataset presented a significant challenge, leading to limited target identification, compromised AI/ML performance, missed opportunities, and delayed innovation. (Read more about data preparation for predictive modeling in our whitepaper on selecting and preparing data for AI/ML predictive modeling).

Client’s goals

Client’s goals

The primary goal was to obtain a robust, comprehensive, and meticulously structured dataset of therapeutic monoclonal antibodies and their target binding data. This dataset was intended to:

  • Enhance their proprietary AI/ML algorithm’s capability.
  • Enable the precise identification of new targets within the field of immune oncology.
  • Accelerate the overall drug discovery process towards developing novel, personalized treatments.

Our Approach

Recognizing our established legacy in data curation and our proven track record of transforming vast amounts of data into valuable, actionable insights, the client partnered with us. Our approach was comprehensive and rational:

Defining the project scope

Clearly outlining specific objectives and deliverables to ensure complete alignment with the client’s unique requirements.

Identifying data sources

Conducting thorough research to pinpoint relevant and reliable data sources, primarily patents, containing the necessary information on Therapeutic Monoclonal Antibodies (mAbs) and their binding targets.

Creation of a data extraction template

Developing a structured template that included all mandatory data variables to guarantee consistency and completeness during the curation process.

Data variable identification

Pinpointing essential data variables crucial for the client’s AI/ML algorithm/model, ensuring the curated dataset was perfectly aligned with desired outcomes. (Explore our Data Curation services).

Protocol documentation

Creating detailed documentation to maintain high-quality standards throughout the data curation process.

Data delivery

Delivering the curated data in a compatible Excel format upon completion of each target curation, ensuring ease of integration with the client’s existing systems.

Our Solution

Our solution was centered on building a proprietary repository of Therapeutic Monoclonal Antibodies (mAbs) against their binding targets, meticulously curated from patent data. This repository contained key data sections vital for advanced Antibody Sequence Mining and Target Affinity Analysis:

Clone details

Information on specific clones, including specificity, orientation, and binding targets.

Sequence details

Deep dive into the sequence of monoclonal antibodies, with a focus on variable light (VL) and variable heavy (VH) regions. This is critical for understanding molecular characteristics and functional properties.

Binding affinity information

Quantitative results and methodologies related to the strength and specificity of the antibody-target interactions.

Stability parameters

Insights into thermostability and pharmacokinetics, essential for drug development considerations.

We curated 538 total patents resulting in 24,222 total data rows and 9,705 total antibody sequences. The key targets we focused on included critical immune oncology markers (PD1, PDL1, CTLA4, TIGIT, CD3, CD52) and others like PCSK9, TNF-alpha, BCMA, BAFF, VEGFA, CD202, EGFR, HER2, and C5.

Key statistics from the Therapeutic Monoclonal Antibodies dataset curation: 9,705 total antibody sequences and 24,222 data rows curated.
Antibody Sequence Mining and Target Affinity Analysis process diagram for AI Drug Discovery.

Conclusion

This project success was driven by our systematic approach and deep expertise in Data Curation Services. The meticulous extraction of desired antibody sequences and target affinity data from patents, guided by a well-defined scope, ensured the highest level of accuracy and completeness. The resulting dataset of Therapeutic Monoclonal Antibodies (mAbs) provides valuable insights into amino acid sequences and binding affinities, directly empowering the client’s AI Drug Discovery platform. This strategic leveraging of high-quality, structured data is crucial for the advancement of precision and personalized medicine. (To see another example of how structured data enables AI/ML, view our case study: Structured and analysis-ready data for AI/ML-based drug discovery).