Data Distillation for extrapolative protein design through exact preference optimization

Authors: Mostafa Karimi, Sharmi Banerjee, Tommi Jaakkola, Bella Dubrov, Shang Shang, Ron Benson

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluated our model s performance in designing AAV and GFP proteins and demonstrated that the proposed framework significantly improves effectiveness in extrapolation tasks. Our benchmark shows that our approach can drastically improve the performance upon prior methods (Section 5). Through ablation studies, we show the importance of training on hard triplewise rankings in comparison to other methods for preference dataset creation (Section 6.1).
Researcher Affiliation Collaboration Mostafa Karimi, Sharmi Banerjee Amazon EMAIL Tommi Jaakkola Massachusetts Institute of Technology EMAIL Bella Dubrov, Shang Shang, Ron Benson Amazon EMAIL
Pseudocode No The paper describes the methodology using prose and mathematical equations, and includes a 'Schematic overview' in Figure 1, but does not contain explicitly labeled pseudocode or algorithm blocks for its own method. References to 'algorithms' in section 6.3 refer to external, state-of-the-art preference learning algorithms being benchmarked against.
Open Source Code No The paper does not contain an unambiguous statement of code release, a direct link to a code repository, or mention of code in supplementary materials for the methodology described.
Open Datasets Yes We evaluate our method on the well studied Green Fluorescent Proteins (GFP) by Sarkisyan et al. (2016) and Adeno-Associated Virus (AAV) by Bryant et al. (2021). We utilize the carefully created medium and hard difficulty splits provided by Kirjner et al. (2024).
Dataset Splits Yes We used the medium and hard difficulty split of datasets where mutational gap are 6 and 7 mutations respectively. In total, we created 500K (50K) training (validation) samples in which half of them based on Don t go backward and the other half based on Don t get stuck at the same fitness. We chose the top 100K (10K) hardest triplets as training (validation) samples for offline preference learning.
Hardware Specification Yes Table 7: Comparison of computational costs of generating triplets with P3 (V100) GPU machine.
Software Dependencies No The paper mentions models like Prot-T5-XL and optimizers like Adam W, but does not provide specific version numbers for any software libraries or frameworks used in their implementation (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup Yes We trained the local editor model on Dpairs for 10 epochs with the Adam W optimizer (Loshchilov & Hutter, 2017), a learning rate of 1e-4 and batch size of 384. We further fine-tuned the local editor model based on triplet-based preference learning through EXO loss function defined in equation 3 for 1 epoch with batch size of 32, learning rate of 5e-7, β = 0.1 and the Adam W optimizer (Loshchilov & Hutter, 2017). Inspired from Padmakumar et al. (2023), for each initial seed sequence, we sample N (i.e. 10 for AAV and 2 for GFP) sequences using a combination of top-k and top-p sampling with k = 10, p = 0.95 and a temperature of 0.7 (1.0) without (with) scorer in inference.