reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multi-Attribute Constraint Satisfaction via Language Model Rewriting

Authors: Ashutosh Baheti, Debanjana Chakraborty, Faeze Brahman, Ronan Le Bras, Ximing Lu, Nouha Dziri, Yejin Choi, Mark Riedl, Maarten Sap

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To evaluate our approach, we present a new Fine-grained Constraint Satisfaction (Fine CS) benchmark, featuring two challenging tasks: (1) Text Style Transfer, where the goal is to simultaneously modify the sentiment and complexity of reviews, and (2) Protein Design, focusing on modulating fluorescence and stability of Green Fluorescent Proteins (GFP). Our empirical results show that MACS achieves the highest threshold satisfaction in both Fine CS tasks, outperforming strong domain-specific baselines.
Researcher Affiliation	Academia	Georgia Institute of Technology, University of Washington, The Ohio State University, Carnegie Mellon University, Allen Institute for Artificial Intelligence EMAIL
Pseudocode	Yes	Algorithm 1: Multi-Attribute Constraint Satisfaction Training pseudo code
Open Source Code	Yes	We release the code at https://github.com/abaheti95/MACS.
Open Datasets	Yes	We train Sentiment regressor on Yelp reviews (Zhang et al., 2015). The original data contained 650K train and 50K test reviews divided evenly across five labels (1 very negative, 2 negative, 3 neutral, 4 positive, and 5 very positive). To obtain the Complexity regressor we train a ranking model on top of the SWi PE Wikipedia simplification dataset (Laban et al., 2023). We obtain the dataset of 51.7K mutants of the GFP wild-type (i.e., the protein sequence that occurs in nature) (Sarkisyan et al., 2016; Gonzalez Somermeyer et al., 2022).
Dataset Splits	Yes	We divide the dataset into 50% train, 15% validation, and 35% test set for both fluorescence and dd G attributes. After filtering, we obtain 464K train reviews and 36K test reviews. We randomly sample 36K reviews from the train set for validation and train a Ro BERTa-large (Liu et al., 2020) regressor on the remaining instances for 4 epochs using mean squared error loss. After filtering these instances, we are left with 79K train, 1K validation, and 1.8K test simple to complex pairs.
Hardware Specification	No	No specific hardware details (like GPU models, CPU types, or memory specifications) used for running experiments are mentioned in the paper.
Software Dependencies	No	The paper mentions several models and tools such as RoBERTa-large, Llama2-7B, Llama3-8B, TinyLlama, Prot GPT2 LM, ESM2-based regressors, sentence-transformers, and Fold X software. However, it does not provide specific version numbers for the underlying software frameworks (e.g., PyTorch, TensorFlow) or the exact version of the Fold X software itself, which are necessary for full reproducibility.
Experiment Setup	Yes	We train both control tokens and text-prompted models with supervised fine-tuning (SFT) for 200K steps and a batch size of 16. For weighted behavior cloning (w BC) objective, we continue training the supervised finetuned models for an additional 50% steps (100K steps). We finetune the Prot GPT2 LM (Ferruz et al., 2022), which is a 738M parameter protein language model, as the rewriter for this task... we train Prot GPT2 with SFT for 20K steps with batch size 16 and learning rate 10^-4. We then further continue finetuning with the w BC objective for an additional 10K steps and learning rate 10^-5. For the Protein Language Model editor models trained with our MACS framework, we use the nucleus sampling (Holtzman et al., 2019) with (topp = 0.95). We also had to increase the generation temperature to 1.2 to encourage more diverse sequences.