Multi-Attribute Constraint Satisfaction via Language Model Rewriting
Authors: Ashutosh Baheti, Debanjana Chakraborty, Faeze Brahman, Ronan Le Bras, Ximing Lu, Nouha Dziri, Yejin Choi, Mark Riedl, Maarten Sap
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To evaluate our approach, we present a new Fine-grained Constraint Satisfaction (Fine CS) benchmark, featuring two challenging tasks: (1) Text Style Transfer, where the goal is to simultaneously modify the sentiment and complexity of reviews, and (2) Protein Design, focusing on modulating fluorescence and stability of Green Fluorescent Proteins (GFP). Our empirical results show that MACS achieves the highest threshold satisfaction in both Fine CS tasks, outperforming strong domain-specific baselines. |
| Researcher Affiliation | Academia | Georgia Institute of Technology, University of Washington, The Ohio State University, Carnegie Mellon University, Allen Institute for Artificial Intelligence EMAIL |
| Pseudocode | Yes | Algorithm 1: Multi-Attribute Constraint Satisfaction Training pseudo code |
| Open Source Code | Yes | We release the code at https://github.com/abaheti95/MACS. |
| Open Datasets | Yes | We train Sentiment regressor on Yelp reviews (Zhang et al., 2015). The original data contained 650K train and 50K test reviews divided evenly across five labels (1 very negative, 2 negative, 3 neutral, 4 positive, and 5 very positive). To obtain the Complexity regressor we train a ranking model on top of the SWi PE Wikipedia simplification dataset (Laban et al., 2023). We obtain the dataset of 51.7K mutants of the GFP wild-type (i.e., the protein sequence that occurs in nature) (Sarkisyan et al., 2016; Gonzalez Somermeyer et al., 2022). |
| Dataset Splits | Yes | We divide the dataset into 50% train, 15% validation, and 35% test set for both fluorescence and dd G attributes. After filtering, we obtain 464K train reviews and 36K test reviews. We randomly sample 36K reviews from the train set for validation and train a Ro BERTa-large (Liu et al., 2020) regressor on the remaining instances for 4 epochs using mean squared error loss. After filtering these instances, we are left with 79K train, 1K validation, and 1.8K test simple to complex pairs. |
| Hardware Specification | No | No specific hardware details (like GPU models, CPU types, or memory specifications) used for running experiments are mentioned in the paper. |
| Software Dependencies | No | The paper mentions several models and tools such as RoBERTa-large, Llama2-7B, Llama3-8B, TinyLlama, Prot GPT2 LM, ESM2-based regressors, sentence-transformers, and Fold X software. However, it does not provide specific version numbers for the underlying software frameworks (e.g., PyTorch, TensorFlow) or the exact version of the Fold X software itself, which are necessary for full reproducibility. |
| Experiment Setup | Yes | We train both control tokens and text-prompted models with supervised fine-tuning (SFT) for 200K steps and a batch size of 16. For weighted behavior cloning (w BC) objective, we continue training the supervised finetuned models for an additional 50% steps (100K steps). We finetune the Prot GPT2 LM (Ferruz et al., 2022), which is a 738M parameter protein language model, as the rewriter for this task... we train Prot GPT2 with SFT for 20K steps with batch size 16 and learning rate 10^-4. We then further continue finetuning with the w BC objective for an additional 10K steps and learning rate 10^-5. For the Protein Language Model editor models trained with our MACS framework, we use the nucleus sampling (Holtzman et al., 2019) with (topp = 0.95). We also had to increase the generation temperature to 1.2 to encourage more diverse sequences. |