Learning Extrapolative Sequence Transformations from Markov Chains
Authors: Sophia Hager, Aleem Khan, Andrew Wang, Nicholas Andrews
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The proposed approach is validated on three problems: protein sequence design, text sentiment control, and text anonymization. We find that the autoregressive model can extrapolate as well or better than MCMC, but with the additional benefits of scalability and significantly higher sample efficiency. |
| Researcher Affiliation | Academia | 1Department of Computer Science, Johns Hopkins University. Correspondence to: Sophia Hager <EMAIL>, Nicholas Andrews <EMAIL>. |
| Pseudocode | No | The paper describes methods in prose, including a 'Toy example' in Section 2, but does not present any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | 1Code made available at https://github.com/ sophia-hager/learning-MCMC-extrapolation |
| Open Datasets | Yes | We use the ACE2 dataset from Chan et al. (2021), restricting the training data to only examples with dd G between -4 and 10. The objective is to generalize to sequences with dd G beyond the training range (i.e. below -4). We describe our experimental procedure in detail in B.1. and Given a training dataset of Yelp reviews (Zhang et al., 2015) with sentiment ranging from 2-stars to 4-stars and We sample training and evaluation data from the Reddit IUR dataset proposed by Andrews & Bishop (2019). |
| Dataset Splits | Yes | As the Yelp review dataset does not have a premade validation split (Zhang et al., 2015), we use the first thousand examples of the test set as a validation set. Padmakumar et al. (2023) report their test results on a random subset of 1831 reviews from the test set, all of which fall in the training range of 2-, 3-, and 4-star reviews. For MCMC and qθ, we create three 2000-sentence subsets of the test set and report the average of each of these three runs in our results, finding that there is little variation regardless of the test set. and We select 16 posts from 1600 unique users (25600 total posts) to generate training episodes, 16 posts for 50 unique users (800 total posts) for an anonymization validation and test split. |
| Hardware Specification | Yes | We finetune using Lo RA (Hu et al., 2022), with a rank of 16 and scaling factor of 32. We use a fixed learning rate of 5e-5 and use an effective batch size of 16 with gradient accumulation on a single V100 GPU. |
| Software Dependencies | No | The paper mentions several models and libraries such as 'Prot-T5-XL', 'T5-3B', 'T5-base', 'GPT-2 large', 'RoBERTa', 'PEGASUS', 'sentence transformers library', and 'Llama3.1-8B', but does not provide specific version numbers for these software components or any other key software dependencies. |
| Experiment Setup | Yes | Table 12 shows the hyperparameters used in our framework. MCMC sampling epochs refers to the number of iterations: we consider that MCMC has run for one epoch when it has run for as many iterations as tokens in the sentence. Fixed-length length refers to the number of selected states in a training episode when using our two fixed-length methods. energy (variable-length) threshold and thinning factor(variable-length) refer to the hyperparameters used to determine sequence length for the variable-length training episodes, as described in 2.3. Lo RA rank and learning rate are the hyperparameters used while training qθ; as sentiment did not use Lo RA, we do not report Lo RA rank. Decoding temperature and Decoding top k refer to the hyperparameters used while generating using qθ. |