reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Extrapolative Sequence Transformations from Markov Chains

Authors: Sophia Hager, Aleem Khan, Andrew Wang, Nicholas Andrews

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The proposed approach is validated on three problems: protein sequence design, text sentiment control, and text anonymization. We find that the autoregressive model can extrapolate as well or better than MCMC, but with the additional benefits of scalability and significantly higher sample efficiency.
Researcher Affiliation	Academia	1Department of Computer Science, Johns Hopkins University. Correspondence to: Sophia Hager <EMAIL>, Nicholas Andrews <EMAIL>.
Pseudocode	No	The paper describes methods in prose, including a 'Toy example' in Section 2, but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	1Code made available at https://github.com/ sophia-hager/learning-MCMC-extrapolation
Open Datasets	Yes	We use the ACE2 dataset from Chan et al. (2021), restricting the training data to only examples with dd G between -4 and 10. The objective is to generalize to sequences with dd G beyond the training range (i.e. below -4). We describe our experimental procedure in detail in B.1. and Given a training dataset of Yelp reviews (Zhang et al., 2015) with sentiment ranging from 2-stars to 4-stars and We sample training and evaluation data from the Reddit IUR dataset proposed by Andrews & Bishop (2019).
Dataset Splits	Yes	As the Yelp review dataset does not have a premade validation split (Zhang et al., 2015), we use the first thousand examples of the test set as a validation set. Padmakumar et al. (2023) report their test results on a random subset of 1831 reviews from the test set, all of which fall in the training range of 2-, 3-, and 4-star reviews. For MCMC and qθ, we create three 2000-sentence subsets of the test set and report the average of each of these three runs in our results, finding that there is little variation regardless of the test set. and We select 16 posts from 1600 unique users (25600 total posts) to generate training episodes, 16 posts for 50 unique users (800 total posts) for an anonymization validation and test split.
Hardware Specification	Yes	We finetune using Lo RA (Hu et al., 2022), with a rank of 16 and scaling factor of 32. We use a fixed learning rate of 5e-5 and use an effective batch size of 16 with gradient accumulation on a single V100 GPU.
Software Dependencies	No	The paper mentions several models and libraries such as 'Prot-T5-XL', 'T5-3B', 'T5-base', 'GPT-2 large', 'RoBERTa', 'PEGASUS', 'sentence transformers library', and 'Llama3.1-8B', but does not provide specific version numbers for these software components or any other key software dependencies.
Experiment Setup	Yes	Table 12 shows the hyperparameters used in our framework. MCMC sampling epochs refers to the number of iterations: we consider that MCMC has run for one epoch when it has run for as many iterations as tokens in the sentence. Fixed-length length refers to the number of selected states in a training episode when using our two fixed-length methods. energy (variable-length) threshold and thinning factor(variable-length) refer to the hyperparameters used to determine sequence length for the variable-length training episodes, as described in 2.3. Lo RA rank and learning rate are the hyperparameters used while training qθ; as sentiment did not use Lo RA, we do not report Lo RA rank. Decoding temperature and Decoding top k refer to the hyperparameters used while generating using qθ.