reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

An approximate sampler for energy-based models with divergence diagnostics

Authors: Bryan Eikema, Germán Kruszewski, Christopher R Dance, Hady Elsahar, Marc Dymetman

TMLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We demonstrate the effectiveness of QRS on controlled text generation. Large pre-trained language models are becoming increasingly useful general purpose tools and generation is typically accomplished by sampling (Nadeem et al., 2020). Controlling the distribution of these language models to accommodate human preferences can be difficult, but EBMs are a promising way to achieve this (Khalifa et al., 2021). However, sampling from EBMs defined over a discrete sequence space is non-trivial, making it a challenging task to benchmark QRS. In this paper, we experiment with EBMs resulting from restricting a GPT-2 language model (Radford et al., 2019) in some way: either the model is restricted to only generate sequences containing a specific term; or the model is restricted to have certain moments, for example debiasing a distribution over biographies to consist of 50% female biographies. We explore a variety of ways to construct proposal distributions for QRS. In particular, we explore prompting a pre-trained language model, as well as training an autoregressive model to approximate the EBM (Khalifa et al., 2021). In App. B, we also experiment with a paraphrase generation task in which we use off-the-shelf machine translation models as conditional proposal distributions. Results show that we are able to approximate the target distributions to any desired level in exchange for sampling efficiency.
Researcher Affiliation	Collaboration	Bryan Eikema EMAIL University of Amsterdam Germán Kruzsewski EMAIL NAVER Labs Europe Christopher Dance EMAIL NAVER Labs Europe Hady Elsahar EMAIL Meta AI Marc Dymetman EMAIL Independent researcher
Pseudocode	Yes	Algorithm 1 QRS 1: Require: Target P, proposal q, parameter β, number of required samples N {0 < β < } 2: n 0 3: while n < N do 4: x q 5: rx min (1, P(x)/(βq(x))) {Acceptance prob.} 6: u U[0,1] {U[0,1] : unif. dist. over [0, 1]} 7: if u rx then 8: output x 9: n n + 1 10: end if 11: end while
Open Source Code	No	1To access the code for this work (scheduled for release in Jan. 2023) and for a short introductory video, please go to https://disco.europe.naverlabs.com/QRS. 2We will release the code for our experiments upon publication.
Open Datasets	Yes	We now turn to the task, also introduced by Khalifa et al. (2021), of generating biographies of scientists while debiasing the gender distribution to contain female scientists 50% of the time. For this we make use of GPT-2 Biographies (a(x)), a language model fine-tuned on Wikipedia biographies9 and follow the same setup as the authors to define the binary classifiers identifying sequences talking about scientists or females10 and infer an EBM that matches the distributional constraints with minimal deviation from the original model.
Dataset Splits	No	The paper mentions generating "1M samples" for evaluation and "10^4 elements" for toy examples, but it does not specify how the datasets used for training the underlying models (like GPT-2 Biographies) were split into training, validation, or test sets. It primarily focuses on evaluating samples generated by the QRS method rather than training new models from scratch.
Hardware Specification	No	The paper does not explicitly mention any specific hardware (e.g., GPU models, CPU types, or cloud computing instances) used for running the experiments or training the models.
Software Dependencies	No	The paper mentions using "GPT-2 language model (Radford et al., 2019)", "sentence-BERT13 (Reimers and Gurevych, 2019)", and "BERT (Devlin et al., 2019)". However, it does not specify any version numbers for these software components or any other libraries/frameworks used.
Experiment Setup	Yes	In all cases, we use β values in the interval [0.5, 3.5]. Furthermore, we compute the sampler s efficiency by estimating the acceptance rate (AR) for each value of β (Eq. 4). We set a burn-in period of 1,000 steps and only keep every 1, 000th sample to attain an acceptance rate of 10 3. Note that we chose not to include the burn-in period to compute the acceptance rate of MCMC samplers, as this period is constant and does not grow with sample size. We also experiment with a reset variant (-R) of the MH samplers that does away with autocorrelations among samples altogether (i.e. produces i.i.d. samples like QRS) by, instead of using thinning, resetting the chain after 1,000 steps and only retaining the last sample of the chain (see Robert and Casella 2004, Theorem 7.4 (ii)). This variant does not make use of a burn-in period.