reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Diversity-Rewarded CFG Distillation

Authors: Geoffrey Cideron, Andrea Agostinelli, Johan Ferret, Sertan Girgin, Romuald Elie, Olivier Bachem, Sarah Perrin, Alexandre Rame

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on the Music LM text-to-music generative model, where our approach surpasses CFG in terms of quality-diversity Pareto optimality. According to human evaluators, our ﬁnetuned-then-merged model generates samples with higher quality-diversity than the base model augmented with CFG.
Researcher Affiliation	Industry	Geoffrey Cideron , Andrea Agostinelli, Johan Ferret, Sertan Girgin, Romuald Elie Olivier Bachem, Sarah Perrin, Alexandre Ram e Google Deep Mind, * Equal advisory contribution Correspondence to: Geoffrey Cideron <EMAIL>
Pseudocode	No	The paper includes mathematical equations (1-6) and derivations in Appendix A but does not feature any clearly labeled pseudocode blocks or algorithms formatted with structured steps.
Open Source Code	No	Explore our generations at google-research.github.io/seanet/musiclm/diverse music. This link directs to a webpage for exploring generated music samples, not to source code for the methodology.
Open Datasets	Yes	We use the prompt dataset described in Section 4.1 from Cideron et al. (2024)... and prompts derived from Music Caps (Agostinelli et al., 2023). ...using 16 k Hz audio excerpts sourced from the same training dataset as Agostinelli et al. (2023).
Dataset Splits	No	The paper mentions using a batch size of 128 and provides details about human evaluation prompts (101 for quality, 50 for diversity), but it does not specify training, validation, or test splits for the datasets used to train the models.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running experiments, such as GPU models, CPU types, or cloud computing specifications.
Software Dependencies	No	The paper mentions using a 'LLM transformer-based architecture', 'RL algorithm is a variant of REINFORCE (Williams, 1992)', and 'semi-hard triplet loss (Schroff et al., 2015)' but does not specify any software names with version numbers (e.g., programming languages, libraries, frameworks).
Experiment Setup	Yes	For CFG, we set γ = 3 and use the negative prompt Bad audio quality. ...with temperature T = 0.99. ...We use a batch size of 128 and a learning rate of 0.00015 for all our ﬁnetunings.