reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Advancing Audio-Based Text Generation with Imbalance Preference Optimization

Authors: Zhenghao Zhou, Yongjie Liu, Chen Cao

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct both subjective and objective evaluations to demonstrate the alignment benefits of IPO and its enhancement on model perception and generation capacities. On both AAC and AST, a few hundreds of annotations significantly enhance the weak model, and the strong model can also be encouraged to achieve new state-of-the-art results in terms of objective metrics. Additionally, we show the extensibility of IPO by applying it to the reverse task of text-to-speech generation, improving system robustness on unseen reference speaker.
Researcher Affiliation	Academia	1 National Supercomputing Center in Wuxi, China 2 University of Sheffield, United Kingdom EMAIL, EMAIL
Pseudocode	No	The paper describes the methodology in prose and mathematical formulas, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block, nor a structured, code-like formatted procedure.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets	Yes	For experiments, we use the FLEURS (Conneau et al. 2023) benchmark to evaluate our proposed approach, which is one of the most popular AST benchmarks. In this work, we select two large audio language models that can perform audio captioning as our starting point, i.e., Pengi (Deshmukh et al. 2023) and Qwen-Audio (Chu et al. 2023), both of which attempt to incorporate audio modality into language model for comprehensive understanding. For experiments, we the Clotho (Drossos, Lipping, and Virtanen 2020) benchmark to evaluate our proposed approach, which is among the most popular AAC benchmarks. For experiments, we select two size of zero-shot TTS system which are proposed in Voice Craft (Peng et al. 2024). i.e., VC-330M and VC-830M as our baseline model, and we select the popular Libri TTS dataset (Zen et al. 2019) as benchmark.
Dataset Splits	Yes	We utilize the training set of (Conneau et al. 2023) to show the BLEU diversity in {yi}n i=1 with beam set of 5. As shown in Table 1, each level samples 50 data points according to their average BLEU ( 1), e.g., level 1 consists of 50 set of results with the average BLEU from 40 to 41. Specifically, we select 15 common X En language directions in this study. For subjective evaluation, for each language, 100 examples are randomly selected from test set. 100 examples (6 second to 16 second) sampled from Lirbri TTS test set.
Hardware Specification	No	The paper does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	The paper mentions several models and frameworks used (e.g., Whisper-large-v2, Seamless M4T-large-v2, Pengi, Qwen-Audio, Voice Craft, NISQA, Wav LM-TDCNN) and their corresponding citation years, but it does not specify any programming language versions or library versions (e.g., Python 3.x, PyTorch 1.x) used for the implementation.
Experiment Setup	Yes	For sampling, we use the two large models introduced above with beam size of 5. Additionally, we utilize the hyper-parameter β in (6) to control the model s updates: when the preference for a given data point comes from human annotators, we set β to 0.1 (same as (Ethayarajh et al. 2024)); however, when it comes from beam search or the adversarial model, β is decreased to 0.05 to regulate the update magnitude.