Advancing Audio-Based Text Generation with Imbalance Preference Optimization

Authors: Zhenghao Zhou, Yongjie Liu, Chen Cao

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct both subjective and objective evaluations to demonstrate the alignment benefits of IPO and its enhancement on model perception and generation capacities. On both AAC and AST, a few hundreds of annotations significantly enhance the weak model, and the strong model can also be encouraged to achieve new state-of-the-art results in terms of objective metrics. Additionally, we show the extensibility of IPO by applying it to the reverse task of text-to-speech generation, improving system robustness on unseen reference speaker.
Researcher Affiliation Academia 1 National Supercomputing Center in Wuxi, China 2 University of Sheffield, United Kingdom EMAIL, EMAIL
Pseudocode No The paper describes the methodology in prose and mathematical formulas, but does not include any explicitly labeled 'Pseudocode' or 'Algorithm' block, nor a structured, code-like formatted procedure.
Open Source Code No The paper does not contain an explicit statement about releasing source code for the described methodology, nor does it provide a link to a code repository.
Open Datasets Yes For experiments, we use the FLEURS (Conneau et al. 2023) benchmark to evaluate our proposed approach, which is one of the most popular AST benchmarks. In this work, we select two large audio language models that can perform audio captioning as our starting point, i.e., Pengi (Deshmukh et al. 2023) and Qwen-Audio (Chu et al. 2023), both of which attempt to incorporate audio modality into language model for comprehensive understanding. For experiments, we the Clotho (Drossos, Lipping, and Virtanen 2020) benchmark to evaluate our proposed approach, which is among the most popular AAC benchmarks. For experiments, we select two size of zero-shot TTS system which are proposed in Voice Craft (Peng et al. 2024). i.e., VC-330M and VC-830M as our baseline model, and we select the popular Libri TTS dataset (Zen et al. 2019) as benchmark.
Dataset Splits Yes We utilize the training set of (Conneau et al. 2023) to show the BLEU diversity in {yi}n i=1 with beam set of 5. As shown in Table 1, each level samples 50 data points according to their average BLEU ( 1), e.g., level 1 consists of 50 set of results with the average BLEU from 40 to 41. Specifically, we select 15 common X En language directions in this study. For subjective evaluation, for each language, 100 examples are randomly selected from test set. 100 examples (6 second to 16 second) sampled from Lirbri TTS test set.
Hardware Specification No The paper does not provide specific details about the hardware used, such as GPU models, CPU types, or memory specifications.
Software Dependencies No The paper mentions several models and frameworks used (e.g., Whisper-large-v2, Seamless M4T-large-v2, Pengi, Qwen-Audio, Voice Craft, NISQA, Wav LM-TDCNN) and their corresponding citation years, but it does not specify any programming language versions or library versions (e.g., Python 3.x, PyTorch 1.x) used for the implementation.
Experiment Setup Yes For sampling, we use the two large models introduced above with beam size of 5. Additionally, we utilize the hyper-parameter β in (6) to control the model s updates: when the preference for a given data point comes from human annotators, we set β to 0.1 (same as (Ethayarajh et al. 2024)); however, when it comes from beam search or the adversarial model, β is decreased to 0.05 to regulate the update magnitude.