reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ReFF: Reinforcing Format Faithfulness in Language Models Across Varied Tasks

Authors: Jiashu Yao, Heyan Huang, Zeming Liu, Haoyu Wen, Wei Su, Boao Qian, Yuhang Guo

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on the benchmark reveal that state-of-the-art openand closed-source LLMs still suffer from severe deficiency in format faithfulness. By virtue of the decidable nature of formats, we propose to Reinforce Format Faithfulness (REFF) to help LLMs generate formatted output as instructed without compromising general quality. Extensive experiments of REFF on FORMATBENCH yield highly favorable results.
Researcher Affiliation	Academia	1School of Computer Science and Technology, Beijing Institute of Technology 2School of Computer Science and Engineering, Beihang University EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1: REFF Input: query set Q, format checker F, LLM M, # epoch n Output: adapted LLM M 1: Let M M 2: for epoch in [1, 2, ..., n] do 3: for q in Q do 4: r M (q) // response generation 5: s F(q, r) // format checking, s { 1, 1} 6: M step(M , q, r, s) // PPO stepping 7: end for 8: end for 9: return M
Open Source Code	Yes	Code & Datasets https://github.com/BITHLP/Re FF
Open Datasets	Yes	Code & Datasets https://github.com/BITHLP/Re FF. To address the gap in comprehensive benchmarks, we combine adaptation of existing datasets, online data collection, and manual data annotation, presenting FORMATBENCH.
Dataset Splits	Yes	Settings Test Queries Train Queries Train Labels REFF-tst " % % REFF-trn % " % REFF-trn-ft % " " Table 3: Data used for RL in three settings of REFF. Test-Only REFF When there exists no extra training data, LLMs can use queries in the test set as the query set Q. Notably, no label of the test set is available to the model in this setting. Train-Only REFF w./wo. Finetuning Train-only setting can be applied in an online scenario, where the queries are processed and responsed one by one, as the adaptation of LLMs only involves training queries as the query set Q. Additionally, considering that a training set often includes both queries and labels, we further study a train-only with finetuning setting, where the reinforcement process is implemented after finetuning on the training set.
Hardware Specification	No	No specific hardware details (like GPU/CPU models or types) were mentioned for running the experiments. The paper lists LLMs used but not the computing infrastructure.
Software Dependencies	No	The paper states: "We use trl (von Werra et al. 2020) library to implement the finetuning and the RLHF-style PPO of REFF." While a library is named, a specific version number for the 'trl' library is not provided, which is required for reproducibility.
Experiment Setup	Yes	Hyper-Parameters To ensure the robustness and reliability of the results, we try to use default and commonly-used hyper-parameters, and keep them consistent among different experiments. Here we list several key points, and the detailed hyper-parameters are outlined in Appendix D. In generation, we adopt greedy decoding in all experiments for a fair and efficient comparison. We use Lo RA (Hu et al. 2021) in all LLM adaptation experiments with a consistent configuration r = 16. In fintuning, we use a constant learning rate 2e 5 and train for 3 epochs with 256 instances per batch. In reinforcement learning, we set target of KL divergency to be 6, use a constant learning rate 1.41e 5, and train for 3 epochs with 32 instances per batch.