reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DRESSing Up LLM: Efficient Stylized Question-Answering via Style Subspace Editing

Authors: Xinyu Ma, Yifeng Xu, Yang Lin, Tianlong Wang, Xu Chu, Xin Gao, Junfeng Zhao, Yasha Wang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We develop two stylized QA benchmark datasets to validate the effectiveness of DRESS, and the results demonstrate significant improvements compared to baseline methods such as prompting and ITI. In short, DRESS is a lightweight, train-free solution for enhancing LLMs with flexible and effective style control, making it particularly useful for developing stylized conversational agents. ... To validate the effectiveness of our approach, we construct an evaluation benchmark comprising two specific stylized question-answering datasets of different languages ... The objective evaluation metrics include style intensity, semantic preservation, and fluency ... Quantitative Analysis Table 1 presents the performance of various methods on two stylistic QA evaluation benchmarks.
Researcher Affiliation	Academia	Xinyu Ma1, Yifeng Xu1, Yang Lin1, Tianlong Wang3, Xu Chu1,2,3, Xin Gao1, Junfeng Zhao1, Yasha Wang1,3 1 School of Computer Science, Peking University 2 Center on Frontiers of Computing Studies, Peking University 3 National Research and Engineering Center of Software Engineering, Peking University EMAIL
Pseudocode	Yes	We also present the algorithmic pseudo-code of DRESS in Appendix A. Alg. 1 shows the detailed procedure of how DRESS solves steering vectors and conducts adaptive representation editing.
Open Source Code	Yes	1Codes and benchmark datasets are available at https://github.com/ArthurLeoM/DRESS-LLM.
Open Datasets	Yes	We develop two stylized QA benchmark datasets to validate the effectiveness of DRESS, and the results demonstrate significant improvements compared to baseline methods such as prompting and ITI. ... 1Codes and benchmark datasets are available at https://github.com/ArthurLeoM/DRESS-LLM.
Dataset Splits	Yes	Additionally, as mentioned in Section 4.1, for each dataset, we incorporated the general questionanswer dataset (i.e., MOSS (Sun et al., 2024) in the corresponding language) to address the bias in question style distribution. We then randomly divided each of them into training and testing sets at a ratio of 10:1. The training set is used to solve the stylized QA model, while the testing set only utilizes the questions as the test queries to evaluate the model performances. The detailed statistics and the examples of the datasets are introduced in Appendix E.
Hardware Specification	Yes	We apply Qwen-1.5-14B-Chat (Bai et al., 2023) as our base LLM to experiment on. The experiments are conducted on a machine equipped with 8 NVIDIA-RTX3090 24GB GPUs. All the hyperparameters (e.g., the number of selected attention heads H, editing strength λ, etc.) are tuned via grid search. See Appendix D for more details.
Software Dependencies	Yes	We apply Qwen-1.5-14B-Chat (Bai et al., 2023) as our base LLM to experiment on.
Experiment Setup	Yes	For SFT, the rank of Lo RA is set to 8, and the training epochs is set to 3. We apply cosine learning rate scheduler with a warm-up stage of 10% of total steps, and the maximum learning rate is set to 5e-5. The batch size is set to 32, and only Wq, Wk, Wv, Wo are fine-tuned. For DRESS , the number of selected attention heads H 64, the number of the style subspace basis K 16, and overall editing strength λ 3. For ITI, the number of selected attention heads H 64, and the editing strength α 3. For Tr Fr, the number of selected attention heads H 48, the orthogonal regularization coefficient λ 5e 2, and the editing strength α 40. For Mean-Centring, the editing strength α 3, and the edited layer l t17, 18, , 22u. For Rep E, the editing strength α 4, and the edited layer l t15, 16, , 25u.