reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Constrain Alignment with Sparse Autoencoders

Authors: Qingyu Yin, Chak Tou Leong, Hongbo Zhang, Minjun Zhu, Hanqi Yan, Qiang Zhang, Yulan He, Wenjie Li, Jun Wang, Yue Zhang, Linyi Yang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results on benchmark datasets demonstrate that FPO achieves an above 5% absolute improvement in win rate with much lower computational cost compared to state-of-the-art baselines, making it a promising solution for efficient and controllable LLM alignments. Code is available at Feature Alignment.
Researcher Affiliation	Academia	1Zhejiang University 2Westlake University 3Hongkong Polytechnic University 4King s College London 5University College London 6Southern University of Science and Technology.
Pseudocode	No	The paper describes methods mathematically and textually with equations (e.g., Equation 1, 5, 6, 8, 10) and figures, but does not contain a formal pseudocode block or algorithm.
Open Source Code	No	Code is available at Feature Alignment. This statement is ambiguous and does not provide a direct link or specific repository name for the code.
Open Datasets	Yes	Dataset for initial instruction tuning. This establishes a baseline conversational capability and ensures that all our methods are compared on a consistent SFT model. Subsequently, we employ the Ultra Feedback (Cui et al., 2024). Dataset to align the SFT models using various methods. We evaluate our models on three widely-used open-ended instruction-following benchmarks: MT-Bench (Zheng et al., 2023a), Alpaca Eval 2 (Li et al., 2023; Dubois et al., 2024), and Arena-Hard (Li et al., 2024; Chiang et al., 2024).
Dataset Splits	No	The paper mentions using "Ultrachat-200K" and "Ultra Feedback" datasets for training and alignment, but it does not specify explicit training/validation/test splits (e.g., percentages or sample counts) for these datasets during its experimental setup. While it references standard protocols for evaluation benchmarks, it does not provide details for reproducing the data partitioning for the training phase.
Hardware Specification	Yes	Table 5: Hyperparameters for Gemma-2-2b and Gemma-2-9b. GPU(s) 4 * H100
Software Dependencies	No	The paper mentions using Adam (Kingma, 2014) and RMSProp optimizers (Graves, 2013) but does not specify version numbers for these or other software libraries (e.g., Python, PyTorch/TensorFlow, CUDA).
Experiment Setup	Yes	For the hyperparameters related to alignment methods, such as α and β, we initially refer to the hyperparameter settings from the corresponding papers. If these settings are explicitly provided, we directly adopt their configurations. For configurations that are not given, we perform a hyperparameter search to determine the optimal values. Regarding the training hyperparameters, we standardize the batch size to 32, set the learning rate to 5 10 7, and use a warm-up period of 150 steps, after which the learning rate remains constant, set the epoch as 1. We employ the Adam (Kingma, 2014) and RMSProp optimizers (Graves, 2013) for Gemma2-2B and Gemma-2-9B, respectively. Table 5: Hyperparameters for Gemma-2-2b and Gemma-2-9b. Includes α, β, γ, learning rate, optimizer, warmup steps, activation checkpoint, and SAE width.