reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Distribution-wise Control in Representation Space for Language Models

Authors: Chunyuan Deng, Ruidi Chang, Hanjie Chen

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct comprehensive experiments, spanning eight commonsense reasoning benchmarks and seven mathematical reasoning benchmarks. We test performance on Llama-family models (Touvron et al., 2023b; Dubey et al., 2024) under both layer-wise and all-layer configurations. In our layer-wise experiments, we observed an intriguing performance gain: replacing deterministic nodes with stochastic counterparts in early layers significantly improved model performance, yielding gains of +4% to +6%.
Researcher Affiliation	Academia	1Department of Computer Science, Rice University. Correspondence to: Chunyuan Deng <EMAIL>, Hanjie Chen <EMAIL>.
Pseudocode	No	The paper defines mathematical equations for intervention methods but does not provide a clearly labeled pseudocode block or algorithm.
Open Source Code	Yes	The code is at: https://github.com/chili-lab/D-Intervention.
Open Datasets	Yes	For commonsense reasoning, we have Bool Q (Clark et al., 2019), PIQA (Bisk et al., 2020), SIQA (Sap et al., 2019), Hella Swag (Zellers et al., 2019), Wino Grande (Sakaguchi et al., 2020), ARC-e, ARC-c (Clark et al., 2018) and OBQA (Mihaylov et al., 2018). For arithmetic reasoning, we have Add Sub (Hosseini et al., 2014), Single EQ (Koncel-Kedziorski et al., 2015), Multi Arith (Roy & Roth, 2015), AQu A (Ling et al., 2017), GSM8K (Cobbe et al., 2021), MAWPS (Koncel-Kedziorski et al., 2016), and SVAMP (Patel et al., 2021).
Dataset Splits	No	For the commonsense reasoning benchmark, we train the model using the Commonsense170K dataset. For arithmetic reasoning benchmarks, we use the Math10K dataset. These datasets are combined training sets from their original benchmarks. We use a portion of the training set from GSM8K as a development set to tune the best hyperparameters and apply this set of hyperparameters to report the test scores.
Hardware Specification	Yes	We conducted all experiments using a single NVIDIA RTX A6000 GPU with mixed precision (bfloat16) enabled.
Software Dependencies	No	Generally, we follow the standard setup of previous SOTA methods like Re FT (Wu et al., 2024b), and our codebase is built on pyenve (Wu et al., 2024c).
Experiment Setup	Yes	Key parameters include the intervention layer (l), noise scale (ϵ), subspace rank (r), intervention position (p), batch size (bs), training epochs (e), and learning rate (lr). These parameters are tuned on the development set, but an ablation study is not included in the main text. Detailed values are provided in Appendix B.