reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Controlling Large Language Model with Latent Action

Authors: Chengxing Jia, Ziniu Li, Pengyuan Wang, Yi-Chen Li, Zhenyu Hou, Yuxiao Dong, Yang Yu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that, compared to RL with tokenlevel actions, Co LA s latent actions enable greater semantic diversity. For enhancing downstream tasks, we show that Co LA with RL achieves a score of 42.4 on the math500 benchmark, surpassing the baseline score of 38.2, and reaches 68.2 when augmented with a Monte Carlo Tree Search variant. Furthermore, Co LA with RL consistently improves performance on agent-based tasks without degrading the pre-trained LLM s capabilities, unlike the baseline. Finally, Co LA reduces computation time by half in tasks involving enhanced thinking prompts for LLMs via RL. These results highlight Co LA s potential to advance RL-based adaptation of LLMs for downstream applications.
Researcher Affiliation	Academia	1National Key Laboratory for Novel Software Technology, School of Artificial Intelligence, Nanjing University, Nanjing, China 2The Chinese University of Hong Kong, Shenzhen, China 3Tsinghua University, Beijing, China 4Pazhou Laboratory (Huangpu), Guangzhou, China. Correspondence to: Yang Yu <EMAIL>.
Pseudocode	Yes	We summarize the training algorithm of pre-training in Algorithm 1, the post-training in Algorithm 2, and the RLHF process in Algorithm 4.
Open Source Code	Yes	The Co LA model is available at https://huggingface.co/LAMDA-RL/Llama-3.1-Co LA-10B. Codes for training Co LA will be available at https://github.com/LAMDA-RL/Co LA.
Open Datasets	Yes	We select several open-source datasets, including Slimpajama (Cerebras, 2023), Starcoder (Li et al., 2023a), Proof-Pile2 (Azerbayev et al., 2024), and Wu Dao (Yuan et al., 2021), covering general knowledge, code, mathematics, and Chinese and English bilingual content, totaling 1.1T tokens. We tune the model on the Numina Math dataset. We compare training the language world model with policy (FTA-P) with the baseline (Llama3.1-8B SFT on the same dataset) in several benchmarks, including math500 (Hendrycks et al., 2021), gsm8k (Cobbe et al., 2021), AIME and Drop (Dua et al., 2019), where the first three are mathematical reasoning tasks, and the fourth is a general reasoning task. The prompts are related to MATH and collected from PRM800k (Lightman et al., 2024).
Dataset Splits	No	Due to resource constraints, we train only the inverse dynamics model and the world model on 200G randomly selected tokens from this dataset, with 100G of these tokens used for training the policy model to validate the effectiveness of Co LA. For evaluation, we utilize Nd = 100 sequences with length of 2048 to compute the prediction loss, semantic diversity and KL computation. When computing generation semantic diversity, we need to take the prefix of the sequence for generation, the length of prefix is set to be 256. Given a prompt set {x1:p} from the math training dataset, we utilize the Co LA model after FTA-P to generate Nr responses {xp+1:T } by sampling action sequence ap:T 1 for each prompt and label the responses with reward {r} by the Qwen-2.5-Math-72B model. For agentic RL, we chose the validation task set of each environment to perform RL since the training set is too large. The paper mentions
Hardware Specification	No	The authors would like to thank Zhipu AI for sponsoring the computation resources used in this work. This work was done while C. Jia interned at Zhipu AI. Due to resource constraints, we train only the inverse dynamics model and the world model on 200G randomly selected tokens from this dataset, with 100G of these tokens used for training the policy model to validate the effectiveness of Co LA.
Software Dependencies	No	The paper mentions using Llama-3.1-8B as the base model and techniques like Double-DQN and various RL algorithms, but does not specify software versions for libraries or frameworks like PyTorch, TensorFlow, etc.
Experiment Setup	Yes	For the pre-training hyperparameters, we adopt a learning rate of 1e 4, a global batch size of 512, a micro batch size of 4, a maximum sequence length of 2048, and a maximum gradient norm of 1.0 for both the inverse dynamics model, the language world, and policy pre-training. For inverse dynamics model and language world model training, we adopt a regularization loss, and its coefficient β is set to be 0.001. For the post-training hyperparameters, first for SFT and FTA-I in preference tasks, we utilize learning rate with 5e 6, training epoch with 1, global batch size with 256, and micro batch size with 4. For MCTS, the length of multi-token search k is 64, the max repeating number Nmc is 64, and the coefficient in UCT c is 0.7. For MCTS-Q, the threshold is set to be 0.01. For Q function learning, the learning rate is 5e 6, the learning epoch is 100, the number of generated responses Nr is 8, the update interval for target Q is 100, τ is 1.0, the global batch size is 256 and the micro batch size is 2.