reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning

Authors: Joey Hong, Anca Dragan, Sergey Levine

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirically, we evaluate our method on both pretrained LLMs and VLMs, on a variety of tasks including both natural language dialogue and robotic manipulation and navigation from images.
Researcher Affiliation	Academia	Joey Hong Anca Dragan Sergey Levine University of California, Berkeley EMAIL
Pseudocode	Yes	Algorithm 1 Q-learning via Supervised Fine-tuning (Q-SFT) Require: Dataset D = {(si, ai, ri, s i)}i [N], hyperparameter β > 0 1: Initialize ϕ, θ, θ from pretrained model. 2: Optimize behavior policy: 3: for each gradient step do 4: Update ϕ ϕ λϕ ϕLCE(ϕ) 5: end for 6: Optimize likelihood model: 7: for each gradient step do 8: Update θ θ λθ θLWCE(θ) 9: Update target parameters: θ (1 α) θ + αθ 10: end for 11: At inference time, policy probabilites become: bπ(a \| s) πϕ(a \| s) exp (β pθ(a \| s))
Open Source Code	No	The paper does not provide an explicit statement about releasing source code, a link to a repository, or mention of code in supplementary materials.
Open Datasets	Yes	The first set of tasks include language games from the LMRL benchmark (Abdulhai et al., 2023), which is one of the first benchmarks tailored at evaluating offline RL for language generation. ... The dataset consists of 20K trajectories by a suboptimal heuristic policy that achieves an average return of -4.12, originally collected by Snell et al. (2022). ... ALFWorld. This is a popular text-based environment grounded in image observations (Shridhar et al., 2021). ... Robotic manipulation. We consider the large-scale robotic manipulation control tasks from Singh et al. (2020).
Dataset Splits	Yes	The benchmark consists of 12k initial user instructions, of which we randomly held out 100 for evaluation. With the remaining instructions, we generate a dataset of trajectories where we simulate an suboptimal agent by prompting GPT3.5 with few-shot examples, following the prompts used by Yao et al. (2022).
Hardware Specification	Yes	All algorithms were trained on a single TPUv3 on Google Cloud until convergence.
Software Dependencies	No	The paper mentions models like GPT2-medium, LLaVA-1.6, RT-1, GPT3.5, GPT4-V, and Stockfish 15.1, but does not provide specific version numbers for software libraries or development environments (e.g., Python, PyTorch, CUDA versions) used for their implementation.
Experiment Setup	Yes	We use the hyperparameters reported in Table 4. All algorithms were trained on a single TPUv3 on Google Cloud until convergence. Table 4: Hyperparameters used during training Q-SFT in our experiments. Hyperparameter Chess Wordle 20Q Web Shop ALFWorld β 8.0 4.0 1.0 1.0 1.0 γ discount factor 0.99 0.99 0.95 0.9 0.95 Batch size 128 128 128 128 128 Target network update α 0.005 0.005 0.005 0.01 0.01 Number of updates per iteration 60 60 60 50 50 Number of iterations 100 100 100 200 200 λϕ learning rate 1e-4 1e-4 1e-4 2e-4 3e-4 λθ learning rate 1e-4 1e-4 1e-4 2e-4 1e-4