Q-SFT: Q-Learning for Language Models via Supervised Fine-Tuning
Authors: Joey Hong, Anca Dragan, Sergey Levine
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirically, we evaluate our method on both pretrained LLMs and VLMs, on a variety of tasks including both natural language dialogue and robotic manipulation and navigation from images. |
| Researcher Affiliation | Academia | Joey Hong Anca Dragan Sergey Levine University of California, Berkeley EMAIL |
| Pseudocode | Yes | Algorithm 1 Q-learning via Supervised Fine-tuning (Q-SFT) Require: Dataset D = {(si, ai, ri, s i)}i [N], hyperparameter β > 0 1: Initialize ϕ, θ, θ from pretrained model. 2: Optimize behavior policy: 3: for each gradient step do 4: Update ϕ ϕ λϕ ϕLCE(ϕ) 5: end for 6: Optimize likelihood model: 7: for each gradient step do 8: Update θ θ λθ θLWCE(θ) 9: Update target parameters: θ (1 α) θ + αθ 10: end for 11: At inference time, policy probabilites become: bπ(a | s) πϕ(a | s) exp (β pθ(a | s)) |
| Open Source Code | No | The paper does not provide an explicit statement about releasing source code, a link to a repository, or mention of code in supplementary materials. |
| Open Datasets | Yes | The first set of tasks include language games from the LMRL benchmark (Abdulhai et al., 2023), which is one of the first benchmarks tailored at evaluating offline RL for language generation. ... The dataset consists of 20K trajectories by a suboptimal heuristic policy that achieves an average return of -4.12, originally collected by Snell et al. (2022). ... ALFWorld. This is a popular text-based environment grounded in image observations (Shridhar et al., 2021). ... Robotic manipulation. We consider the large-scale robotic manipulation control tasks from Singh et al. (2020). |
| Dataset Splits | Yes | The benchmark consists of 12k initial user instructions, of which we randomly held out 100 for evaluation. With the remaining instructions, we generate a dataset of trajectories where we simulate an suboptimal agent by prompting GPT3.5 with few-shot examples, following the prompts used by Yao et al. (2022). |
| Hardware Specification | Yes | All algorithms were trained on a single TPUv3 on Google Cloud until convergence. |
| Software Dependencies | No | The paper mentions models like GPT2-medium, LLaVA-1.6, RT-1, GPT3.5, GPT4-V, and Stockfish 15.1, but does not provide specific version numbers for software libraries or development environments (e.g., Python, PyTorch, CUDA versions) used for their implementation. |
| Experiment Setup | Yes | We use the hyperparameters reported in Table 4. All algorithms were trained on a single TPUv3 on Google Cloud until convergence. Table 4: Hyperparameters used during training Q-SFT in our experiments. Hyperparameter Chess Wordle 20Q Web Shop ALFWorld β 8.0 4.0 1.0 1.0 1.0 γ discount factor 0.99 0.99 0.95 0.9 0.95 Batch size 128 128 128 128 128 Target network update α 0.005 0.005 0.005 0.01 0.01 Number of updates per iteration 60 60 60 50 50 Number of iterations 100 100 100 200 200 λϕ learning rate 1e-4 1e-4 1e-4 2e-4 3e-4 λθ learning rate 1e-4 1e-4 1e-4 2e-4 1e-4 |