SFT Memorizes, RL Generalizes: A Comparative Study of Foundation Model Post-training
Authors: Tianzhe Chu, Yuexiang Zhai, Jihan Yang, Shengbang Tong, Saining Xie, Dale Schuurmans, Quoc V Le, Sergey Levine, Yi Ma
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | This paper studies the comparative effect of SFT and RL on generalization and memorization, focusing on text-based and visual reasoning tasks. We introduce General Points, an arithmetic reasoning card game, and also consider V-IRL, a real-world navigation environment, to assess how models trained with SFT and RL generalize to unseen variants in both novel textual rules and visual domains. We show that RL, especially when trained with an outcomebased reward, generalizes in both the rule-based textual and visual environments. Figure 5: Success rate (%) GFLOPs trendlines for RL and SFT on General Points and V-IRL. The top row shows in-distribution performance, while the bottom row shows out-of-distribution performance. Results are presented for both pure language (-L) and vision-language (-VL) variants of each task. |
| Researcher Affiliation | Collaboration | Equal contribution . HKU, UC Berkeley, Google Deep Mind, NYU, University of Alberta. All experiments are conducted outside of Google. Project page: https://tianzhechu.com/SFTvs RL. Correspondence to: Tianzhe Chu <EMAIL>, Yuexiang Zhai <EMAIL>. |
| Pseudocode | No | The paper describes the methodologies and processes in prose and uses figures to illustrate concepts (e.g., Figure 2, 3), but it does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | Project page: https://tianzhechu.com/SFTvs RL. The text provides a project page URL, which is a general domain/project overview page, but it does not explicitly state that the source code for the methodology described in the paper is available there, nor does it provide a direct link to a code repository. |
| Open Datasets | Yes | Our first task is General Points, an original card game task similar to Points24 of RL4VLM (Zhai et al., 2024a), which is designed to evaluate a model s arithmetic reasoning capabilities. Second, we adopt V-IRL (Yang et al., 2024a), a real-world navigation task that focuses on the model s spatial reasoning capabilities. For visual out-of-distribution experiments, we directly adopt the VLN mini benchmark from Yang et al. (2024a). |
| Dataset Splits | Yes | Leveraging the data collection pipeline of Yang et al. (2024a), we construct a training database with 1000 unique routes from New York City. We evaluate all rule-variant experiments and visual in-distribution experiments using randomly sampled routes from this database. For visual out-of-distribution experiments, we directly adopt the VLN mini benchmark from Yang et al. (2024a). This benchmark consists of 18 distinct routes across nine cities: Milan, New Delhi, Buenos Aires, London, Hong Kong, New York,4 Melbourne, Lagos, and San Francisco, with two routes per city. |
| Hardware Specification | Yes | All training experiments are conducted on an 8 H800 machine (80GB). |
| Software Dependencies | No | The paper mentions Llama-3.2-Vision-11B as the backbone model and PPO as the backbone RL algorithm, but it does not specify any software libraries or frameworks with explicit version numbers (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | For experiments fine-tuning all parameters, we search learning rates from {1 10 4, 1 10 4, 1 10 5, 1 10 6, 5 10 7, 1 10 7}. Freezing the vision encoder, we search learning rates {1 10 6, 1 10 7}. Freezing vision encoder and adapter, we search learning rates {1 10 6, 5 10 7, 1 10 7}. We conduct a search over learning rates 2 10 6, 1 10 6, with the in-distribution success rate curves shown in Figure 17. All parameters are tunable in our RL experiments. |