Teaching Language Models to Critique via Reinforcement Learning
Authors: Zhihui Xie, Jie Chen, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through extensive evaluations on diverse benchmarks including Code Contests (Li et al., 2022), Live Code Bench (Jain et al., 2024), MBPP+ (Liu et al., 2024a), and Judge Bench (Tan et al., 2024), we demonstrate that training with CTRL significantly outperforms both self-critique approaches and methods using stronger critic models. Our empirical analysis (Figure 4) demonstrates that this efficiency stems from reduced error compounding the critic effectively identifies and corrects mistakes early, guiding the model toward more direct solution paths without compromising solution quality. |
| Researcher Affiliation | Collaboration | *Equal contribution 1The University of Hong Kong 2Bytedance Seed. Correspondence to: Zhihui Xie <EMAIL>. |
| Pseudocode | Yes | def results Array(self, queries: List[List[int]], k: int) List[int]: min_heap = [] results = [] for x, y in queries: distance = abs(x) + abs(y) heapq.heappush(min_heap, distance) if len(min_heap) >= k: results.append(min_heap[k-1]) else: results.append(-1) return results |
| Open Source Code | Yes | https://critic-rl.github.io |
| Open Datasets | Yes | Training Data. We use TACO (Li et al., 2023), a dataset containing 26,443 programming problems collected from competitive programming platforms like Code Forces and Leet Code. Benchmarks. We evaluate our approach on three programming benchmarks and one general-domain benchmark: (1) Code Contests (Li et al., 2022)... (2) Live Code Bench (24.08-24.11) (Jain et al., 2024)... (3) MBPP+ (Liu et al., 2024a)... and (4) Judge Bench (Tan et al., 2024) |
| Dataset Splits | No | The paper mentions filtering the TACO dataset to 18,820 problems for training and evaluating on various benchmarks, but it does not specify the exact training/validation/test splits (e.g., percentages or absolute counts) used for its own experimental setup from these datasets. |
| Hardware Specification | No | The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments. |
| Software Dependencies | No | The paper mentions using "Ve RL (Sheng et al., 2024) as the codebase" but does not specify its version or any other software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | Table 8. SFT Hyperparameters. Parameter Value: Learning Rate 2e-5, Learning Rate Schedule Cosine, Training Batch Size 256, Maximum Token Length 2,048, Training Epochs 1, Mixed Precision Format bfloat16. Table 9. RL Hyperparameters. Parameter Value: Training Batch Size 1,024, Mini-Batch Size 256, Group Size 8, Learning Rate 1e-5, KL Coefficient 0.001, Maximum Prompt Length 1,536, Maximum Response Length 768, Temperature 1.0, Training Epochs 2. |