reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Teaching Language Models to Critique via Reinforcement Learning

Authors: Zhihui Xie, Jie Chen, Liyu Chen, Weichao Mao, Jingjing Xu, Lingpeng Kong

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through extensive evaluations on diverse benchmarks including Code Contests (Li et al., 2022), Live Code Bench (Jain et al., 2024), MBPP+ (Liu et al., 2024a), and Judge Bench (Tan et al., 2024), we demonstrate that training with CTRL significantly outperforms both self-critique approaches and methods using stronger critic models. Our empirical analysis (Figure 4) demonstrates that this efficiency stems from reduced error compounding the critic effectively identifies and corrects mistakes early, guiding the model toward more direct solution paths without compromising solution quality.
Researcher Affiliation	Collaboration	*Equal contribution 1The University of Hong Kong 2Bytedance Seed. Correspondence to: Zhihui Xie <EMAIL>.
Pseudocode	Yes	def results Array(self, queries: List[List[int]], k: int) List[int]: min_heap = [] results = [] for x, y in queries: distance = abs(x) + abs(y) heapq.heappush(min_heap, distance) if len(min_heap) >= k: results.append(min_heap[k-1]) else: results.append(-1) return results
Open Source Code	Yes	https://critic-rl.github.io
Open Datasets	Yes	Training Data. We use TACO (Li et al., 2023), a dataset containing 26,443 programming problems collected from competitive programming platforms like Code Forces and Leet Code. Benchmarks. We evaluate our approach on three programming benchmarks and one general-domain benchmark: (1) Code Contests (Li et al., 2022)... (2) Live Code Bench (24.08-24.11) (Jain et al., 2024)... (3) MBPP+ (Liu et al., 2024a)... and (4) Judge Bench (Tan et al., 2024)
Dataset Splits	No	The paper mentions filtering the TACO dataset to 18,820 problems for training and evaluating on various benchmarks, but it does not specify the exact training/validation/test splits (e.g., percentages or absolute counts) used for its own experimental setup from these datasets.
Hardware Specification	No	The paper does not provide any specific hardware details such as GPU models, CPU types, or memory specifications used for running experiments.
Software Dependencies	No	The paper mentions using "Ve RL (Sheng et al., 2024) as the codebase" but does not specify its version or any other software dependencies with version numbers (e.g., Python, PyTorch, TensorFlow versions).
Experiment Setup	Yes	Table 8. SFT Hyperparameters. Parameter Value: Learning Rate 2e-5, Learning Rate Schedule Cosine, Training Batch Size 256, Maximum Token Length 2,048, Training Epochs 1, Mixed Precision Format bfloat16. Table 9. RL Hyperparameters. Parameter Value: Training Batch Size 1,024, Mini-Batch Size 256, Group Size 8, Learning Rate 1e-5, KL Coefficient 0.001, Maximum Prompt Length 1,536, Maximum Response Length 768, Temperature 1.0, Training Epochs 2.