reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

C-3PO: Compact Plug-and-Play Proxy Optimization to Achieve Human-like Retrieval-Augmented Generation

Authors: Guoxin Chen, Minpeng Liao, Peiying Yu, Dingmin Wang, Zile Qiao, Chao Yang, Xin Zhao, Kai Fan

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments in both in-domain and outof-distribution scenarios demonstrate that C-3PO significantly enhances RAG performance while maintaining plug-and-play flexibility and superior generalization capabilities. Code is available at https://github.com/Chen-GX/C-3PO.
Researcher Affiliation	Collaboration	1Gaoling School of Artificial Intelligence, Renmin University of China 2Tongyi Lab 3Soochow University 4University of Oxford 5Tsinghua University. Correspondence to: Guoxin Chen <EMAIL>, Minpeng Liao <EMAIL>, Wayne Xin Zhao <EMAIL>, Kai Fan <EMAIL>.
Pseudocode	Yes	Algorithm 1 Inference Process of our C-3PO input question q, the retrieval server (Retriever), the LLM server (LLM), the proxy model in our C-3PO π, instruction for different agent (Reasoning Router, Information Filter, and Decision Maker). output The Answer. 1: a1 π(q, instruction1) {Reasoning Router agent}
Open Source Code	Yes	Code is available at https://github.com/Chen-GX/C-3PO.
Open Datasets	Yes	To comprehensively evaluate our C-3PO, we experiment on both single-hop datasets including Natural Questions (NQ) (Kwiatkowski et al., 2019), Pop QA (Mallen et al., 2023), and Trivia QA (TQA) (Joshi et al., 2017), as well as multi-hop datasets including 2Wiki Multi Hop QA (2Wiki) (Ho et al., 2020), Musique (Trivedi et al., 2022), and Hotpot QA (HQA) (Yang et al., 2018).
Dataset Splits	Yes	For each dataset, we only use 6000 randomly sampled questions instead of the full training set. ... For the in-domain test sets, we randomly sampled 1,000 instances as the test set.
Hardware Specification	No	Our implementation supports two high-performance inference engines: SGLang and VLLM, allowing users to optimize for different deployment scenarios and hardware configurations. ... We utilize Qwen2-72B-Instruct (Yang et al., 2024a) as fixed LLM server, while Qwen2-0.5B or Qwen2-1.5B is trained as candidate lightweight proxy for efficient edge deployment. The paper does not specify concrete hardware details such as GPU or CPU models.
Software Dependencies	No	We utilize Llama-Factory (Zheng et al., 2024) as our training framework for the initial supervised fine-tuning phase. ... For the RL training phase, we adopt Open RLHF (Hu et al., 2024) as our primary training framework, coupled with VLLM (Kwon et al., 2023) inference engine. ... We integrate SGLang4 as our LLM server, which provides compatibility with various state-of-the-art language models, including Qwen2-72B-Instruct (Yang et al., 2024a) and Llama3.3-70B-Instruct (Dubey et al., 2024). ... We employ contriever-msmarco (Izacard et al., 2022) as our dense retriever. While specific software components are named, no version numbers for these components or any other underlying software (e.g., Python, PyTorch) are provided.
Experiment Setup	Yes	Table 6. Key hyperparameters in the supervised warm-up phase. Hyperparameter Value Learning Rate 4e-5 Batch size 512 #Epochs 3 Optimizer type Adam W (Loshchilov & Hutter, 2019) ... Table 7. Key hyperparameters in the RL phase. Hyperparameter Value Learning Rate of Policy model 5e-7 Learning Rate of Value model 5e-6 Batch size 1024 KL Coefficient 0.005 Optimizer type Adam ...