C-3PO: Compact Plug-and-Play Proxy Optimization to Achieve Human-like Retrieval-Augmented Generation

Authors: Guoxin Chen, Minpeng Liao, Peiying Yu, Dingmin Wang, Zile Qiao, Chao Yang, Xin Zhao, Kai Fan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Extensive experiments in both in-domain and outof-distribution scenarios demonstrate that C-3PO significantly enhances RAG performance while maintaining plug-and-play flexibility and superior generalization capabilities. Code is available at https://github.com/Chen-GX/C-3PO.
Researcher Affiliation Collaboration 1Gaoling School of Artificial Intelligence, Renmin University of China 2Tongyi Lab 3Soochow University 4University of Oxford 5Tsinghua University. Correspondence to: Guoxin Chen <EMAIL>, Minpeng Liao <EMAIL>, Wayne Xin Zhao <EMAIL>, Kai Fan <EMAIL>.
Pseudocode Yes Algorithm 1 Inference Process of our C-3PO input question q, the retrieval server (Retriever), the LLM server (LLM), the proxy model in our C-3PO π, instruction for different agent (Reasoning Router, Information Filter, and Decision Maker). output The Answer. 1: a1 π(q, instruction1) {Reasoning Router agent}
Open Source Code Yes Code is available at https://github.com/Chen-GX/C-3PO.
Open Datasets Yes To comprehensively evaluate our C-3PO, we experiment on both single-hop datasets including Natural Questions (NQ) (Kwiatkowski et al., 2019), Pop QA (Mallen et al., 2023), and Trivia QA (TQA) (Joshi et al., 2017), as well as multi-hop datasets including 2Wiki Multi Hop QA (2Wiki) (Ho et al., 2020), Musique (Trivedi et al., 2022), and Hotpot QA (HQA) (Yang et al., 2018).
Dataset Splits Yes For each dataset, we only use 6000 randomly sampled questions instead of the full training set. ... For the in-domain test sets, we randomly sampled 1,000 instances as the test set.
Hardware Specification No Our implementation supports two high-performance inference engines: SGLang and VLLM, allowing users to optimize for different deployment scenarios and hardware configurations. ... We utilize Qwen2-72B-Instruct (Yang et al., 2024a) as fixed LLM server, while Qwen2-0.5B or Qwen2-1.5B is trained as candidate lightweight proxy for efficient edge deployment. The paper does not specify concrete hardware details such as GPU or CPU models.
Software Dependencies No We utilize Llama-Factory (Zheng et al., 2024) as our training framework for the initial supervised fine-tuning phase. ... For the RL training phase, we adopt Open RLHF (Hu et al., 2024) as our primary training framework, coupled with VLLM (Kwon et al., 2023) inference engine. ... We integrate SGLang4 as our LLM server, which provides compatibility with various state-of-the-art language models, including Qwen2-72B-Instruct (Yang et al., 2024a) and Llama3.3-70B-Instruct (Dubey et al., 2024). ... We employ contriever-msmarco (Izacard et al., 2022) as our dense retriever. While specific software components are named, no version numbers for these components or any other underlying software (e.g., Python, PyTorch) are provided.
Experiment Setup Yes Table 6. Key hyperparameters in the supervised warm-up phase. Hyperparameter Value Learning Rate 4e-5 Batch size 512 #Epochs 3 Optimizer type Adam W (Loshchilov & Hutter, 2019) ... Table 7. Key hyperparameters in the RL phase. Hyperparameter Value Learning Rate of Policy model 5e-7 Learning Rate of Value model 5e-6 Batch size 1024 KL Coefficient 0.005 Optimizer type Adam ...