ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models

Authors: Yaswanth Narsupalli, Abhranil Chandra, Sreevatsa Muppirala, Manish Gupta, Pawan Goyal

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We rigorously evaluate Re Fe R on four diverse evaluation benchmarks, where it surpasses prior methods in accuracy while also generating constructive feedback useful for downstream distillation and self-improvement via finetuning. Interestingly, Re Fe R is also applicable for reasoning tasks experiments on four reasoning benchmarks show Re Fe R s superior collective reasoning abilities.
Researcher Affiliation Collaboration Yaswanth Narsupalli EMAIL IIT Kharagpur Abhranil Chandra EMAIL University of Waterloo Sreevatsa Muppirala EMAIL IIT Kharagpur Manish Gupta EMAIL Microsoft, India Pawan Goyal EMAIL IIT Kharagpur
Pseudocode Yes Refer to Fig. 5 (in the appendix) for illustration of Re Fe R for multimodality and Algorithm 1 showing the framework s working.
Open Source Code Yes We make a PIP package, code and data publicly available1,2. 1https://github.com/yaswanth-iitkgp/Re Fe R 2https://pypi.org/project/refer-agents/
Open Datasets Yes For NLG evaluation, we test our framework on Summ Eval (Fabbri et al., 2021) for summarization evaluation, and Topical Chat (Mehri & Eskenazi, 2020) for dialogue generation evaluation. For multimodal evaluation, we compare our framework on evaluating two types of task, image-to-text using ICQD (Image Caption Quality Dataset) (Levinboim et al., 2019) and text-to-image generation using AGIQA-1k by Zhang et al. (2023). ... We also test our framework on 4 reasoning datasets: AQu A (Ling et al., 2017), BBH-DU (Srivastava et al., 2023), CSQA (Aggarwal et al., 2021) and GSM8k (Cobbe et al., 2021).
Dataset Splits Yes Table 1: Dataset Statistics. We list all the tasks we tackle in our paper and the datasets we used to show results with the number of samples used.3 ... Topical Chat ... 360 Rating ... Summ Eval ... 1600 Rating ... ICQD ... 864 Caption Score ... AGIQA ... 500 Generation Score ... AQu A ... 100 Option ... CSQA ... 100 Option ... BBH-DU ... 100 Option ... GSM8k ... 100 Number - 3For Reasoning, a random subset of 100 was sampled from the original datasets, following (Chen et al., 2024a). 500 random samples were selected from the original AGIQA-1k to get a well-distributed dataset. We use 864 samples with usable image urls from the ICQD test dataset. We use the full test sets for the NLG Evaluation datasets.
Hardware Specification No The paper mentions 'local GPU deployment of the peers' but does not specify any particular GPU models or other hardware details. It also discusses FLOPs calculation for various models in Appendix L but does not state the specific hardware on which their experiments were run.
Software Dependencies No The paper mentions specific models like 'Llama-3.1-8B-Instruct (Meta-AI, 2024)', 'Mistral-Nemo-12B (Mistral-AI, 2024)', 'Gemma-2-9B (Google-Research, 2024)', and 'GPT-4o-mini (Open AI, 2024b)' and mentions using 'Together-AI (2023) s API'. However, it does not provide specific version numbers for core software libraries or frameworks (e.g., PyTorch, TensorFlow, CUDA versions) that would be needed for replication.
Experiment Setup Yes For the Re Fe R NLG Evaluation setup, following Analyze-Rate (Chiang & Lee, 2023), we set these hyperparameters as follows, for the Area Chair GPT-4o-mini modeltemperature=1, max_tokens=256, top_p=1, frequency_penalty=0, presence_penalty=0, stop=None, n=20 (varies for Re Fe R Lite and Pro). For the peer models, we use the default hyperparameters except for the max_tokens=128. For multimodal evaluation, we use the same setup for the AC, but for the peers, we increase the max_tokens from 128 to 192 tokens. For reasoning tasks, we follow the NLG evaluation setup for the area chair, but we don t set any limit on the max_tokens hyperparameter. For the peer models, we increase max_tokens to 256 and set the hyper-parameters temperature=1, top_p=1.