reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ReFeR: Improving Evaluation and Reasoning through Hierarchy of Models

Authors: Yaswanth Narsupalli, Abhranil Chandra, Sreevatsa Muppirala, Manish Gupta, Pawan Goyal

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We rigorously evaluate Re Fe R on four diverse evaluation benchmarks, where it surpasses prior methods in accuracy while also generating constructive feedback useful for downstream distillation and self-improvement via finetuning. Interestingly, Re Fe R is also applicable for reasoning tasks experiments on four reasoning benchmarks show Re Fe R s superior collective reasoning abilities.
Researcher Affiliation	Collaboration	Yaswanth Narsupalli EMAIL IIT Kharagpur Abhranil Chandra EMAIL University of Waterloo Sreevatsa Muppirala EMAIL IIT Kharagpur Manish Gupta EMAIL Microsoft, India Pawan Goyal EMAIL IIT Kharagpur
Pseudocode	Yes	Refer to Fig. 5 (in the appendix) for illustration of Re Fe R for multimodality and Algorithm 1 showing the framework s working.
Open Source Code	Yes	We make a PIP package, code and data publicly available1,2. 1https://github.com/yaswanth-iitkgp/Re Fe R 2https://pypi.org/project/refer-agents/
Open Datasets	Yes	For NLG evaluation, we test our framework on Summ Eval (Fabbri et al., 2021) for summarization evaluation, and Topical Chat (Mehri & Eskenazi, 2020) for dialogue generation evaluation. For multimodal evaluation, we compare our framework on evaluating two types of task, image-to-text using ICQD (Image Caption Quality Dataset) (Levinboim et al., 2019) and text-to-image generation using AGIQA-1k by Zhang et al. (2023). ... We also test our framework on 4 reasoning datasets: AQu A (Ling et al., 2017), BBH-DU (Srivastava et al., 2023), CSQA (Aggarwal et al., 2021) and GSM8k (Cobbe et al., 2021).
Dataset Splits	Yes	Table 1: Dataset Statistics. We list all the tasks we tackle in our paper and the datasets we used to show results with the number of samples used.3 ... Topical Chat ... 360 Rating ... Summ Eval ... 1600 Rating ... ICQD ... 864 Caption Score ... AGIQA ... 500 Generation Score ... AQu A ... 100 Option ... CSQA ... 100 Option ... BBH-DU ... 100 Option ... GSM8k ... 100 Number - 3For Reasoning, a random subset of 100 was sampled from the original datasets, following (Chen et al., 2024a). 500 random samples were selected from the original AGIQA-1k to get a well-distributed dataset. We use 864 samples with usable image urls from the ICQD test dataset. We use the full test sets for the NLG Evaluation datasets.
Hardware Specification	No	The paper mentions 'local GPU deployment of the peers' but does not specify any particular GPU models or other hardware details. It also discusses FLOPs calculation for various models in Appendix L but does not state the specific hardware on which their experiments were run.
Software Dependencies	No	The paper mentions specific models like 'Llama-3.1-8B-Instruct (Meta-AI, 2024)', 'Mistral-Nemo-12B (Mistral-AI, 2024)', 'Gemma-2-9B (Google-Research, 2024)', and 'GPT-4o-mini (Open AI, 2024b)' and mentions using 'Together-AI (2023) s API'. However, it does not provide specific version numbers for core software libraries or frameworks (e.g., PyTorch, TensorFlow, CUDA versions) that would be needed for replication.
Experiment Setup	Yes	For the Re Fe R NLG Evaluation setup, following Analyze-Rate (Chiang & Lee, 2023), we set these hyperparameters as follows, for the Area Chair GPT-4o-mini modeltemperature=1, max_tokens=256, top_p=1, frequency_penalty=0, presence_penalty=0, stop=None, n=20 (varies for Re Fe R Lite and Pro). For the peer models, we use the default hyperparameters except for the max_tokens=128. For multimodal evaluation, we use the same setup for the AC, but for the peers, we increase the max_tokens from 128 to 192 tokens. For reasoning tasks, we follow the NLG evaluation setup for the area chair, but we don t set any limit on the max_tokens hyperparameter. For the peer models, we increase max_tokens to 256 and set the hyper-parameters temperature=1, top_p=1.