reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

MetaMetrics: Calibrating Metrics for Generation Tasks Using Human Preferences

Authors: Genta Winata, David Anugraha, Lucky Susanto, Garry Kuwanto, Derry Wijaya

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We introduce METAMETRICS, a calibrated meta-metric designed to evaluate generation tasks across different modalities in a supervised manner. METAMETRICS optimizes the combination of existing metrics to enhance their alignment with human preferences. Our metric demonstrates flexibility and effectiveness in both language and vision downstream tasks, showing significant benefits across various multilingual and multi-domain scenarios. ... 4 EXPERIMENTS SETUP In this work, we explore two optimization methodologies: BO and Boosting.
Researcher Affiliation	Collaboration	Genta Indra Winata1 , David Anugraha2 , Lucky Susanto3 , Garry Kuwanto4 , Derry Tanti Wijaya3,4 1Capital One 2University of Toronto 3Monash University Indonesia 4Boston University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 Iterative-based Pruning with XGBoost 1: procedure ITERATIVEXGBOOST(X, y, k) 2: F {f1, f2, . . . , fp} Initial set of features 3: P [] Performance history 4: Fleast [] Feature pruning history 5: for i 1 to k do 6: Train Φ(i) XGB on XF with CV 7: Ii Importance(ΦXGB) Compute feature importance from ΦXGB 8: P[i] ρ(ΦXGB(XF), y) Store performance score 9: fleast[i] argmin(Ii) Identify least important feature 10: Fleast fleast 11: F F \ {fleast} Remove least important feature 12: end for 13: i argmax(P) Find the best index iteration 14: Fbest F \ Fleast[i :] Best features from highest performance 15: Train final ˆf XGB on XFbest 16: return ˆfxgb 17: end procedure
Open Source Code	Yes	We will release the code and models to facilitate reproducibility and empower researchers and practitioners to utilize and extend our metric for various applications.2 2We release METAMETRICS as an open-source library at https://github.com/meta-metrics/metametrics.
Open Datasets	Yes	Abstractive Text Summarization. For this task, we use the Summ Eval (Fabbri et al., 2021) for text summarization evaluation. ... For additional benchmarking, we also evaluate on Benchmark LLM (BLLM) (Zhang et al., 2024) dataset... Machine Translation. ...WMT shared tasks datasets from 2020 to 2022 that are annotated using MQM annotation scores. We evaluate on MQM dataset from WMT23 and WMT24 shared task (Freitag et al., 2023; 2024). Question Answering. ...Evaluating Question Answering Evaluation (EQAE) (Chen et al., 2019) and EVOUNA (Wang et al., 2024a) and include Open-QA datasets: Natural Questions (NQ) (Kwiatkowski et al., 2019) and Trivia QA (TQ) (Joshi et al., 2017), RCQA dataset: Narrative QA (Koˇcisk y et al., 2018), and Reasoning QA dataset: Sem Eval 2018 Task 11 (Sem Eval) (Ostermann et al., 2018). Image Captioning. For this task, we use Flickr8k-Expert (Hodosh et al., 2013) and THum B 1.0 (Kasai et al., 2022) human evaluation datasets. Reward Model Scoring. We use Reward Bench (Lambert et al., 2024) as the test benchmark. We use cleaned Skywork Reward Data Collection3 for training our METAMETRICS, including Help Steer2 (Wang et al., 2024c), Offset Bias (Park et al., 2024), Wild Guard (Han et al., 2024), and Magpie (Xu et al., 2024). In addition, we add Preference Test Data4 to increase the size of the human preference dataset. ... 3Cleaned Skywork Reward Data Collection can be accessed at https://huggingface.co/datasets/natolambert/skywork-preferences-80k-v0.1-cleaned. 4Preference Test Data from AI2 can be accessed at https://huggingface.co/datasets/allenai/preference-test-sets.
Dataset Splits	Yes	For each evaluation task in abstractive text summarization, question answering, and image captioning, we use a 30%-70% train-test split as no predefined split was available in the datasets used for these tasks. Alternatively, we follow existing standardized benchmarks for machine translation and reward model scoring. Table 6: Dataset statistics used in the experiments. Dataset Name Type Tuning Size Testing Size ... Summ Eval ... 510 1,190 ... NQ ... 4,528 10,567
Hardware Specification	Yes	All experiments are run on the same machine with AMD EPYC 9354 32-Core Processor and NVIDIA RTX 6000 Ada GPU with 48GB memory.
Software Dependencies	No	The paper mentions various models and frameworks like XGBoost (Chen & Guestrin, 2016), Gaussian Process with Matern kernel (Williams & Rasmussen, 2006), and specific metrics like BLEU, BERTScore, G-Eval (GPT-4), etc., but it does not provide specific version numbers for the general software environment (e.g., Python version, PyTorch/TensorFlow versions, or XGBoost library version).
Experiment Setup	Yes	Table 7 describes the hyper-parameter settings that we use for our experiments. For the Bayesian optimization, we run GP with a Matern kernel Williams & Rasmussen (2006), a generalization of the RBF kernel, using ν = 2.5. Table 8: Initial hyper-parameter values used for parameter searching during XGBoost.