reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

The FIX Benchmark: Extracting Features Interpretable to eXperts

Authors: Helen Jin, Shreya Havaldar, Chaehyeon Kim, Anton Xue, Weiqiu You, Helen Qu, Marco Gatti, Daniel A Hashimoto, Bhuvnesh Jain, Amin Madani, Masao Sako, Lyle Ungar, Eric Wong

DMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate commonly used techniques for extracting higher-level features and find that existing methods score poorly on FIXScore, highlighting the need for developing new general-purpose methods designed to automatically extract expert features. ... We show results on the baselines in Table 2.
Researcher Affiliation	Academia	Department of Computer and Information Science, University of Pennsylvania, USA Department of Physics and Astronomy, University of Pennsylvania, USA Department of Surgery, Perelman School of Medicine, University of Pennsylvania, USA Department of Surgery, University of Toronto, Canada
Pseudocode	No	The paper describes algorithms and metrics using mathematical formulas (e.g., Equation 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21) but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Packaged libraries of code, hugging face data loaders and updates are available at https://brachiolab.github.io/fix/
Open Datasets	Yes	All datasets and their respective Croissant metadata records and licenses are available on Hugging Face at the following links. Mass Maps: https://huggingface.co/datasets/BrachioLab/massmaps-cosmogrid-100k Supernova: https://huggingface.co/datasets/BrachioLab/supernova-timeseries Multilingual Politeness: https://huggingface.co/datasets/BrachioLab/multilingual_politeness Emotion: https://huggingface.co/datasets/BrachioLab/emotion Chest X-Ray: https://huggingface.co/datasets/BrachioLab/chestx Laparoscopic Cholecystectomy Surgery: https://huggingface.co/datasets/BrachioLab/cholec
Dataset Splits	Yes	The dataset has contains train/validation/test splits of sizes 90,000/10,000/10,000, respectively. (Mass Maps) ... The supernova dataset contains train/validation/test splits of sizes 6274/728/792, respectively. ... The dataset contains train/validation/test splits of sizes 43,400/5,430/5,430, respectively. (Emotion) ... We randomly partition the dataset into train/test splits of sizes 23,094/5,774, respectively. (Chest X-Ray) ... This dataset consists of 1015 annotated images that are randomly split by video sources, with train/test splits of sizes 785/230, respectively. (Cholecystectomy)
Hardware Specification	Yes	All experiments were conducted on two server machines, each with 8 NVIDIA A100 GPUs and 8 NVIDIA A6000 GPUs, respectively.
Software Dependencies	No	The paper does not provide specific software dependencies with version numbers. It mentions the use of 'Torch XRay Vision library' for Chest X-Ray and 'BERTopic' for text clustering, but without specific version information.
Experiment Setup	No	The paper describes the datasets, metrics, and baselines but does not provide specific hyperparameters (e.g., learning rate, batch size, number of epochs) or detailed training configurations for the experiments conducted. It mentions using 'CNN-based model' for Mass Maps and 'fine-tuned multilingual LLM' for Multilingual Politeness and 'fine-tuned LLM' for Emotion, but without explicit setup details.