The FIX Benchmark: Extracting Features Interpretable to eXperts

Authors: Helen Jin, Shreya Havaldar, Chaehyeon Kim, Anton Xue, Weiqiu You, Helen Qu, Marco Gatti, Daniel A Hashimoto, Bhuvnesh Jain, Amin Madani, Masao Sako, Lyle Ungar, Eric Wong

DMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We evaluate commonly used techniques for extracting higher-level features and find that existing methods score poorly on FIXScore, highlighting the need for developing new general-purpose methods designed to automatically extract expert features. ... We show results on the baselines in Table 2.
Researcher Affiliation Academia Department of Computer and Information Science, University of Pennsylvania, USA Department of Physics and Astronomy, University of Pennsylvania, USA Department of Surgery, Perelman School of Medicine, University of Pennsylvania, USA Department of Surgery, University of Toronto, Canada
Pseudocode No The paper describes algorithms and metrics using mathematical formulas (e.g., Equation 1, 2, 3, 4, 5, 6, 7, 9, 10, 11, 12, 13, 14, 15, 16, 18, 19, 20, 21) but does not present any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Packaged libraries of code, hugging face data loaders and updates are available at https://brachiolab.github.io/fix/
Open Datasets Yes All datasets and their respective Croissant metadata records and licenses are available on Hugging Face at the following links. Mass Maps: https://huggingface.co/datasets/BrachioLab/massmaps-cosmogrid-100k Supernova: https://huggingface.co/datasets/BrachioLab/supernova-timeseries Multilingual Politeness: https://huggingface.co/datasets/BrachioLab/multilingual_politeness Emotion: https://huggingface.co/datasets/BrachioLab/emotion Chest X-Ray: https://huggingface.co/datasets/BrachioLab/chestx Laparoscopic Cholecystectomy Surgery: https://huggingface.co/datasets/BrachioLab/cholec
Dataset Splits Yes The dataset has contains train/validation/test splits of sizes 90,000/10,000/10,000, respectively. (Mass Maps) ... The supernova dataset contains train/validation/test splits of sizes 6274/728/792, respectively. ... The dataset contains train/validation/test splits of sizes 43,400/5,430/5,430, respectively. (Emotion) ... We randomly partition the dataset into train/test splits of sizes 23,094/5,774, respectively. (Chest X-Ray) ... This dataset consists of 1015 annotated images that are randomly split by video sources, with train/test splits of sizes 785/230, respectively. (Cholecystectomy)
Hardware Specification Yes All experiments were conducted on two server machines, each with 8 NVIDIA A100 GPUs and 8 NVIDIA A6000 GPUs, respectively.
Software Dependencies No The paper does not provide specific software dependencies with version numbers. It mentions the use of 'Torch XRay Vision library' for Chest X-Ray and 'BERTopic' for text clustering, but without specific version information.
Experiment Setup No The paper describes the datasets, metrics, and baselines but does not provide specific hyperparameters (e.g., learning rate, batch size, number of epochs) or detailed training configurations for the experiments conducted. It mentions using 'CNN-based model' for Mass Maps and 'fine-tuned multilingual LLM' for Multilingual Politeness and 'fine-tuned LLM' for Emotion, but without explicit setup details.