reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LEMoN: Label Error Detection using Multimodal Neighbors

Authors: Haoran Zhang, Aparna Balagopalan, Nassim Oufattole, Hyewon Jeong, Yan Wu, Jiacheng Zhu, Marzyeh Ghassemi

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through empirical evaluations across eight datasets and twelve baselines, we find that LEMON outperforms the baselines by over 3% in label error detection, and that training on datasets filtered using our method improves downstream captioning performance by more than 2 BLEU points over noisy training.
Researcher Affiliation	Academia	1Massachusetts Institute of Technology. Correspondence to: Haoran Zhang <EMAIL>, Aparna Balagopalan <EMAIL>.
Pseudocode	Yes	An algorithm outline and high-level description of the method can be found in Appendix C. Algorithm 1: LEMON: Label Error Detection Using Multimodal Neighbors
Open Source Code	Yes	1Code: https://github.com/MLforHealth/LEMoN
Open Datasets	Yes	We evaluate our method using eight datasets, as shown in Table 1. Four datasets (cifar10, cifar100, stanford Cars, mini Image Net) are label error detection datasets from the classification setting. The four remaining datasets are image captioning datasets. For mscoco and flickr30k, we use the Karpathy split (Karpathy & Fei-Fei, 2015). The remaining datasets were randomly split into: training or reference set for the label detection method (80%), validation set for hyperparameter selection (10%), and test set for performance evaluation (10%).
Dataset Splits	Yes	We evaluate our method using eight datasets, as shown in Table 1. Four datasets (cifar10, cifar100, stanford Cars, mini Image Net) are label error detection datasets from the classification setting. The four remaining datasets are image captioning datasets. For mscoco and flickr30k, we use the Karpathy split (Karpathy & Fei-Fei, 2015). The remaining datasets were randomly split into: training or reference set for the label detection method (80%), validation set for hyperparameter selection (10%), and test set for performance evaluation (10%).
Hardware Specification	Yes	We run our experiments on a shared Slurm cluster. Most experiments used one RTX A6000 with 48 GB VRAM, 10 CPU cores of Intel Xeon Ice Lake Platinum 8368, and 50 GB RAM.
Software Dependencies	No	The paper mentions software like CLIP, LLaVA, Llama-3.1-8B-Instruct, Instruct BLIP-Vicuna-7b, and GIT (Wang et al., 2022a), but it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	For LEMONOPT, we select the hyperparameter combination that maximizes F1 on a labeled validation set. We report the AUROC, macro-averaged AUPRC, and F1 for this model. For LEMONFIX, we fix the hyperparameters at the following reasonable values: k = 30, β = γ = 5, τ1,n = τ1,m = 0.1, and τ2,n = τ2,m = 5. We report AUROC and AUPRC, as the F1 requires additional information to compute a threshold for the score.