LEMoN: Label Error Detection using Multimodal Neighbors

Authors: Haoran Zhang, Aparna Balagopalan, Nassim Oufattole, Hyewon Jeong, Yan Wu, Jiacheng Zhu, Marzyeh Ghassemi

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Through empirical evaluations across eight datasets and twelve baselines, we find that LEMON outperforms the baselines by over 3% in label error detection, and that training on datasets filtered using our method improves downstream captioning performance by more than 2 BLEU points over noisy training.
Researcher Affiliation Academia 1Massachusetts Institute of Technology. Correspondence to: Haoran Zhang <EMAIL>, Aparna Balagopalan <EMAIL>.
Pseudocode Yes An algorithm outline and high-level description of the method can be found in Appendix C. Algorithm 1: LEMON: Label Error Detection Using Multimodal Neighbors
Open Source Code Yes 1Code: https://github.com/MLforHealth/LEMoN
Open Datasets Yes We evaluate our method using eight datasets, as shown in Table 1. Four datasets (cifar10, cifar100, stanford Cars, mini Image Net) are label error detection datasets from the classification setting. The four remaining datasets are image captioning datasets. For mscoco and flickr30k, we use the Karpathy split (Karpathy & Fei-Fei, 2015). The remaining datasets were randomly split into: training or reference set for the label detection method (80%), validation set for hyperparameter selection (10%), and test set for performance evaluation (10%).
Dataset Splits Yes We evaluate our method using eight datasets, as shown in Table 1. Four datasets (cifar10, cifar100, stanford Cars, mini Image Net) are label error detection datasets from the classification setting. The four remaining datasets are image captioning datasets. For mscoco and flickr30k, we use the Karpathy split (Karpathy & Fei-Fei, 2015). The remaining datasets were randomly split into: training or reference set for the label detection method (80%), validation set for hyperparameter selection (10%), and test set for performance evaluation (10%).
Hardware Specification Yes We run our experiments on a shared Slurm cluster. Most experiments used one RTX A6000 with 48 GB VRAM, 10 CPU cores of Intel Xeon Ice Lake Platinum 8368, and 50 GB RAM.
Software Dependencies No The paper mentions software like CLIP, LLaVA, Llama-3.1-8B-Instruct, Instruct BLIP-Vicuna-7b, and GIT (Wang et al., 2022a), but it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes For LEMONOPT, we select the hyperparameter combination that maximizes F1 on a labeled validation set. We report the AUROC, macro-averaged AUPRC, and F1 for this model. For LEMONFIX, we fix the hyperparameters at the following reasonable values: k = 30, β = γ = 5, τ1,n = τ1,m = 0.1, and τ2,n = τ2,m = 5. We report AUROC and AUPRC, as the F1 requires additional information to compute a threshold for the score.