LEMoN: Label Error Detection using Multimodal Neighbors
Authors: Haoran Zhang, Aparna Balagopalan, Nassim Oufattole, Hyewon Jeong, Yan Wu, Jiacheng Zhu, Marzyeh Ghassemi
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through empirical evaluations across eight datasets and twelve baselines, we find that LEMON outperforms the baselines by over 3% in label error detection, and that training on datasets filtered using our method improves downstream captioning performance by more than 2 BLEU points over noisy training. |
| Researcher Affiliation | Academia | 1Massachusetts Institute of Technology. Correspondence to: Haoran Zhang <EMAIL>, Aparna Balagopalan <EMAIL>. |
| Pseudocode | Yes | An algorithm outline and high-level description of the method can be found in Appendix C. Algorithm 1: LEMON: Label Error Detection Using Multimodal Neighbors |
| Open Source Code | Yes | 1Code: https://github.com/MLforHealth/LEMoN |
| Open Datasets | Yes | We evaluate our method using eight datasets, as shown in Table 1. Four datasets (cifar10, cifar100, stanford Cars, mini Image Net) are label error detection datasets from the classification setting. The four remaining datasets are image captioning datasets. For mscoco and flickr30k, we use the Karpathy split (Karpathy & Fei-Fei, 2015). The remaining datasets were randomly split into: training or reference set for the label detection method (80%), validation set for hyperparameter selection (10%), and test set for performance evaluation (10%). |
| Dataset Splits | Yes | We evaluate our method using eight datasets, as shown in Table 1. Four datasets (cifar10, cifar100, stanford Cars, mini Image Net) are label error detection datasets from the classification setting. The four remaining datasets are image captioning datasets. For mscoco and flickr30k, we use the Karpathy split (Karpathy & Fei-Fei, 2015). The remaining datasets were randomly split into: training or reference set for the label detection method (80%), validation set for hyperparameter selection (10%), and test set for performance evaluation (10%). |
| Hardware Specification | Yes | We run our experiments on a shared Slurm cluster. Most experiments used one RTX A6000 with 48 GB VRAM, 10 CPU cores of Intel Xeon Ice Lake Platinum 8368, and 50 GB RAM. |
| Software Dependencies | No | The paper mentions software like CLIP, LLaVA, Llama-3.1-8B-Instruct, Instruct BLIP-Vicuna-7b, and GIT (Wang et al., 2022a), but it does not provide specific version numbers for general software dependencies like Python, PyTorch, or CUDA. |
| Experiment Setup | Yes | For LEMONOPT, we select the hyperparameter combination that maximizes F1 on a labeled validation set. We report the AUROC, macro-averaged AUPRC, and F1 for this model. For LEMONFIX, we fix the hyperparameters at the following reasonable values: k = 30, β = γ = 5, τ1,n = τ1,m = 0.1, and τ2,n = τ2,m = 5. We report AUROC and AUPRC, as the F1 requires additional information to compute a threshold for the score. |