What Makes ImageNet Look Unlike LAION
Authors: Ali Shirali, Moritz Hardt
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this work, we carry out this counterfactual investigation. We find that the resulting Image Net recreation, which we call LAIONet, looks distinctly unlike the original. Specifically, the intra-class similarity of images in the original Image Net is dramatically higher than it is for LAIONet. Consequently, models trained on Image Net perform significantly worse on LAIONet. We propose a rigorous explanation for the discrepancy in terms of a subtle, yet important, difference in two plausible causal data-generating processes for the respective datasets, that we support with systematic experimentation. |
| Researcher Affiliation | Academia | Ali Shirali EMAIL University of California, Berkeley Moritz Hardt Max Planck Institute for Intelligent Systems, Tübingen and Tübingen AI Center |
| Pseudocode | No | The paper describes its methodology through textual explanations and figures such as causal graphs and distributions, but it does not contain any explicit pseudocode or algorithm blocks. |
| Open Source Code | Yes | All code is available at: https://github.com/alishirali Git/eval-on-laion |
| Open Datasets | Yes | For nearly a decade, Image Net (Deng et al., 2009) was the focal benchmark for much of computer vision and deep learning. Available to the academic public is the massive scale LAION dataset, in two versions, featuring 400 million (Schuhmann et al., 2021) and 5 billion (Schuhmann et al., 2022) crawled image-text pairs. We also use Image Net V2 (Recht et al., 2019) and Image Net-Captions dataset (Fang et al., 2022). |
| Dataset Splits | Yes | Unless otherwise stated, by Image Net we mean the Image Net ILSVRC-2012 dataset. The final dataset, which we call LAIONet, consists of 822k samples from 915 Image Net classes. We can also create a more conservative version of LAIONet mimicking the Image Net validation set by retaining only the top 50 most similar instances for each class. This uniform weighting is consistent with the setup of Image Net validation with 50 images per class. |
| Hardware Specification | No | The paper does not provide specific hardware details (e.g., GPU/CPU models, memory amounts) used for running its experiments. It mentions using models from Hugging Face checkpoints but not the underlying hardware. |
| Software Dependencies | No | The paper mentions using specific models like the Open AI CLIP model and MPNet (Song et al., 2020) and tools like EAST (Zhou et al., 2017) and Tr OCR (Li et al., 2023a), and states models come from Hugging Face checkpoints. However, it does not provide specific version numbers for software dependencies such as programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or their respective versions. |
| Experiment Setup | Yes | We use various versions of each model in terms of the size (small, base, large, etc.), image resolution (224x224 or 384x384), patch resolution (16x16 or 32x32), and whether models are pre-trained on the complete Image Net with 22k classes or not. All models are trained on Image Net without extra training data. We use the cosine similarity of CLIP text embeddings to calculate this similarity, however, we make consistent observations using MPNet (Song et al., 2020) as the text encoder. We found the threshold of 0.82 the highest reasonable choice as it allows for covering most classes... We select the similarity threshold of 0.58. We drop samples with more than one label to simplify the evaluation on the dataset. Second, we drop images tagged as not-safe-for-work in LAION. Finally, we exclude images that contain text matching the name of their synset. |