A Decade's Battle on Dataset Bias: Are We There Yet?
Authors: Zhuang Liu, Kaiming He
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We revisit the dataset classification experiment suggested by Torralba & Efros (2011) a decade ago, in the new era with large-scale, diverse, and hopefully less biased datasets as well as more capable neural network architectures. Surprisingly, we observe that modern neural networks can achieve excellent accuracy in classifying which dataset an image is from: e.g., we report 84.7% accuracy on held-out validation data for the three-way classification problem consisting of the YFCC, CC, and Data Comp datasets. |
| Researcher Affiliation | Industry | Zhuang Liu Kaiming He Meta AI Research, FAIR |
| Pseudocode | No | The paper describes methods and processes in paragraph form and tables, but does not include any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Work done at Meta; now at MIT. Code: github.com/liuzhuang13/bias |
| Open Datasets | Yes | Examples include YFCC100M (Thomee et al., 2016), CC12M (Changpinyo et al., 2021), and Data Comp-1B (Gadre et al., 2023) the main datasets we study in this paper among many others (Sun et al., 2017; Desai et al., 2021; Srinivasan et al., 2021; Schuhmann et al., 2022). |
| Dataset Splits | Yes | By default, we randomly sample 1M and 10K images from each dataset as training and validation sets, respectively. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory amounts used for the experiments. |
| Software Dependencies | No | The paper mentions software components like 'optimizer Adam W', 'randomaug', 'mixup', and 'cutmix' along with their respective citations, and also 'Vi T-B' and 'MAE (He et al., 2022)', but does not provide specific version numbers for any software libraries, frameworks, or programming languages used (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | The complete training recipe is shown in Table 10. config value optimizer Adam W learning rate 1e-3 weight decay 0.3 optimizer momentum β1, β2=0.9, 0.95 batch size 4096 learning rate schedule cosine decay warmup epochs 20 (Image Net-1K) training epochs 300 (Image Net-1K) randomaug (Cubuk et al., 2020) (9, 0.5) label smoothing 0.1 mixup (Zhang et al., 2018b) 0.8 cutmix (Yun et al., 2019) 1.0 |