Common Sense Bias Modeling for Classification Tasks

Authors: Miao Zhang, Zee Fryer, Ben Colman, Ali Shahriyari, Gaurav Bharaj

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Downstream experiments show that our method uncovers novel model biases in multiple image benchmark datasets. Furthermore, the discovered bias can be mitigated by simple data re-weighting to de-correlate the features, outperforming state-of-the-art unsupervised bias mitigation methods.
Researcher Affiliation Collaboration 1New York University 2Reality Defender Inc.
Pseudocode No The paper states: "Detailed algorithm steps are in the appendix (Zhang et al. 2024a)." However, no pseudocode or algorithm blocks are provided within the main body of the paper.
Open Source Code No The paper does not contain any explicit statements about releasing source code for the methodology, nor does it provide a direct link to a code repository.
Open Datasets Yes Celeb A-Dialog (Jiang et al. 2021) is a visual-language dataset including captions for images in Celeb A (Liu et al. 2015), a popular face recognition benchmark dataset containing 40 facial attributes. MS-COCO 2014 (Lin et al. 2014) is a large-scale scene image datasets including annotations for 80 objects, and each image has 5 descriptive captions generated independently (Chen et al. 2015).
Dataset Splits Yes For the downstream training, we use the same training, validation, and testing split as in (Zhang et al. 2022b) for Celeb A dataset, and as in (Misra et al. 2016) for MS-COCO dataset. However, because of label scarcity of individual objects relative to the entire dataset (for example, only 3% of images in MS-COCO contain Cat and 2% of images contain Kite ), to avoid class imbalance problem introducing confounding bias to the classifier, we randomly take subsets of the dataset splits which include 50% of each target label.
Hardware Specification No The paper does not provide specific details about the hardware used to run the experiments.
Software Dependencies No The paper mentions using "spa Cy en core web sm model" and "Universal Sentence Encoder" but does not specify their version numbers or other software dependencies with version numbers.
Experiment Setup Yes For correlation reasoning, the noun chuncks from the description corpus are encoded as vectors in 512 dimensions using the Universal Sentence Encoder. Dimension reduction is performed using PCA before clustering, ensuring that the sum of the PCA components accounts for a high amount of variance (>90%). The distance criterion for generating feature clusters is set to z = 1.0; see ablation sections for sensitivity analysis. We use Chi-square test (Pearson 1900) to verify significance of derived correlation coefficients. For the downstream training, we use the same training, validation, and testing split as in (Zhang et al. 2022b) for Celeb A dataset, and as in (Misra et al. 2016) for MS-COCO dataset. Experiments are based on three independent runs which provide the mean and standard deviation results.