reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Common Sense Bias Modeling for Classification Tasks

Authors: Miao Zhang, Zee Fryer, Ben Colman, Ali Shahriyari, Gaurav Bharaj

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Downstream experiments show that our method uncovers novel model biases in multiple image benchmark datasets. Furthermore, the discovered bias can be mitigated by simple data re-weighting to de-correlate the features, outperforming state-of-the-art unsupervised bias mitigation methods.
Researcher Affiliation	Collaboration	1New York University 2Reality Defender Inc.
Pseudocode	No	The paper states: "Detailed algorithm steps are in the appendix (Zhang et al. 2024a)." However, no pseudocode or algorithm blocks are provided within the main body of the paper.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code for the methodology, nor does it provide a direct link to a code repository.
Open Datasets	Yes	Celeb A-Dialog (Jiang et al. 2021) is a visual-language dataset including captions for images in Celeb A (Liu et al. 2015), a popular face recognition benchmark dataset containing 40 facial attributes. MS-COCO 2014 (Lin et al. 2014) is a large-scale scene image datasets including annotations for 80 objects, and each image has 5 descriptive captions generated independently (Chen et al. 2015).
Dataset Splits	Yes	For the downstream training, we use the same training, validation, and testing split as in (Zhang et al. 2022b) for Celeb A dataset, and as in (Misra et al. 2016) for MS-COCO dataset. However, because of label scarcity of individual objects relative to the entire dataset (for example, only 3% of images in MS-COCO contain Cat and 2% of images contain Kite ), to avoid class imbalance problem introducing confounding bias to the classifier, we randomly take subsets of the dataset splits which include 50% of each target label.
Hardware Specification	No	The paper does not provide specific details about the hardware used to run the experiments.
Software Dependencies	No	The paper mentions using "spa Cy en core web sm model" and "Universal Sentence Encoder" but does not specify their version numbers or other software dependencies with version numbers.
Experiment Setup	Yes	For correlation reasoning, the noun chuncks from the description corpus are encoded as vectors in 512 dimensions using the Universal Sentence Encoder. Dimension reduction is performed using PCA before clustering, ensuring that the sum of the PCA components accounts for a high amount of variance (>90%). The distance criterion for generating feature clusters is set to z = 1.0; see ablation sections for sensitivity analysis. We use Chi-square test (Pearson 1900) to verify significance of derived correlation coefficients. For the downstream training, we use the same training, validation, and testing split as in (Zhang et al. 2022b) for Celeb A dataset, and as in (Misra et al. 2016) for MS-COCO dataset. Experiments are based on three independent runs which provide the mean and standard deviation results.