reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Robustness to Spurious Correlations via Dynamic Knowledge Transfer

Authors: Xiaoling Zhou, Wei Ye, Zhemg Lee, Shikun Zhang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments validate the effectiveness of the DKT framework in mitigating spurious correlations, achieving stateof-the-art performance across three typical learning scenarios susceptible to such correlations.
Researcher Affiliation	Academia	Xiaoling Zhou1 , Wei Ye1 , Zhemg Lee2 , Shikun Zhang1 1Peking University 2Tianjin University EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	Yes	Optimization Process We introduce a reinforcement learning-based algorithm for the alternating optimization of parameters in both the target (θ) and strategy (Ω) networks. With Ωfixed, the optimization subproblem for the target network can be defined as minθ E(x,y) Dtr EA p(A\|ξ;Ω) LDKT (θ, Ω) . (10) ... θt+1 = θt η1 1 n i=1 θLDKT i , (11) ... Ωt+1 = Ωt η2 ΩH Ωt , (15)
Open Source Code	No	The paper does not explicitly state that source code is provided or offer a link to a repository for the methodology described.
Open Datasets	Yes	We evaluate the performance of DKT under four subpopulation shift datasets: Colored MNIST (CMNIST) [Yao et al., 2022], Waterbirds [Sagawa et al., 2020], Celeb A [Liu et al., 2016], and Civil Comments [Borkan et al., 2019]. ... We employ two GLT benchmarks [Tang et al., 2022]: Image Net GLT and MSCOCO-GLT. ... We examine three domain shift benchmarks featuring out-of-distribution test data. These benchmarks (i.e., Camelyon17 [Bandi et al., 2018], FMo W [Christie et al., 2018], and Rx Rx1 [Taylor et al., 2019]) are sourced from WILDS [Koh et al., 2021].
Dataset Splits	Yes	We evaluate the performance of DKT under four subpopulation shift datasets: Colored MNIST (CMNIST) [Yao et al., 2022], Waterbirds [Sagawa et al., 2020], Celeb A [Liu et al., 2016], and Civil Comments [Borkan et al., 2019]. In these datasets, certain attributes are highly spuriously correlated with the labels. Following Yao et al. [2022], we adopt pre-trained Res Net-50 [He et al., 2016] and Distil BERT [Sanh et al., 2019] as the model for image (i.e., CMNIST, Waterbirds, Celeb A) and text data (i.e., Civil Comments), respectively. ... Each benchmark consists of three protocols: Class-wise Long Tail (CLT), Attribute-wise Long Tail (ALT), and GLT, showcasing variations in class distribution, attribute distribution, and combinations of both between training and testing datasets.
Hardware Specification	No	The paper does not provide specific details about the hardware used for experiments, such as GPU or CPU models.
Software Dependencies	No	The paper mentions using specific models like Res Net-50 and Distil BERT, but does not provide specific software dependencies (e.g., libraries with version numbers) used for implementation.
Experiment Setup	Yes	For all experiments, the parameter sets for αi, βi, and γi range from 0.1 to 1, with intervals of 0.1.