reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Curriculum Abductive Learning for Mitigating Reasoning Shortcuts

Authors: Wen-Da Wei, Xiao-Wen Yang, Jie-Jing Shao, Lan-Zhe Guo

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we conduct experiments to verify our claims and validate the superior performance of Cur ABL. In Subsection 5.1, we verify the clustering capability of the perception model of ABL after training on dataset MNIST-Additon. In Subsection 5.2, we describe the experimental setup. In Subsection 5.3, we evaluate the effectiveness of Cur ABL on two datasets, MNIST-Additon and Handwritten Formula Recognition, by comparing it with different ABL methods.
Researcher Affiliation	Academia	1National Key Laboratory for Novel Software Technology, Nanjing University, China 2School of Artificial Intelligence, Nanjing University, China 3School of Intelligence Science and Technology, Nanjing University, China EMAIL
Pseudocode	Yes	Algorithm 1 Cold Start Graph Construction Algorithm Input: Training set S = {(xi, yi)}N i=1, encoder E, threshold τ Parameter: Number of intermediate concepts k Output: k undirected graphs {G1, G2, . . . , Gk} Algorithm 2 Incorrect Pseudo-Label Removal Algorithm Input: k undirected graphs {G1, G2, . . . , Gk}, initial pseudo-label candidate sets {C1, C2, . . . , CN} for N samples Parameter: Number of intermediate concepts k Output: Updated pseudo-label candidate sets {C1, C2, . . . , CN} Algorithm 3 Cur ABL Training Overflow Input: Training set S = {(xi, yi)}N i=1, updated pseudo-label candidate sets {C1, C2, . . . , CN}, batch size B, number of epochs E Parameter: Model f, learning rate η Output: Trained ABL model f
Open Source Code	No	The paper does not contain any explicit statement about releasing source code or a link to a code repository for the methodology described.
Open Datasets	Yes	Settings of MNIST-Addition The MNIST-Addition task [Manhaeve et al., 2018] takes two images of handwritten digits as input and outputs their sum. Settings of Handwritten Formula Recognition Additionally, we conduct experiments on the Handwritten Formula Recognition HWF task [Li et al., 2020].
Dataset Splits	No	The paper mentions the total number of samples for MNIST-Addition (30000) and HWF (10000) datasets, and describes how HWF-M and HWF-H were created by filtering based on complexity (e.g., 'removed samples where the cardinality of the candidate set of all pseudo-labels was less than 10'). However, it does not provide specific training/validation/test split percentages, sample counts, or explicit references to standard splits for the primary datasets to reproduce the data partitioning.
Hardware Specification	Yes	All experiments are implemented by Pytorch and are conducted on an NVIDIA RTX 3090 GPU.
Software Dependencies	No	All experiments are implemented by Pytorch and are conducted on an NVIDIA RTX 3090 GPU. The paper mentions PyTorch but does not provide a specific version number, nor does it list any other software dependencies with version numbers.
Experiment Setup	Yes	In the cold start method, we first train the ABL model on the dataset for six epochs and we set the threshold τ = 0.95. To ensure reliability of our results, all experiments are repeated five times with different random seeds.