reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Advancing Generalization Across a Variety of Abstract Visual Reasoning Tasks

Authors: Mikołaj Małkiński, Jacek Mańdziuk

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experiments demonstrate strong generalization capabilities of the proposed model, which in several settings outperforms the existing literature methods.
Researcher Affiliation	Academia	1Warsaw University of Technology, Warsaw, Poland 2AGH University of Krakow, Krakow, Poland EMAIL, EMAIL
Pseudocode	No	The paper describes the model architecture (Fig. 5) and its components in detail but does not present any structured pseudocode or algorithm blocks.
Open Source Code	Yes	The code for reproducing all experiments is publicly accessible at: https://github.com/mikomel/raven
Open Datasets	Yes	We utilize four RPM datasets: PGM [Barrett et al., 2018], I-RAVEN [Hu et al., 2021], I-RAVEN-Mesh and A-I-RAVEN [Małki nski and Ma ndziuk, 2025a]. We extend the evaluation of Po NG beyond RPMs, to two benchmarks comprising visual analogies with both synthetic [Hill et al., 2019] and real-world [Bitton et al., 2023] images.
Dataset Splits	Yes	PGM: Each regime contains 1.42M RPMs, where 1.2M, 20K, and 200K belong to the train, validation, and test splits, resp. I-RAVEN: The benchmark consists of 10K matrices per configuration, totaling 70K matrices, split into the train, validation, and test with a 60/20/20 ratio. VAP: Each regime contains 710K matrices, with 600K, 10K, and 100K devoted to the train, validation, and test splits, resp. VASR: We use Silver data for training, which includes 150K, 2.25K, and 2.55K matrices in the train, validation, and test splits, resp.
Hardware Specification	Yes	All experiments are performed on a single GPU (NVIDIA DGX A100).
Software Dependencies	No	The paper mentions the Adam optimizer, Vision Transformer (ViT), and TCN, but does not specify any software versions for these or other libraries.
Experiment Setup	Yes	Po NG is trained using a standard training strategy involving the Adam optimizer [Kingma and Ba, 2014] with default hyperparameters (λ = 0.001, β1 = 0.9, β2 = 0.999, ϵ = 10 8). Learning rate λ is reduced by a factor of 10 after 5 epochs without improvement in validation loss. Early stopping is applied after 10 epochs without validation loss reduction. We use batch size B = 128 for experiments on RAVEN-like datasets and B = 256 in the remaining cases to reduce training time on large datasets.