A-I-RAVEN and I-RAVEN-Mesh: Two New Benchmarks for Abstract Visual Reasoning
Authors: Mikołaj Małkiński, Jacek Mańdziuk
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate 13 strong models from the AVR literature on the introduced datasets, revealing their specific shortcomings in generalization and knowledge transfer. |
| Researcher Affiliation | Academia | 1Warsaw University of Technology, Warsaw, Poland 2AGH University of Krakow, Krakow, Poland EMAIL, EMAIL |
| Pseudocode | No | The paper describes methods and processes but does not contain a clearly labeled pseudocode or algorithm block. |
| Open Source Code | Yes | The code for reproducing all experiments is publicly accessible at: https://github.com/mikomel/raven |
| Open Datasets | Yes | First, we introduce Attributeless-I-RAVEN (A-I-RAVEN), comprising 10 generalization regimes. Next, we propose I-RAVEN-Mesh, a variant of I-RAVEN with a new grid-like structure overlaid on the matrices. The released code allows for generation of all datasets from scratch, eliminating the dependency on file-hosting services required to distribute the data. |
| Dataset Splits | Yes | In each experiment, we utilize 42 000 training, 14 000 validation, and 14 000 test matrices, following the standard data split protocol taken in prior works [Zhang et al., 2019a; Hu et al., 2021]. |
| Hardware Specification | Yes | Experiments were run on a worker with a single NVIDIA DGX A100 GPU. |
| Software Dependencies | No | The paper mentions using the Adam optimizer with specific parameters and that the training job is packaged as a Docker image with fixed dependencies, but it does not explicitly list software dependencies with specific version numbers within the text. |
| Experiment Setup | Yes | In all experiments we use the Adam optimizer [Kingma and Ba, 2014] with β1 = 0.9, β2 = 0.999, ϵ = 10 8 and a batch size set to 128. Learning rate is initialized to 0.001 and reduced 10-fold (at most 3 times) if no progress is seen in the validation loss in 5 subsequent epochs, and training stops early in the case of 10 epochs without progress. |