reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

V-LoL: A Diagnostic Dataset for Visual Logical Learning

Authors: Lukas Helff, Wolfgang Stammer, Hikaru Shindo, Devendra Singh Dhami, Kristian Kersting

DMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate a variety of AI systems including traditional symbolic AI, neural AI, as well as neuro-symbolic AI. Our evaluations demonstrate that even SOTA AI faces difficulties in dealing with visual logical learning challenges, highlighting unique advantages and limitations of each methodology. Quantitative results (unless noted otherwise) correspond to the test set classification accuracy. The hyperparameters for each model are the same over all challenges. Details on these can be found in the supplement (cf. Sec. D). Unless specified otherwise, the evaluation datasets were sampled from the Michalski distribution. Finally, runs aborted due to code instabilities, memory overflows, or infinite loops are marked with .
Researcher Affiliation	Academia	1AI and ML Group, TU Darmstadt; 2Hessian Center for AI (hessian.AI) 3Eindhoven University of Technology; 4Centre for Cognitive Science, TU Darmstadt; 5German Center for AI (DFKI)
Pseudocode	Yes	Algorithm 1 V-Lo L train generation for a single image. 1. distr Select Dist() // User selects attribute distribution 2. symb train Sample Symb Train(distr) // Sample symbolic train representations 3. logic program Define Logic() // User defines rule as logic program 4. class Eval(symb train, logic program) // Evaluate class affinity based on logic program 5. visuals Select Visuals() // User selects background and object visuals 6. v lol train Generate(symb train, class, visuals) // Create scene and render image
Open Source Code	Yes	All code and data is available at https://sites.google.com/view/v-lol. Access to the data set, data generator and experimental code for reproducing the results are bundled on our website: https://sites.google.com/view/v-lol. All data is released under the Creative Commons CC BY 4.0 license. All code is released under the MIT license.
Open Datasets	Yes	All code and data is available at https://sites.google.com/view/v-lol. Access to the data set, data generator and experimental code for reproducing the results are bundled on our website: https://sites.google.com/view/v-lol. All data is released under the Creative Commons CC BY 4.0 license.
Dataset Splits	Yes	Unless stated otherwise the problem setup in our evaluations is a classification setup consisting of a training dataset and held-out test set of images belonging to one of two classes (eastbound and westbound). Furthermore, unless stated otherwise the training set includes 1k images. All models are trained on specified training splits and evaluated by performing stratified 5-fold cross-validation on a held-out test set, containing 2k images. Yes, we split 20% 80% test/train splits for the datasets, with the exception of OOD variant, which is used for evaluation only. We use stratified 5 fold cross-validation.
Hardware Specification	Yes	All code was run on multiple NVIDIA A100-SXM4-40GB gpus.
Software Dependencies	Yes	To render the V-Lo L images, we utilize Python 3.10.2 and the Blender Python module version 3.3. For the perceptions modules of the Neuro-Symbolic AI systems, we modify the improved mask-RCNN (v2 version) Li et al. (2021) to allow for multi-label instance segmentation.
Experiment Setup	Yes	The hyperparameters for each model are the same over all challenges. Details on these can be found in the supplement (cf. Sec. D). For Popper we set the hyperparameters to allow for a maximum of 10 rules each allowing a maximum of 6 variables and 6 literals in its body. Predicate finding and recursion are turned off, as we could not observe any performance improvement. For ALEPH we use the following hyperparameters: clauselength = 10, minacc = 0.6, minscore = 3, minpos = 3, nodes = 5000, explore = true, max features = 10. Subsequently, the models are transfer trained on the respective datasets for 25 epochs using a batch size of 50 and starting with a learning rate of 0.001 (0.0001 for the Vision Transformer), which decreases by 20% every five epochs. The Adam optimizer is used for updating the models weights and the cross-entropy loss function for calculating the loss.