reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

COLUMBUS: Evaluating COgnitive Lateral Understanding Through Multiple-Choice reBUSes

Authors: Koen Kraaijveld, Yifan Jiang, Kaixin Ma, Filip Ilievski

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	COLUMBUS: Evaluating COgnitive Lateral Understanding Through Multiple-Choice re BUSes... An experimental analysis with COLUMBUS with human participants and representative state-of-the-art (Sot A) vision-language models evaluated in a zero-shot setting, revealing that models perform decently but lag behind humans.
Researcher Affiliation	Collaboration	Koen Kraaijveld1, Yifan Jiang2, Kaixin Ma3, Filip Ilievski1 1Department of Computer Science, Faculty of Science, Vrije Universiteit Amsterdam 2Information Sciences Institute, University of Southern California 3Tencent AI Lab, Bellevue, WA EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes the 'Graph Generation Algorithm' in prose in Section 3.2 but does not present it as a structured pseudocode or algorithm block.
Open Source Code	Yes	Code https://github.com/koen-47/COLUMBUS
Open Datasets	Yes	We start by scraping common English phrases from publicly available sources, namely Wiktionary and www.theidioms.com, yielding 9,745 instances. We use the Large Database of English Compounds (La DEC) (Gagn, Spalding, and Schmidtke 2019) for compound words. This dataset has been feature-engineered and curated by humans, consisting of 8,957 compounds.
Dataset Splits	Yes	Benchmark Composition. We split the benchmark into two partitions: COLUMBUS-TEXT that only contain text and COLUMBUS-ICON that contain at least one icon.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU/CPU models or memory used for running the experiments. It only lists the models evaluated.
Software Dependencies	No	The paper mentions various models and tools used (e.g., Sentence-BERT, Chat GPT-3.5, DALLE-3, different VLMs) but does not provide specific version numbers for these or for common software dependencies like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We evaluate all models in a zero-shot setting using standard hyperparameter values... We experiment with two structural variants of closed-source models: forward and backward chaining (Jurafsky and Martin 2009)... Model Inputs. We explore four human-curated input levels, each providing the model with increasing information about the puzzle, i.e., its description and details on the nodes or edges of a puzzle s graph.