COLUMBUS: Evaluating COgnitive Lateral Understanding Through Multiple-Choice reBUSes

Authors: Koen Kraaijveld, Yifan Jiang, Kaixin Ma, Filip Ilievski

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental COLUMBUS: Evaluating COgnitive Lateral Understanding Through Multiple-Choice re BUSes... An experimental analysis with COLUMBUS with human participants and representative state-of-the-art (Sot A) vision-language models evaluated in a zero-shot setting, revealing that models perform decently but lag behind humans.
Researcher Affiliation Collaboration Koen Kraaijveld1, Yifan Jiang2, Kaixin Ma3, Filip Ilievski1 1Department of Computer Science, Faculty of Science, Vrije Universiteit Amsterdam 2Information Sciences Institute, University of Southern California 3Tencent AI Lab, Bellevue, WA EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode No The paper describes the 'Graph Generation Algorithm' in prose in Section 3.2 but does not present it as a structured pseudocode or algorithm block.
Open Source Code Yes Code https://github.com/koen-47/COLUMBUS
Open Datasets Yes We start by scraping common English phrases from publicly available sources, namely Wiktionary and www.theidioms.com, yielding 9,745 instances. We use the Large Database of English Compounds (La DEC) (Gagn, Spalding, and Schmidtke 2019) for compound words. This dataset has been feature-engineered and curated by humans, consisting of 8,957 compounds.
Dataset Splits Yes Benchmark Composition. We split the benchmark into two partitions: COLUMBUS-TEXT that only contain text and COLUMBUS-ICON that contain at least one icon.
Hardware Specification No The paper does not provide specific hardware details such as GPU/CPU models or memory used for running the experiments. It only lists the models evaluated.
Software Dependencies No The paper mentions various models and tools used (e.g., Sentence-BERT, Chat GPT-3.5, DALLE-3, different VLMs) but does not provide specific version numbers for these or for common software dependencies like Python, PyTorch, or CUDA.
Experiment Setup Yes We evaluate all models in a zero-shot setting using standard hyperparameter values... We experiment with two structural variants of closed-source models: forward and backward chaining (Jurafsky and Martin 2009)... Model Inputs. We explore four human-curated input levels, each providing the model with increasing information about the puzzle, i.e., its description and details on the nodes or edges of a puzzle s graph.