reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ChroKnowledge: Unveiling Chronological Knowledge of Language Models in Multiple Domains

Authors: Yein Park, Chanwoong Yoon, Jungwoo Park, Donghyeon Lee, Minbyul Jeong, Jaewoo Kang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	To overcome this, we introduce CHROKNOWBENCH, a benchmark dataset designed to evaluate chronologically accumulated knowledge across three key aspects: multiple domains, time dependency, temporal state. Our evaluation led to the following observations: (1) The ability of eliciting temporal knowledge varies depending on the data format that model was trained on.
Researcher Affiliation	Collaboration	Yein Park1, Chanwoong Yoon1, Jungwoo Park1,3, Donghyeon Lee1,3, Minbyul Jeong2 , Jaewoo Kang1,3 Korea University1 Upstage AI2 AIGEN Sciences3
Pseudocode	Yes	Algorithm 1: Iterative Distractor Generation Algorithm Algorithm 2: Chronological Prompting Algorithm
Open Source Code	Yes	Our datasets and code are publicly available at https://github.com/dmis-lab/ChroKnowledge
Open Datasets	Yes	Our datasets and code are publicly available at https://github.com/dmis-lab/ChroKnowledge
Dataset Splits	Yes	The test set consists of 10% of the total dataset from each domain.
Hardware Specification	Yes	The precision is done with eight NVIDIA A100 GPUs(80GB).
Software Dependencies	No	We utilize the rapidfuzz library to compare the model s responses with the predefined labels. ... We utilize the spaCy en_core_web_lg model to detect named entities in the paragraphs...
Experiment Setup	Yes	We use a temperature set T 0, 0.7 to capture variations in prediction, where T includes both greedy decoding and temperature sampling. We set n as 5, meaning that we evaluate using five distinct combinations of few-shot exemplars to ensure the robust assessment.