reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Evaluating Long Range Dependency Handling in Code Generation LLMs

Authors: Yannick Assogba, Donghao Ren

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Empirical evaluation of several open source and proprietary code generation models. We find that model performance varies greatly depending on the number of steps involved, and the distinctiveness of the target fact compared to the rest of the context. We also discover that the order of function declarations has a large effect on model ability to complete these tasks. We further observe that sliding window mechanisms degrade models ability to resolve references beyond the size of the window.
Researcher Affiliation	Industry	Yannick Assogba EMAIL Apple Donghao Ren EMAIL Apple
Pseudocode	Yes	Algorithm 1: Generate Long Context Retrieval Tasks
Open Source Code	Yes	1We open source the code to generate these tasks at https://github.com/apple/ml-key-retrieval-code-tasks.
Open Datasets	Yes	Then we sample standalone Python functions from the Human Eval dataset (Chen et al., 2021) to fill out the context window to our desired size.
Dataset Splits	No	The paper does not explicitly mention train/test/validation dataset splits. It describes how prompts are generated using a number of unique key functions (nk), distractor functions (nd), maximum tokens (nt), and position combinations (np) for evaluation, rather than splitting a fixed dataset into traditional training, validation, and testing sets.
Hardware Specification	Yes	We ran all experiments on machines with a single A100 GPU with 80GB of VRAM on a cloud provider.
Software Dependencies	No	We use implementations from the Hugging Face transformers library (Wolf et al., 2020) for the open source models. The paper mentions the library but does not specify a version number.
Experiment Setup	Yes	Hyperparameters for generation are in Appendix D Table 14: Generation hyperparameters. Hyperparameter Value Temperature 0.8 Top p 0.95 Top k 0 Batch size 1 Output samples per input prompt 10