Evaluating Long Range Dependency Handling in Code Generation LLMs

Authors: Yannick Assogba, Donghao Ren

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirical evaluation of several open source and proprietary code generation models. We find that model performance varies greatly depending on the number of steps involved, and the distinctiveness of the target fact compared to the rest of the context. We also discover that the order of function declarations has a large effect on model ability to complete these tasks. We further observe that sliding window mechanisms degrade models ability to resolve references beyond the size of the window.
Researcher Affiliation Industry Yannick Assogba EMAIL Apple Donghao Ren EMAIL Apple
Pseudocode Yes Algorithm 1: Generate Long Context Retrieval Tasks
Open Source Code Yes 1We open source the code to generate these tasks at https://github.com/apple/ml-key-retrieval-code-tasks.
Open Datasets Yes Then we sample standalone Python functions from the Human Eval dataset (Chen et al., 2021) to fill out the context window to our desired size.
Dataset Splits No The paper does not explicitly mention train/test/validation dataset splits. It describes how prompts are generated using a number of unique key functions (nk), distractor functions (nd), maximum tokens (nt), and position combinations (np) for evaluation, rather than splitting a fixed dataset into traditional training, validation, and testing sets.
Hardware Specification Yes We ran all experiments on machines with a single A100 GPU with 80GB of VRAM on a cloud provider.
Software Dependencies No We use implementations from the Hugging Face transformers library (Wolf et al., 2020) for the open source models. The paper mentions the library but does not specify a version number.
Experiment Setup Yes Hyperparameters for generation are in Appendix D Table 14: Generation hyperparameters. Hyperparameter Value Temperature 0.8 Top p 0.95 Top k 0 Batch size 1 Output samples per input prompt 10