Evaluating Long Range Dependency Handling in Code Generation LLMs
Authors: Yannick Assogba, Donghao Ren
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical evaluation of several open source and proprietary code generation models. We find that model performance varies greatly depending on the number of steps involved, and the distinctiveness of the target fact compared to the rest of the context. We also discover that the order of function declarations has a large effect on model ability to complete these tasks. We further observe that sliding window mechanisms degrade models ability to resolve references beyond the size of the window. |
| Researcher Affiliation | Industry | Yannick Assogba EMAIL Apple Donghao Ren EMAIL Apple |
| Pseudocode | Yes | Algorithm 1: Generate Long Context Retrieval Tasks |
| Open Source Code | Yes | 1We open source the code to generate these tasks at https://github.com/apple/ml-key-retrieval-code-tasks. |
| Open Datasets | Yes | Then we sample standalone Python functions from the Human Eval dataset (Chen et al., 2021) to fill out the context window to our desired size. |
| Dataset Splits | No | The paper does not explicitly mention train/test/validation dataset splits. It describes how prompts are generated using a number of unique key functions (nk), distractor functions (nd), maximum tokens (nt), and position combinations (np) for evaluation, rather than splitting a fixed dataset into traditional training, validation, and testing sets. |
| Hardware Specification | Yes | We ran all experiments on machines with a single A100 GPU with 80GB of VRAM on a cloud provider. |
| Software Dependencies | No | We use implementations from the Hugging Face transformers library (Wolf et al., 2020) for the open source models. The paper mentions the library but does not specify a version number. |
| Experiment Setup | Yes | Hyperparameters for generation are in Appendix D Table 14: Generation hyperparameters. Hyperparameter Value Temperature 0.8 Top p 0.95 Top k 0 Batch size 1 Output samples per input prompt 10 |