reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Language Models Are Implicitly Continuous

Authors: Samuele Marro, Davide Evangelista, X. Huang, Emanuele La Malfa, Michele Lombardi, Michael Wooldridge

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By running experiments on state-of-the-art LLMs, we find that the language LLMs learn is implicitly continuous, as they are able to handle, with minor modifications, inputs that are both time continuous and spatial continuous. In particular, we formally show that the results obtained by extending pretrained LLMs to handle time continuous input strongly depend on a quantity, named duration, associated with each sentence. We also show in Section 4 that the semantics of this continuum significantly deviate from human intuition.
Researcher Affiliation	Academia	1Department of Engineering Science University of Oxford Oxford, UK 2Department of Computer Science University of Bologna Bologna, Italy 3Department of Computer Science ETH Zurich Zurich, Switzerland 4Department of Computer Science University of Oxford Oxford, UK
Pseudocode	No	The paper provides mathematical derivations and descriptions of modifications to Transformer architecture (Appendix A, B.1), but does not include any clearly labeled pseudocode blocks or algorithms formatted like code.
Open Source Code	Yes	Our code is available at https://github.com/samuelemarro/continuous-llm-experiments.
Open Datasets	Yes	We quantitatively study this phenomenon by repeating this experiment on a dataset of 200 word counting tasks. We consider the sequential dataset from Lin et al. (2024), which contains 200 curated how-to tutorials split by step.
Dataset Splits	No	The paper mentions using a 'dataset of 200 word counting tasks' and 'the sequential dataset from Lin et al. (2024), which contains 200 curated how-to tutorials split by step'. However, it does not specify any explicit training, validation, or test splits for these datasets within the context of their own experiments. The experiments involve probing pre-trained LLMs, not training a new model requiring such splits.
Hardware Specification	No	The paper does not explicitly state the specific hardware used for running its experiments (e.g., GPU models, CPU types, or cloud computing instance specifications). It refers to 'state-of-the-art Large Language Models (LLMs), including Llama2, Llama3, Phi3, Gemma, Gemma2, and Mistral' but does not specify the hardware on which these models were evaluated.
Software Dependencies	No	In our experiments, we used Hugging Face, which natively supports 1. and 3. and can be easily adapted to support 2. While Hugging Face is mentioned as a tool used, specific version numbers for Hugging Face libraries or other critical software dependencies (e.g., PyTorch, TensorFlow, Python version) are not provided.
Experiment Setup	Yes	Experiment-specific parameters are reported in the respective subsections of Appendix C.4. CCTs can be implemented with little effort by starting with the implementation of a regular transformer and applying three modifications: 1. Modifying it so that it accepts arbitrary embeddings, rather than only tokens; 2. Modifying it so that positional indices can be floating points, instead of only integers; 3. Adding support for custom floating-point attention masks. For single-token continuity, we shrink the subset of considered tokens with a coefficient in the range [0.1,1]. We then interpolate (with 40 steps) between the sentence containing one object or the other.