reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Interchangeable Token Embeddings for Extendable Vocabulary and Alpha-Equivalence

Authors: İlker Işık, Ramazan Gokberk Cinbis, Ebru Aydin Gol

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We use our embedding method in a Transformer encoder-decoder model and evaluate it on three tasks: copying with an extendable vocabulary, solving LTL formulae, and predicting assignments for propositional logic. As a baseline, we consider a simpler approach that uses alpha-renaming for data augmentation during training to expose the model to a larger vocabulary, which is also new in the literature to the best of our knowledge. Overall, our method demonstrates generalization capabilities to larger vocabulary sizes, and also combines well with positional encodings that exhibit length generalization. We also experiment with dataset perturbation to show that our method introduces a helpful inductive bias for alpha-equivalence. Finally, we present alpha-covariance, a metric for measuring robustness against alpha-conversions that is applicable to any domain where alpha-equivalence is relevant.
Researcher Affiliation	Collaboration	1Department of Computer Engineering, Middle East Technical University, Ankara, Turkey 2Microsoft, Istanbul, Turkey. Correspondence to: Ilker Is ık <EMAIL>, Ramazan Gokberk Cinbis <EMAIL>, Ebru Aydin Gol <EMAIL>.
Pseudocode	No	The paper describes methods verbally and with mathematical formulas, but it does not contain any clearly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	Yes	Our code and project page are available at necrashter.github.io/interchangeabletoken-embeddings
Open Datasets	Yes	In this section, we train models on the LTLRandom35 dataset from Deep LTL (Hahn et al., 2021) and other synthetic datasets created with the same method.
Dataset Splits	Yes	To generate the evaluation datasets (validation and test splits), we create 100 samples for each possible combination of unique character count and string length, starting from a minimum of 3. Consequently, the total evaluation dataset is arranged in a matrix in which the rows represent unique character count in the string and the columns represent the string length. This matrix is upper triangular since the unique character count cannot exceed the string length. For random embeddings, we repeat the evaluation 10 times and report the average.
Hardware Specification	Yes	We measured training durations for models trained on NVIDIA H100 GPUs using identical hyperparameter settings. [...] We conducted a runtime analysis using our best-performing LTL model on NVIDIA A4000 hardware.
Software Dependencies	Yes	To evaluate the correctness of the generated formulae, we utilize spot framework version 2.11.6 (Duret-Lutz et al., 2022). [...] We use pyaiger (Vazquez-Chanlatte & Rabe) to generate datasets and evaluate predictions. [...] We use the 3B-parameter version of Llama 3.2 (Grattafiori et al., 2024), quantized with Q4_K_M, and run it using Ollama 0.4.7 as our LLM backend.
Experiment Setup	Yes	The hyperparameter settings are given in Table 4 in Appendix D. [...] Table 4. Hyperparameter choices. Experiment Embedding Layers Heads FC size Batch Size Train Steps Copy (Sections E.1 and E.2) 64 2 4 64 512 20K Copy (Section E.5) 128 6 8 128 512 20K LTL (Section 5.2) 128 8 8 1024 768 52K Propositional Logic (Section 5.3) 132 6 6 512 1024 50K