reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Position: We Can’t Understand AI Using our Existing Vocabulary

Authors: John Hewitt, Robert Geirhos, Been Kim

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	As a proof of concept, we demonstrate how a length neologism enables controlling LLM response length, while a diversity neologism allows sampling more variable responses. Taken together, we argue that we cannot understand AI using our existing vocabulary, and expanding it through neologisms creates opportunities for both controlling and understanding machines better. (...) 5. A proof of concept: Neologism Embedding Learning (...) 5.3. Experiment: Length Neologism (H M) (...) 5.4. Experiment: Diversity Neologism (H M) (...) 5.5. Experiment: A Model s Preferences (M H)
Researcher Affiliation	Industry	1Google Deep Mind. Correspondence to: John Hewitt <EMAIL>, Been Kim <EMAIL>.
Pseudocode	No	The paper describes methods in prose, such as in Section 5.1 'Method' and Appendix A.1 'Preference Loss', and includes mathematical formulas (e.g., Equation 1), but it does not present any clearly labeled pseudocode or algorithm blocks with structured, code-like steps.
Open Source Code	No	The paper describes its proposed method of 'neologism embedding learning' and the experiments conducted using it, but it does not contain any explicit statements about releasing source code, provide links to code repositories, or mention code being available in supplementary materials.
Open Datasets	Yes	For our preference data, we used 700 instructions from the LIMA dataset (Zhou et al., 2023).
Dataset Splits	No	The paper mentions using '700 instructions from the LIMA dataset' and 'held out instructions' for testing in Section B.1, and '50 examples in the LIMA dataset' for another experiment in Section B.3. However, it does not provide specific percentages or counts for training, validation, and test splits for reproducibility.
Hardware Specification	No	The paper does not specify any hardware components such as GPU models, CPU types, or other computing infrastructure used for training or running the experiments.
Software Dependencies	No	The paper mentions specific models and optimizers like 'Gemma 2B model (Mesnard et al., 2024)', 'Adafactor optimizer (Shazeer & Stern, 2018)', and 'Gemini 1.5 Pro (Georgiev et al., 2024)', and a 'variant of DPO (Rafailov et al., 2024) called APO-up (D Oosterlinck et al., 2024)'. However, it does not provide specific version numbers for general software dependencies such as programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA libraries.
Experiment Setup	Yes	Through early exploration, we determined a learning rate of 0.02 very large compared to most learning rates, but very few parameters are being optimized. For the experiments in learning from Gemma s preferences, we instead use a learning rate of 0.001. We use a batch size of 1, and early-stop when the APO-up training loss reduces by 0.2. During all generation, we enforce that the new token is not generated by the model by replacing its logit with . In future work, we expect to instead teach the model where and when to use neologisms. For the β hyperparameter in APO-up, we use 0.2. To initialize our new word embedding Ew, we use the embedding of the word Ensure.