reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Unnatural Languages Are Not Bugs but Features for LLMs

Authors: Keyu Duan, Yiran Zhao, Zhili Feng, Jinjie Ni, Tianyu Pang, Qian Liu, Tianle Cai, Longxu Dou, Kenji Kawaguchi, Anirudh Goyal, J Zico Kolter, Michael Qizhe Shieh

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The results show that compared to natural context, all models can recover 82.0% of the original accuracy on our constructed Syn Context QA dataset and 61.6% on Sim GSM8K, demonstrating that unnatural languages contain latent features that enable comprehension across different scenarios. Furthermore, models fine-tuned on unnatural versions of instruction datasets perform on-par with those trained on natural language, achieving 49.71 win rates in Length-controlled Alpaca Eval 2.0 in average across various base models. In Figure 3, we present the distribution of token importance and relative token importance within the unnatural version of the Sim GSM8K dataset as processed by the model Llama3-8B-Instruct.
Researcher Affiliation	Collaboration	1National University of Singapore 2Carnegie Mellon University 3Sea AI Lab 4Princeton University 5Mila, University of Montreal. Correspondence to: Keyu Duan <EMAIL>, Yiran Zhao <EMAIL>, Michael Qizhe Shieh <EMAIL>.
Pseudocode	Yes	Algorithm 1 in Appendix A1 provides a detailed illustration of the searching algorithm. Algorithm 1 Unnatural Languages Searching Input: Natural string S, searching models M and tasks T , batch size B, k, number of iterations T 1: // Initialize x via shuffle words in S and inject special characters. 2: Initialization: x = random_inject(shuffle(S)) 3: repeat 4: for M M do 5: // Tokenize x through model M. 6: x1:n = Tokenize M(x) 7: // Obtain top-k alternative tokens of each position in x1:n. 8: X1:n = Top-k x1:n P t T log PM(S\|x1:n, t) 9: for b = 1, ..., B do 10: // Uniformly sample candidates. 11: x(b) 1:n = x1:n 12: x(b) 1:n[i] = Uni(X1:n[i]), i = Uni([1 : n]) 13: // Decode tokens back string. 14: x(b) = Decode M( x(b) 1:n) 15: end for 16: end for 17: // Select the best candidate. 18: xb = argmaxb P t T log PM(S\| x(b), t) 19: // Replace the original string with the modified string. 20: x = xb 21: until Repeat for T times Output: Equivalent unnatural string S
Open Source Code	Yes	Our code is publicly available at https://github.com/John-AI-Lab/ Unnatural_Language.
Open Datasets	Yes	Specifically, to prevent models from relying on common-sense memory when answering questions without context, we develop Syn Context QA, a synthetic dataset generated by another LLM, containing contexts about non-existent entities paired with corresponding questions. Furthermore, to ensure models do not simply extract keywords from unnatural contexts in Syn Context QA, we create Sim GSM8K, a dataset of simple questions derived from GSM8K (Cobbe et al., 2021). We employ LIMA (Zhou et al., 2023), a high-quality instruction tuning dataset of 1000 carefully created (instruction, answer) pairs. We evaluate all variants on Length-controlled (LC) Alpaca Eval 2.0 (Li et al., 2023) and Mix Eval (Ni et al., 2024).
Dataset Splits	Yes	Furthermore, to ensure models do not simply extract keywords from unnatural contexts in Syn Context QA, we create Sim GSM8K, a dataset of simple questions derived from GSM8K (Cobbe et al., 2021). We created an unnatural languages version of GSM8K for its training subset and test subset. Due to the high cost of GCG and computation limitation, we searched 1333 training instances and 654 test instances.
Hardware Specification	No	We are also grateful to the Center for AI Safety (CAIS) for providing computational resources that supported this research. Explanation: The paper mentions "computational resources" but does not provide specific details such as GPU models, CPU types, or memory specifications.
Software Dependencies	No	Explanation: The paper does not explicitly state specific software dependencies with version numbers used for implementing their methodology or running experiments. It mentions LLM names but not the libraries or frameworks with their versions.
Experiment Setup	Yes	all models are fine-tuned for 10 epochs using identical hyperparameters. Under in-context learning setting with 8 examples, the unnatural test accuracy of pre-trained base models before alignment achieves 38% and 42% in average, respectively. Algorithm 1 Unnatural Languages Searching Input: Natural string S, searching models M and tasks T , batch size B, k, number of iterations T