Unnatural Languages Are Not Bugs but Features for LLMs

Authors: Keyu Duan, Yiran Zhao, Zhili Feng, Jinjie Ni, Tianyu Pang, Qian Liu, Tianle Cai, Longxu Dou, Kenji Kawaguchi, Anirudh Goyal, J Zico Kolter, Michael Qizhe Shieh

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The results show that compared to natural context, all models can recover 82.0% of the original accuracy on our constructed Syn Context QA dataset and 61.6% on Sim GSM8K, demonstrating that unnatural languages contain latent features that enable comprehension across different scenarios. Furthermore, models fine-tuned on unnatural versions of instruction datasets perform on-par with those trained on natural language, achieving 49.71 win rates in Length-controlled Alpaca Eval 2.0 in average across various base models. In Figure 3, we present the distribution of token importance and relative token importance within the unnatural version of the Sim GSM8K dataset as processed by the model Llama3-8B-Instruct.
Researcher Affiliation Collaboration 1National University of Singapore 2Carnegie Mellon University 3Sea AI Lab 4Princeton University 5Mila, University of Montreal. Correspondence to: Keyu Duan <EMAIL>, Yiran Zhao <EMAIL>, Michael Qizhe Shieh <EMAIL>.
Pseudocode Yes Algorithm 1 in Appendix A1 provides a detailed illustration of the searching algorithm. Algorithm 1 Unnatural Languages Searching Input: Natural string S, searching models M and tasks T , batch size B, k, number of iterations T 1: // Initialize x via shuffle words in S and inject special characters. 2: Initialization: x = random_inject(shuffle(S)) 3: repeat 4: for M M do 5: // Tokenize x through model M. 6: x1:n = Tokenize M(x) 7: // Obtain top-k alternative tokens of each position in x1:n. 8: X1:n = Top-k x1:n P t T log PM(S|x1:n, t) 9: for b = 1, ..., B do 10: // Uniformly sample candidates. 11: x(b) 1:n = x1:n 12: x(b) 1:n[i] = Uni(X1:n[i]), i = Uni([1 : n]) 13: // Decode tokens back string. 14: x(b) = Decode M( x(b) 1:n) 15: end for 16: end for 17: // Select the best candidate. 18: xb = argmaxb P t T log PM(S| x(b), t) 19: // Replace the original string with the modified string. 20: x = xb 21: until Repeat for T times Output: Equivalent unnatural string S
Open Source Code Yes Our code is publicly available at https://github.com/John-AI-Lab/ Unnatural_Language.
Open Datasets Yes Specifically, to prevent models from relying on common-sense memory when answering questions without context, we develop Syn Context QA, a synthetic dataset generated by another LLM, containing contexts about non-existent entities paired with corresponding questions. Furthermore, to ensure models do not simply extract keywords from unnatural contexts in Syn Context QA, we create Sim GSM8K, a dataset of simple questions derived from GSM8K (Cobbe et al., 2021). We employ LIMA (Zhou et al., 2023), a high-quality instruction tuning dataset of 1000 carefully created (instruction, answer) pairs. We evaluate all variants on Length-controlled (LC) Alpaca Eval 2.0 (Li et al., 2023) and Mix Eval (Ni et al., 2024).
Dataset Splits Yes Furthermore, to ensure models do not simply extract keywords from unnatural contexts in Syn Context QA, we create Sim GSM8K, a dataset of simple questions derived from GSM8K (Cobbe et al., 2021). We created an unnatural languages version of GSM8K for its training subset and test subset. Due to the high cost of GCG and computation limitation, we searched 1333 training instances and 654 test instances.
Hardware Specification No We are also grateful to the Center for AI Safety (CAIS) for providing computational resources that supported this research. Explanation: The paper mentions "computational resources" but does not provide specific details such as GPU models, CPU types, or memory specifications.
Software Dependencies No Explanation: The paper does not explicitly state specific software dependencies with version numbers used for implementing their methodology or running experiments. It mentions LLM names but not the libraries or frameworks with their versions.
Experiment Setup Yes all models are fine-tuned for 10 epochs using identical hyperparameters. Under in-context learning setting with 8 examples, the unnatural test accuracy of pre-trained base models before alignment achieves 38% and 42% in average, respectively. Algorithm 1 Unnatural Languages Searching Input: Natural string S, searching models M and tasks T , batch size B, k, number of iterations T