Logicbreaks: A Framework for Understanding Subversion of Rule-based Inference

Authors: Anton Xue, Avishree Khare, Rajeev Alur, Surbhi Goel, Eric Wong

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Moreover, our empirical experiments show that LLMs exhibit reasoning behaviors consistent with our theoretical constructions. This suggests that our framework offers a preliminary working theory for studying how LLMs perform rule-following. LLM Jailbreaks Align with Our Theoretical Predictions (Section 4). We observe that automated jailbreak attacks like GCG (Zou et al., 2023) find suffixes similar to those predicted by our theory. Additionally, these attacks induce attention patterns that align with our predictions, providing evidence for the mechanisms underlying our theory-derived attack strategies. While our theory does not make definitive claims about LLM behavior, our experiments suggest a useful empirical connection for understanding the behavior of LLMs in rule-based contexts like logical reasoning and jailbreak attacks.
Researcher Affiliation Academia Anton Xue , Avishree Khare , Rajeev Alur, Surbhi Goel, and Eric Wong Department of Computer and Information Science, University of Pennsylvania
Pseudocode No The paper describes methods using mathematical formulations (e.g., equations for R(X), Attn(X), Ffwd(z)) and prose, but it does not include a distinct section or figure labeled 'Pseudocode' or 'Algorithm' with structured, step-by-step instructions formatted like code.
Open Source Code Yes All code and experiments from this paper are available and open-sourced at https://github.com/Anton Xue/tf_logic
Open Datasets Yes To study inference subversion in natural language, we consider the task of sabotaging item-crafting in Minecraft (Mojang Studios, 2011). Given a prompt on crafting items... We use Minecraft (Mojang Studios, 2011) crafting recipes gathered from Git Hub 1 to generate prompts such as the following:
Dataset Splits Yes We do this construction for depths of T = 1, 3, 5, each with a train/test split of 65536 and 16384 prompts, respectively.
Hardware Specification Yes We had access to a server with three NVIDIA Ge Force RTX 4900 GPUs (24GB RAM each). In addition, we had access to a shared cluster with the following GPUs: eight NVIDIA A100 PCIe (80GB RAM each) and eight NVIDIA RTX A6000 (48GB RAM each).
Software Dependencies No The paper mentions models like GPT-2, Llama-2-7B-chat-hf, and Meta-Llama-3-8B-Instruct, and optimizers like Adam W, but does not specify versions for underlying ancillary software such as Python, PyTorch/TensorFlow, or CUDA libraries.
Experiment Setup Yes For training, we use Adam W (Loshchilov & Hutter, 2017) as our optimizer with default configurations. We train for 8192 steps with batch size 512, learning rate 5 10-4, and a linear decay schedule at 10% warmup. Each model takes about one hour to train using a single NVIDIA Ge Force RTX 4900 GPU. ... We fine-tuned a GPT-2 model for each of the Minecraft datasets. Each model is trained for 25 epochs using the standard causal language modeling objective. We use Adam W with default configurations, a learning rate of 5 10-5, and linear decay with 10% warmup. We used a 32-batch size with four gradient accumulation steps.