ObscuraCoder: Powering Efficient Code LM Pre-Training Via Obfuscation Grounding
Authors: Indraneil Paul, Haoyi Yang, Goran Glavaš, Kristian Kersting, Iryna Gurevych
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | To this end, we compile Obscura X, a dataset of approximately 55M source and obfuscated code pairs in seven languages. Subsequently, we pre-train Obscura Coder models, ranging in size from 255M to 2.8B parameters, on a 272B-token corpus that includes Obscura X and demonstrate that our obfuscation-based pre-training recipe leads to consistent improvements in Code-LMs abilities compared to both vanilla autoregressive pre-training as well as existing de-obfuscation (DOBF) objectives. Obscura Coder demonstrates sizeable gains across multiple tests of syntactic and semantic code understanding, along with improved capabilities in multilingual code completion, multilingual code commit summarization, and multi-purpose library-oriented code generation. Our evaluation comprises a mix of five zero-shot and fine-tuning tasks, selected to provide answers to the three research questions from Section 1. For each task, we compare the performance of Obscura Coder against an equally-sized autoregressive LM. We further contextualize the sampleefficiency benefits of our obfuscation objective by including comparisons to seven frontier Code-LMs from the Deepseek-Coder (Guo et al., 2024), Code Gemma (Zhao et al., 2024), Phi (Gunasekar et al., 2023) and Star Coder (Lozhkov et al., 2024; Li et al., 2023b) families that are under 3B parameters in size and pre-trained on corpora between 5x to 22x larger than the one used to train Obscura Coder. |
| Researcher Affiliation | Academia | Indraneil Paul UKP Lab, TU Darmstadt * # Haoyi Yang AIML Lab, TU Darmstadt * Goran Glavaš UKP Lab, TU Darmstadt Kristian Kersting CAIDAS, JMU Würzburg Iryna Gurevych UKP Lab, TU Darmstadt Corresponding author: EMAIL |
| Pseudocode | No | The paper describes the data sourcing and pre-training process in narrative text and diagrams (Figure 1), but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | All training runs are performed using an open-source implementation 7 of the Megatron-LM (Shoeybi et al., 2019) kernels and leverage Deepspeed Stage-2 (Rajbhandari et al., 2020) sharding on BF16 precision. ... For inference, we resort to a Paged Attention (Kwon et al., 2023) enabled fork of an open-source evaluation harness.8 The paper mentions using open-source implementations for training (Eleuther AI/gpt-neox) and inference (bigcode-project/bigcode-evaluation-harness), but does not explicitly state that the code for their custom obfuscator or Obscura Coder models is made publicly available. |
| Open Datasets | No | In this work, we: 1) create Obscura X, a source-to-obfuscated-code translation pairs dataset containing approximately 55M pairs in seven programming languages; ... We initiate this effort by acquiring source code files containing fewer than 2000 lines of code from the Stack corpus (Kocetkov et al., 2023) in seven languages C, C++, Go, Java, Python, Rust and Type Script. ... Our resultant dataset, which we dub Obscura X, comprises 55M samples and is the largest multilingual collection of source code to obfuscated code translation pairs yet. Figure 1 details a high-level view of how Obscura X is a critical part of the Obscura Coder pre-training pipeline and also lists some examples. Refer to Appendix B.3 for more detailed samples in all languages. The paper describes the creation of the Obscura X dataset and provides examples in the appendix, but it does not provide a direct URL, DOI, or explicit statement for its public release. |
| Dataset Splits | Yes | Our evaluation comprises a mix of five zero-shot and fine-tuning tasks, selected to provide answers to the three research questions from Section 1. For fine-tuning tasks, all models are trained for three epochs using a cosine scheduler with a peak learning rate of 5e-5 using Lo RA (Xu et al., 2024) modules coupled with trainable embeddings. ... Commit Chronicle. As a further measure of multilingual code competence, we evaluate Code-LMs on Commit Chronicle (Eliseeva et al., 2023), a fine-tuning code-change summarization benchmark in seven languages (C, C++, Go, Java, Python, Rust and Type Script). ... We construct language-specific splits by filtering the original dataset and partitioning 75%, 15% and 10% of the data into train, validation and test splits, respectively. |
| Hardware Specification | Yes | All inference runs are conducted on Nvidia A100 80GB GPUs with 95% of the GPU VRAM explicitly reserved for v LLMs GPU pages. We further set aside 64GB of RAM as a CPU swap, allowing for offloading pages to the CPU during bursts of long sequences. |
| Software Dependencies | No | All training runs are performed using an open-source implementation 7 of the Megatron-LM (Shoeybi et al., 2019) kernels and leverage Deepspeed Stage-2 (Rajbhandari et al., 2020) sharding on BF16 precision. ... We use a custom byte-level BPE tokenizer (Wang et al., 2020) with a vocabulary size of 49152 tokens in total that we train on a 5B token corpus comprising of code and code-adjacent text data. ... For inference, we resort to a Paged Attention (Kwon et al., 2023) enabled fork of an open-source evaluation harness.8 The paper mentions using Megatron-LM, Deepspeed Stage-2, a custom BPE tokenizer, and a Paged Attention enabled evaluation harness, but does not provide specific version numbers for these software components. |
| Experiment Setup | Yes | We use the Adam optimizer (Kingma & Ba, 2015), with a learning rate of 5e-4 and a cosine annealed schedule that terminates at 5% of the peak learning rate. The models are trained using Flash Attention-2 (Dao, 2024) with a sequence length of 2048 tokens and a batch size of 256 for 520K steps. We use a custom byte-level BPE tokenizer (Wang et al., 2020) with a vocabulary size of 49152 tokens in total that we train on a 5B token corpus comprising of code and code-adjacent text data. The tokenizer is augmented with special tokens pertaining to the outputs of our obfuscator ID_{n}, CLASS_{n}, FUNC_{n}, VAR_{n} and IMPORT_{n} n ranging from 0 to 149. ... For fine-tuning tasks, all models are trained for three epochs using a cosine scheduler with a peak learning rate of 5e-5 using Lo RA (Xu et al., 2024) modules coupled with trainable embeddings. Appendix A (Table 4) further provides detailed training and architectural attributes including Scheduler Type, Warmup Prop., Optimizer Type, Peak LR, Terminal LR, Beta, Epsilon, Gradient Clipping, Weight Decay, Deepspeed variant, Model Datatype, Softmax Datatype, Global Batch Size, Training Steps, Sequence Length, Tokenizer Variant, and Vocab. Size. |