Chunk-Distilled Language Modeling

Authors: Yanhong Li, Karen Livescu, Jiawei Zhou

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct a diverse set of empirical studies, including language modeling perplexity, text generation, and domain adaptation, showing the ability of CD-LM to improve inference efficiency and modeling performance.
Researcher Affiliation Academia Yanhong Li University of Chicago & TTI-Chicago EMAIL Karen Livescu TTI-Chicago EMAIL Jiawei Zhou TTI-Chicago & Stony Brook University EMAIL
Pseudocode Yes The chunk-integrated generative process is as follows: (1) For step t = 1, LM Mθ generates the first token x1. We have current sequence length l1 = 1. (2) At step t ≥ 2, set next token position: n = lt−1 + 1; (3) Chunk proposal: G(x<n) → (cn, qn), and length of cn is τn; (4) Sample: zn ∼ Bernoulli(qn); (5) If zn = 1: accept cn, and lt = lt−1 + τn; (6) Else zn = 0: reject cn. Generate xn from the base LM Mθ, and lt = lt−1 + 1; (7) Move to generation step t + 1.
Open Source Code Yes Code and data are available at https://github.com/yanhong-lbh/cd-lm.
Open Datasets Yes We evaluate on the Wiki Text-103 dataset and the Dockerfile subset of the Git Hub Code dataset.7 Dockerfile is a low-resource code language and the base model has poor PPL on the Dockerfile data. This setting allows us to explore the effectiveness of KCD-LM in low-resource settings. For domain adaptation, we focus on adapting to medical and legal domains. We use the Medical Instruction Dataset,8 which contains conversations between an AI assistant and patients during medical consultations, and the Federal Register subset of the Pile-of-Law (Henderson et al., 2022).
Dataset Splits Yes We measure PPL computed from 512-token sequences on corresponding test sets. For datasets that do not come with a test split, we construct test sets of 500 sequences to match Wiki Text.
Hardware Specification Yes Empirically, extracting chunks from Wiki Text-103 takes under an hour on four A4000 GPUs for a small base model like GPT-2. Building the Wiki Text datastore takes up to 1.5 hours on a single A4000 GPU, while other datastores are built within 30 minutes.
Software Dependencies No The paper discusses various models like GPT-2, Llama-2-7b-chat, and Mistral-7B-Instruct-v0.2, and mentions using Hugging Face datasets. However, it does not specify version numbers for general software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup Yes We formulate gϕ in Eq (2) as a simple piecewise linear function, where the maximum context matching similarity score only maps to a non-zero chunk acceptance probability qn if the score is larger than η ≥ 0, which is a hyperparameter. Similarity scores in the range [η, 1] are then linearly mapped to [0, 1]. See Appendix D.4 for full details. We decode zn greedily, which is equivalent to accepting zn = 1 when the chunk context matching similarity score passes a threshold η.