reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Chunk-Distilled Language Modeling

Authors: Yanhong Li, Karen Livescu, Jiawei Zhou

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct a diverse set of empirical studies, including language modeling perplexity, text generation, and domain adaptation, showing the ability of CD-LM to improve inference efficiency and modeling performance.
Researcher Affiliation	Academia	Yanhong Li University of Chicago & TTI-Chicago EMAIL Karen Livescu TTI-Chicago EMAIL Jiawei Zhou TTI-Chicago & Stony Brook University EMAIL
Pseudocode	Yes	The chunk-integrated generative process is as follows: (1) For step t = 1, LM Mθ generates the first token x1. We have current sequence length l1 = 1. (2) At step t ≥ 2, set next token position: n = lt−1 + 1; (3) Chunk proposal: G(x<n) → (cn, qn), and length of cn is τn; (4) Sample: zn ∼ Bernoulli(qn); (5) If zn = 1: accept cn, and lt = lt−1 + τn; (6) Else zn = 0: reject cn. Generate xn from the base LM Mθ, and lt = lt−1 + 1; (7) Move to generation step t + 1.
Open Source Code	Yes	Code and data are available at https://github.com/yanhong-lbh/cd-lm.
Open Datasets	Yes	We evaluate on the Wiki Text-103 dataset and the Dockerfile subset of the Git Hub Code dataset.7 Dockerfile is a low-resource code language and the base model has poor PPL on the Dockerfile data. This setting allows us to explore the effectiveness of KCD-LM in low-resource settings. For domain adaptation, we focus on adapting to medical and legal domains. We use the Medical Instruction Dataset,8 which contains conversations between an AI assistant and patients during medical consultations, and the Federal Register subset of the Pile-of-Law (Henderson et al., 2022).
Dataset Splits	Yes	We measure PPL computed from 512-token sequences on corresponding test sets. For datasets that do not come with a test split, we construct test sets of 500 sequences to match Wiki Text.
Hardware Specification	Yes	Empirically, extracting chunks from Wiki Text-103 takes under an hour on four A4000 GPUs for a small base model like GPT-2. Building the Wiki Text datastore takes up to 1.5 hours on a single A4000 GPU, while other datastores are built within 30 minutes.
Software Dependencies	No	The paper discusses various models like GPT-2, Llama-2-7b-chat, and Mistral-7B-Instruct-v0.2, and mentions using Hugging Face datasets. However, it does not specify version numbers for general software dependencies such as Python, PyTorch, or CUDA.
Experiment Setup	Yes	We formulate gϕ in Eq (2) as a simple piecewise linear function, where the maximum context matching similarity score only maps to a non-zero chunk acceptance probability qn if the score is larger than η ≥ 0, which is a hyperparameter. Similarity scores in the range [η, 1] are then linearly mapped to [0, 1]. See Appendix D.4 for full details. We decode zn greedily, which is equivalent to accepting zn = 1 when the chunk context matching similarity score passes a threshold η.