Tractable Transformers for Flexible Conditional Generation
Authors: Anji Liu, Xuejie Liu, Dayuan Zhao, Mathias Niepert, Yitao Liang, Guy Van Den Broeck
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Empirical results demonstrate that Tracformers achieve stateof-the-art conditional generation performance on text modeling compared to recent diffusion and AR model baselines. ... In this section, we aim to empirically evaluate Tracformer s effectiveness in both conditional and unconditional generation. Specifically, our experiments are designed to answer two key questions: (i) How does Tracformer compare to other NAR architectures in terms of conditional generation performance? (ii) Can Tracformer scale effectively and outperform existing So TA generative models in both conditional and unconditional tasks? To this end, we conduct two sets of experiments: In Section 6.1, we compare Tracformer with a range of NAR architectures on Wiki Text (Merity et al., 2022), LAMBADA (Paperno et al., 2016), and One Billion Words (1BW) (Chelba et al., 2013) datasets to evaluate its performance across diverse conditional queries. In Section 6.2, we scale Tracformer to Open Web Text (Gokaslan & Cohen, 2019) and benchmark it against So TA discrete diffusion models, focusing on zero-shot conditional and unconditional performance. These experiments comprehensively evaluate Tracformer s advantages and its potential to serve as a more effective backbone for NAR generation. |
| Researcher Affiliation | Academia | 1Department of Computer Science, University of California, Los Angeles 2Institute for Artificial Intelligence, University of Stuttgart 3Institute for Artificial Intelligence, Peking University 4School of Intelligence Science and Technology, Peking University 5Yuanpei College, Peking University. |
| Pseudocode | Yes | Algorithm 1 Span Masking Strategy Algorithm 2 Mixed Masking Strategy for Open Web Text Training |
| Open Source Code | Yes | Code is available at https: //github.com/liuanji/Tracformer. |
| Open Datasets | Yes | Wiki Text103 (Merity et al., 2022), LAMBADA (Paperno et al., 2016), and One Billion Words (1BW) (Chelba et al., 2013) datasets... Open Web Text (Gokaslan & Cohen, 2019) |
| Dataset Splits | No | The paper mentions using |
| Hardware Specification | No | The paper does not provide specific details on the hardware used for the experiments, such as GPU or CPU models. It only states: "Due to resource limitations, we only train Tracformer at the GPT-2 (base) scale." |
| Software Dependencies | No | The paper does not explicitly state specific software dependencies with version numbers used for the experiments. |
| Experiment Setup | Yes | For both CAR and AC training tasks, the sequence length is set to 1024 tokens, with a batch size of 256. The models are optimized using Adam W with β1 = 0.9, β2 = 0.95, and a weight decay of 0.1. The initial learning rate is set to 6 10 4 and follows a cosine decay schedule, with 1,000 warmup steps to stabilize the early training phase. The final learning rate is 6 10 5. Training is conducted for 30,000 steps. ... Tracformer is implemented with a 10-layer encoder-decoder architecture, maintaining a block size (i.e., maximum sequence length) of 1024 tokens. It utilizes sparse multi-scope attention with a constraint of 16 attended tokens per step... The decoder operates with a maximum stride of 1024 tokens... The model is configured with 9 attention heads and an embedding dimension of 576. A dropout rate of 0.1 is applied... |