Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

Authors: Jaehyeon Kim, Taehong Moon, Keon Lee, Jaewoong Cho

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We validate the efficacy and generalizability of Res Gen across two real-world generative tasks: conditional image generation on Image Net 256 256 and zero-shot text-to-speech synthesis. Experimental results demonstrate superior performance over autoregressive counterparts in these tasks.
Researcher Affiliation Collaboration 1NVIDIA 2KRAFTON. Correspondence to: Jaewoong Cho <EMAIL>.
Pseudocode Yes Algorithm 1 Training 1: procedure Binary Mask(n, L, D) ... Algorithm 2 Sampling 1: procedure Binary Unmask(n, L, D, m)
Open Source Code No Our work advances the field of generative modeling by introducing a memory-efficient approach for high-fidelity sample generation using Residual Vector Quantization (RVQ). The potential societal benefits of this research are substantial, particularly in areas where efficient and high-quality generation is critical, such as accessibility technologies, creative industries, and scientific simulations. To mitigate these risks, we encourage researchers and practitioners to adopt responsible use policies, including mechanisms to detect and authenticate synthetic content, and to promote transparency in model development and deployment. The paper mentions a project page for audio samples, but not for code release. "For qualitative comparison, we present our generated audio samples in the project page1. 1https://resgen-ai.github.io/"
Open Datasets Yes For the vision domain, we focus on conditional image generation tasks on Image Net (Krizhevsky et al., 2017) at a resolution of 256 256. This research (paper) used datasets from The Open AI Dataset Project (AI-Hub, S. Korea) . All data information can be accessed through AI-Hub (www.aihub.or.kr).
Dataset Splits No The paper mentions using ImageNet 256x256 for image generation and discusses audio tasks, but it does not specify explicit training, validation, or test splits with percentages or sample counts for any dataset used. It refers to 'standard splits' implicitly by mentioning benchmarks but does not detail them.
Hardware Specification Yes Wall-clock time results reflect the time required to generate a single sample on an NVIDIA A100 GPU. Both wall-clock time and maximum batch size are measured on an NVIDIA A100 GPU.
Software Dependencies No The paper mentions using Python and PyTorch implicitly for deep learning models and also mentions ByT5-Large for text encoder. However, it does not specify version numbers for any of these software components. "We train our method using an architecture similar to Di T (Peebles & Xie, 2023)..." and "...By T5-Large (Xue et al., 2022)."
Experiment Setup Yes We train our method using an architecture similar to Di T (Peebles & Xie, 2023), adopting the XLarge version while modifying the adaptive layer normalization layers for conditioning by replacing their linear layers with bias parameters. As shown in Table 1, all generative models are trained for 2.8M iterations under the same RVQ token setting. ... In Table 2, all variants of Res Gen are trained with a batch size of 256 across 4 GPUs for 7M iterations. The masking scheduling function γ( ) is defined as γ(r) = (1 r2) 1 2 and applied throughout all training. For the Text-to-Speech task, our model, based on the Di T XLarge architecture as in the vision task, is trained using the same configuration as in prior work (Kim et al., 2024), utilizing 4 GPUs for 310M iterations.