reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Efficient Generative Modeling with Residual Vector Quantization-Based Tokens

Authors: Jaehyeon Kim, Taehong Moon, Keon Lee, Jaewoong Cho

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We validate the efficacy and generalizability of Res Gen across two real-world generative tasks: conditional image generation on Image Net 256 256 and zero-shot text-to-speech synthesis. Experimental results demonstrate superior performance over autoregressive counterparts in these tasks.
Researcher Affiliation	Collaboration	1NVIDIA 2KRAFTON. Correspondence to: Jaewoong Cho <EMAIL>.
Pseudocode	Yes	Algorithm 1 Training 1: procedure Binary Mask(n, L, D) ... Algorithm 2 Sampling 1: procedure Binary Unmask(n, L, D, m)
Open Source Code	No	Our work advances the field of generative modeling by introducing a memory-efficient approach for high-fidelity sample generation using Residual Vector Quantization (RVQ). The potential societal benefits of this research are substantial, particularly in areas where efficient and high-quality generation is critical, such as accessibility technologies, creative industries, and scientific simulations. To mitigate these risks, we encourage researchers and practitioners to adopt responsible use policies, including mechanisms to detect and authenticate synthetic content, and to promote transparency in model development and deployment. The paper mentions a project page for audio samples, but not for code release. "For qualitative comparison, we present our generated audio samples in the project page1. 1https://resgen-ai.github.io/"
Open Datasets	Yes	For the vision domain, we focus on conditional image generation tasks on Image Net (Krizhevsky et al., 2017) at a resolution of 256 256. This research (paper) used datasets from The Open AI Dataset Project (AI-Hub, S. Korea) . All data information can be accessed through AI-Hub (www.aihub.or.kr).
Dataset Splits	No	The paper mentions using ImageNet 256x256 for image generation and discusses audio tasks, but it does not specify explicit training, validation, or test splits with percentages or sample counts for any dataset used. It refers to 'standard splits' implicitly by mentioning benchmarks but does not detail them.
Hardware Specification	Yes	Wall-clock time results reflect the time required to generate a single sample on an NVIDIA A100 GPU. Both wall-clock time and maximum batch size are measured on an NVIDIA A100 GPU.
Software Dependencies	No	The paper mentions using Python and PyTorch implicitly for deep learning models and also mentions ByT5-Large for text encoder. However, it does not specify version numbers for any of these software components. "We train our method using an architecture similar to Di T (Peebles & Xie, 2023)..." and "...By T5-Large (Xue et al., 2022)."
Experiment Setup	Yes	We train our method using an architecture similar to Di T (Peebles & Xie, 2023), adopting the XLarge version while modifying the adaptive layer normalization layers for conditioning by replacing their linear layers with bias parameters. As shown in Table 1, all generative models are trained for 2.8M iterations under the same RVQ token setting. ... In Table 2, all variants of Res Gen are trained with a batch size of 256 across 4 GPUs for 7M iterations. The masking scheduling function γ( ) is defined as γ(r) = (1 r2) 1 2 and applied throughout all training. For the Text-to-Speech task, our model, based on the Di T XLarge architecture as in the vision task, is trained using the same configuration as in prior work (Kim et al., 2024), utilizing 4 GPUs for 310M iterations.