HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

Authors: Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, Song Han

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section, we evaluate HART’s performance in tokenization and generation. For generation, we present both text-to-image and class-conditioned image generation results.
Researcher Affiliation Collaboration Haotian Tang1 Yecheng Wu1,3 Shang Yang1 Enze Xie2 Junsong Chen2 Junyu Chen1,3 Zhuoyang Zhang1 Han Cai2 Yao Lu2 Song Han1,2 MIT1 NVIDIA2 Tsinghua University3
Pseudocode No The paper describes the methodology and architecture through text and figures (e.g., Figure 5 and 6), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is open sourced at https://github.com/mit-han-lab/hart.
Open Datasets Yes We evaluate HART on Image Net (Deng et al., 2009) for class-conditioned image generation, and on MJHQ-30K (Li et al., 2024a), Gen Eval (Ghosh et al., 2024), and DPG-Bench (Hu et al., 2024) for text-to-image generation. The HART tokenizer is trained on Open Images (Kuznetsova et al., 2020). For HART transformer training, we utilize Image Net, Journey DB (Pan et al., 2023), and internal Mid Journey-style synthetic data.
Dataset Splits No The paper lists various datasets used for evaluation and training but does not provide specific details on how these datasets were split into training, validation, or testing sets (e.g., percentages or exact counts) for reproduction.
Hardware Specification Yes Latency and throughput (batch=8) measurements are conducted on NVIDIA A100. We thank NVIDIA for donating the DGX server.
Software Dependencies No The paper mentions using specific models and architectures like "Qwen2-1.5B (Yang et al., 2024)" and "Llama-style (Touvron et al., 2023) building blocks," but it does not specify version numbers for programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup Yes For class-conditioned image generation models, we follow VAR (Tian et al., 2024) to construct HART models with varying parameter sizes in the AR transformer: 600M, 1B, and 2B. The diffusion MLP contains an additional 37M parameters. We replace VAR’s attention and FFN blocks with Llama-style (Touvron et al., 2023) building blocks. For text-conditioned image generation, we start with the 1B model and remove all Ada LN (Peebles & Xie, 2023) layers, resulting in a 30% reduction in parameters. ... We utilize sinusoidal PE for step embeddings, which naturally accommodates varying sampling steps in 256/512px (10 steps) and 1024px (14 steps) generation. For token index embeddings, we implement a hybrid approach: 1D rotary embeddings for text tokens and 2D rotary embeddings (Sun et al., 2024; Ma et al., 2024a; Wang et al., 2024) for visual tokens. ... we found that discarding 80% of the tokens (on average) in the final step and applying supervision only to the remaining tokens during training does not degrade performance. ... HART achieves optimal quality with just 8 sampling steps at inference, compared to MAR’s 30-50