HART: Efficient Visual Generation with Hybrid Autoregressive Transformer
Authors: Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, Song Han
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we evaluate HART’s performance in tokenization and generation. For generation, we present both text-to-image and class-conditioned image generation results. |
| Researcher Affiliation | Collaboration | Haotian Tang1 Yecheng Wu1,3 Shang Yang1 Enze Xie2 Junsong Chen2 Junyu Chen1,3 Zhuoyang Zhang1 Han Cai2 Yao Lu2 Song Han1,2 MIT1 NVIDIA2 Tsinghua University3 |
| Pseudocode | No | The paper describes the methodology and architecture through text and figures (e.g., Figure 5 and 6), but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is open sourced at https://github.com/mit-han-lab/hart. |
| Open Datasets | Yes | We evaluate HART on Image Net (Deng et al., 2009) for class-conditioned image generation, and on MJHQ-30K (Li et al., 2024a), Gen Eval (Ghosh et al., 2024), and DPG-Bench (Hu et al., 2024) for text-to-image generation. The HART tokenizer is trained on Open Images (Kuznetsova et al., 2020). For HART transformer training, we utilize Image Net, Journey DB (Pan et al., 2023), and internal Mid Journey-style synthetic data. |
| Dataset Splits | No | The paper lists various datasets used for evaluation and training but does not provide specific details on how these datasets were split into training, validation, or testing sets (e.g., percentages or exact counts) for reproduction. |
| Hardware Specification | Yes | Latency and throughput (batch=8) measurements are conducted on NVIDIA A100. We thank NVIDIA for donating the DGX server. |
| Software Dependencies | No | The paper mentions using specific models and architectures like "Qwen2-1.5B (Yang et al., 2024)" and "Llama-style (Touvron et al., 2023) building blocks," but it does not specify version numbers for programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions). |
| Experiment Setup | Yes | For class-conditioned image generation models, we follow VAR (Tian et al., 2024) to construct HART models with varying parameter sizes in the AR transformer: 600M, 1B, and 2B. The diffusion MLP contains an additional 37M parameters. We replace VAR’s attention and FFN blocks with Llama-style (Touvron et al., 2023) building blocks. For text-conditioned image generation, we start with the 1B model and remove all Ada LN (Peebles & Xie, 2023) layers, resulting in a 30% reduction in parameters. ... We utilize sinusoidal PE for step embeddings, which naturally accommodates varying sampling steps in 256/512px (10 steps) and 1024px (14 steps) generation. For token index embeddings, we implement a hybrid approach: 1D rotary embeddings for text tokens and 2D rotary embeddings (Sun et al., 2024; Ma et al., 2024a; Wang et al., 2024) for visual tokens. ... we found that discarding 80% of the tokens (on average) in the final step and applying supervision only to the remaining tokens during training does not degrade performance. ... HART achieves optimal quality with just 8 sampling steps at inference, compared to MAR’s 30-50 |