reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

HART: Efficient Visual Generation with Hybrid Autoregressive Transformer

Authors: Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, Song Han

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate HART’s performance in tokenization and generation. For generation, we present both text-to-image and class-conditioned image generation results.
Researcher Affiliation	Collaboration	Haotian Tang1 Yecheng Wu1,3 Shang Yang1 Enze Xie2 Junsong Chen2 Junyu Chen1,3 Zhuoyang Zhang1 Han Cai2 Yao Lu2 Song Han1,2 MIT1 NVIDIA2 Tsinghua University3
Pseudocode	No	The paper describes the methodology and architecture through text and figures (e.g., Figure 5 and 6), but it does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is open sourced at https://github.com/mit-han-lab/hart.
Open Datasets	Yes	We evaluate HART on Image Net (Deng et al., 2009) for class-conditioned image generation, and on MJHQ-30K (Li et al., 2024a), Gen Eval (Ghosh et al., 2024), and DPG-Bench (Hu et al., 2024) for text-to-image generation. The HART tokenizer is trained on Open Images (Kuznetsova et al., 2020). For HART transformer training, we utilize Image Net, Journey DB (Pan et al., 2023), and internal Mid Journey-style synthetic data.
Dataset Splits	No	The paper lists various datasets used for evaluation and training but does not provide specific details on how these datasets were split into training, validation, or testing sets (e.g., percentages or exact counts) for reproduction.
Hardware Specification	Yes	Latency and throughput (batch=8) measurements are conducted on NVIDIA A100. We thank NVIDIA for donating the DGX server.
Software Dependencies	No	The paper mentions using specific models and architectures like "Qwen2-1.5B (Yang et al., 2024)" and "Llama-style (Touvron et al., 2023) building blocks," but it does not specify version numbers for programming languages, libraries, or frameworks (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	For class-conditioned image generation models, we follow VAR (Tian et al., 2024) to construct HART models with varying parameter sizes in the AR transformer: 600M, 1B, and 2B. The diffusion MLP contains an additional 37M parameters. We replace VAR’s attention and FFN blocks with Llama-style (Touvron et al., 2023) building blocks. For text-conditioned image generation, we start with the 1B model and remove all Ada LN (Peebles & Xie, 2023) layers, resulting in a 30% reduction in parameters. ... We utilize sinusoidal PE for step embeddings, which naturally accommodates varying sampling steps in 256/512px (10 steps) and 1024px (14 steps) generation. For token index embeddings, we implement a hybrid approach: 1D rotary embeddings for text tokens and 2D rotary embeddings (Sun et al., 2024; Ma et al., 2024a; Wang et al., 2024) for visual tokens. ... we found that discarding 80% of the tokens (on average) in the final step and applying supervision only to the remaining tokens during training does not degrade performance. ... HART achieves optimal quality with just 8 sampling steps at inference, compared to MAR’s 30-50