Hierarchical Masked Autoregressive Models with Low-Resolution Token Pivots
Authors: Guangting Zheng, Yehao Li, Yingwei Pan, Jiajun Deng, Ting Yao, Yanyong Zhang, Tao Mei
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Extensive evaluations on both class-conditional and text-toimage generation tasks demonstrate that Hi-MAR outperforms typical AR baselines, while requiring fewer computational costs. Code is available at https://github.com/Hi Dream-ai/himar. [...] 4. Experiments 4.1. Datasets [...] 4.3. Results on Class-Conditional Image Generation [...] 4.4. Results on Text-to-Image Generation [...] 4.5. Experimental Analysis Ablation Study. |
| Researcher Affiliation | Collaboration | 1University of Science and Technology of China, Anhui, China 2Hi Dream.ai Inc, Beijing, China 3The University of Adelaide, Adelaide, Australia. |
| Pseudocode | No | The paper describes the model architecture and methodology in detail within Section 3, and presents visual pipelines in Figure 2, but it does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Code is available at https://github.com/Hi Dream-ai/himar. |
| Open Datasets | Yes | We empirically verify the merit of hierarchical masked autoregressive models for image generation in comparison with state-of-the-art approaches on two datasets, i.e., Image Net (Deng et al., 2009) and MS-COCO (Lin et al., 2014). |
| Dataset Splits | Yes | For class-conditional image generation, we validate Hi MAR on Image Net at 256 256 resolution, which consists of 1,281,167 training images from 1K different classes. For text-to-image generation, we evaluate Hi-MAR on MSCOCO at 256 256, which is composed of 82,783 training images and 40,504 validation images. |
| Hardware Specification | Yes | At training stage, we conduct all experiments on 80GB-H100 GPUs. For class-conditional image generation on Image Net, we follow MAR (Li et al., 2024) and train the models using Adam W optimizer (β1 = 0.9, β2 = 0.95) with 0.02 weight decay for 800 epochs. We use the constant lr schedule with a 1e-4 learning rate and 100-epoch linear warmup. [...] We measure the speed on Image Net 256 256 using one H100 GPU with batch size 128. |
| Software Dependencies | No | The paper mentions the use of 'Adam W optimizer' but does not specify version numbers for any software libraries, frameworks (like PyTorch or TensorFlow), or programming languages used. |
| Experiment Setup | Yes | At training stage, we conduct all experiments on 80GB-H100 GPUs. For class-conditional image generation on Image Net, we follow MAR (Li et al., 2024) and train the models using Adam W optimizer (β1 = 0.9, β2 = 0.95) with 0.02 weight decay for 800 epochs. We use the constant lr schedule with a 1e-4 learning rate and 100-epoch linear warmup. In the first phase, the masking ratio is randomly sampled in [0.7, 1.0] as MAR, while the second phase uses the cosine masking strategy following Mask GIT (Chang et al., 2022). For text-to-image generation on MS-COCO, we follow Auto NAT-L (Ni et al., 2024) and randomly sample the masking ratio by Beta distribution (α = 4, β = 1). The Adam W optimizer is adopted with an 8e-4 learning rate, 0.03 weight decay and 8K-step linear warmup. The exponential moving average is adopted with a momentum of 0.9999. At inference, we use 32 and 4 steps for the first and second phases with a cosine schedule. |