Fractal Generative Models
Authors: Tianhong Li, Qinyi Sun, Lijie Fan, Kaiming He
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments show strong performance in both likelihood estimation and generation quality. We conduct extensive experiments on the Image Net dataset (Deng et al., 2009) with resolutions at 64 64 and 256 256. Our evaluation includes both unconditional and class-conditional image generation, covering various aspects of the model such as likelihood estimation, fidelity, diversity, and generation quality. Accordingly, we report the negative log-likelihood (NLL), Frechet Inception Distance (FID) (Heusel et al., 2017), Inception Score (IS) (Salimans et al., 2016), Precision and Recall (Dhariwal & Nichol, 2021a), and visualization results for a comprehensive assessment of our fractal framework. |
| Researcher Affiliation | Collaboration | Tianhong Li EMAIL MIT Qinyi Sun EMAIL MIT Lijie Fan EMAIL Google Deep Mind Kaiming He EMAIL MIT |
| Pseudocode | No | The paper describes implementation details and processes in prose within Appendix A, but it does not contain any clearly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | We also provide our source code in the supplementary materials. All codes and models will be made publicly available. |
| Open Datasets | Yes | We conduct extensive experiments on the Image Net dataset (Deng et al., 2009) with resolutions at 64 64 and 256 256. |
| Dataset Splits | Yes | We conduct extensive experiments on the Image Net dataset (Deng et al., 2009) with resolutions at 64 64 and 256 256. More fractal levels achieve better likelihood estimation performance with lower computational costs, measured on unconditional Image Net 64 64 test set. |
| Hardware Specification | Yes | The training time is measured per training iteration on 1 H100 GPU with batch size 8. Fractal MAR-H achieves an FID of 6.15 and an Inception Score of 348.9, with an average throughput of 1.29 seconds per image (evaluated at a batch size of 1,024 on a single Nvidia H100 PCIe GPU). The 64 64 model takes 3.5 days, and the 256 256 Fractal MAR-L model takes 7.6 days on 32 H100 GPUs. |
| Software Dependencies | No | The paper mentions using the Adam W optimizer and various model architectures, but it does not specify any software names with version numbers (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | We train our fractal model end-to-end directly on raw image pixels following a breadth-first manner through the fractal architecture. ... The models are trained using the Adam W optimizer (Loshchilov & Hutter, 2019) for 800 epochs (the Fractal MAR-H model is trained for 600 epochs). The weight decay and momenta for Adam W are 0.05 and (0.9, 0.95). We use a batch size of 2048 for Image Net 64 64 and 1024 for Image Net 256 256, and a base learning rate (lr) of 5e-5 (scaled by batch size divided by 256). The model is trained with 40 epochs linear lr warmup (Goyal et al., 2017), followed by a cosine lr schedule. |