Orthus: Autoregressive Interleaved Image-Text Generation with Modality-Specific Heads

Authors: Siqi Kou, Jiachun Jin, Zhihong Liu, Chang Liu, Ye Ma, Jian Jia, Quan Chen, Peng Jiang, Zhijie Deng

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We have performed extensive studies to evaluate Orthus. For visual understanding and generation, Orthus achieves a Gen Eval score of 0.58 and an MME-P score of 1265.8 using 7B parameters, outperforming competing baselines including Show-o and Chameleon. Our code is available at https://github.com/ zhijie-group/Orthus. Sections 5, 5.2, 5.3, and 5.4 are dedicated to Experiments, Interleaved Image-Text Generation, Visual Understanding and Generation, and Ablation Studies, respectively, all detailing empirical evaluations and results.
Researcher Affiliation Collaboration Siqi Kou 1 * Jiachun Jin 1 Zhihong Liu 1 Chang Liu 1 * Ye Ma 2 Jian Jia 2 Quan Chen 2 Peng Jiang 2 Zhijie Deng 1 *Work done during an internship at Kuaishou Technology. 1Qing Yuan Research Institute, Shanghai Jiao Tong University 2Kuaishou Technology. Correspondence to: Zhijie Deng <EMAIL>.
Pseudocode No The paper describes the architecture and methodology of Orthus, including the components and how they interact (e.g., 'As shown in Figure 2, Orthus directly takes the continuous image features V and discrete text tokens U as input...'). However, it does not present any formal pseudocode or algorithm blocks with structured steps in a code-like format.
Open Source Code Yes Our code is available at https://github.com/ zhijie-group/Orthus.
Open Datasets Yes The images for training Orthus-base are the first 10k from laion-coco aesthetic. After training Orthus-base on the 400k Instruct-pix2pix (Brooks et al., 2023) training dataset. We fine-tune Orthus-base with the unified learning objective on the Story Stream (Yang et al., 2024b) dataset. Visual Understanding and Generation: post-training Orthus-base with a mixture of Lla VA-v1.5-665K (Liu et al., 2024d) and high-quality text-to-image data (Journey DB (Sun et al., 2024a) and LAION-COCO-aesthetic (laion-coco aesthetic) recaptioned from Share GPT-4v (Chen et al., 2023a)).
Dataset Splits Yes The evaluation is conducted on a subset of the LAION-Aesthetic, consisting of 10,000 images that are excluded from the training dataset.
Hardware Specification Yes Initialized with the typical Chameleon-7B (Team, 2024), Orthus can acquire image processing capabilities while preserving the text generation capacity after 9-hour training on 10k high-quality images (laion-coco aesthetic) using 8 A100 GPUs. Both training and evaluation are carried out on servers equipped with 8 NVIDIA A100 80GB GPUs.
Software Dependencies No The paper mentions using 'Adam W' as the optimizer but does not specify any software libraries (e.g., PyTorch, TensorFlow) or their version numbers. No other software dependencies with specific version numbers are provided.
Experiment Setup Yes The diffusion noise schedule is linear following (Rombach et al., 2022), with 1000 steps at training time. λ is set to 100 to balance the order of magnitude between Ldiff and Lar during post-training. During inference, we use greedy decoding to generate text. For image generation, we adopt the DDIM (Song et al., 2020a) sampler with 100 steps. We employ classifier-free guidance (CFG) (Ho & Salimans, 2022) with the scale set to 5 during sampling. All images are generated at a resolution of 512 512. Table 7: Optimizer Adam W (β1 = 0.9, β2 = 0.99) Learning Rate 1e-4 1e-5 Batch Size 32 16 Training Steps 15,000 35,000.