A Dual Stream Visual Tokenizer for LLM Image Generation
Authors: Yongqian Li, Yong Luo, Xiantao Cai, Zheng He, Zhennan Meng, Nidong Wang, Yunlin Chen, Zhifei Li
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 4 Experiment 4.1 Comparison with SOTA Methods We have presented a comparison of our model with several state-of-the-art (SOTA) models based on diffusion, with qualitative comparisons illustrated in Figure 2. It can be observed that SEED and Lavit, though preserving semantic similarity with the original image during reconstruction, show significant differences in structure (such as pose, orientation, color, etc.). After structural guidance, our model shows notable improvement in structural reconstruction compared to the former two. Table 2 displays the comparisons across several pixel-level reconstruction metrics, including SSIM, and PSNR. |
| Researcher Affiliation | Collaboration | 1School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University 2Mobvoi Innovation Technology Company Limited EMAIL, EMAIL |
| Pseudocode | No | The paper describes the method using text and mathematical equations, but it does not include any explicit pseudocode blocks or algorithm listings. |
| Open Source Code | No | The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository. |
| Open Datasets | Yes | We fine-tuned this branch on Image Net for several epochs. During this process, we accounted for the difference in size between the reconstructed image and the original image. [...] We used 5,000 image-text pairs from the COCO dataset, with captions serving as prompts, and evaluated the restored images. |
| Dataset Splits | No | The paper mentions using 5,000 image-text pairs from the COCO dataset and 25 sets of images from ImageNet for evaluation, but it does not specify any training/validation/test splits, their percentages, or sample counts for model reproduction. |
| Hardware Specification | No | The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University. |
| Software Dependencies | No | The paper describes the model architecture and training process, but it does not specify any software dependencies with version numbers. |
| Experiment Setup | No | The paper describes the overall architecture, loss functions (Lrecon, Lcontrastive, Lcosine, MSE), and training steps (e.g., fine-tuning on Image Net for several epochs, training Causal Q-Former on 5 million image-text pairs). However, it does not provide specific hyperparameters such as learning rate, batch size, number of epochs for specific training phases, or optimizer details (e.g., Adam with specific epsilon, beta values). |