reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Dual Stream Visual Tokenizer for LLM Image Generation

Authors: Yongqian Li, Yong Luo, Xiantao Cai, Zheng He, Zhennan Meng, Nidong Wang, Yunlin Chen, Zhifei Li

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 Experiment 4.1 Comparison with SOTA Methods We have presented a comparison of our model with several state-of-the-art (SOTA) models based on diffusion, with qualitative comparisons illustrated in Figure 2. It can be observed that SEED and Lavit, though preserving semantic similarity with the original image during reconstruction, show significant differences in structure (such as pose, orientation, color, etc.). After structural guidance, our model shows notable improvement in structural reconstruction compared to the former two. Table 2 displays the comparisons across several pixel-level reconstruction metrics, including SSIM, and PSNR.
Researcher Affiliation	Collaboration	1School of Computer Science, National Engineering Research Center for Multimedia Software and Hubei Key Laboratory of Multimedia and Network Communication Engineering, Wuhan University 2Mobvoi Innovation Technology Company Limited EMAIL, EMAIL
Pseudocode	No	The paper describes the method using text and mathematical equations, but it does not include any explicit pseudocode blocks or algorithm listings.
Open Source Code	No	The paper does not contain any explicit statements about releasing source code, nor does it provide links to a code repository.
Open Datasets	Yes	We fine-tuned this branch on Image Net for several epochs. During this process, we accounted for the difference in size between the reconstructed image and the original image. [...] We used 5,000 image-text pairs from the COCO dataset, with captions serving as prompts, and evaluated the restored images.
Dataset Splits	No	The paper mentions using 5,000 image-text pairs from the COCO dataset and 25 sets of images from ImageNet for evaluation, but it does not specify any training/validation/test splits, their percentages, or sample counts for model reproduction.
Hardware Specification	No	The numerical calculations in this paper have been done on the supercomputing system in the Supercomputing Center of Wuhan University.
Software Dependencies	No	The paper describes the model architecture and training process, but it does not specify any software dependencies with version numbers.
Experiment Setup	No	The paper describes the overall architecture, loss functions (Lrecon, Lcontrastive, Lcosine, MSE), and training steps (e.g., fine-tuning on Image Net for several epochs, training Causal Q-Former on 5 million image-text pairs). However, it does not provide specific hyperparameters such as learning rate, batch size, number of epochs for specific training phases, or optimizer details (e.g., Adam with specific epsilon, beta values).