reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions

Authors: Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Junlin Xie, Gao Peng, Hongsheng Li

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments demonstrate that Pix Wizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits generalization capabilities with unseen tasks and human instructions. 4 EXPERIMENTS
Researcher Affiliation	Academia	Weifeng Lin1 Xinyu Wei2 Renrui Zhang1 Le Zhuo1,3 Shitian Zhao3 Siyuan Huang3 Junlin Xie3 Peng Gao3 Hongsheng Li1 1CUHK MMLab 2Peking University 3Shanghai AI Laboratory
Pseudocode	Yes	In the implementation of the Multi-Hot Gumbel-Softmax (MHGS), the pseudocode is defined as follows: def MHGS(logits, temp=1, dim=-1, sample_tokens=16): # Add Gumbel noise and scale by temperature gumbels = (logits + Gumbel Noise(shape=logits.shape)) / temp # Apply Softmax to obtain soft outputs y_soft = Softmax(gumbels, dimension=dim) # Select top-k values for discrete output indices = Top-K(y_soft, k=sample_tokens, dimension=dim) # Create a hard multi-hot tensor from indices y_hard = Multi Hot Tensor(indices, shape=logits.shape) # Combine hard and soft outputs while preserving gradients ret = y_hard Stop Gradient(y_soft) + y_soft
Open Source Code	No	The paper does not provide explicit statements or links to its own open-source code. It mentions open-source datasets and third-party tools but not its own implementation.
Open Datasets	Yes	To equip our image-to-image visual assistant with comprehensive capabilities in image generation, manipulation, and translation, we compiled a multi-task training dataset for visual instruction tuning, consisting of 30 million instances across seven primary domains, as illustrated in Fig. 1. This is the user-friendly image-instruction-image triplet dataset, built from both open-source and in-house data, filtered with the help of MLLMs and manual review. All open-source datasets we use are provided in Sec. B.1. Image Grounding. The data for this part is sourced from well-known datasets such as g Ref COCO (Liu et al., 2023a), Ref COCO3 (Yu et al., 2016), and Visual Genome (Krishna et al., 2017).
Dataset Splits	Yes	For image grounding, we evaluate referring segmentation tasks on the g Ref COCO (2023a), Ref COCO, and Ref COCO+ validation and test sets. Few-shot Examples. Following Emu Edit’s experimental setup (Sheynin et al., 2024), we further validated Pix Wizard’s generalizability from a few-shot learning perspective. We prepared an object contour detection task as an illustration. We constructed 50 training samples for this task (all with red contours) and added them to our training dataset, then fine-tuned Pix Wizard.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments.
Software Dependencies	No	The paper mentions several models and frameworks used (e.g., Gemma-2B, CLIP, SDXL, Control Net) but does not specify software dependency versions like Python, PyTorch, or CUDA versions.
Experiment Setup	Yes	S1: In the first stage, we initialize the model by combining the weights of a pre-trained Lumina-Next-T2I (Zhuo et al., 2024) with randomly initialized weights for the newly added modules. We prioritize tasks with smaller datasets, assigning each a sampling weight to increase its data volume. This weight determines how many times the dataset is repeated during an epoch. Using this method, each task achieves approximately 20k data points. We then randomly sample from other tasks to match this scale, creating our first-stage training dataset, with training spanning 4 epochs. During training, we randomly set c I = I or c T = T for 5% of examples, and both conditions are for 5% of examples. Given computational limitations, we conducted the ablation on Pix Wizard, training it for 40k steps. All experiments were conducted at a resolution of 512 × 512.