PixWizard: Versatile Image-to-Image Visual Assistant with Open-Language Instructions
Authors: Weifeng Lin, Xinyu Wei, Renrui Zhang, Le Zhuo, Shitian Zhao, Siyuan Huang, Junlin Xie, Gao Peng, Hongsheng Li
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our experiments demonstrate that Pix Wizard not only shows impressive generative and understanding abilities for images with diverse resolutions but also exhibits generalization capabilities with unseen tasks and human instructions. 4 EXPERIMENTS |
| Researcher Affiliation | Academia | Weifeng Lin1 Xinyu Wei2 Renrui Zhang1 Le Zhuo1,3 Shitian Zhao3 Siyuan Huang3 Junlin Xie3 Peng Gao3 Hongsheng Li1 1CUHK MMLab 2Peking University 3Shanghai AI Laboratory |
| Pseudocode | Yes | In the implementation of the Multi-Hot Gumbel-Softmax (MHGS), the pseudocode is defined as follows: def MHGS(logits, temp=1, dim=-1, sample_tokens=16): # Add Gumbel noise and scale by temperature gumbels = (logits + Gumbel Noise(shape=logits.shape)) / temp # Apply Softmax to obtain soft outputs y_soft = Softmax(gumbels, dimension=dim) # Select top-k values for discrete output indices = Top-K(y_soft, k=sample_tokens, dimension=dim) # Create a hard multi-hot tensor from indices y_hard = Multi Hot Tensor(indices, shape=logits.shape) # Combine hard and soft outputs while preserving gradients ret = y_hard Stop Gradient(y_soft) + y_soft |
| Open Source Code | No | The paper does not provide explicit statements or links to its own open-source code. It mentions open-source datasets and third-party tools but not its own implementation. |
| Open Datasets | Yes | To equip our image-to-image visual assistant with comprehensive capabilities in image generation, manipulation, and translation, we compiled a multi-task training dataset for visual instruction tuning, consisting of 30 million instances across seven primary domains, as illustrated in Fig. 1. This is the user-friendly image-instruction-image triplet dataset, built from both open-source and in-house data, filtered with the help of MLLMs and manual review. All open-source datasets we use are provided in Sec. B.1. Image Grounding. The data for this part is sourced from well-known datasets such as g Ref COCO (Liu et al., 2023a), Ref COCO3 (Yu et al., 2016), and Visual Genome (Krishna et al., 2017). |
| Dataset Splits | Yes | For image grounding, we evaluate referring segmentation tasks on the g Ref COCO (2023a), Ref COCO, and Ref COCO+ validation and test sets. Few-shot Examples. Following Emu Edit’s experimental setup (Sheynin et al., 2024), we further validated Pix Wizard’s generalizability from a few-shot learning perspective. We prepared an object contour detection task as an illustration. We constructed 50 training samples for this task (all with red contours) and added them to our training dataset, then fine-tuned Pix Wizard. |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU or CPU models used for running the experiments. |
| Software Dependencies | No | The paper mentions several models and frameworks used (e.g., Gemma-2B, CLIP, SDXL, Control Net) but does not specify software dependency versions like Python, PyTorch, or CUDA versions. |
| Experiment Setup | Yes | S1: In the first stage, we initialize the model by combining the weights of a pre-trained Lumina-Next-T2I (Zhuo et al., 2024) with randomly initialized weights for the newly added modules. We prioritize tasks with smaller datasets, assigning each a sampling weight to increase its data volume. This weight determines how many times the dataset is repeated during an epoch. Using this method, each task achieves approximately 20k data points. We then randomly sample from other tasks to match this scale, creating our first-stage training dataset, with training spanning 4 epochs. During training, we randomly set c I = I or c T = T for 5% of examples, and both conditions are for 5% of examples. Given computational limitations, we conducted the ablation on Pix Wizard, training it for 40k steps. All experiments were conducted at a resolution of 512 × 512. |