IMProv: Inpainting-based Multimodal Prompting for Computer Vision Tasks

Authors: Jiarui Xu, Yossi Gandelsman, Amir Bar, Jianwei Yang, Jianfeng Gao, Trevor Darrell, Xiaolong Wang

TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We show that training our model with text conditioning and scaling the dataset size improves in-context learning for computer vision tasks by over +10% AP for Foreground Segmentation, over +5% gains in AP for Single Object Detection, and almost 20% lower LPIPS in Colorization. Our emperical results suggest that vision and language prompts are complementary and it is advantageous to use both to achieve better in-context learning performance.
Researcher Affiliation Collaboration Jiarui Xu EMAIL UC San Diego Yossi Gandelsman EMAIL UC Berkeley Amir Bar EMAIL UC Berkeley Jianwei Yang EMAIL Microsoft Research Jianfeng Gao EMAIL Microsoft Research Trevor Darrell EMAIL UC Berkeley Xiaolong Wang EMAIL UC San Diego
Pseudocode No The paper describes the model architecture and training process in text and a diagram (Figure 2), but does not include any explicit pseudocode or algorithm blocks.
Open Source Code No The paper does not provide an explicit statement about the release of source code for the described methodology, nor does it include a link to a code repository. The Open Review link is for peer review, not code.
Open Datasets Yes We train a IMProv with a Vi T-L backbone on a combination of our CCVF, S2CV dataset and LAION400M (Schuhmann et al., 2021)... We use the Pascal VOC 2012 dataset (Everingham et al., 2015)... We randomly sampled 1000 example pairs and image query from Image Net (Russakovsky et al., 2015) validation set... We follow the evaluation protocol of Bar et al. (2022) and test IMProv on four splits of Pascal-5i dataset (Shaban et al., 2017).
Dataset Splits Yes We follow the evaluation protocol of Bar et al. (2022) and test IMProv on four splits of Pascal-5i dataset (Shaban et al., 2017)... We randomly sampled 1000 example pairs and image query from Image Net (Russakovsky et al., 2015) validation set and converted them to gray-scale to obtain gray-scale and color version for each image.
Hardware Specification Yes We train our models on one machine with 8 A100 GPUs with a batch size of 2048 for 150k iterations.
Software Dependencies No The paper mentions using Adam W optimizer and pre-trained models like CLIP and VQGAN, but it does not specify version numbers for any core software dependencies like programming languages (e.g., Python), deep learning frameworks (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup Yes We use Adam W (Loshchilov & Hutter, 2017) optimizer with a learning rate of 2e 4 and weight decay of 0.05. We train our models on one machine with 8 A100 GPUs with a batch size of 2048 for 150k iterations. Our learning-rate schedule consists of 2k linear warm-up steps followed by a cosine learning rate decay. During training, we drop the text conditioning with a probability of 0.1. During training, the input image x is split into patches and randomly masked by dropping a fixed percent of the patches (75% in our experiments).