Re-Thinking Inverse Graphics With Large Language Models
Authors: Peter Kulits, Haiwen Feng, Weiyang Liu, Victoria Fernandez Abrevaya, Michael J. Black
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Through our investigation, we demonstrate the potential of LLMs to facilitate inverse graphics through next-token prediction, without the application of image-space supervision. Our analysis enables new possibilities for precise spatial reasoning about images that exploit the visual knowledge of LLMs. We release our code and data at https://ig-llm.is.tue.mpg.de/ to ensure the reproducibility of our investigation and to facilitate future research. 4 Evaluations To evaluate the ability of our proposed framework to generalize across distribution shifts, we design a number of focused evaluation settings. We conduct experiments on synthetic data in order to quantitatively analyze model capability under controlled shifts. |
| Researcher Affiliation | Academia | Peter Kulits* EMAIL Max Planck Institute for Intelligent Systems, Tübingen, Germany. Haiwen Feng* EMAIL Max Planck Institute for Intelligent Systems, Tübingen, Germany. Weiyang Liu EMAIL Max Planck Institute for Intelligent Systems, Tübingen, Germany, University of Cambridge. Victoria Abrevaya EMAIL Max Planck Institute for Intelligent Systems, Tübingen, Germany. Michael J. Black EMAIL Max Planck Institute for Intelligent Systems, Tübingen, Germany. |
| Pseudocode | No | The paper describes a framework and its components using figures (e.g., Figure 1 and 2) but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | Yes | We release our code and data at https://ig-llm.is.tue.mpg.de/ to ensure the reproducibility of our investigation and to facilitate future research. |
| Open Datasets | Yes | CLEVR (Johnson et al., 2017) is a procedurally generated dataset of simple 3D objects on a plane. ...incorporating objects sourced from Shape Net (Chang et al., 2015). |
| Dataset Splits | Yes | We train both our proposed framework and NS-VQA, our neural-scene de-rendering baseline, on 4k images from the ID condition and evaluate them on 1k images from both the ID and OOD conditions. ... we create a dataset comprising 10k images... render a training dataset of one-million images. ...render 100k training images and evaluate the framework on three conditions, each with 1K images |
| Hardware Specification | No | The paper mentions using "Deep Speed Ze RO-2" as a memory optimization technique, but it does not specify any particular hardware components like GPU models, CPU types, or cloud computing instances used for the experiments. |
| Software Dependencies | Yes | We finetune the LLa MA 1-based Vicuna 1.3 model2 with Lo RA (Hu et al., 2022a). We use the Hugging Face Transformers and PEFT libraries, along with Deep Speed Ze RO-2 (Rajbhandari et al., 2020). ... The frozen CLIP visual tokenizer from 3. (Footnote 2 points to https://huggingface.co/lmsys/vicuna-7b-v1.3 and Footnote 3 points to https://huggingface.co/openai/clip-vit-large-patch14-336) |
| Experiment Setup | Yes | In all experiments, we use a lora_r of 128, a lora_alpha of 256, a Lo RA learning rate of 2e-05, a linear projector learning rate of 2e-05, a numeric head learning rate of 2e-04, and a cosine learning-rate schedule. All models are trained with an effective batch size of 32 with bfloat16 mixed-precision training. Both the cross-entropy next-token-prediction and mean-square-error (MSE) losses are given a weight of 1. The models for the CLEVR and parameter-space generalization experiments are trained for 40k steps. The single-object 6-Do F pose-estimation model is trained for 200k and the scene-level Shape Net model for 500k steps. |