Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Field-DiT: Diffusion Transformer on Unified Video, 3D, and Game Field Generation
Authors: Kangfu Mei, Mo Zhou, Vishal Patel
ICLR 2025 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results across various modalities demonstrate the effectiveness of our model, with its 675M parameter size, and highlight its potential as a foundational framework for scalable, architectureunified visual content generation for different modalities with different weights. Our project page can be found at https://kfmei.com/Field-Di T/. We empirically validate its superiority against previous domain-agnostic methods across three different tasks, including text-to-video generation, 3D novel-view generation, and game generation. Various experiments show that our method achieves compelling performance even when compared to the state-of-the-art domain-specific methods, underlining its potential as a scalable and architecture-unified visual content generation model across various modalities. |
| Researcher Affiliation | Academia | Kangfu Mei Johns Hopkins University EMAIL Mo Zhou Johns Hopkins University EMAIL Vishal M. Patel Johns Hopkins University EMAIL |
| Pseudocode | No | The paper describes the methodology in prose and figures, but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps. |
| Open Source Code | No | The paper mentions a project page: "Our project page can be found at https://kfmei.com/Field-Di T/". While project pages can sometimes host code, this statement does not explicitly confirm that the source code for the methodology described in the paper is available there, nor does it provide a direct link to a code repository. The paper also states, "we release the benchmarks including both training and testing data for replication and comparisons," which refers to data, not code. |
| Open Datasets | Yes | For image generation, we use the standard benchmark dataset, i.e., CIFAR10 64 64 (Krizhevsky et al., 2009)... we conduct experiments on the recent text-to-video benchmark: Celeb V-Text 256 256 128 (Yu et al., 2023b)... We also evaluate our method on 3D novel view generation with the Shape Net dataset (Chang et al., 2015)... Game generation is an under-explored area and lacks data and benchmarks. We demonstrate the game generation capability of our method by showing the accuracy of predicted frames compared with the frame of the real game when using the same action. Specially, we model the World 1-1 of Super Mario Bros (NES version) with a sliding window size of 16... we release the benchmarks including both training and testing data for replication and comparisons. |
| Dataset Splits | No | The paper mentions using specific datasets (CIFAR10, Celeb V-Text, ShapeNet, Super Mario Bros World 1-1) and refers to "test data" (e.g., "We randomly select 2,048 videos from the test data"). It also mentions using "the last 16 frames as the context length for game generation, and the last 8 frames as the context length for text-to-video generation." However, it does not explicitly provide specific details about how the overall datasets were split into training, validation, and test sets (e.g., percentages, sample counts, or references to predefined splits for reproduction). |
| Hardware Specification | No | The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only mentions general aspects like "memory bottleneck in existing GPU-accelerated computing systems" but not the specific hardware used for their own experiments. |
| Software Dependencies | No | The paper mentions using specific models like "T5XXL (Raffel et al., 2020)" for text embeddings and "openai/clip-vit-large-patch14 model" for CLIPSIM calculation. However, it does not provide version numbers for any key software libraries, frameworks, or programming languages (e.g., Python, PyTorch, TensorFlow, CUDA versions) that would be needed to replicate the experimental environment. |
| Experiment Setup | Yes | In the interest of maintaining simplicity, we adhere to the methodology outlined by Dhariwal et al. (Dhariwal & Nichol, 2021) and utilize a 256-dimensional frequency embedding to encapsulate input denoising timesteps. This embedding is then refined through a two-layer Multilayer Perceptron (MLP) with Swish (Si LU) activation functions. Our model aligns with the size configuration of Di T-XL (Peebles & Xie, 2023), which includes retaining the number of transformer blocks (i.e. 28), the hidden dimension size of each transformer block (i.e., 1152), and the number of attention heads (i.e., 16). Our model derives text embeddings employing T5XXL (Raffel et al., 2020), culminating in a fixed length token sequence (i.e., 256) which matches the length of the noisy tokens. To further process each text embedding token, our model compresses them via a single layer MLP, which has a hidden dimension size identical to that of the transformer block. Our model uses classifier-free guidance in the backward process with a fixed scale of 8.5. To keep consistency with Di T-XL (Peebles & Xie, 2023), we only applied guidance to the first three channels of each denoised token. Empirically, we use the last 16 frames as the context length for game generation, and the last 8 frames as the context length for text-to-video generation. |