Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Field-DiT: Diffusion Transformer on Unified Video, 3D, and Game Field Generation

Authors: Kangfu Mei, Mo Zhou, Vishal Patel

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Experimental results across various modalities demonstrate the effectiveness of our model, with its 675M parameter size, and highlight its potential as a foundational framework for scalable, architectureunified visual content generation for different modalities with different weights. Our project page can be found at https://kfmei.com/Field-Di T/. We empirically validate its superiority against previous domain-agnostic methods across three different tasks, including text-to-video generation, 3D novel-view generation, and game generation. Various experiments show that our method achieves compelling performance even when compared to the state-of-the-art domain-specific methods, underlining its potential as a scalable and architecture-unified visual content generation model across various modalities.
Researcher Affiliation Academia Kangfu Mei Johns Hopkins University EMAIL Mo Zhou Johns Hopkins University EMAIL Vishal M. Patel Johns Hopkins University EMAIL
Pseudocode No The paper describes the methodology in prose and figures, but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code No The paper mentions a project page: "Our project page can be found at https://kfmei.com/Field-Di T/". While project pages can sometimes host code, this statement does not explicitly confirm that the source code for the methodology described in the paper is available there, nor does it provide a direct link to a code repository. The paper also states, "we release the benchmarks including both training and testing data for replication and comparisons," which refers to data, not code.
Open Datasets Yes For image generation, we use the standard benchmark dataset, i.e., CIFAR10 64 64 (Krizhevsky et al., 2009)... we conduct experiments on the recent text-to-video benchmark: Celeb V-Text 256 256 128 (Yu et al., 2023b)... We also evaluate our method on 3D novel view generation with the Shape Net dataset (Chang et al., 2015)... Game generation is an under-explored area and lacks data and benchmarks. We demonstrate the game generation capability of our method by showing the accuracy of predicted frames compared with the frame of the real game when using the same action. Specially, we model the World 1-1 of Super Mario Bros (NES version) with a sliding window size of 16... we release the benchmarks including both training and testing data for replication and comparisons.
Dataset Splits No The paper mentions using specific datasets (CIFAR10, Celeb V-Text, ShapeNet, Super Mario Bros World 1-1) and refers to "test data" (e.g., "We randomly select 2,048 videos from the test data"). It also mentions using "the last 16 frames as the context length for game generation, and the last 8 frames as the context length for text-to-video generation." However, it does not explicitly provide specific details about how the overall datasets were split into training, validation, and test sets (e.g., percentages, sample counts, or references to predefined splits for reproduction).
Hardware Specification No The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only mentions general aspects like "memory bottleneck in existing GPU-accelerated computing systems" but not the specific hardware used for their own experiments.
Software Dependencies No The paper mentions using specific models like "T5XXL (Raffel et al., 2020)" for text embeddings and "openai/clip-vit-large-patch14 model" for CLIPSIM calculation. However, it does not provide version numbers for any key software libraries, frameworks, or programming languages (e.g., Python, PyTorch, TensorFlow, CUDA versions) that would be needed to replicate the experimental environment.
Experiment Setup Yes In the interest of maintaining simplicity, we adhere to the methodology outlined by Dhariwal et al. (Dhariwal & Nichol, 2021) and utilize a 256-dimensional frequency embedding to encapsulate input denoising timesteps. This embedding is then refined through a two-layer Multilayer Perceptron (MLP) with Swish (Si LU) activation functions. Our model aligns with the size configuration of Di T-XL (Peebles & Xie, 2023), which includes retaining the number of transformer blocks (i.e. 28), the hidden dimension size of each transformer block (i.e., 1152), and the number of attention heads (i.e., 16). Our model derives text embeddings employing T5XXL (Raffel et al., 2020), culminating in a fixed length token sequence (i.e., 256) which matches the length of the noisy tokens. To further process each text embedding token, our model compresses them via a single layer MLP, which has a hidden dimension size identical to that of the transformer block. Our model uses classifier-free guidance in the backward process with a fixed scale of 8.5. To keep consistency with Di T-XL (Peebles & Xie, 2023), we only applied guidance to the first three channels of each denoised token. Empirically, we use the last 16 frames as the context length for game generation, and the last 8 frames as the context length for text-to-video generation.