reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models

Authors: Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, Shelly Sheynin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct qualitative and quantitative experiments to demonstrate the effectiveness of Video JAM. We benchmark our models against their base (pre-trained) versions, as well as leading proprietary and open-source video models, to highlight Video JAM s enhanced motion coherence.
Researcher Affiliation	Collaboration	1Gen AI, Meta 2Tel Aviv University.
Pseudocode	No	The paper describes the Video JAM framework and its components using figures and mathematical equations but does not present any structured pseudocode or algorithm blocks.
Open Source Code	No	The paper provides a URL (https://hila-chefer.github.io/videojam-paper.github.io/) which appears to be a project page for the paper, but it does not explicitly state that the source code for the methodology described is released or provide a direct link to a code repository.
Open Datasets	Yes	Benchmarks We use two benchmarks for evaluation. First, we introduce Video JAM-bench, constructed specifically to test motion coherence. Second, we consider the Movie Gen (MGen) benchmark (Polyak et al., 2024) to show the robustness of our results.
Dataset Splits	Yes	We then fine-tune the models with Video JAM using 3 million random samples from the model s original training set, which constitute less than 3% of the training videos. ... To construct Video JAM-bench, we consider prompts from four categories of natural motion that challenge video generators (see Fig. 2): basic motion, complex motion, rotational motion, and physics. We use a holdout set from our training data on which no model was trained and employ an LLM to select the top 128 prompts that best fit at least one of the four categories and describe a single, specific, and clear motion. ... In all our comparisons, each model runs once with the same random seed for all the benchmark prompts.
Hardware Specification	Yes	Video JAM-4B was fine-tuned using 32 A100 GPUs with a batch size of 32 for 50,000 iterations on a spatial resolution of 256 256. ... Video JAM-30B was fine-tuned using 256 A100 GPUs with a batch size of 256 for 35,000 iterations on a spatial resolution of 256 256.
Software Dependencies	No	During this fine-tuning, we employ RAFT (Teed & Deng, 2020) to obtain optical flow. ... The text prompt conditioning is processed by three different text encoders: UL2 (Tay et al., 2022), By T5 (Xue et al., 2022), and Meta CLIP (Xu et al., 2023). The paper mentions software tools and models but does not provide specific version numbers for these or other key software components like Python or PyTorch.
Experiment Setup	Yes	Video JAM-4B was fine-tuned using 32 A100 GPUs with a batch size of 32 for 50,000 iterations on a spatial resolution of 256 256. ... Video JAM-30B was fine-tuned using 256 A100 GPUs with a batch size of 256 for 35,000 iterations on a spatial resolution of 256 256. ... Both models were trained with a fixed learning rate of 5e 6, using the Flow Matching paradigm (Lipman et al., 2023). ... During inference, we perform 100 denoising steps with a linear quadratic t-schedule using a text guidance scale of w1 = 5 and a motion guidance scale of w2 = 3 (see Eq. 8), other than the ablations that test these components. Additionally, we only employ the motion guidance for the first half of the generation steps (50 steps) ... The models are trained to generate 128 frame videos at 24 frames per second.