VideoJAM: Joint Appearance-Motion Representations for Enhanced Motion Generation in Video Models
Authors: Hila Chefer, Uriel Singer, Amit Zohar, Yuval Kirstain, Adam Polyak, Yaniv Taigman, Lior Wolf, Shelly Sheynin
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We conduct qualitative and quantitative experiments to demonstrate the effectiveness of Video JAM. We benchmark our models against their base (pre-trained) versions, as well as leading proprietary and open-source video models, to highlight Video JAM s enhanced motion coherence. |
| Researcher Affiliation | Collaboration | 1Gen AI, Meta 2Tel Aviv University. |
| Pseudocode | No | The paper describes the Video JAM framework and its components using figures and mathematical equations but does not present any structured pseudocode or algorithm blocks. |
| Open Source Code | No | The paper provides a URL (https://hila-chefer.github.io/videojam-paper.github.io/) which appears to be a project page for the paper, but it does not explicitly state that the source code for the methodology described is released or provide a direct link to a code repository. |
| Open Datasets | Yes | Benchmarks We use two benchmarks for evaluation. First, we introduce Video JAM-bench, constructed specifically to test motion coherence. Second, we consider the Movie Gen (MGen) benchmark (Polyak et al., 2024) to show the robustness of our results. |
| Dataset Splits | Yes | We then fine-tune the models with Video JAM using 3 million random samples from the model s original training set, which constitute less than 3% of the training videos. ... To construct Video JAM-bench, we consider prompts from four categories of natural motion that challenge video generators (see Fig. 2): basic motion, complex motion, rotational motion, and physics. We use a holdout set from our training data on which no model was trained and employ an LLM to select the top 128 prompts that best fit at least one of the four categories and describe a single, specific, and clear motion. ... In all our comparisons, each model runs once with the same random seed for all the benchmark prompts. |
| Hardware Specification | Yes | Video JAM-4B was fine-tuned using 32 A100 GPUs with a batch size of 32 for 50,000 iterations on a spatial resolution of 256 256. ... Video JAM-30B was fine-tuned using 256 A100 GPUs with a batch size of 256 for 35,000 iterations on a spatial resolution of 256 256. |
| Software Dependencies | No | During this fine-tuning, we employ RAFT (Teed & Deng, 2020) to obtain optical flow. ... The text prompt conditioning is processed by three different text encoders: UL2 (Tay et al., 2022), By T5 (Xue et al., 2022), and Meta CLIP (Xu et al., 2023). The paper mentions software tools and models but does not provide specific version numbers for these or other key software components like Python or PyTorch. |
| Experiment Setup | Yes | Video JAM-4B was fine-tuned using 32 A100 GPUs with a batch size of 32 for 50,000 iterations on a spatial resolution of 256 256. ... Video JAM-30B was fine-tuned using 256 A100 GPUs with a batch size of 256 for 35,000 iterations on a spatial resolution of 256 256. ... Both models were trained with a fixed learning rate of 5e 6, using the Flow Matching paradigm (Lipman et al., 2023). ... During inference, we perform 100 denoising steps with a linear quadratic t-schedule using a text guidance scale of w1 = 5 and a motion guidance scale of w2 = 3 (see Eq. 8), other than the ablations that test these components. Additionally, we only employ the motion guidance for the first half of the generation steps (50 steps) ... The models are trained to generate 128 frame videos at 24 frames per second. |