Test-Time Training on Video Streams
Authors: Renhao Wang, Yu Sun, Arnuv Tandon, Yossi Gandelsman, Xinlei Chen, Alexei A. Efros, Xiaolong Wang
JMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Online TTT significantly outperforms the fixed-model baseline for four tasks, on three real-world datasets. The improvements are more than 2.2 and 1.5 for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its offline variant that accesses strictly more information, training on all frames from the entire test video regardless of temporal order. This finding challenges those in prior work using synthetic videos. We formalize a notion of locality as the advantage of online over offline TTT, and analyze its role with ablations and a theory based on bias-variance trade-off. Experiments in this paper are also of practical interests, besides conceptual ones. Online TTT significantly improves prediction quality on three real-world video datasets, for four tasks: semantic, instance and panoptic segmentation, and colorization. Figure 2 visualizes results for the first three tasks... Table 1 presents our main results. Table 4 contains ablations on our two forms of memory. |
| Researcher Affiliation | Collaboration | Renhao Wang, Yu Sun, Yossi Gandelsman, Alexei A. Efros are with UC Berkeley. Arnuv Tandon is with Stanford University. Xinlei Chen is with Meta AI. Xiaolong Wang is with UC San Diego. |
| Pseudocode | No | The paper describes methods using mathematical equations and textual explanations, for example, Equations (1) and (2) for optimization problems and textual descriptions of the algorithm flow, but it does not contain a dedicated pseudocode or algorithm block. |
| Open Source Code | Yes | Project website with videos, dataset and code: https://test-time-training.github.io/video |
| Open Datasets | Yes | Online TTT significantly improves prediction quality on three real-world video datasets, for four tasks: semantic, instance and panoptic segmentation, and colorization. Figure 2 visualizes results for the first three tasks (since metrics for colorization are less reliable): online TTT beats even the offline oracle. We also collect a new video dataset with dense annotations COCO Videos. These videos are orders of magnitude longer than in other public datasets... semantic segmentation on KITTI-STEP a public dataset of urban driving videos; 2) instance and panoptic segmentation on COCO Videos a new dataset we annotated; 3) colorization on COCO Videos and a collection of black and white films. Please visit our project website at https://test-time-training.github.io/video to watch videos of our results. Joint training is performed on City Scapes (Cordts et al., 2016), another driving dataset with exactly the same 19 categories as KITTI-STEP, but containing still images instead of videos. |
| Dataset Splits | Yes | KITTI-STEP (Weber et al., 2021) contains 9 validation videos and 12 test videos... Joint training is performed on City Scapes (Cordts et al., 2016)... Our 3 videos are only used for evaluation. All hyper-parameters, even for COCO Videos, are selected on the KITTI-STEP validation set. |
| Hardware Specification | Yes | Time is in seconds per frame, using a single A100 GPU, averaged over the KITTI-STEP test set. |
| Software Dependencies | No | Our current implementation uses Mask2Former (Cheng et al., 2021), which has achieved state-of-the-art performance on many semantic, instance and panoptic segmentation benchmarks. Our Mask2Former uses a Swin-S (Liu et al., 2021c) backbone in our case, this is also the shared encoder f. Everything following the backbone in the original architecture is taken as the main task head h, and our decoder g copies the architecture of h except the last layer that maps into pixel space for reconstruction. Joint training starts from their model checkpoint, which has already been trained for the main task. Only g is initialized from scratch. Following He et al. (2021), we split each input into patches, and mask out 80% of them. |
| Experiment Setup | Yes | It turns out that only one iteration is sufficient for our final algorithm... Following He et al. (2021), we split each input into patches, and mask out 80% of them. All hyper-parameters, even for COCO Videos, are selected on the KITTI-STEP validation set. We use exactly the same hyper-parameters as tuned on the KITTI-STEP validation set, for all algorithms considered. Let k denote the window size. At each timestep t, our method solves the following optimization problem instead of Equation 1: ft, gt = arg min f,g t =t k+1 ℓs(g f( xt ), xt ), (2) before predicting h0 ft(xt). Optimization is performed with stochastic gradient descent: at each iteration, we sample a batch with replacement, uniformly from the same window. |