reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Test-Time Training on Video Streams

Authors: Renhao Wang, Yu Sun, Arnuv Tandon, Yossi Gandelsman, Xinlei Chen, Alexei A. Efros, Xiaolong Wang

JMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Online TTT signiﬁcantly outperforms the ﬁxed-model baseline for four tasks, on three real-world datasets. The improvements are more than 2.2 and 1.5 for instance and panoptic segmentation. Surprisingly, online TTT also outperforms its oﬄine variant that accesses strictly more information, training on all frames from the entire test video regardless of temporal order. This ﬁnding challenges those in prior work using synthetic videos. We formalize a notion of locality as the advantage of online over oﬄine TTT, and analyze its role with ablations and a theory based on bias-variance trade-oﬀ. Experiments in this paper are also of practical interests, besides conceptual ones. Online TTT signiﬁcantly improves prediction quality on three real-world video datasets, for four tasks: semantic, instance and panoptic segmentation, and colorization. Figure 2 visualizes results for the ﬁrst three tasks... Table 1 presents our main results. Table 4 contains ablations on our two forms of memory.
Researcher Affiliation	Collaboration	Renhao Wang, Yu Sun, Yossi Gandelsman, Alexei A. Efros are with UC Berkeley. Arnuv Tandon is with Stanford University. Xinlei Chen is with Meta AI. Xiaolong Wang is with UC San Diego.
Pseudocode	No	The paper describes methods using mathematical equations and textual explanations, for example, Equations (1) and (2) for optimization problems and textual descriptions of the algorithm flow, but it does not contain a dedicated pseudocode or algorithm block.
Open Source Code	Yes	Project website with videos, dataset and code: https://test-time-training.github.io/video
Open Datasets	Yes	Online TTT signiﬁcantly improves prediction quality on three real-world video datasets, for four tasks: semantic, instance and panoptic segmentation, and colorization. Figure 2 visualizes results for the ﬁrst three tasks (since metrics for colorization are less reliable): online TTT beats even the oﬄine oracle. We also collect a new video dataset with dense annotations COCO Videos. These videos are orders of magnitude longer than in other public datasets... semantic segmentation on KITTI-STEP a public dataset of urban driving videos; 2) instance and panoptic segmentation on COCO Videos a new dataset we annotated; 3) colorization on COCO Videos and a collection of black and white ﬁlms. Please visit our project website at https://test-time-training.github.io/video to watch videos of our results. Joint training is performed on City Scapes (Cordts et al., 2016), another driving dataset with exactly the same 19 categories as KITTI-STEP, but containing still images instead of videos.
Dataset Splits	Yes	KITTI-STEP (Weber et al., 2021) contains 9 validation videos and 12 test videos... Joint training is performed on City Scapes (Cordts et al., 2016)... Our 3 videos are only used for evaluation. All hyper-parameters, even for COCO Videos, are selected on the KITTI-STEP validation set.
Hardware Specification	Yes	Time is in seconds per frame, using a single A100 GPU, averaged over the KITTI-STEP test set.
Software Dependencies	No	Our current implementation uses Mask2Former (Cheng et al., 2021), which has achieved state-of-the-art performance on many semantic, instance and panoptic segmentation benchmarks. Our Mask2Former uses a Swin-S (Liu et al., 2021c) backbone in our case, this is also the shared encoder f. Everything following the backbone in the original architecture is taken as the main task head h, and our decoder g copies the architecture of h except the last layer that maps into pixel space for reconstruction. Joint training starts from their model checkpoint, which has already been trained for the main task. Only g is initialized from scratch. Following He et al. (2021), we split each input into patches, and mask out 80% of them.
Experiment Setup	Yes	It turns out that only one iteration is suﬃcient for our ﬁnal algorithm... Following He et al. (2021), we split each input into patches, and mask out 80% of them. All hyper-parameters, even for COCO Videos, are selected on the KITTI-STEP validation set. We use exactly the same hyper-parameters as tuned on the KITTI-STEP validation set, for all algorithms considered. Let k denote the window size. At each timestep t, our method solves the following optimization problem instead of Equation 1: ft, gt = arg min f,g t =t k+1 ℓs(g f( xt ), xt ), (2) before predicting h0 ft(xt). Optimization is performed with stochastic gradient descent: at each iteration, we sample a batch with replacement, uniformly from the same window.