reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

FlowBench: Benchmarking Optical Flow Estimation Methods for Reliability and Generalization

Authors: Shashank Agnihotri, Julian Yuya Caspary, Luca Schwarz, Xinyan Gao, Jenny Schmalfuss, Andres Bruhn, Margret Keuper

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Flow Bench facilitates streamlined research into the reliability of optical flow methods by benchmarking their robustness to adversarial attacks and out-of-distribution samples. With Flow Bench, we benchmark 57 checkpoints across 3 datasets under 9 diverse adversarial attacks and 23 established common corruptions, making it the most comprehensive robustness analysis of optical flow methods to date.
Researcher Affiliation	Academia	Shashank Agnihotri EMAIL Data and Web Science Group, University of Mannheim, Germany Julian Yuya Caspary EMAIL Data and Web Science Group, University of Mannheim, Germany Luca Schwarz EMAIL Data and Web Science Group, University of Mannheim, Germany Xinyan Gao EMAIL Data and Web Science Group, University of Mannheim, Germany Jenny Schmalfuss EMAIL Computer Vision Group, University of Stuttgart, Germany Andrés Bruhn EMAIL Computer Vision Group, University of Stuttgart, Germany Margret Keuper EMAIL Data and Web Science Group, University of Mannheim, Germany Max-Planck-Institute for Informatics, Saarland Informatics Campus, Germany
Pseudocode	No	The paper describes algorithms and methods using mathematical equations and textual explanations, but it does not contain a dedicated section or figure explicitly labeled as "Pseudocode" or "Algorithm" with structured, code-like steps.
Open Source Code	Yes	The open-source code and weights for Flow Bench are available in this Git Hub repository. ... Flow Bench is completely open-source, allowing the community to generate pull requests to add new methods, attacks, checkpoints, benchmarking results, and metrics, and thus pursue these directions of work as well. ... The proposed Flow Bench benchmarking tool is available as a library in the following codebase: https://github.com/shashankskagnihotri/Flow Bench.
Open Datasets	Yes	Flow Bench supports 37 unique architectures, for example, RAFT, Flow Former, Flow Former++, CCMR, anmd others (new architectures added to ptlflow over time are compatible with Flow Bench) and distinct datasets, namely Flying Things3D (Mayer et al., 2016), KITTI2015 (Menze & Geiger, 2015), MPI Sintel (Butler et al., 2012) (clean and final) and Spring (Mehl et al., 2023) datasets.
Dataset Splits	Yes	KITTI2015: Proposed by Menze & Geiger (2015), this dataset is focused on the real-world driving scenario. It contains a total of 400 pairs of image frames, split equally for training and testing. ... MPI Sintel: Proposed by Butler et al. (2012) and Wulff et al. (2012), this dataset ... consists of a total of 1064 synthetic frames for training and 564 synthetic frames for testing
Hardware Specification	Yes	Most experiments were done on a single 40 GB NVIDIA Tesla V100 GPU each, however, MS-RAFT+, Flow Former, and Flow Former++ are more compute-intensive, and thus 80GB NVIDIA A100 GPUs or NVIDIA H100 were used for these models, a single GPU for each experiment.
Software Dependencies	No	The paper mentions that Flow Bench is built using ptlflow (Morimitsu, 2021) and refers to pytorch (Paszke et al., 2019) for calculation approximations. However, specific version numbers for these software libraries used in the experiments are not provided (e.g., "PyTorch 1.9" or "ptlflow 1.2"). The years refer to publications, not explicit software versions.
Experiment Setup	Yes	For calculating TARE and NARE values we used BIM, PGD, and Cos PGD attack with step size α=0.01, perturbation budget ϵ = 8/255 under the ℓ∞-norm bound... We use 20 attack iterations for calculating TARE and NARE... For finetuning, each mini-batch is adversarially attacked and used to finetune the model. We report results from a subset of methods adversarially finetuned (10k iterations with a starting learning rate=10^-6 and same learning rate scheduler as used by the method during training) on the KITTI2015 training dataset using ℓ∞-norm constrained 3 iterations PGD attack with ϵ [4/255, 8/255] and α = 0.01... For PCFA: perturbation budget ϵ = 0.05 and step size α = 1e-7... Adversarial Weather: Snow (random snowflakes) Number of Particles: 3000 Number of optimization steps: 750