reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Rethinking Visual Counterfactual Explanations Through Region Constraint

Authors: Bartlomiej Sobieski, Jakub Grzywaczewski, Bartłomiej Sadlej, Matthew Tivnan, Przemyslaw Biecek

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Through large-scale experiments, we demonstrate that, besides a fully automated way of synthesizing meaningful and highly interpretable RVCEs, our approach, Region-constrained Counterfactual Schr odinger Bridge (RCSB), allows to infer causally about the model s change in prediction and enables the user to actively interact with the explanatory process by manually deﬁning the region of interest. (...) 4 EXPERIMENTS Method FID s FID S3 COUT FR Zebra Sorrel ACE l1 84.5 122.7 0.92 0.45 47.0 ACE l2 67.7 98.4 0.90 0.25 81.0 LDCE-cls 84.2 107.2 0.78 0.06 88.0 LDCE-txt 82.4 107.2 0.71 0.21 81.0 DVCE 33.1 43.9 0.62 0.21 57.8 RCSBC 13.0 20.4 0.82 0.70 99.7 RCSBB 9.51 17.4 0.86 0.72 97.4 RCSBA 8.0 16.2 0.88 0.74 94.7
Researcher Affiliation	Academia	Bartlomiej Sobieski University of Warsaw EMAIL Jakub Grzywaczewski Warsaw University of Technology EMAIL Bartlomiej Sadlej University of Warsaw EMAIL Matthew Tivnan Harvard Medical School EMAIL Przemyslaw Biecek University of Warsaw, Warsaw University of Technology EMAIL
Pseudocode	Yes	For the pseudocode of the entire procedure, see Appendix. We include our implementation at https://github.com/sobieskibj/rcsb. (...) A Pseudocode Algorithm 1 Standard I2SB Generation 1: Input: x N p1(x N), trained sψ( , ) 2: for n = N to 1 do 3: Predict ˆx0(xn) using sψ(xn, tn) 4: xn 1 p(xn 1 \| ˆx0, xn) according to DDPM 5: end for 6: return x0 Algorithm 2 OT-ODE I2SB Generation 1: Input: x N p1(x N), trained sψ( , ) 2: for n = N to 1 do 3: Predict ˆx0(xt) using sψ(xn, tn) 4: xn 1 = µn 1ˆx0 + µn 1xn 5: end for 6: return x0 Algorithm 3 RCSB 1: Input: Number of steps N, binary region mask R, trajectory truncation τ, classiﬁer scale s, input image x , trained sψ( , ), trained classiﬁer f(y \| ), target class y 2: x1 = (1 R) x + R z, where z N(z; 0, I) 3: Discretize truncated timeline 0 = t0 < t1 < < t N = τ 4: x N q(x N\|x0, x1) # sample from analytic posterior (Eq. (15)) 5: for n = N to 1 do 6: Predict ˆx0(xn) using sψ(xn, tn) 7: gn = xn log f(y \| ˆx0) 8: gn = ADAM(gn) 9: if n == N then g = g N 2 # register norm of the ﬁrst gradient 10: end if 11: xn = xn + s gn g 12: xn 1 = µn 1ˆx0 + µn 1 xn 13: end for 14: return x0 Algorithm 4 ADAM Update Rule 1: Input: Gradient at step n gn, hyperparameters α, ϵ, β1, β2 (set to Py Torch (Paszke et al., 2019) defaults) 2: mn = β1mn 1 + (1 β1)gn # update biased ﬁrst moment estimate 3: vn = β2vn 1 + (1 β2)g2 n # update biased second moment estimate 4: ˆmn = mn/(1 βn 1 ) # compute bias-corrected ﬁrst moment 5: ˆvn = vn/(1 βn 2 ) # compute bias-corrected second moment 6: gn = α ˆmn/( ˆvn + ϵ) # update gradient 7: return gn # return updated gradient
Open Source Code	Yes	We include our implementation at https://github.com/sobieskibj/rcsb.
Open Datasets	Yes	Speciﬁcally, we set a new quantitative state-of-the-art (SOTA) on Image Net (Deng et al., 2009) with up to 4 times better scores in FID and 3 times better s FID (realism)... We extend the evaluation of RCSB with three additional datasets: Celeb A-HQ (Karras et al., 2018) with 30 000 samples of 256 256 resolution face images, Celeb A (Liu et al., 2015) with around 200 000 samples of 128 128 resolution face images, and MNIST (Deng, 2012) with 70 000 samples of 32 32 resolution images of handwritten digits.
Dataset Splits	Yes	Following previous works for VCEs on Image Net, we base the quantitative evaluation on 3 challenging main VCE generation tasks: Zebra Sorrel, Cheetah Cougar, Egyptian Cat Persian Cat, where each task requires creating VCEs for images from both classes and ﬂipping the decision to their counterparts. (...) For Res Net50, this results in around 2000 images per task. (...) For MNIST, we train Le Net (Lecun et al., 1998) from scratch using the default training and validation splits.
Hardware Specification	Yes	The computational resources were provided by the Laboratory of Bioinformatics and Computational Genomics and the High Performance Computing Center of the Faculty of Mathematics and Information Science, Warsaw University of Technology. (...) Each inpainting algorithm is given a 24 A100 GPU hours time budget
Software Dependencies	Yes	Algorithm 4 ADAM Update Rule 1: Input: Gradient at step n gn, hyperparameters α, ϵ, β1, β2 (set to Py Torch (Paszke et al., 2019) defaults)
Experiment Setup	Yes	The best results are obtained with A(a = 0.1, c = 4, s = 3, τ = 0.6), but the superiority is clear for various conﬁgurations, including B(a = 0.2, c = 4, s = 1.5, τ = 0.6), C(a = 0.3, c = 4, s = 1.5, τ = 0.6). (...) By default, we use NFE=100, which we explored the most, but lower NFE regimes provided promising initial results.