reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Categorical Schrödinger Bridge Matching

Authors: Grigoriy Ksenofontov, Alexander Korotin

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We show the performance of CSBM via a series of experiments with synthetic data and VQ representations of images. ... We evaluate our CSBM algorithm across several setups. First, we analyze the convergence of D-IMF on discrete data (M4.1). Then, we demonstrate how CSBM performs with different reference processes in 2D experiments (M4.2). Next, we test CSBM s ability to translate images using the colored MNIST dataset (M4.3), varying the number of steps N. We then present an experiment on the Celeb A dataset (M4.4), showcasing CSBM s performance in a latent space. Finally, we explore the text domain by solving sentiment transfer on the Amazon Reviews dataset (Appendix C.4).
Researcher Affiliation	Academia	1Skoltech, Moscow, Russia 2MIPT, Moscow, Russia 3AIRI, Moscow, Russia. Correspondence to: Grigoriy Ksenofontov <EMAIL>, Alexander Korotin <EMAIL>.
Pseudocode	Yes	Algorithm 1 Categorical SB matching (CSBM)
Open Source Code	Yes	The code of CSBM is available at this repository.
Open Datasets	Yes	We test CSBM s ability to translate images using the colored MNIST dataset (M4.3)...Here, we present an unpaired image-to-image translation experiment on the Celeb A dataset (M4.4)...This section examines the text domain, focusing on style transfer in the Amazon Reviews corpus (Ni et al., 2019).
Dataset Splits	Yes	For the Celeb A experiment (M4.4)...We train the model on 162 770 pre-quantized images of celebrities. For evaluation, we compute FID and CMMD using 11 816 hold-out images...For the Amazon experiment (Appendix C.4)...The model is trained on 104 000 pre-tokenized reviews and evaluated on 2 000 reviews from the held-out test set.
Hardware Specification	Yes	Training the 2D experiment requires several hours on a single A100 GPU. The colored MNIST experiment takes approximately two days to train using two A100 GPUs. The most computationally demanding task, the Celeb A and Amazon Reviews experiments, requires around five days of training on four A100 GPUs.
Software Dependencies	No	The paper mentions several software tools and models, such as the 'Adam W optimizer', 'Hugging Face pipeline', 'GPT-2 Large', 'unigram Sentence Piece model', and 'Di T model'. It also refers to official repositories like D3PM, VQ-GAN, VQ-Diffusion, and mdlm for implementation references. However, it does not explicitly provide specific version numbers for programming languages (e.g., Python), libraries (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup	Yes	Table 5. Hyperparameters for experiments. Lr denotes the learning rate, and m represents millions. Params indicate the number of model parameters, where for the Celeb A dataset, the first value corresponds to the model and the second to the VQ-GAN.