reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Progressive Tempering Sampler with Diffusion

Authors: Severi Rissanen, Ruikang Ouyang, Jiajun He, Wenlin Chen, Markus Heinonen, Arno Solin, José Miguel Hernández-Lobato

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this section, we evaluate our proposed approach and compare with other baselines. In Sec. 5.1, we first test our temperature guidance method on the Lennard-Jones potential with 55 particles (LJ-55), as introduced in (K ohler et al., 2020; Klein et al., 2024). As we will show, this guidance enables effective extrapolation. In Sec. 5.2, we compare PTSD on two distinct multi-modal distributions, Mixture of 40 Gaussians (Mo G-40) and Many Well-32 (MW-32), with other neural samplers, including FAB (Midgley et al., 2023), i DEM (Akhound-Sadegh et al., 2024), BNEM (Ou Yang et al., 2024), Di KL (He et al., 2024), DDS (Vargas et al., 2023), and CMCD (Vargas et al., 2024). We also evaluate the performance of a diffusion model trained directly on PT-generated data (PT+DM). Table 1. Comparing PTSD with other neural sampler baselines. We measure (best, second best) the TVD and MMD between Energy histograms, and W2 distance between data samples. Table 2. Number of target density calls for different approaches to achieve the performance reported in Table 1.
Researcher Affiliation	Academia	1Department of Computer Science, Aalto University, Finland 2Department of Engineering, University of Cambridge, United Kingdom 3Department of Empirical Inference, Max Planck Institute for Intelligent Systems, T ubingen, Germany. Correspondence to: Severi Rissanen <EMAIL>, Rui Kang Ou Yang <EMAIL>, Jiajun He <EMAIL>.
Pseudocode	Yes	Algorithm 1 Training for PTSD Input: Target density p, Temperatures {Tk}K k=1, Empty Buffers B = {Bk}K k=1, Initial parallel tempering (PT) steps L, Refinement PT steps l, Truncate quantile τ, Training iterations M, Buffer size B. Output: Model θ1. # Initialize at two highest temperatures: Initialize buffers BK 1, BK with L steps PT; Train models θK 1, θK for M iterations; # Progressively decrease the temperature: for k from K to 3 do # Sample with temperature-guidance: Draw B samples {xn}B n=1 for Tk 2 by PF ODE with temperature-guidance, using models θk 1, θk; # Calculate Truncated IS Weights: Calculate the IS weights {wn}B n=1 by Eq. (12); wmax τ-quantile {wn}B n=1 ; For n = 1, , B, set wn min(wn, wmax); Renormalize {wn}B n=1; # Importance Resample: for i from 1 to B do n Category({wn}B n=1); Append xn to Bk 2; end for # Local PT Refinement: Refine samples by l-step PT in Bk 2 and Bk 1; # Fine-tune models: Initialize θk 2 θk 1; Train θk 2 on Bk 2 for M iterations; Train θk 1 on Bk 1 for M iterations; end for
Open Source Code	Yes	1The code for the paper will be available at https://github.com/cambridge-mlg/Progressive-Tempering Sampler-with-Diffusion.
Open Datasets	Yes	Mixture of 40 Gaussians (Mo G-40) is a mixture of Gaussians with 40 components in 2-dimensional space, proposed by Midgley et al. (2023). Many-Well-32 (MW-32) is a multi-modal distribution in 32-dimensional space with 232 modes, proposed by Midgley et al. (2023). Lennard-Jones-n (LJ-n) describes a n-particle system... We employ cubic spline interpolation introduced by Moore et al. (2024). Alanine Dipeptide (ALDP) is a 22-particle system... We use the implementation in Midgley et al. (2023) to calculate the energy.
Dataset Splits	No	The paper describes generating samples for training diffusion models (e.g., 'collect samples into buffers BK and BK 1', 'Draw B samples {xn}B n=1 for Tk 2'). It refers to 'Mixture of 40 Gaussians (Mo G-40)', 'Many-Well-32 (MW-32)', 'Lennard-Jones-n (LJ-n)', and 'Alanine Dipeptide (ALDP)' as target distributions for sampling tasks, not as datasets with conventional train/test/validation splits for supervised learning. The evaluation metrics (W2, TVD, MMD) are calculated between generated samples and ground truth/reference distributions, not on separate test sets from a fixed dataset split.
Hardware Specification	No	The paper acknowledges resources like 'Cambridge Service for Data-Driven Discovery (CSD3)' provided by 'Dell EMC and Intel' and 'LUMI supercomputer', but does not provide specific details on GPU models, CPU processors, or memory configurations used for the experiments.
Software Dependencies	No	The paper mentions software components such as 'EGNN implemented by Satorras et al. (2021)', 'Python optimal transport package (POT, Flamary et al., 2021)', and a 'codebase of Di GS (Chen et al., 2024b)', but it does not provide specific version numbers for these or other key software dependencies like Python or PyTorch.
Experiment Setup	Yes	Table 5 (Hyperparameter settings for PTSD on different targets) provides specific hyperparameter values for tasks including Temperature range, Number of temperatures, Buffer Size, Batch size, Number of initial PT chains, Number of initial PT steps, PT swap interval, Burn-in at the initial PT, Interval for subsampling the initial PT chain, Number of generated samples at extrapolation, Number of PT chains at extrapolation, Number of PT steps after extrapolation, Number of training iterations, and whether Importance resampling is used at the last step.