reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Lazy Testing of Machine-Learning Models

Authors: Anastasia Isychev, Valentin Wüstholz, Maria Christakis

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	6 Experimental Evaluation We evaluate the effectiveness of LAZ by focusing on the following research questions: RQ1: Does LAZ improve test throughput? RQ2: Does the exploration phase improve test throughput? RQ3: How significant is the analysis overhead? RQ4: What is the effect of LAZ s hyperparameters? 6.1 Benchmarks We use LAZ to test models from a variety of different domains, involving tabular data (German Credit [Hofmann, 1994] and COMPAS [Larson et al., 2016]), images (MNIST [Le Cun et al., 1999]), natural language (Hotel Review [Liu, 2017]), speech (Speech Command [Warden, 2018]), and action policies (Lunar Lander and Bipedal Walker [Brockman et al., 2016]).
Researcher Affiliation	Collaboration	TU Wien, Austria 2Consensys EMAIL, EMAIL
Pseudocode	Yes	Algorithm 1 The LAZ exploration phase Input: spec, // NOMOS specification Input: test Budget, // total number of tests in the campaign Input: explore Pt, // percentage of tests to use for exploration Input: margin Pt // time margin (in percent) across orders Output: best Cfg // configuration with best invocation order
Open Source Code	Yes	We implement our technique in the publicly available1 tool LAZ. 1https://github.com/Rigorous-Software-Engineering/La Z
Open Datasets	Yes	We use LAZ to test models from a variety of different domains, involving tabular data (German Credit [Hofmann, 1994] and COMPAS [Larson et al., 2016]), images (MNIST [Le Cun et al., 1999]), natural language (Hotel Review [Liu, 2017]), speech (Speech Command [Warden, 2018]), and action policies (Lunar Lander and Bipedal Walker [Brockman et al., 2016]).
Dataset Splits	No	For each testing campaign, we use a budget of 1000 tests. To account for fluctuations in running time due to randomness in the testing process, we run each experiment with 5 different random seeds. ... The default setting uses 10% of the tests for exploration (parameter explore Pt from Alg. 1) and a 10% margin to decide the winning configuration (parameter margin Pt from Alg. 1). The paper describes the testing budget and exploration phase for the LAZ framework but does not specify the training/validation/test splits for the underlying machine learning models used in the benchmarks, as it states 'We use the pre-trained models'.
Hardware Specification	Yes	Hardware. We run all experiments sequentially (no parallelism) on a machine with a Quadro RTX 8000 GPU and an Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz.
Software Dependencies	No	We implemented LAZ in Python as an extension of the NOMOS framework. For the static analysis, we use the MOPSA abstract interpreter [Journault et al., 2019] with the default Intervals domain and a setting to unroll all loops. The paper mentions software names (Python, MOPSA) but does not provide specific version numbers for them.
Experiment Setup	Yes	Hyperparameters. For variants 3 6 that include the exploration phase, we compare several hyperparameter settings. The default setting uses 10% of the tests for exploration (parameter explore Pt from Alg. 1) and a 10% margin to decide the winning configuration (parameter margin Pt from Alg. 1). To evaluate this setting, we independently double and halve each of the two parameters to obtain the following explore Pt margin Pt settings expressed in percent: 5 10, 10 5, 20 10, 10 20. We compare the default 10 10 setting with these.