Lazy Testing of Machine-Learning Models
Authors: Anastasia Isychev, Valentin Wüstholz, Maria Christakis
IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | 6 Experimental Evaluation We evaluate the effectiveness of LAZ by focusing on the following research questions: RQ1: Does LAZ improve test throughput? RQ2: Does the exploration phase improve test throughput? RQ3: How significant is the analysis overhead? RQ4: What is the effect of LAZ s hyperparameters? 6.1 Benchmarks We use LAZ to test models from a variety of different domains, involving tabular data (German Credit [Hofmann, 1994] and COMPAS [Larson et al., 2016]), images (MNIST [Le Cun et al., 1999]), natural language (Hotel Review [Liu, 2017]), speech (Speech Command [Warden, 2018]), and action policies (Lunar Lander and Bipedal Walker [Brockman et al., 2016]). |
| Researcher Affiliation | Collaboration | TU Wien, Austria 2Consensys EMAIL, EMAIL |
| Pseudocode | Yes | Algorithm 1 The LAZ exploration phase Input: spec, // NOMOS specification Input: test Budget, // total number of tests in the campaign Input: explore Pt, // percentage of tests to use for exploration Input: margin Pt // time margin (in percent) across orders Output: best Cfg // configuration with best invocation order |
| Open Source Code | Yes | We implement our technique in the publicly available1 tool LAZ. 1https://github.com/Rigorous-Software-Engineering/La Z |
| Open Datasets | Yes | We use LAZ to test models from a variety of different domains, involving tabular data (German Credit [Hofmann, 1994] and COMPAS [Larson et al., 2016]), images (MNIST [Le Cun et al., 1999]), natural language (Hotel Review [Liu, 2017]), speech (Speech Command [Warden, 2018]), and action policies (Lunar Lander and Bipedal Walker [Brockman et al., 2016]). |
| Dataset Splits | No | For each testing campaign, we use a budget of 1000 tests. To account for fluctuations in running time due to randomness in the testing process, we run each experiment with 5 different random seeds. ... The default setting uses 10% of the tests for exploration (parameter explore Pt from Alg. 1) and a 10% margin to decide the winning configuration (parameter margin Pt from Alg. 1). The paper describes the testing budget and exploration phase for the LAZ framework but does not specify the training/validation/test splits for the underlying machine learning models used in the benchmarks, as it states 'We use the pre-trained models'. |
| Hardware Specification | Yes | Hardware. We run all experiments sequentially (no parallelism) on a machine with a Quadro RTX 8000 GPU and an Intel(R) Xeon(R) Gold 6248R CPU @ 3.00GHz. |
| Software Dependencies | No | We implemented LAZ in Python as an extension of the NOMOS framework. For the static analysis, we use the MOPSA abstract interpreter [Journault et al., 2019] with the default Intervals domain and a setting to unroll all loops. The paper mentions software names (Python, MOPSA) but does not provide specific version numbers for them. |
| Experiment Setup | Yes | Hyperparameters. For variants 3 6 that include the exploration phase, we compare several hyperparameter settings. The default setting uses 10% of the tests for exploration (parameter explore Pt from Alg. 1) and a 10% margin to decide the winning configuration (parameter margin Pt from Alg. 1). To evaluate this setting, we independently double and halve each of the two parameters to obtain the following explore Pt margin Pt settings expressed in percent: 5 10, 10 5, 20 10, 10 20. We compare the default 10 10 setting with these. |