reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Pitfalls and Best Practices in Algorithm Configuration

Authors: Katharina Eggensperger, Marius Lindauer, Frank Hutter

JAIR 2019 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Although the usability of AC systems improved over the years (e.g., Spy SMAC, Falkner, Lindauer, & Hutter, 2015), we still often observe fundamental issues in the design and execution of experiments with algorithm conﬁguration methods by both experts and new users. The goals of this work are therefore to: highlight the many pitfalls we have encountered in AC experiments (run by ourselves and others); present best practices to avoid most of these pitfalls; and propose a uniﬁed interface between an AC system and the algorithm it optimizes (the so-called target algorithm) that directly implements best practices related to properly measuring the target algorithm s performance with diﬀerent parameter settings. ... As an exemplary application where AC yields dramatic speedups, we ran SMAC to optimize 75 parameters of the conﬁgurator Clasp (Gebser et al., 2012) to solve N-Rooks (Manthey & Steinke, 2014a) instances. We will return to this scenario in more detail in Subsection 4.2. Here, we used a training set of 484 instances and a test set of 351 instances to evaluate the best found conﬁgurations over time. We used a cutoﬀof 300 seconds, within which the default conﬁguration solves 82% of all training instances. Figure 2 reports results from 16 independent SMAC runs, showing that AC using an adequate setup can robustly yield large speedups compared to not tuning the algorithm.
Researcher Affiliation	Academia	Katharina Eggensperger EMAIL Marius Lindauer EMAIL Frank Hutter EMAIL Institut f ur Informatik, Albert-Ludwigs-Universit at Freiburg, Georges-K ohler-Allee 74, 79110 Freiburg, Germany
Pseudocode	Yes	Listing 1 shows an example for how to extend the Generic Wrapper4AC to wrap the wellknown SAT Solver Mini SAT (E en & S orensson, 2004). Since the output format is standardized in the SAT community, we already provide a domain-speciﬁc generic wrapper, called Sat Wrapper, which can parse and verify the SAT solver s output using standard tools from the annual SAT competitions. Therefore, SAT solver users only need to implement one method, which constructs a command line call string for their SAT solver from the provided input arguments (parameter settings, instance, cutoﬀtime, seed). ... Listing 2: Example Generic Wrapper from scratch
Open Source Code	Yes	We also provide an open-source implementation of a generic wrapper that provides a uniﬁed interface for the communication between target algorithms and conﬁgurators and for limiting resource consumption. ... Our package called Generic Wrapper4AC is available at https://github.com/automl/Generic Wrapper4AC.
Open Datasets	Yes	our algorithm conﬁguration benchmark library AClib (Hutter et al., 2014a) provides a very broad collection of such benchmarks. ... See www.aclib.net
Dataset Splits	Yes	Split your instances into a training and test instances (see Section 5.1); the test instances are later used to safeguard against over-tuning eﬀects (see Section 4.2); ... Here, we used a training set of 484 instances and a test set of 351 instances to evaluate the best found conﬁgurations over time. We used a cutoﬀof 300 seconds, within which the default conﬁguration solves 82% of all training instances.
Hardware Specification	Yes	Where not speciﬁed otherwise, we ran all experiments on the University of Freiburg s META cluster, each of whose nodes shares 64 GB of RAM among two Intel Xeon E5-2650v2 8-core CPUs with 20 MB L3 cache and runs Ubuntu 14.04 LTS 64 bit. ... As diﬀerent machine types, we used AWS m4.4xlarge instances with 2.4-GHz Intel Xeon E5-2676 v3 CPUs with 30MB level-3 cache and the META-cluster at the University of Freiburg with 2.6GHz Intel Xeon E5-2650v2 8-core CPUs with 20 MB L3 cache.
Software Dependencies	Yes	Where not speciﬁed otherwise, we ran all experiments on the University of Freiburg s META cluster, each of whose nodes shares 64 GB of RAM among two Intel Xeon E5-2650v2 8-core CPUs with 20 MB L3 cache and runs Ubuntu 14.04 LTS 64 bit.
Experiment Setup	Yes	Deﬁne the resource limitations your algorithm (cutoﬀtime and memory limit) and the conﬁgurator (conﬁguration budget) should respect (see Section 5.4); ... Deﬁne your cost metric to be optimized; if the cost metric is runtime, conﬁgurators typically optimize PAR10 as the metric of interest, which is the penalized average runtime (in CPU seconds) counting runs exceeding the cutoﬀtime κ as 10 κ; furthermore please consider Pitfalls 2 and 3 (see Section 3.2) and recommendations in Section 5.8 for runtime optimization; ... Here, we used a training set of 484 instances and a test set of 351 instances to evaluate the best found conﬁgurations over time. We used a cutoﬀof 300 seconds, within which the default conﬁguration solves 82% of all training instances.