reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

RZ-NAS: Enhancing LLM-guided Neural Architecture Search via Reflective Zero-Cost Strategy

Authors: Zipeng Ji, Guanghui Zhu, Chunfeng Yuan, Yihua Huang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate RZ-NAS on multiple widely adopted Zero-Cost NAS proxies for different downstream tasks. The details of Zero-Cost proxies are provided in Appendix A.1. We define the mutation space based on the set of architecture operators specified in the system prompt. The search procedure runs for 1500 evolutionary iterations. The population size is set to 100 for NAS-Bench-201 and CIFAR-10, and 256 for CIFAR-100, Image Net, and COCO. All populations are initialized from scratch using random sampling. The search spaces differ by task: for NAS-Bench-201, CIFAR-10, and CIFAR-100, we use the micro cell-based search space; for Image Net, we adopt the Mobile Net macro search space, consistent with prior works like Zico and Zen-NAS. For COCO object detection, we stack operators to build the backbone following the same configuration used in MAE-DET (Sun et al., 2022). In all experiments, we use the GPT4o model to generate mutations. We also perform an ablation study with different LLMs in Appendix A.4.2. We sample the temperature of the model from [0.2, 0.4, 0.6, 0.8, 1.0] to encourage output diversity. The other settings are identical to different Zero-Cost proxies. In RZ-NAS, the number of input tokens and output tokens is in the range of 2300-2600 and 150-200, respectively. We perform 1500 iterations for one proxy in one search space per proxy. Therefore, the total cost per proxy is around $75.
Researcher Affiliation	Academia	1State Key Laboratory for Novel Software Technology, Nanjing University, Nanjing, China. Correspondence to: Guanghui Zhu <EMAIL>.
Pseudocode	Yes	Algorithm 1 LLM-guided Mutation and Zero-Cost Evaluation Strategy
Open Source Code	Yes	1RZ-NAS is available at https://github.com/ Pasa Lab/RZ-NAS.
Open Datasets	Yes	We evaluate RZ-NAS on multiple widely adopted Zero-Cost NAS proxies for different downstream tasks. ... The population size is set to 100 for NAS-Bench-201 and CIFAR-10, and 256 for CIFAR-100, Image Net, and COCO. ...We compare the accuracy performance on the NAS-Bench-201 Benchmark as shown in Table 2. ... For training, we use Res Netlike backbones on the COCO dataset (Lin et al., 2014), incorporating multi-scale training and Synchronized Batch Normalization.
Dataset Splits	No	The paper mentions common datasets like CIFAR-10, CIFAR-100, ImageNet, NAS-Bench-201, and COCO, and discusses training/validation/test accuracy, but does not explicitly state the specific split percentages or sample counts for these datasets within the main text. It implies the use of standard splits for benchmarks but doesn't detail them.
Hardware Specification	No	The paper mentions "GPU days" in Table 1 and code snippets refer to `torch.cuda.set_device(gpu)`, implying the use of GPUs. However, no specific GPU models (e.g., NVIDIA A100, Tesla V100), CPU models, or detailed hardware specifications are provided for the experimental setup.
Software Dependencies	No	The paper mentions using "GPT4o model" (and other LLMs like LLaMA 3.1, Claude 3.5) for generating mutations and includes Python code snippets using libraries like `torch`, `nn`, `numpy`, `torch.nn.functional`. However, it does not provide specific version numbers for any of these software dependencies.
Experiment Setup	Yes	The search procedure runs for 1500 evolutionary iterations. The population size is set to 100 for NAS-Bench-201 and CIFAR-10, and 256 for CIFAR-100, Image Net, and COCO. All populations are initialized from scratch using random sampling. In all experiments, we use the GPT4o model to generate mutations. We sample the temperature of the model from [0.2, 0.4, 0.6, 0.8, 1.0] to encourage output diversity. ... For training, we use Res Netlike backbones on the COCO dataset (Lin et al., 2014), incorporating multi-scale training and Synchronized Batch Normalization.