Automated Superscalar Processor Design by Learning Data Dependencies

Authors: Shuyao Cheng, Rui Zhang, Wenkai He, Pengwei Jin, Chongxiao Li, Zidong Du, Xing Hu, Yifan Hao, Guanglin Xu, Yuanbo Wen, Ling Li, Qi Guo, Yunji Chen

IJCAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental 4 Evaluation In this section, we evaluate the proposed State-BSD method in terms of the processor design performance, the predictor effectiveness, and the efficiency of the State-BSD components. First, we compare State-BSD with the state-of-the-art automated processor design methods. Second, we compare State-BSD with the human design paradigm. Third, we make a detailed algorithm analysis for the State-BSD, demonstrating the effectiveness of every State-BSD component. We evaluate the State-BSD on a large-scale, real-world RISCV-32IA CPU, outperforming the largest-scale processor that state-of-the-art automated methods can design. It is the second version of the automated CPU design after the one proposed in [Cheng et al., 2024], and it is called Qi Meng-CPU-v2. Our design is a 4-ALU superscalar design with a 2KB buffer in each predictor. The CPU is functionally correct on over 1012 instructions on real-world programs, including Linux System, SPEC CPU Benchmark, and others. Additionally, the CPU is taping out with 28nm technology; the hardware characteristics are shown in Table 1. The designed CPU is evaluated on the standard CPU benchmarks, Dhrystone [Weicker, 1984] and Coremark [Coremark, 2024], measured by both the benchmark result and the corresponding Cycles per Instruction (CPI) on both Xilinx Zynq Ultra Scale+ ZCU104 FPGA and commercial simulation tools.
Researcher Affiliation Collaboration 1State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences 2University of Chinese Academy of Sciences 3Cambricon Technologies 4Shanghai Innovation Center for Processor Technologies 5Institute of Software, Chinese Academy of Sciences
Pseudocode No The paper describes the methods used for training the state-selector (simulated annealing) and state-speculator (BSD expansion) in detail within the 'Methodology' section (Section 3.2), including algorithmic steps and formulas (e.g., for probability P in simulated annealing). However, these descriptions are presented in prose and mathematical notation, without being formatted as distinct 'Pseudocode' or 'Algorithm' blocks.
Open Source Code Yes The resources are open-sourced at https://qimeng-ict.github.io.
Open Datasets Yes The CPU is functionally correct on over 1012 instructions on real-world programs, including Linux System, SPEC CPU Benchmark, and others. Additionally, the CPU is taping out with 28nm technology; the hardware characteristics are shown in Table 1. The designed CPU is evaluated on the standard CPU benchmarks, Dhrystone [Weicker, 1984] and Coremark [Coremark, 2024], measured by both the benchmark result and the corresponding Cycles per Instruction (CPI) on both Xilinx Zynq Ultra Scale+ ZCU104 FPGA and commercial simulation tools.
Dataset Splits No The paper uses standard CPU benchmarks like Dhrystone and Coremark, and real-world programs including Linux and SPEC Benchmarks for evaluation. It discusses performance metrics but does not specify any explicit training/test/validation dataset splits typically associated with machine learning models. The methods described involve training a predictor using simulated annealing and BSD expansion, which do not inherently rely on predefined dataset splits in the conventional sense.
Hardware Specification Yes The designed CPU is evaluated on the standard CPU benchmarks, Dhrystone [Weicker, 1984] and Coremark [Coremark, 2024], measured by both the benchmark result and the corresponding Cycles per Instruction (CPI) on both Xilinx Zynq Ultra Scale+ ZCU104 FPGA and commercial simulation tools. ... The design and evaluation run on Cent OS 7.8 with 2 Intel Xeon Gold 6230 CPUs and 512G Memory.
Software Dependencies No The paper mentions that "The design and evaluation run on Cent OS 7.8" and that the predictor is implemented using "Verilog code". It also refers to "commercial simulation tools" but does not specify any particular version numbers for these tools or any other key software libraries (e.g., Python packages, machine learning frameworks, or specific compilers) that would be necessary for reproducibility.
Experiment Setup Yes The state-selector is trained by a simulated annealing method, starting with a randomly selected set of states S0, and iteratively updates the selected state set to optimize the energy function (i.e. the reusability of the selected states). It first randomly samples a set of states Sk , which contains a small change on the starting set of states Sk, then calculate the energy function E(Sk ). If E(Sk ) < E(Sk ), showing that Sk is a more reusable set of states, Sk is taken to update the selected set of states, i.e. Sk+1 = Sk . Otherwise if E(Sk ) E(Sk), Sk is taken to update the selected set of states with an adaptive probability P = e E(Sk ) E(Sk) T , where T is an adaptive hyperparameter decreases when k increases, showing that Sk is more likely to be taken to update if it has better reusability.