Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

HexGen-2: Disaggregated Generative Inference of LLMs in Heterogeneous Environment

Authors: YOUHE JIANG, Ran Yan, Binhang Yuan

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We conduct extensive experiments to evaluate HEXGEN-2, i.e., on OPT (30B) and LLAMA-2 (70B) models in various real-world settings, the results reveal that HEXGEN-2 delivers up to a 2.0 and on average a 1.3 improvement in serving throughput, reduces the average inference latency by 1.5 compared with stateof-the-art systems given the same price budget, and achieves comparable inference performance with a 30% lower price budget.
Researcher Affiliation Academia Youhe Jiang , Ran Yan , Binhang Yuan Department of Computer Science and Engineering The Hong Kong University of Science and Technology Correspond to Binhang Yuan (EMAIL).
Pseudocode No The paper describes a scheduling algorithm in Section 3 and its steps (Graph partition, Max-flow, Iterative refinement) but does not provide a formal pseudocode block or algorithm figure.
Open Source Code No HEXGEN-2 uses a task coordinator to handle the dispatch of incoming LLM inference requests, which is based on an open-source implementation of decentralized computation coordination (Yao, 2023) that utilizes lib P2P (Lib P2P, 2023) to establish connections among the work groups in a peer-to-peer network. All parallel communications in HEXGEN-2 are implemented using NVIDIA Collective Communication Library (NCCL) (NVIDIA, 2024), and all required communication groups for different parallelism plans are established in advance to avoid the overhead associated with constructing NCCL groups.
Open Datasets Yes Models. We evaluate HEXGEN-2 on OPT (30B) (Zhang et al., 2022) and LLAMA-2 (70B) (Touvron et al., 2023) models, both are representative and popular open-source transformer models, to study the system performance on models of different sizes. LLM inference workloads. To evaluate the performances in different LLM inference workloads, we run four different types of workloads: heavy prefill with light decoding (HPLD), heavy prefill with heavy decoding (HPHD), light prefill with heavy decoding (LPHD), light prefill with light decoding (LPLD). Prefill requests that have more than 512 tokens are categorized as heavy, others are light, and decoding requests with more than 128 tokens are categorized as heavy (Hu et al., 2024). We generate these workloads using samples from the Azure Conversation dataset (Patel et al., 2024).
Dataset Splits No The paper defines categories for workloads (heavy/light prefill/decoding based on token count) and mentions generating these workloads from the Azure Conversation dataset. It also describes online and offline testing with arrival rates and shows a distribution for online request traces (Figure 5). However, it does not specify explicit training/test/validation splits for any dataset used for model training or evaluation in terms of percentages or sample counts.
Hardware Specification Yes Homogeneous setup: We rent one on-demand instance equipped with 8 NVIDIA H100-80G GPUs, with a budget of $29.52/hour to represent the standard homogeneous case. Heterogeneous setups: We utilize four types of GPUs: H100, A100, L40, and A6000, to construct five different heterogeneous cluster setups, where the first four settings use a similar budget as the homogeneous setting, while the last setting use a 70% budget of the homogeneous settings. The detailed configuration is illustrated in Figure 4.
Software Dependencies No HEXGEN-2 uses a task coordinator to handle the dispatch of incoming LLM inference requests, which is based on an open-source implementation of decentralized computation coordination (Yao, 2023) that utilizes lib P2P (Lib P2P, 2023) to establish connections among the work groups in a peer-to-peer network. All parallel communications in HEXGEN-2 are implemented using NVIDIA Collective Communication Library (NCCL) (NVIDIA, 2024), and all required communication groups for different parallelism plans are established in advance to avoid the overhead associated with constructing NCCL groups. Furthermore, HEXGEN-2 integrates popular features for optimizing LLM inference such as continuous batching (Yu et al., 2022), Flash Attention (Dao et al., 2022; Dao, 2024), Paged Attention (Kwon et al., 2023), and supports open-source LLMs such as OPT (Zhang et0 al., 2022) and LLAMA (Touvron et al., 2023).
Experiment Setup Yes LLM inference workloads. To evaluate the performances in different LLM inference workloads, we run four different types of workloads: heavy prefill with light decoding (HPLD), heavy prefill with heavy decoding (HPHD), light prefill with heavy decoding (LPHD), light prefill with light decoding (LPLD). Prefill requests that have more than 512 tokens are categorized as heavy, others are light, and decoding requests with more than 128 tokens are categorized as heavy (Hu et al., 2024). We generate these workloads using samples from the Azure Conversation dataset (Patel et al., 2024). Online and offline testing. We test two different arrival rates: In the online setting, we scale the average arrival rate to 75% of the cluster s peak throughput to prevent request bursts that could cause system outages due to out-of-memory (OOM) errors, Figure 5 illustrates the distribution of input and output lengths in our trace. In the offline setting, we permit requests to arrive at a rate that fully utilizes the cluster, testing all four types of workloads (HPLD, HPHD, LPHD, LPLD). Memory requirement estimation for a single model replica. The memory cost model in Table 1 estimates the memory required for a single model replica. To determine the total memory requirement for a single model replica, we assume a batch size of 32 concurrent requests (i.e., bt = 32). Thus, the total memory requirement is calculated as: model parameter size + 32 * KV cache size per request.