Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation

Authors: Shunfan Zheng, Xiechi Zhang, Gerard de Melo, Xiaoling Wang, Linlin Wang

AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental The experimental results demonstrate that HDCEval significantly outperforms existing baseline methods across various medical scenarios. Notably, compared to the Panda LM evaluator, HDCEval achieves an overall improvement in consistency with human evaluations by 23.92%.
Researcher Affiliation Academia 1 East China Normal University 2 Hasso Plattner Institute 3 University of Potsdam
Pseudocode Yes Algorithm 1: Attribute-Driven Token Optimization (ADTO)
Open Source Code Yes Models and supplementary materials: https://huggingface.co/collections/AAAzsf/hdceval6762cda19a07c157778aa22d
Open Datasets Yes Data Source First, we integrate medical questions from different sources including medical meadow wikidoc1, Med Bench (Cai et al. 2024), Med Text2, and Med Dialog (Zeng et al. 2020). 1https://huggingface.co/datasets/medalpaca/medical meadow wikidoc 2https://huggingface.co/datasets/BI55/Med Text
Dataset Splits Yes For the test data, we initially extracted 2,994 samples from the constructed dataset to form the test set, with the remaining samples used as the training set.
Hardware Specification Yes We train using a batch size of 128 and a maximum token length of 4,096 on 4 NVIDIA A100-80GB GPUs.
Software Dependencies No Our evaluation models are based on the Med Llama2-7B model. We train using a batch size of 128 and a maximum token length of 4,096 on 4 NVIDIA A100-80GB GPUs. To maximize GPU memory usage and accelerate training, we employed the Fully Sharded Data Parallel (Zhao et al. 2023) strategy and the Flash Attention (Dao et al. 2022) algorithm.
Experiment Setup Yes Our evaluation models are based on the Med Llama2-7B model. We train using a batch size of 128 and a maximum token length of 4,096 on 4 NVIDIA A100-80GB GPUs. To maximize GPU memory usage and accelerate training, we employed the Fully Sharded Data Parallel (Zhao et al. 2023) strategy and the Flash Attention (Dao et al. 2022) algorithm. The learning rates for the instruction tuning and direct preference optimization phases are set to 2e-5 and 5e-7, respectively. During inference, we use greedy decoding with a temperature of 0 to minimize randomness.