Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation
Authors: Shunfan Zheng, Xiechi Zhang, Gerard de Melo, Xiaoling Wang, Linlin Wang
AAAI 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | The experimental results demonstrate that HDCEval significantly outperforms existing baseline methods across various medical scenarios. Notably, compared to the Panda LM evaluator, HDCEval achieves an overall improvement in consistency with human evaluations by 23.92%. |
| Researcher Affiliation | Academia | 1 East China Normal University 2 Hasso Plattner Institute 3 University of Potsdam |
| Pseudocode | Yes | Algorithm 1: Attribute-Driven Token Optimization (ADTO) |
| Open Source Code | Yes | Models and supplementary materials: https://huggingface.co/collections/AAAzsf/hdceval6762cda19a07c157778aa22d |
| Open Datasets | Yes | Data Source First, we integrate medical questions from different sources including medical meadow wikidoc1, Med Bench (Cai et al. 2024), Med Text2, and Med Dialog (Zeng et al. 2020). 1https://huggingface.co/datasets/medalpaca/medical meadow wikidoc 2https://huggingface.co/datasets/BI55/Med Text |
| Dataset Splits | Yes | For the test data, we initially extracted 2,994 samples from the constructed dataset to form the test set, with the remaining samples used as the training set. |
| Hardware Specification | Yes | We train using a batch size of 128 and a maximum token length of 4,096 on 4 NVIDIA A100-80GB GPUs. |
| Software Dependencies | No | Our evaluation models are based on the Med Llama2-7B model. We train using a batch size of 128 and a maximum token length of 4,096 on 4 NVIDIA A100-80GB GPUs. To maximize GPU memory usage and accelerate training, we employed the Fully Sharded Data Parallel (Zhao et al. 2023) strategy and the Flash Attention (Dao et al. 2022) algorithm. |
| Experiment Setup | Yes | Our evaluation models are based on the Med Llama2-7B model. We train using a batch size of 128 and a maximum token length of 4,096 on 4 NVIDIA A100-80GB GPUs. To maximize GPU memory usage and accelerate training, we employed the Fully Sharded Data Parallel (Zhao et al. 2023) strategy and the Flash Attention (Dao et al. 2022) algorithm. The learning rates for the instruction tuning and direct preference optimization phases are set to 2e-5 and 5e-7, respectively. During inference, we use greedy decoding with a temperature of 0 to minimize randomness. |