reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Hierarchical Divide-and-Conquer for Fine-Grained Alignment in LLM-Based Medical Evaluation

Authors: Shunfan Zheng, Xiechi Zhang, Gerard de Melo, Xiaoling Wang, Linlin Wang

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	The experimental results demonstrate that HDCEval significantly outperforms existing baseline methods across various medical scenarios. Notably, compared to the Panda LM evaluator, HDCEval achieves an overall improvement in consistency with human evaluations by 23.92%.
Researcher Affiliation	Academia	1 East China Normal University 2 Hasso Plattner Institute 3 University of Potsdam
Pseudocode	Yes	Algorithm 1: Attribute-Driven Token Optimization (ADTO)
Open Source Code	Yes	Models and supplementary materials: https://huggingface.co/collections/AAAzsf/hdceval6762cda19a07c157778aa22d
Open Datasets	Yes	Data Source First, we integrate medical questions from different sources including medical meadow wikidoc1, Med Bench (Cai et al. 2024), Med Text2, and Med Dialog (Zeng et al. 2020). 1https://huggingface.co/datasets/medalpaca/medical meadow wikidoc 2https://huggingface.co/datasets/BI55/Med Text
Dataset Splits	Yes	For the test data, we initially extracted 2,994 samples from the constructed dataset to form the test set, with the remaining samples used as the training set.
Hardware Specification	Yes	We train using a batch size of 128 and a maximum token length of 4,096 on 4 NVIDIA A100-80GB GPUs.
Software Dependencies	No	Our evaluation models are based on the Med Llama2-7B model. We train using a batch size of 128 and a maximum token length of 4,096 on 4 NVIDIA A100-80GB GPUs. To maximize GPU memory usage and accelerate training, we employed the Fully Sharded Data Parallel (Zhao et al. 2023) strategy and the Flash Attention (Dao et al. 2022) algorithm.
Experiment Setup	Yes	Our evaluation models are based on the Med Llama2-7B model. We train using a batch size of 128 and a maximum token length of 4,096 on 4 NVIDIA A100-80GB GPUs. To maximize GPU memory usage and accelerate training, we employed the Fully Sharded Data Parallel (Zhao et al. 2023) strategy and the Flash Attention (Dao et al. 2022) algorithm. The learning rates for the instruction tuning and direct preference optimization phases are set to 2e-5 and 5e-7, respectively. During inference, we use greedy decoding with a temperature of 0 to minimize randomness.