reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Improving Natural Language Understanding for LLMs via Large-Scale Instruction Synthesis

Authors: Lin Yuan, Jun Xu, Honghao Gui, Mengshu Sun, Zhiqiang Zhang, Lei Liang, Jun Zhou

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We conduct extensive experiments on 5 NLU tasks and 28 general capability evaluation datasets for LLMs. Experimental results show that Hum enhances the NLU capabilities of six LLMs by an average of 3.1%, with no significant decline observed in other general capabilities.
Researcher Affiliation	Industry	Ant Group, Hangzhou, China EMAIL
Pseudocode	No	The paper describes methods through architectural diagrams (Figure 2) and prompt templates (Figure 3) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper mentions using the "Llama Factory (Zheng et al. 2024) framework" for training but does not provide any statement or link for the open-sourcing of their own methodology's code.
Open Datasets	Yes	To evaluate the effectiveness of the Hum dataset for natural language understanding, we perform zero-shot experiments on five NLU datasets: Cross NER (Liu et al. 2021) for named entity recognition, Few Rel (Han et al. 2018) for relation extraction, CCF Law for event extraction, C3 (Sun et al. 2020) for machine reading comprehension, and IMDB (Maas et al. 2011) for text classification. Additionally, to determine if the Hum dataset adversely affects LLMs, we conduct zero-shot testing across seven dimensions (...) using a total of 28 datasets.
Dataset Splits	No	The paper mentions performing "zero-shot experiments" on five NLU datasets and conducting "zero-shot testing" on 28 general capability evaluation datasets, stating "We employ the same experimental settings as in previous work" for the latter. However, it does not provide specific percentages, sample counts, or explicit descriptions of how the datasets (including the synthesized Hum dataset) were split into training, validation, or test sets for reproduction within the paper's text.
Hardware Specification	Yes	The training is conducted using the Llama Factory (Zheng et al. 2024) framework, leveraging 32 H100 GPUs, 384 CPU cores, and 3.2TB of memory.
Software Dependencies	No	The training is conducted using the Llama Factory (Zheng et al. 2024) framework. No specific version number for Llama Factory or other key software components (e.g., Python, PyTorch, CUDA) is provided.
Experiment Setup	Yes	We consistently apply Lo RA for this finetuning, with a Lo RA rank and alpha both set at 64 and a dropout rate of 0.05. The batch size is established at 320, accompanied by a learning rate of 5e-5. The input length is configured to 1500 tokens, while the output length is capped at 500 tokens. We utilize an Adam optimizer with weight decay at a rate of 1e-4 for training. The learning rate warm-up proportion is set to 0.1, alongside a dropout rate of 0.1. Additionally, the temperature for adjusting next token probabilities is fixed at 0.2, with the topmost probable tokens summing to a probability of 0.95.