reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

LOGO — Long cOntext aliGnment via efficient preference Optimization

Authors: Zecheng Tang, Zechen Sun, Juntao Li, Qiaoming Zhu, Min Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By training with only 0.3B data on a single 8 A800 GPU machine for 16 hours, LOGO allows the Llama-3-8B-Instruct80K model to achieve comparable performance with GPT-4 in real-world long-context tasks... We assess the LOGO training strategy across real-world long-context tasks and the synthetic retrieval task. Additionally, to explore the impact of LOGO training for short-context tasks, we evaluate models on MMLU (Hendrycks et al., 2020) and Truthful QA (Lin et al., 2021) tasks.
Researcher Affiliation	Academia	1School of Computer Science and Technology, Soochow University 2Key Laboratory of Data Intelligence and Advanced Computing, Soochow University. Correspondence to: Juntao Li <EMAIL>.
Pseudocode	No	The paper describes the methodology in prose and diagrams (e.g., Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a link to a code repository.
Open Datasets	Yes	Training Dataset Construction We construct the synthetic datasets based on two corpora: (1) 4,000 instances sampled from long-llm-data (Zhang et al., 2024b)...; (2) 2,000 instances sampled from Red Pajama (Computer, 2023) to mitigate forgetting... The reference list includes: Computer, T. Redpajama: an open dataset for training large language models, 2023. URL https://github. com/togethercomputer/Red Pajama-Data.
Dataset Splits	No	We construct the synthetic datasets based on two corpora... Finally, we have a total number of 12,000 training samples, with a total data size of approximately 12,000 * 512 * 16 * 3 ~ 0.3B tokens. While Appendix F mentions 'The best model checkpoint is then selected based on performance on the validation set.', the specific details of this validation split (size, how it was created) are not provided for their constructed dataset.
Hardware Specification	Yes	All the experiments are conducted on a 8 A800 (80GB) GPU machine, and all the training experiments are completed within 16 hours.
Software Dependencies	No	We adopt Deep Speed Zero 3 (Aminabadi et al., 2022). and We use the spaCy model6, a named entity recognition (NER) model that can identify all the entities within a context, as the evaluator Eval( ). Neither Deep Speed nor spaCy mention specific version numbers.
Experiment Setup	Yes	We set M as 2 in equation 4... We set λ = 0.1 in Eq. 3 and search the hyper-parameters of equation 4 based on (Meng et al., 2024) for different models, where β = 10, γ = 3 for Llama3-8B-based model, β = 2.5, γ = 0.25 for Mistral-Instruct7B-V0.2-based model, and β = 3, γ = 0.6 for Llama-2-7Bbased model. More training details are in Appendix F.