LOGO — Long cOntext aliGnment via efficient preference Optimization
Authors: Zecheng Tang, Zechen Sun, Juntao Li, Qiaoming Zhu, Min Zhang
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By training with only 0.3B data on a single 8 A800 GPU machine for 16 hours, LOGO allows the Llama-3-8B-Instruct80K model to achieve comparable performance with GPT-4 in real-world long-context tasks... We assess the LOGO training strategy across real-world long-context tasks and the synthetic retrieval task. Additionally, to explore the impact of LOGO training for short-context tasks, we evaluate models on MMLU (Hendrycks et al., 2020) and Truthful QA (Lin et al., 2021) tasks. |
| Researcher Affiliation | Academia | 1School of Computer Science and Technology, Soochow University 2Key Laboratory of Data Intelligence and Advanced Computing, Soochow University. Correspondence to: Juntao Li <EMAIL>. |
| Pseudocode | No | The paper describes the methodology in prose and diagrams (e.g., Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | No | The paper does not contain an explicit statement about releasing the source code for the methodology described, nor does it provide a link to a code repository. |
| Open Datasets | Yes | Training Dataset Construction We construct the synthetic datasets based on two corpora: (1) 4,000 instances sampled from long-llm-data (Zhang et al., 2024b)...; (2) 2,000 instances sampled from Red Pajama (Computer, 2023) to mitigate forgetting... The reference list includes: Computer, T. Redpajama: an open dataset for training large language models, 2023. URL https://github. com/togethercomputer/Red Pajama-Data. |
| Dataset Splits | No | We construct the synthetic datasets based on two corpora... Finally, we have a total number of 12,000 training samples, with a total data size of approximately 12,000 * 512 * 16 * 3 ~ 0.3B tokens. While Appendix F mentions 'The best model checkpoint is then selected based on performance on the validation set.', the specific details of this validation split (size, how it was created) are not provided for their constructed dataset. |
| Hardware Specification | Yes | All the experiments are conducted on a 8 A800 (80GB) GPU machine, and all the training experiments are completed within 16 hours. |
| Software Dependencies | No | We adopt Deep Speed Zero 3 (Aminabadi et al., 2022). and We use the spaCy model6, a named entity recognition (NER) model that can identify all the entities within a context, as the evaluator Eval( ). Neither Deep Speed nor spaCy mention specific version numbers. |
| Experiment Setup | Yes | We set M as 2 in equation 4... We set λ = 0.1 in Eq. 3 and search the hyper-parameters of equation 4 based on (Meng et al., 2024) for different models, where β = 10, γ = 3 for Llama3-8B-based model, β = 2.5, γ = 0.25 for Mistral-Instruct7B-V0.2-based model, and β = 3, γ = 0.6 for Llama-2-7Bbased model. More training details are in Appendix F. |