reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DelTA: An Online Document-Level Translation Agent Based on Multi-Level Memory

Authors: Yutong Wang, Jiali Zeng, Xuebo Liu, Derek Wong, Fandong Meng, Jie Zhou, Min Zhang

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results indicate that DELTA significantly outperforms strong baselines in terms of translation consistency and quality across four open/closed-source LLMs and two representative document translation datasets, achieving an increase in consistency scores by up to 4.58 percentage points and in COMET scores by up to 3.16 points on average.
Researcher Affiliation	Collaboration	1Institute of Computing and Intelligence, Harbin Institute of Technology, Shenzhen, China 2Pattern Recognition Center, We Chat AI, Tencent Inc, China 3NLP2CT Lab, Department of Computer and Information Science, University of Macau
Pseudocode	Yes	The main framework of DELTA is illustrated in Figure 1, the algorithm of DELTA is detailed in Algorithm 1, and the prompts used for each module are given in Appendix C.
Open Source Code	Yes	The code and data of our approach are released at https://github.com/Yutong Wang1216/Doc MTAgent.
Open Datasets	Yes	We conduct our experiments on the two test sets. The first is the tst2017 test sets from the IWSLT2017 translation task (Akiba et al., 2004), which consists of parallel documents sourced from TED talks, covering 12 language pairs. ... The second is Guofeng Webnovel (Wang et al., 2023c; 2024b), a high-quality and discourse-level corpus of web fiction.
Dataset Splits	Yes	We conduct our experiments on the two test sets. The first is the tst2017 test sets from the IWSLT2017 translation task... The second is Guofeng Webnovel... We conduct our experiments on the Guofeng V1 TEST 2 set in the Zh En direction.
Hardware Specification	Yes	As shown in Figure 3, we compared the memory usage by utilizing Qwen2-72B-Instruction to translate a document in En Zh on a device with 2 NVIDIA A800 80GB GPUs.
Software Dependencies	Yes	In this work, we utilize two versions of GPT models, GPT-3.5-Turbo-0125 and GPT-4o-mini, as our base models. ... We also introduce the open-source Qwen2-7B-Instruct and Qwen2-72B-Instruct in our experiments. ... We utilize two neural metrics to assess the quality of document translation. The first is the sentence-level COMET (s COMET) score, for which we utilize the model Unbabel/wmt22-comet-da to obtain the scores. The second metric is the document-level COMET (d COMET) score proposed by Vernikos et al. (2022), for which we use wmt21-comet-qe-mqm to derive reference-free scores.
Experiment Setup	Yes	The max new tokens is set to 2048 and other hyper-parameters remain default. The updating window of Bilingual summary m and length of Long-Term Memory l are set to 20. The number of retrieved relative sentences from Long-Term Memory n is set to 2. The length of Short-Term Memory k is set to 3.