reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Large Language Models Are Read/Write Policy-Makers for Simultaneous Generation

Authors: Shoutao Guo, Shaolei Zhang, Zhengrui Ma, Yang Feng

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on simultaneous translation and streaming automatic speech recognition tasks show that our method can achieve state-of-the-art performance utilizing the open-source LLMs and demonstrate practicality in real-world scenarios.
Researcher Affiliation	Academia	1Key Laboratory of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences (ICT/CAS) 2 Key Laboratory of AI Safety, Chinese Academy of Sciences 3 University of Chinese Academy of Sciences, Beijing, China
Pseudocode	No	The paper describes the method using text and mathematical equations, but does not include a dedicated pseudocode block or algorithm section.
Open Source Code	Yes	Code https://github.com/ictnlp/LSG
Open Datasets	Yes	WMT151 German English (De En) We conduct Simul T2TT task on this dataset. Consistent with Ma et al. (2020b), we use the newstest2015 set as the test set. Mu ST-C English German (En De) This dataset (Di Gangi et al. 2019) is collected from TED talks and we conduct the Simul T2TT task using its text data. Co Vo ST2 French English (Fr En) We use this dataset (Wang, Wu, and Pino 2020) to conduct both Simul S2TT and streaming ASR tasks.
Dataset Splits	Yes	WMT151 German English (De En) We conduct Simul T2TT task on this dataset. Consistent with Ma et al. (2020b), we use the newstest2015 set as the test set.
Hardware Specification	Yes	Additionally, for the Simul S2TT task, we evaluate computation-aware latency on an NVIDIA RTX 3090 GPU, which assesses the latency of the systems in practical applications.
Software Dependencies	No	The paper mentions several LLMs and methods like Llama2-7B-chat, LoRA, Qwen-Audio, Wav2Vec2-large, Whisper-base, and Sacre BLEU, but does not provide specific version numbers for underlying software or libraries like Python, PyTorch, or CUDA.
Experiment Setup	Yes	We set δ = 9.0 and α = 0.6 for De En task, δ = 7.5 and α = 0.6 for En De task, and δ = 7.0 and α = 0.5 for Fr En task. For different latency scenarios, we set [L, U] as [1, 4], [3, 4], [5, 6], and [7, 6], respectively.