reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

SimulPL: Aligning Human Preferences in Simultaneous Machine Translation

Authors: Donglei Yu, Yang Zhao, Jie Zhu, Yangyifan Xu, Yu Zhou, Chengqing Zong

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results indicate that Simul PL exhibits better alignment with human preferences across all latency levels in Zh En, De En and En Zh Si MT tasks. Our data and code will be available at https://github.com/Eureka For NLP/Simul PL.
Researcher Affiliation	Academia	Donglei Yu1 2, Yang Zhao 1 2, Jie Zhu3, Yangyifan Xu1 2, Yu Zhou1 2 , Chengqing Zong1 2 1 School of Artificial Intelligence, University of Chinese Academy of Sciences 2 State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China 3 Graduate School of Translation and Interpretation, Beijing Foreign Studies University EMAIL EMAIL EMAIL
Pseudocode	Yes	Algorithm 1: Confidence-based Policy In Inference
Open Source Code	Yes	Our data and code will be available at https://github.com/Eureka For NLP/Simul PL.
Open Datasets	Yes	We validate our method on text-to-text Si MT tasks using our annotated datasets with human-preferred references. For the training data, we select subsets from three datasets WMT15 De En, WMT22 Zh En, and MUST-C En Zh for annotation.
Dataset Splits	Yes	Table 1: Statics of our constructed datasets. We present the reference-free COMET scores of our annotated target sentences with GPT-4/4o and the original target sentences. Dataset Size train test Zh En 13,491 2,000 De En 15,717 2,168 En Zh 19,967 2,841
Hardware Specification	No	The paper does not explicitly describe the hardware used for running its experiments, such as specific GPU/CPU models or processor types.
Software Dependencies	No	The paper mentions software like Fairseq, Llama2-7B-chat, Simuleval, awesome-align, and Stanza, but does not provide specific version numbers for these software components, which is required for reproducibility.
Experiment Setup	Yes	Table 7: Hyper-parameters of Transformerbased Si MT models in our experiments. encoder layers 6 encoder attention heads 8 encoder embed dim 512 encoder ffn embed dim 1024 decoder layers 6 decoder attention heads 8 decoder embed dim 512 decoder ffn embed dim 1024 dropout 0.1 optimizer adam adam-β (0.9, 0.98) clip-norm 1e-7 lr 5e-4 lr scheduler inverse sqrt warmup-updates 4000 warmup-init-lr 1e-7 weight decay 0.0001 label-smoothing 0.1 max tokens 8192. Table 8: Hyper-parameters of Simul PL in our experiments. Lo RA lora r 64 lora alpha 16 lora dropout 0.1 batch size 64 micro batch size 32 learning rate 2e-4 training steps 1000 α 0.1 β 0.1 batch size 64 micro batch size 16 learning rate 2e-6 training steps 400