reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Function-to-Style Guidance of LLMs for Code Translation

Authors: Longhui Zhang, Bin Wang, Jiahao Wang, Xiaofeng Zhao, Min Zhang, Hao Yang, Meishan Zhang, Yu Li, Jing Li, Jun Yu, Min Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experiments on both our new benchmark and existing datasets demonstrate that our approach significantly improves code translation performance. Notably, our approach enables Qwen1.5B to outperform promptenhanced Qwen32B and GPT-4 on average across 20 diverse code translation scenarios.
Researcher Affiliation	Collaboration	1Harbin Institute of Technology, Shenzhen, China. 2Huawei Translation Services Center, Beijing, China. 3Zhejiang University, Hangzhou, China.
Pseudocode	No	The paper describes its methodology in natural language and block diagrams (Figure 2) but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not provide any explicit statement about releasing code or a link to a source-code repository for the described methodology.
Open Datasets	Yes	Experiments on both our new benchmark and existing datasets demonstrate that our approach significantly improves code translation performance. ... We further evaluate F2STRANS on x Code Eval (Khan et al., 2024), as shown in Table 5. ... The latest data for the Code Net benchmark comes from 2020 (Puri et al., 2021).
Dataset Splits	No	In the function-oriented training, we construct approximately 5,000 code pairs for each translation scenario, such as translating from C++ to Python, with a corresponding scale of 10,000 in the style-oriented training. ... The paper does not provide specific train/test/validation splits for the datasets used in evaluation.
Hardware Specification	Yes	All our experiments are carried out on a machine equipped with eight NVIDIA A800-SXM4-80GB GPUs.
Software Dependencies	No	The paper mentions using LLMs (Qwen, GPT-4) and general concepts like Instruction Fine-tuning, but does not provide specific version numbers for any software libraries, frameworks, or environments used for implementation or experimentation (e.g., Python, PyTorch, CUDA versions).
Experiment Setup	Yes	In the function-oriented guidance, we set the maximum algorithmic consistency label K in Eq. 1 to 5. In the style-oriented guidance, we set both the numbers of positive translations T + and negative translations T , namely m and n, to 10, with the value of α in negative translation collection construction set to 0.8 and the trade-off hyperparameter β in Eq. 5 fixed at 0.6. ... Throughout both training stages, we maintain consistent hyperparameters, employing 2 epochs and a learning rate of 1 10 5. During inference, we set the temperature of the LLMs to 0.7.