reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Cross-lingual Transfer in Programming Languages: An Extensive Empirical Study

Authors: Razan Baltaji, Saurabh Pujar, Martin Hirzel, Louis Mandel, Luca Buratti, Lav R. Varshney

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Addressing this gap, we investigate the potential of transfer learning to enhance LLM performance on low-resource programming languages by leveraging data from high-resource counterparts. Our extensive empirical study evaluates transferability across 10 to 41 programming languages and five key tasks: code generation, clone detection, code repair, solution domain classification, and error detection. Additionally, we develop a performance prediction model to guess the best source languages for a given target and task, and analyze the features that influence transfer performance. We further replicate a representative subset of experiments with a larger model to test the generalizability of our conclusions to contemporary large-scale LLMs.
Researcher Affiliation	Collaboration	Razan Baltaji EMAIL Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Saurabh Pujar EMAIL Louis Mandel EMAIL Martin Hirzel EMAIL Luca Buratti EMAIL IBM Research Lav R. Varshney EMAIL Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign
Pseudocode	No	The paper describes its methodology using textual explanations and flow diagrams (e.g., Figure 1 'Overview') but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	Our code is available at https://github.com/baltaci-r/XPL.
Open Datasets	Yes	The experiments are based on the publicly available Code T5-base (220M parameters) model (Wang et al., 2021) and the open-sourced datasets Code Net (Puri et al., 2021) and XCode Eval (Khan et al., 2023).
Dataset Splits	Yes	Data Splits Train and test splits for Error Detection and Solution Domain Classification are provided from the original dataset as described in the dataset statistics from x Code Eval (Khan et al., 2023). For Code Repair, 50,000 training examples and 1,000 test examples are synthetically generated for each language. For Clone Detection, a distinct set of problems is used for train and test splits, including languages with a minimum of 450 test examples. Table 1 shows the number of examples for Clone Detection for each language. In the few-shot prompting experiment, we conduct a cross-lingual evaluation across all language pairs, using nearly 1,000 examples for each.
Hardware Specification	Yes	For each source language, we finetune the model using one A100 GPU for 6 to 30 hours depending on the task. For the few-shot prompting experiments, we generated responses with the Llama 3.3 70B-Instruct model via Together AI, 2025 using a temperature of 0.8.
Software Dependencies	Yes	Our finetuning experiments are based on Code T5-base (220M parameters) using the Hugging Face transformers library (Wolf et al., 2020).
Experiment Setup	Yes	We keep the same hyperparameters for all the experiments: learning rate of 2e-5, batch size of 8, and the number of epochs set to 20.