Cross-lingual Transfer in Programming Languages: An Extensive Empirical Study
Authors: Razan Baltaji, Saurabh Pujar, Martin Hirzel, Louis Mandel, Luca Buratti, Lav R. Varshney
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Addressing this gap, we investigate the potential of transfer learning to enhance LLM performance on low-resource programming languages by leveraging data from high-resource counterparts. Our extensive empirical study evaluates transferability across 10 to 41 programming languages and five key tasks: code generation, clone detection, code repair, solution domain classification, and error detection. Additionally, we develop a performance prediction model to guess the best source languages for a given target and task, and analyze the features that influence transfer performance. We further replicate a representative subset of experiments with a larger model to test the generalizability of our conclusions to contemporary large-scale LLMs. |
| Researcher Affiliation | Collaboration | Razan Baltaji EMAIL Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Saurabh Pujar EMAIL Louis Mandel EMAIL Martin Hirzel EMAIL Luca Buratti EMAIL IBM Research Lav R. Varshney EMAIL Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign |
| Pseudocode | No | The paper describes its methodology using textual explanations and flow diagrams (e.g., Figure 1 'Overview') but does not contain explicitly labeled pseudocode or algorithm blocks. |
| Open Source Code | Yes | Our code is available at https://github.com/baltaci-r/XPL. |
| Open Datasets | Yes | The experiments are based on the publicly available Code T5-base (220M parameters) model (Wang et al., 2021) and the open-sourced datasets Code Net (Puri et al., 2021) and XCode Eval (Khan et al., 2023). |
| Dataset Splits | Yes | Data Splits Train and test splits for Error Detection and Solution Domain Classification are provided from the original dataset as described in the dataset statistics from x Code Eval (Khan et al., 2023). For Code Repair, 50,000 training examples and 1,000 test examples are synthetically generated for each language. For Clone Detection, a distinct set of problems is used for train and test splits, including languages with a minimum of 450 test examples. Table 1 shows the number of examples for Clone Detection for each language. In the few-shot prompting experiment, we conduct a cross-lingual evaluation across all language pairs, using nearly 1,000 examples for each. |
| Hardware Specification | Yes | For each source language, we finetune the model using one A100 GPU for 6 to 30 hours depending on the task. For the few-shot prompting experiments, we generated responses with the Llama 3.3 70B-Instruct model via Together AI, 2025 using a temperature of 0.8. |
| Software Dependencies | Yes | Our finetuning experiments are based on Code T5-base (220M parameters) using the Hugging Face transformers library (Wolf et al., 2020). |
| Experiment Setup | Yes | We keep the same hyperparameters for all the experiments: learning rate of 2e-5, batch size of 8, and the number of epochs set to 20. |