Cross-lingual Transfer in Programming Languages: An Extensive Empirical Study

Authors: Razan Baltaji, Saurabh Pujar, Martin Hirzel, Louis Mandel, Luca Buratti, Lav R. Varshney

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Addressing this gap, we investigate the potential of transfer learning to enhance LLM performance on low-resource programming languages by leveraging data from high-resource counterparts. Our extensive empirical study evaluates transferability across 10 to 41 programming languages and five key tasks: code generation, clone detection, code repair, solution domain classification, and error detection. Additionally, we develop a performance prediction model to guess the best source languages for a given target and task, and analyze the features that influence transfer performance. We further replicate a representative subset of experiments with a larger model to test the generalizability of our conclusions to contemporary large-scale LLMs.
Researcher Affiliation Collaboration Razan Baltaji EMAIL Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign Saurabh Pujar EMAIL Louis Mandel EMAIL Martin Hirzel EMAIL Luca Buratti EMAIL IBM Research Lav R. Varshney EMAIL Department of Electrical and Computer Engineering University of Illinois at Urbana-Champaign
Pseudocode No The paper describes its methodology using textual explanations and flow diagrams (e.g., Figure 1 'Overview') but does not contain explicitly labeled pseudocode or algorithm blocks.
Open Source Code Yes Our code is available at https://github.com/baltaci-r/XPL.
Open Datasets Yes The experiments are based on the publicly available Code T5-base (220M parameters) model (Wang et al., 2021) and the open-sourced datasets Code Net (Puri et al., 2021) and XCode Eval (Khan et al., 2023).
Dataset Splits Yes Data Splits Train and test splits for Error Detection and Solution Domain Classification are provided from the original dataset as described in the dataset statistics from x Code Eval (Khan et al., 2023). For Code Repair, 50,000 training examples and 1,000 test examples are synthetically generated for each language. For Clone Detection, a distinct set of problems is used for train and test splits, including languages with a minimum of 450 test examples. Table 1 shows the number of examples for Clone Detection for each language. In the few-shot prompting experiment, we conduct a cross-lingual evaluation across all language pairs, using nearly 1,000 examples for each.
Hardware Specification Yes For each source language, we finetune the model using one A100 GPU for 6 to 30 hours depending on the task. For the few-shot prompting experiments, we generated responses with the Llama 3.3 70B-Instruct model via Together AI, 2025 using a temperature of 0.8.
Software Dependencies Yes Our finetuning experiments are based on Code T5-base (220M parameters) using the Hugging Face transformers library (Wolf et al., 2020).
Experiment Setup Yes We keep the same hyperparameters for all the experiments: learning rate of 2e-5, batch size of 8, and the number of epochs set to 20.