reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Leveraging Automated Unit Tests for Unsupervised Code Translation

Authors: Baptiste Roziere, Jie Zhang, Francois Charton, Mark Harman, Gabriel Synnaeve, Guillaume Lample

ICLR 2022 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	4 EXPERIMENTS
Researcher Affiliation	Collaboration	Baptiste Rozière Facebook AI Research Paris-Dauphine University EMAIL Jie M. Zhang University College London EMAIL François Charton Facebook AI Research EMAIL Mark Harman Facebook EMAIL Gabriel Synnaeve Facebook AI Research EMAIL Guillaume Lample Facebook AI Research EMAIL
Pseudocode	No	The paper describes its methods in narrative text and with diagrams but does not include any explicitly labeled pseudocode or algorithm blocks.
Open Source Code	Yes	We submit our code with this submission, along with a Read Me ﬁle detailing clear steps to reproduce our results, including a script to set-up a suitable environment. We will open-source our code and release our trained models.
Open Datasets	Yes	Datasets. As Trans Coder and DOBF, we use the Git Hub public dataset available on Google Big Query ﬁltered to keep only projects with open-source licenses1.
Dataset Splits	Yes	We evaluate our models on the full validation and test sets of Trans Coder.
Hardware Specification	Yes	Our models were trained using standard hardware (Tesla V100 GPUs) and libraries (e.g. Pytorch, Cuda) for machine-learning research.
Software Dependencies	No	The paper mentions 'Pytorch, Cuda' as libraries used but does not provide specific version numbers for these software dependencies.
Experiment Setup	Yes	For the online version, we set a cache warm-up parameter to ensure that we always generate new parallel examples if there are less than 500 examples in the cache for any language pair. Otherwise, we sample from the cache with probability 0.5, or create new parallel functions to add to the cache. When an example is sampled, we remove it from the cache with a given probability. The sampled elements are removed from the cache with probability 0.3, so that each element we create is trained on about 4 times in average before being removed from the cache. We initialize the cache with parallel examples created ofﬂine. During beam decoding, we compute the score of generated sequences by dividing the sum of token log-probabilities by lα where l is the sequence length. We found that taking α = 0.5 (and penalizing long generations) leads to the best performance on the validation set.