reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

A Pattern Language for Machine Learning Tasks

Authors: Benjamin Rodatz, Ian Fan, Tuomas Laakkonen, Neil John Ortega, Thomas Hoffmann, Vincent Wang

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	As preliminary experimental validation of our theoretical framework, we exhibit and implement a novel manipulation task that minimally edits input data to have a desired attribute. Our modelagnostic approach achieves this end-to-end, and without the need for custom architectures, adversarial training, random sampling, or interventions on the data, hence enabling capable, small-scale, and training-stable models.
Researcher Affiliation	Collaboration	Benjamin Rodatz Compositional Intelligence, Quantinuum, University of Oxford Ian Fan Jump Trading, London Tuomas Laakkonen MIT Neil John Ortega Compositional Intelligence, Quantinuum Thomas Hoffmann Artidis AG Vincent Wang-Maścianica HAILab, Philosophy, Oxford
Pseudocode	No	The paper describes tasks using mathematical formulations and string diagrams, and mentions an online tool that can generate code, but it does not contain explicit pseudocode or algorithm blocks within its text.
Open Source Code	Yes	1This is available at https://patlang.vercel.app, and the source code can be found at https://github.com/tlaakkonen/ patlang-editor.
Open Datasets	Yes	For the faces experiment, we use the Celeb Faces Attributes dataset (Liu et al., 2015), with an off-the-shelf data augmentation method called Trivial Augment (Müller & Hutter, 2021). We trained the manipulation task on the MNIST dataset, using the digit label as the property. We used the YELP dataset used by (Li et al., 2018), reusing the same train-dev-test split they used.
Dataset Splits	Yes	We used the YELP dataset used by (Li et al., 2018), reusing the same train-dev-test split they used. It consists of 270K positive and 180K negative sentences for the training set, 2000 sentences each for the dev set, and 500 sentences each for the test set.
Hardware Specification	No	The paper mentions "Training is performed on a GPU" in Appendix C.1 but does not specify any particular GPU model, CPU, or other detailed hardware specifications for any of the experiments.
Software Dependencies	No	For the get of the manipulation task, we used the Py Torch version of the pretrained Transformer by Hugging Face, which uses the Open AI GPT model pretrained by (Radford & Narasimhan, 2018) on the Book Corpus dataset... While PyTorch and Hugging Face are mentioned, no specific version numbers are provided for these or any other software dependencies.
Experiment Setup	Yes	Training is performed on a GPU, with a batch size of 64, learning rate of 1 10 4, weight decay of 1 10 2, and gradient clipping at 1. The model trains for 100,000 steps, logging every 10,000 steps. Hyper-parameter Value Steps 100,000 Batch Size 512 Optimiser Adam W Learning Rate 10 3 Weight Decay 10 2 Gradient Clipping 1 (element-wise) Image Loss L2 + 0.25 L1 Discrete Value Loss Binary cross-entropy Continuous Value Loss Mean squared error Seed 0 Task Weight autoencoding 100 Get Put 1 Put Put 1 Undo 10 Put Get (blue-circleness) 10 (shape and colour) 1 Classification (blue-circleness) 10 (shape and colour) 1