reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Diagrams: A Graphical Language for Compositional Training Regimes

Authors: Mason Lary, Richard Samuelson, Alexander Wilentz, Alina Zare, Matthew Klawonn, James Fairbanks

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By using more powerful architectures and replicating other portions of the model and training procedures, we were able to outperform the original paper s BLEU score Papineni et al. (2002) on the Flickr8k Hodosh et al. (2013) and Flickr30k Plummer et al. (2015) image captioning data sets. The results of our experiments are captured in Table 1. We see in Table 2 that the learning diagram produces results comparable to those of the official implementation. The performance noted in Table 3.
Researcher Affiliation	Collaboration	Mason Lary University at Buffalo Richard Samuelson University of Florida Alexander Wilentz Harvard University Alina Zare University of Florida Matthew Klawonn Air Force Research Lab Information Directorate James P. Fairbanks University of Florida
Pseudocode	Yes	Listing 1: Key functionality for the NIC implementation. Users specify diagrams as Python or Julia data structures (Julia shown) similar to an edge list representation of the underlying graph and assign models to components of the diagram.
Open Source Code	Yes	We introduce a software library, Diagrammatic Learning.jl, that realizes the theory of learning diagrams to supply convenient operations for building and manipulating training setups, both before models are trained and after, when they may be used as components of other training setups. *https://github.com/Algebraic Julia/Diagrammatic Learning.jl
Open Datasets	Yes	By using more powerful architectures and replicating other portions of the model and training procedures, we were able to outperform the original paper s BLEU score Papineni et al. (2002) on the Flickr8k Hodosh et al. (2013) and Flickr30k Plummer et al. (2015) image captioning data sets. We see in Table 2 that the learning diagram produces results comparable to those of the official implementation [on] CIFAR-10. In the case of Tian et al. (2020), the data used comes from Tiered Imagenet Ren et al. (2018), Mini-Imagenet Vinyals et al. (2016), FC100 Oreshkin et al. (2018), and Cifar FS Bertinetto et al. (2018).
Dataset Splits	No	The paper mentions that for few-shot learning, "few shot tasks are sampled at random from the respective test sets" and "N-way K-shot data set". It also refers to "meta-train and meta-test steps". However, it does not provide specific percentages or counts for how the primary datasets (Flickr8k, Flickr30k, CIFAR-10, Tiered Imagenet, Mini-Imagenet, FC100, Cifar FS) were partitioned into training, validation, and test sets for the general model training. The few-shot descriptions relate to task sampling, not global dataset splits for reproducibility of the main models.
Hardware Specification	No	The paper describes implementing models using PyTorch and Flux.jl, and mentions evaluating different CNN encoders (GoogLeNet, Resnet-50, ViT-B-16). However, it does not provide specific details about the hardware used, such as GPU models, CPU specifications, or memory, that would be required to reproduce the experiments.
Software Dependencies	Yes	We further implement learning diagrams in a library that allows users to build diagrams of Py Torch and Flux.jl models. The official Py Torch knowledge distillation introduction Chariton (2024) has a more detailed specification of the experiments than the original paper Hinton et al. (2015). (Reference Chariton (2024): Py Torch tutorials 2.4.0+cu121 documentation, 2024).
Experiment Setup	No	The paper describes re-implementing classic machine learning setups using their framework, stating goals like "recreate them" and "replicating other portions of the model and training procedures". It mentions using specific CNN encoders (GoogLeNet, Resnet-50, ViT-B-16) and referring to "official Py Torch knowledge distillation introduction" benchmarks. However, the main text does not explicitly provide concrete hyperparameters (e.g., learning rates, batch sizes, number of epochs, specific optimizers) used in their experiments, deferring to the original papers' procedures or benchmarks without stating the values.