reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DOLPHIN: A Programmable Framework for Scalable Neurosymbolic Learning

Authors: Aaditya Naik, Jason Liu, Claire Wang, Amish Sethi, Saikat Dutta, Mayur Naik, Eric Wong

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate DOLPHIN on a diverse set of neurosymbolic tasks involving text, image, and video, using rich reasoning features like recursion and black-box Python functions. On simpler problems, neurosymbolic programs written using DOLPHIN match the accuracy of state-of-the-art methods, while achieving these results 47x, 62x, 8x, and 1.7x faster than baselines like Scallop, sampling-based frameworks like ISED and Inde Cate R+, and solely GPU based methods like LTN respectively. We also observe that DOLPHIN efficiently scales to more complex benchmarks and larger datasets, achieving state-of-the-art accuracies. While baselines fail to converge on 5 out of 8 such benchmarks within 10 hours, DOLPHIN requires 5.5 hours in the worst case.
Researcher Affiliation	Academia	1Department of Computer and Information Science, University of Pennsylvania 2Department of Computer Science, Cornell University. Correspondence to: Aaditya Naik <EMAIL>.
Pseudocode	Yes	Figure 2: DOLPHIN code for the MNIST Sum N task. 1 class Sum NNet(torch.nn.Module): 2 def __init__(self): 3 super(Sum NNet, self).__init__() 4 self.CNN = MNISTNet() 5 6 def forward(self, imgs): 7 d = range(10) 8 D_res = Distribution(self.CNN(imgs[0]), d) 9 for i in range(1, len(imgs)): 10 D_i = Distribution(self.CNN(imgs[i]), d) 11 D_res = apply(D_res, D_i, lambda x,y: x + y) 12 return get_logits(D_res)
Open Source Code	Yes	The code is published at https://github.com/Dolphin-NeSy/Dolphin.
Open Datasets	Yes	MNIST Sum N. The MNIST Sum N (or briefly, Sum N) task from (De Smet et al., 2024) takes as inputs N handwritten digits from the MNIST dataset and returns their sum. CLUTRR. In this task from (Sinha et al., 2019), given some text containing information about several individuals and some of their relationships, the model must infer the relationship between two given individuals, which is not explicitly provided in the input.
Dataset Splits	Yes	Sum5 s dataset consisted of 12000 train samples and 2000 test samples. Sum10 s dataset consisted of 6000 train samples and 1000 test samples. Sum15 s dataset consisted of 4000 train samples and 666 test samples. Length 7 s dataset consisted of 9600 samples for training, 2400 samples for testing. Length 15 consisted of 24000 training samples and 6000 testing samples. Length 19 consisted of 32000 training samples and 8000 testing samples. Each task s dataset consisted of 539459 images for training and 59940 images for testing. The length of the training dataset for CLUTRR (Small) was 11,093 and that of the test set was 1146. The training set for CLUTRR (Medium) contained 15,083 samples and the test set contained 1048 samples. From the full Mugen dataset, we sample a training set of 5000 examples for Mugen (Medium), and from that set, we sample a training set of 1000 for Mugen (Small). Both Small and Medium are evaluated on a fixed holdout set of 1000 samples.
Hardware Specification	Yes	All experiments, except CLUTRR, were run on machines with two 20-core Intel Xeon Gold 6248 CPUs, four NVIDIA Ge Force RTX 2080 Ti (11 GB) GPUs, and 768 GB RAM. Since CLUTRR demands more GPU memory due to running the Ro BERT a model with a standard batch size of 16, all programs for this benchmark were run with a NVIDIA A100 40GB GPU.
Software Dependencies	No	The paper mentions PyTorch as the deep learning framework used and specific models like Roberta-base, Distil Bert, and S3D, but does not provide version numbers for PyTorch or other libraries, which is required for reproducibility.
Experiment Setup	Yes	Each of the MNIST Sum-N tasks had a batch train size of 64 samples, a learning rate of 0.001, and a top-k value of 1. We trained each task with a batch train size of 64 samples. The learning rate was 0.0001, the global sampling value was 7, and top-k value was 3. For each of the Path Finder tasks, we used a batch train size of 64 samples, a learning rate of 0.0001, and a top-k value of 1. For each CLUTRR task, we used a single A100 GPU (40 GB), with a learning rate of 0.00001 and use a batch size of 16. For each Mugen task, we use a batch size of 3 and a learning rate of 0.0001. We train and evaluate for up to 100 epochs.