reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Learning Neural Audio Embeddings for Grounding Semantics in Auditory Perception

Authors: Douwe Kiela, Stephen Clark

JAIR 2017 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We evaluate on a standard similarity and relatedness dataset: the MEN test collection (Bruni et al., 2014). This dataset consists of concept pairs together with a human-annotated relatedness score. ... The results are reported in Table 5, according to whether they are (a) uni-modal representations obtained from a single modality or (b) multi-modal representations that have undergone multi-modal fusion. ... We explore categorization using an unsupervised clustering algorithm over the learned representations. These experiments provide two contributions.
Researcher Affiliation	Collaboration	Douwe Kiela EMAIL Facebook Artiﬁcial Intelligence Research 770 Broadway, New York, NY 10003, USA Stephen Clark EMAIL Computer Laboratory, University of Cambridge 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK
Pseudocode	Yes	The neural auditory embedding approach can be summarized to comprise the following steps: 1. (train step) Train a neural network classiﬁer C on the dataset { f(s), Ls \| s SC}, where SC is a set of audio ﬁles, f is a pre-processing function and Ls is the label for that ﬁle. 2. (transfer step) For each label Lx (where Lx is not necessarily also a label in SC, but may be): (a) Retrieve a set of audio ﬁles Sx (b) For each ﬁle s Sx: i. Obtain the auditory representation qs = g(f(s)), where g is the neural network C up to the penultimate pre-softmax layer. (c) (aggregation) The overall representation for label Lx is then obtained by aggregating the per-ﬁle representations, that is, we take the mean of the relevant auditory representations, i.e., rx = 1 \|Sx\| P si Sx qsi.
Open Source Code	No	The paper does not provide concrete access to source code for the methodology described. It mentions using a script from a third-party (word2vec) to obtain a corpus, but not their own implementation.
Open Datasets	Yes	We evaluate on a standard similarity and relatedness dataset: the MEN test collection (Bruni et al., 2014). ... we use the online search engine Freesound (Font, Roma, & Serra, 2013) to obtain audio ﬁles. ... For the textual representations we use the continuous vector representations from the log-linear skip-gram model of Mikolov, Chen, Corrado, and Dean (2013). Speciﬁcally, 300-dimensional vector representations were obtained by training on a dump of the English Wikipedia plus newswire (8 billion words in total). The demo-train-big-model-v1.sh script from https://code.google.com/archive/p/word2vec was used to obtain this corpus. ... We use a standard Alex Net architecture (Krizhevsky et al., 2012), ... trained on the Image Net Large Scale Visual Recognition Challenge (ILSVRC). ... Gygi, Kidd, and Watson (2007) performed an extensive psychological study of auditory perception and its relation to environmental sound categories.
Dataset Splits	Yes	We divide the data into a training and a validation set, sampling 75% for the former and taking the remainder for the latter. ... we do a ﬁve-way cross-validated comparison where we tune the α parameter (and in the tri-modal case also the β) on a held-out validation set of 20% of the data and obtain the Spearman ρs correlation score for the other 80%.
Hardware Specification	No	The paper does not provide specific hardware details (like GPU/CPU models or memory amounts) used for running the experiments. It only mentions using convolutional neural networks, which implies computational resources, but no specifics are given.
Software Dependencies	No	The paper mentions architectural models like "Alex Net architecture (Krizhevsky et al., 2012)" and algorithms such as "mini-batch k-means (Sculley, 2010)" and "log-linear skip-gram model of Mikolov, Chen, Corrado, and Dean (2013)", but it does not specify any software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions) that would be needed to replicate the experiment.
Experiment Setup	Yes	We use standard stochastic gradient descent (SGD) optimization, with an initial learning rate of 0.01. The learning rate was set to degrade in a stepwise fashion by a factor of 0.1 every 1000 iterations, until convergence. ... we set an initial learning rate of 0.01 for the fully connected layers and 0.001 for the earlier convolutional layers and learn for up to 4000 iterations using SGD. ... We set k = 300 ... For each word, we retrieve the ﬁrst 100 sound samples from Free Sound with a maximum duration of 1 minute.