Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]
Learning Neural Audio Embeddings for Grounding Semantics in Auditory Perception
Authors: Douwe Kiela, Stephen Clark
JAIR 2017 | Venue PDF | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We evaluate on a standard similarity and relatedness dataset: the MEN test collection (Bruni et al., 2014). This dataset consists of concept pairs together with a human-annotated relatedness score. ... The results are reported in Table 5, according to whether they are (a) uni-modal representations obtained from a single modality or (b) multi-modal representations that have undergone multi-modal fusion. ... We explore categorization using an unsupervised clustering algorithm over the learned representations. These experiments provide two contributions. |
| Researcher Affiliation | Collaboration | Douwe Kiela EMAIL Facebook Artificial Intelligence Research 770 Broadway, New York, NY 10003, USA Stephen Clark EMAIL Computer Laboratory, University of Cambridge 15 JJ Thomson Avenue, Cambridge CB3 0FD, UK |
| Pseudocode | Yes | The neural auditory embedding approach can be summarized to comprise the following steps: 1. (train step) Train a neural network classifier C on the dataset { f(s), Ls | s SC}, where SC is a set of audio files, f is a pre-processing function and Ls is the label for that file. 2. (transfer step) For each label Lx (where Lx is not necessarily also a label in SC, but may be): (a) Retrieve a set of audio files Sx (b) For each file s Sx: i. Obtain the auditory representation qs = g(f(s)), where g is the neural network C up to the penultimate pre-softmax layer. (c) (aggregation) The overall representation for label Lx is then obtained by aggregating the per-file representations, that is, we take the mean of the relevant auditory representations, i.e., rx = 1 |Sx| P si Sx qsi. |
| Open Source Code | No | The paper does not provide concrete access to source code for the methodology described. It mentions using a script from a third-party (word2vec) to obtain a corpus, but not their own implementation. |
| Open Datasets | Yes | We evaluate on a standard similarity and relatedness dataset: the MEN test collection (Bruni et al., 2014). ... we use the online search engine Freesound (Font, Roma, & Serra, 2013) to obtain audio files. ... For the textual representations we use the continuous vector representations from the log-linear skip-gram model of Mikolov, Chen, Corrado, and Dean (2013). Specifically, 300-dimensional vector representations were obtained by training on a dump of the English Wikipedia plus newswire (8 billion words in total). The demo-train-big-model-v1.sh script from https://code.google.com/archive/p/word2vec was used to obtain this corpus. ... We use a standard Alex Net architecture (Krizhevsky et al., 2012), ... trained on the Image Net Large Scale Visual Recognition Challenge (ILSVRC). ... Gygi, Kidd, and Watson (2007) performed an extensive psychological study of auditory perception and its relation to environmental sound categories. |
| Dataset Splits | Yes | We divide the data into a training and a validation set, sampling 75% for the former and taking the remainder for the latter. ... we do a five-way cross-validated comparison where we tune the α parameter (and in the tri-modal case also the β) on a held-out validation set of 20% of the data and obtain the Spearman ρs correlation score for the other 80%. |
| Hardware Specification | No | The paper does not provide specific hardware details (like GPU/CPU models or memory amounts) used for running the experiments. It only mentions using convolutional neural networks, which implies computational resources, but no specifics are given. |
| Software Dependencies | No | The paper mentions architectural models like "Alex Net architecture (Krizhevsky et al., 2012)" and algorithms such as "mini-batch k-means (Sculley, 2010)" and "log-linear skip-gram model of Mikolov, Chen, Corrado, and Dean (2013)", but it does not specify any software libraries or frameworks with version numbers (e.g., Python, PyTorch, TensorFlow, CUDA versions) that would be needed to replicate the experiment. |
| Experiment Setup | Yes | We use standard stochastic gradient descent (SGD) optimization, with an initial learning rate of 0.01. The learning rate was set to degrade in a stepwise fashion by a factor of 0.1 every 1000 iterations, until convergence. ... we set an initial learning rate of 0.01 for the fully connected layers and 0.001 for the earlier convolutional layers and learn for up to 4000 iterations using SGD. ... We set k = 300 ... For each word, we retrieve the first 100 sound samples from Free Sound with a maximum duration of 1 minute. |