Multimodal Distributional Semantics

Authors: E. Bruni, N. K. Tran, M. Baroni

JAIR 2014 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We report the results of a systematic comparison of the network of semantic relations entertained by a set of concrete nouns in the traditional text-based and novel image-based distributional spaces confirming that image-based features are, indeed, semantically meaningful. Moreover, as expected, they provide somewhat complementary information with respect to text-based features. Having thus found a practical and effective way to extract perceptual information, we must consider next how to combine text- and image-derived features to build a multimodal distributional semantic model. We propose a flexible architecture to integrate text- and image-based distributional information, and we show in a set of empirical tests that our integrated model is superior to the purely text-based approach, and it provides somewhat complementary semantic information with respect to the latter.
Researcher Affiliation Academia Elia Bruni EMAIL Center for Mind/Brain Sciences, University of Trento, Italy; Nam Khanh Tran EMAIL L3S Research Center, Hannover, Germany; Marco Baroni EMAIL Center for Mind/Brain Sciences, University of Trento, Italy Department of Information Engineering and Computer Science, University of Trento, Italy
Pseudocode No The paper describes the methods and processes using descriptive text and mathematical equations, such as in Section 3 "A Framework for Multimodal Distributional Semantics" and Section 4 "Implementation Details", but no explicit pseudocode or algorithm blocks are provided.
Open Source Code Yes Both our implementation of the multimodal framework and of the visual feature extraction procedure are publicly available and open source.8 Moreover the visual feature extraction procedure is presented by Bruni, Bordignon, Liska, Uijlings, and Sergienya (2013). 8. See https://github.com/s2m/FUSE/ and https://github.com/vsem/, respectively.
Open Datasets Yes We adopt as our source corpus the ESP-Game data set11 that contains 100K images... 11. http://www.cs.cmu.edu/~biglou/resources/; MEN is publicly available and it can be downloaded from: http://clic.cimec.unitn.it/~elia.bruni/MEN.; We collect co-occurrence counts from the concatenation of two corpora, uk Wa C and Wackypedia... they are freely and publicly available,10 and they are widely used in linguistic research. 10. http://wacky.sslmit.unibo.it/.
Dataset Splits Yes We use indeed 2,000 MEN pairs (development set) for model tuning and 1,000 pairs for evaluation (test set).
Hardware Specification No The paper describes the computational methods and software libraries used (e.g., SVDLIBC, VLFeat, Gensim, CLUTO), but does not provide specific details about the hardware used to run the experiments, such as GPU or CPU models.
Software Dependencies No The paper mentions specific software tools and libraries such as Tree Tagger, SVDLIBC, VLFeat toolbox, Gensim, and CLUTO toolkit, but it does not provide specific version numbers for these components. For example, 'For their extraction we use the vl_phow command included in the VLFeat toolbox (Vedaldi & Fulkerson, 2010)' refers to the toolbox but not its version.
Experiment Setup Yes We performed two separate parameter optimizations, one specifically for the semantic relatedness task (using MEN development, see Section 5.2.1) and the other specifically for the clustering task (using Battig, see Section 5.3.1). We determined the best model by performing an exhaustive search across SVD k (from 24 to 212 in powers of 2), FL and SL with α varying from 0 to 1 (inclusive) in steps of 0.1 and similarly for β. In total, 198 models were explored and the one with the highest performance on the development data was chosen. To tune the parameter k we used the MEN development set (see Section 5.2.1). By varying k between 500 and 5000 in steps of 500, we found the optimal k being 5000. The Latent Dirichlet Allocation (Mix LDA) model is trained on this matrix and tuned on the MEN development set by varying the number of topics Kt. The optimal value we find is Kt = 128.