Small Transformers Compute Universal Metric Embeddings

Authors: Anastasis Kratsios, Valentin Debarnot, Ivan Dokmanić

JMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this section we complement the theoretical embedding guarantees from Section 4 by preliminary computer experiments on synthetic data. We show that the proposed feature maps can indeed be trained in a standard deep learning framework, that the theoretical advantages of PT mixture-Wasserstein embeddings over Euclidean and hyperbolic carry over to practice, and that the PT-based feature maps generalize beyond Xn.
Researcher Affiliation Academia Anastasis Kratsios EMAIL Mc Master University Department of Mathematics 1280 Main Street West, Hamilton, Ontario, L8S 4K1, Canada Valentin Debarnot EMAIL Universit at Basel Department of Computer Science Basel, 4051, Switzerland Ivan Dokmani c EMAIL Universit at Basel Department of Computer Science Basel, 4051, Switzerland
Pseudocode Yes Algorithm 1 Initialize Bias Require: Set of n-vectors X def.= {x(1), . . . , x(n)} in RK, b0 def.= 0 Initialize first shift for n = 1, . . . , N do xn 1 def.= x(n) 1 + b1 Dummy vectors end for for k = 1, . . . , K do Iteratively build bias components bk def.= maxn N Re LU(x(n) k 1 x(n) k 1) for n N do xn k def.= x(n) k + bk Dummy vectors end for end for return b def.= (b1, . . . , b K) Return Bias)
Open Source Code Yes The Python codes used to produce the results of this section are available at https://github.com/swing-research/Universal-Embeddings.
Open Datasets No The paper uses synthetic data. For example: "We consider a regular binary tree X = (V, E) (Figure 7a) of depth six with a total of |V | = 127 vertices.", and "We randomly sample data points {xi}n i=1 from the uniform probability measure on SN." No concrete access information for a publicly available dataset is provided.
Dataset Splits Yes We partition the vertices V into training and testing sets, Vtrain Vtest = V , with |Vtrain| = 111 and |Vtest| = 16. The test vertices (colored white in Figure 7a) are used to evaluate the quality of out-of-sample representations (that is to say, the generalization) computed by the different representation maps.
Hardware Specification No The paper mentions training on "pytorch" and using an "Adam optimizer" but does not specify any particular hardware (GPU/CPU models, memory, etc.).
Software Dependencies No All networks are trained by the Adam optimizer in pytorch, with weight decay parameter 10 6, initial learning rate 10 4 and final learning rate 10 6. The paper mentions PyTorch and Adam optimizer but does not provide specific version numbers for these or any other software dependencies.
Experiment Setup Yes All networks are trained by the Adam optimizer in pytorch, with weight decay parameter 10 6, initial learning rate 10 4 and final learning rate 10 6. In practice we set α = 1 as it does not have a strong influence on empirical performance. We use K = 5 mixture components and d = 15 for the dimension of the hyperbolic space to ensure a fair comparison with the probabilistic transformer s effective dimension. We train a PT for 160 iterations with the Adam optimizer (Kingma and Ba, 2015). In each iteration, we use random batch of 32 points among the 10,000 fixed training points chosen from a uniform measure on the sphere.