Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1]

Generalization on the Unseen, Logic Reasoning and Degree Curriculum

Authors: Emmanuel Abbe, Samy Bengio, Aryo Lotfi, Kevin Rizk

JMLR 2024 | Venue PDF | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We study how different network architectures trained by (S)GD perform under GOTU and provide both theoretical and experimental evidence that for sparse functions and a class of network models including instances of Transformers, random features models, and linear networks, a min-degree-interpolator is learned on the unseen. More specifically, this means an interpolator of the training data that has minimal Fourier mass on the higher degree basis elements. These findings lead to two implications: (1) we provide an explanation to the length generalization problem for Boolean functions (e.g., Anil et al. 2022); (2) we introduce a curriculum learning algorithm called Degree-Curriculum that learns monomials more efficiently by incrementing supports. Finally, we discuss extensions to other models or non-sparse regimes where the min-degree bias may still occur or fade, as well as how it can be potentially corrected when undesirable. Keywords: reasoning, out-of-distribution generalization, implicit bias, length generalization, curriculum learning
Researcher Affiliation Collaboration Emmanuel Abbe EMAIL EPFL, Apple, Lausanne, VD, Switzerland Samy Bengio EMAIL Apple, Cupertino, CA, USA Aryo Lotfi EMAIL EPFL, Lausanne, VD, Switzerland Kevin Rizk EMAIL EPFL, Lausanne, VD, Switzerland
Pseudocode Yes Algorithm 1 Degree-Curriculum algorithm Input: Training samples S = {(xi, yi)}m i=1; Curriculum Br1 Br2 . . . Brk = Bd; Loss threshold ϵ for i = 1 to k do Sri := {(x, y) S|x Bri} (samples in Bri) initialize train loss = 1 + ϵ. while train loss > ϵ do train model with SGD on Sri update train loss end while end for
Open Source Code Yes 1. Our code is available at https://github.com/aryol/GOTU
Open Datasets No The paper describes generating synthetic datasets for Boolean functions: "For each experiment, we generate all binary sequences in Uc = { 1}d \ U for training." and mentions learning tasks like "learning the full parity function on 15 bits". It does not provide access information for any pre-existing public datasets.
Dataset Splits Yes For each experiment, we generate all binary sequences in Uc = { 1}d \ U for training.5 We then train models under the ℓ2 loss. We employ Adam (Kingma and Ba, 2015) optimizer for the Transformer model and mini-batch SGD for the rest of the architectures. We also use moderate learning rates as learning rate can affect the results (refer to Appendix B.2). During training, we evaluate the coefficients of the function learned by the neural network using ˆf NN(T) = Ex U{ 1}d[χT (x)f NN(x)] to understand which interpolating solution has been learned by the model. Moreover, each experiment is repeated 10 times and averaged results are reported. For more information on the setup of experiments, hyperparameter sensitivity analysis, and additional experiments refer to Appendix B. Here, we consider the following 3 functions and unseen domains on input dimension 15. Dimension 15 is used as a large dimension where the training data can be generated explicitly but has otherwise no specific meaning (Appendix B provides other instances). The first function is an example of degree-2 where the unseen domain induces a degree-1 MD interpolator. The second example is the classic degree-2 parity or XOR function. The third example is such that the function is symmetric under cyclic permutations while its MD interpolator is not, in order to test whether certain models would favor symmetric interpolators.
Hardware Specification Yes The implementation of experiments has been done using the Py Torch framework (Paszke et al., 2019). Additionally, the experiments were executed on NVIDIA A100 GPUs and the experiments took around 90 GPU hours in total (excluding the selection of hyperparameters).
Software Dependencies No The paper mentions "Py Torch framework (Paszke et al., 2019)" but does not specify a version number. It also mentions "Adam (Kingma and Ba, 2015) optimizer" and "mini-batch SGD" but these are algorithms/optimisers, not software with version numbers in the context of dependencies.
Experiment Setup Yes For the Transformer, we have used Adam (Kingma and Ba, 2015) optimizer with batch size 256. For the RF models, we have used mini-batch SGD with a batch size of 256. Also, for the rest of the architectures, SGD with batch size 64 has been used. We did not observe any significant difference in the results of experiments by varying the batch sizes. We generally selected the learning rates per model (and task) by the stability of the training and the speed of convergence. We have included more details about the learning rate in Appendix B.2. We also set the number of training epochs large enough that the loss of models is always less than 10 2. We also note that Transformers usually learn the target function in a few epochs, reaching a loss of the order of 10 4. After that, the training becomes unstable in some instances. Indeed note that Transformers are usually trained with learning rate schedulers. However, we did not use any learning rate schedulers for simplicity and instead opted for early stopping to avoid unstable phases of training. Also for the results reported for causal attention masking in Table 1 and particularly for instance x7x13 with x7 = 1 some of the seeds became unstable and did not converge. So for this particular instance, we reported the results for the first 10 seeds where the training loss converged. We note that even for the unstable seeds the min-degree bias was never violated. Note that our main objective is to demonstrate the min-degree bias of neural networks and not to optimize any performance metric. As a result, we did not focus on hyperparameter tuning in these experiments. Generally, hyperparameters used for our experiments are available in our code online: https://github.com/aryol/GOTU.