Improving Adaptivity via Over-Parameterization in Sequence Models
Authors: Yicheng Li, Qian Lin
NeurIPS 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this section, we provide some numerical experiments to validate the theoretical results. For more detailed numerical experiments, please refer to Section C. |
| Researcher Affiliation | Academia | Yicheng Li Department of Statistics and Data Science Tsinghua University, Beijing, China EMAIL Qian Lin Department of Statistics and Data Science Tsinghua University, Beijing, China EMAIL Corresponding author Qian Lin also affiliates with Beijing Academy of Artificial Intelligence, Beijing, China |
| Pseudocode | No | We approximate the gradient flow equation (22) and (30) by discrete-time gradient descent and truncate the sequence model to the first N terms for some very large N. |
| Open Source Code | Yes | The codes are provided in the supplementary material. |
| Open Datasets | No | We consider the settings as in Corollary 3.3 that θ is given by (4) for some p > 0 and q 1. We set ϵ2 = n 1, where n can be regarded as the sample size, and consider the asymptotic performance of the generalization error as n grows. [...] We consider the two real-world datasets: California Housing and Concrete Compressive Strength. |
| Dataset Splits | No | No explicit mention of training/validation/test dataset splits is found. The paper focuses on generalization error related to training process and sample size. |
| Hardware Specification | Yes | The experiments can be done by a 64 CPU core laptop with 32 GB memory in one day. |
| Software Dependencies | No | The paper mentions 'discrete-time gradient descent' and implies computation, but does not specify software names with version numbers for libraries or programming languages used. |
| Experiment Setup | Yes | We approximate the gradient flow equation (22) and (30) by discrete-time gradient descent with sufficiently small step size. Moreover, we truncate the sequence model to the first N terms for some very large N. We consider the settings as in Corollary 3.3 that θ is given by (4) for some p > 0 and q 1. We set ϵ2 = n 1, where n can be regarded as the sample size, and consider the asymptotic performance of the generalization error as n grows. For the stopping time, we choose the oracle one that minimizes the generalization error for each method. |