Towards Understanding the Mixture-of-Experts Layer in Deep Learning
Authors: Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, Yuanzhi Li
NeurIPS 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Our empirical results suggest that the cluster structure of the underlying problem and the non-linearity of the expert are pivotal to the success of Mo E. This motivates us to consider a challenging classification problem with intrinsic cluster structures. ... Finally, we also conduct extensive experiments on both synthetic and real datasets to corroborate our theory. |
| Researcher Affiliation | Academia | Zixiang Chen Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA EMAIL Yihe Deng Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA EMAIL Yue Wu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA EMAIL Quanquan Gu Department of Computer Science University of California, Los Angeles Los Angeles, CA 90095, USA EMAIL Yuanzhi Li Machine Learning Department Carnegie Mellon University Pittsburgh, PA 15213, USA EMAIL |
| Pseudocode | Yes | Algorithm 1 Gradient descent with random initialization |
| Open Source Code | Yes | The code and data for our experiments can be found on Github 1. https://github.com/uclaml/MoE |
| Open Datasets | Yes | We consider the CIFAR-10 dataset (Krizhevsky, 2009) |
| Dataset Splits | Yes | We generate 16,000 training examples and 16,000 test examples from the data distribution defined in Definition 3.1 |
| Hardware Specification | No | The paper does not specify the hardware (e.g., CPU, GPU models, memory) used for running the experiments. |
| Software Dependencies | No | The paper does not provide specific version numbers for software dependencies (e.g., Python, PyTorch, TensorFlow versions). |
| Experiment Setup | Yes | For CNN model, we use 2 convolution layers followed by 2 fully connected layers. The input channel is 3 and output channel is 64. The kernel size is 3 and padding is 1. We use max pooling layer with kernel size 2 and stride 2. We set learning rate to 0.001 and batch size to 128. We use Adam optimizer for all experiments. |