M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture
Authors: Hongyang Lei, Xiaolong Cheng, Qi Qin, Dan Wang, Huazhen Huang, Qingqing Gu, Yetao Wu, Luo Ji
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | By thoroughly designed experiments, we show that M3-JEPA can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in both training and inference. |
| Researcher Affiliation | Collaboration | 1Geely AI Lab, Zhejiang, China 2Peking University, Beijing, China 3Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China. |
| Pseudocode | No | The paper describes the methodology in Section 3 and the theoretical derivation in Section 4, but it does not contain structured pseudocode or algorithm blocks. |
| Open Source Code | Yes | The paradigm of M3-JEPA is visualized by Figure 1, and the project code can be found at https: //github.com/Hongyang LL/M3-JEPA. |
| Open Datasets | Yes | COCO (Lin et al., 2014b) and Flickr30K (Plummer et al., 2015)... Clotho (Drossos et al., 2020), Audiocaps (Kim et al., 2019)... Wavtext5k (Deshmukh et al., 2023), Freesound (Font et al., 2013)... Image Net-1K (Deng et al., 2009)... VQAv2 (Antol et al., 2015) and NLVR-2 (Suhr et al., 2019). |
| Dataset Splits | Yes | COCO (Lin et al., 2014b): ...5,000 training and 1,000 testing images per category. The test set contains 5,000 samples. Flickr30K (Plummer et al., 2015): ...The test set contains 1,000 samples. Image Net-1K (Deng et al., 2009): ...1,281,167 training images, 50,000 validation images, and 100,000 test images... VQAv2 (Antol et al., 2015)... NLVR-2 (Suhr et al., 2019). Additionally, for audio-text retrieval: "we zero-shot evaluate the performance of M3-JEPA on Clotho and Audiocaps based on the knowledge of other datasets. In more details, we train M3-JEPA on the mixture of Audio Caps, Wav Text5K, and Freesound then test it on the Clotho dataset; then we train M3-JEPA on the mixture of Clotho, Wav Text5K, and Freesound then test it on Audio Caps dataset." |
| Hardware Specification | No | The paper discusses computational efficiency and inference time but does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments. |
| Software Dependencies | No | The paper mentions using models/frameworks like LLama3-8B, Dinov2-Large, and Language Bind, and an Adam optimizer, but does not provide specific version numbers for general software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or CUDA. |
| Experiment Setup | Yes | We implement an MMo E predictor with N = 12, K = 4 and L = 2. The inner hidden size (h) is 2048 and a dropout rate is set to 0.1. All tasks have the batch size of 128, solved by the Adam optimizer with the lr schedule of cosine, warmup of 0.1 and weight decay of 0.005. For retrieval tasks, we evaluate recall-based metrics including R@1, R@5 and R@10. For classification tasks, we provide metrics such as Accuracy, Precision, Recall and F1 score. Specific learning rates for different tasks are provided in Appendix A. |