reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

M3-JEPA: Multimodal Alignment via Multi-gate MoE based on the Joint-Embedding Predictive Architecture

Authors: Hongyang Lei, Xiaolong Cheng, Qi Qin, Dan Wang, Huazhen Huang, Qingqing Gu, Yetao Wu, Luo Ji

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	By thoroughly designed experiments, we show that M3-JEPA can obtain state-of-the-art performance on different modalities and tasks, generalize to unseen datasets and domains, and is computationally efficient in both training and inference.
Researcher Affiliation	Collaboration	1Geely AI Lab, Zhejiang, China 2Peking University, Beijing, China 3Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, China.
Pseudocode	No	The paper describes the methodology in Section 3 and the theoretical derivation in Section 4, but it does not contain structured pseudocode or algorithm blocks.
Open Source Code	Yes	The paradigm of M3-JEPA is visualized by Figure 1, and the project code can be found at https: //github.com/Hongyang LL/M3-JEPA.
Open Datasets	Yes	COCO (Lin et al., 2014b) and Flickr30K (Plummer et al., 2015)... Clotho (Drossos et al., 2020), Audiocaps (Kim et al., 2019)... Wavtext5k (Deshmukh et al., 2023), Freesound (Font et al., 2013)... Image Net-1K (Deng et al., 2009)... VQAv2 (Antol et al., 2015) and NLVR-2 (Suhr et al., 2019).
Dataset Splits	Yes	COCO (Lin et al., 2014b): ...5,000 training and 1,000 testing images per category. The test set contains 5,000 samples. Flickr30K (Plummer et al., 2015): ...The test set contains 1,000 samples. Image Net-1K (Deng et al., 2009): ...1,281,167 training images, 50,000 validation images, and 100,000 test images... VQAv2 (Antol et al., 2015)... NLVR-2 (Suhr et al., 2019). Additionally, for audio-text retrieval: "we zero-shot evaluate the performance of M3-JEPA on Clotho and Audiocaps based on the knowledge of other datasets. In more details, we train M3-JEPA on the mixture of Audio Caps, Wav Text5K, and Freesound then test it on the Clotho dataset; then we train M3-JEPA on the mixture of Clotho, Wav Text5K, and Freesound then test it on Audio Caps dataset."
Hardware Specification	No	The paper discusses computational efficiency and inference time but does not provide specific hardware details (exact GPU/CPU models, processor types, or memory amounts) used for running its experiments.
Software Dependencies	No	The paper mentions using models/frameworks like LLama3-8B, Dinov2-Large, and Language Bind, and an Adam optimizer, but does not provide specific version numbers for general software dependencies such as programming languages, libraries (e.g., PyTorch, TensorFlow), or CUDA.
Experiment Setup	Yes	We implement an MMo E predictor with N = 12, K = 4 and L = 2. The inner hidden size (h) is 2048 and a dropout rate is set to 0.1. All tasks have the batch size of 128, solved by the Adam optimizer with the lr schedule of cosine, warmup of 0.1 and weight decay of 0.005. For retrieval tasks, we evaluate recall-based metrics including R@1, R@5 and R@10. For classification tasks, we provide metrics such as Accuracy, Precision, Recall and F1 score. Specific learning rates for different tasks are provided in Appendix A.