reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

On Path to Multimodal Generalist: General-Level and General-Bench

Authors: Hao Fei, Yuan Zhou, Juncheng Li, Xiangtai Li, Qingshan Xu, Bobo Li, Shengqiong Wu, Yaoting Wang, Junbao Zhou, Jiahao Meng, Qingyu Shi, Zhiyuan Zhou, Liangtao Shi, Minghe Gao, Daoan Zhang, Zhiqi Ge, Siliang Tang, Kaihang Pan, Yaobo Ye, Haobo Yuan, Tao Zhang, Weiming Wu, Tianjie Ju, Zixiang Meng, Shilin Xu, Liyu Jia, Wentao Hu, Meng Luo, Jiebo Luo, Tat-Seng Chua, Shuicheng Yan, Hanwang Zhang

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	In this project, we introduce an evaluation framework to delineate the capabilities and behaviors of current multimodal generalists. This framework, named General-Level, establishes 5-scale levels of MLLM performance and generality, offering a methodology to compare MLLMs and gauge the progress of existing systems towards more robust multimodal generalists and, ultimately, towards AGI (Artificial General Intelligence). Central to our framework is the use of Synergy as the evaluative criterion, categorizing capabilities based on whether MLLMs preserve synergy across comprehension and generation, as well as across multimodal interactions. To evaluate the comprehensive abilities of various generalists, we present a massive multimodal benchmark, General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. The evaluation results that involve over 100 existing state-of-the-art MLLMs uncover the capability rankings of generalists, highlighting the challenges in reaching genuine AI. We expect this project to pave the way for future research on next-generation multimodal foundation models, providing a robust infrastructure to accelerate the realization of AGI.
Researcher Affiliation	Collaboration	1NUS 2NTU 3ZJU 4KAUST 5PKU 6HFUT 7UR 8WHU 9NJU 10SJTU 11Skywork AI. Correspondence to: Shuicheng Yan <EMAIL>, Hanwang Zhang <EMAIL>. The affiliations include various universities (academic) such as National University of Singapore (NUS), Nanyang Technological University (NTU), Zhejiang University (ZJU), King Abdullah University of Science and Technology (KAUST), Peking University (PKU), Hefei University of Technology (HFUT), University of Rochester (UR), Wuhan University (WHU), Nanjing University (NJU), Shanghai Jiao Tong University (SJTU), and an industry entity, Skywork AI.
Pseudocode	No	The paper describes scoring specifications and algorithms in Table 1 and accompanying text (Section 3.2, A.1) using mathematical formulas and definitions, but it does not present these in a structured pseudocode or algorithm block format.
Open Source Code	No	Project Page: https://generalist.top/ Leaderboard: https://generalist.top/leaderboard/ Benchmark: https://huggingface.co/General-Level/ Our open-source codebase supports multi-GPU distributed inference, effectively accelerating the evaluation process. While the paper mentions an "open-source codebase" for evaluation and provides links to a project page and benchmark data, it does not provide a direct link to a specific source-code repository (e.g., GitHub, GitLab) for the methodology itself.
Open Datasets	Yes	To evaluate the comprehensive abilities of various generalists, we present a massive multimodal benchmark, General-Bench, which encompasses a broader spectrum of skills, modalities, formats, and capabilities, including over 700 tasks and 325,800 instances. Benchmark: https://huggingface.co/General-Level/ The dataset for General-Bench is built using publicly available resources or through collaborations with contributors who explicitly consented to their data being included. In alignment with our commitment to fostering inclusivity in the AI research community, all code, tasks, and datasets related to General-Bench are openly available.
Dataset Splits	Yes	For most of the tasks, we maintain around 500 testing instances each. Considering that not all practitioners in the community may be interested in participating in the leaderboard for example, some may simply wish to use our dataset for their research or publications we propose dividing the test set for each task into a closed set and an open set. The closed set is reserved for leaderboard evaluations: only the input data is released, and users are required to submit their model s predicted outputs for centralized assessment. In contrast, the open set provides full access to both inputs and corresponding outputs, enabling practitioners to explore and utilize the data more freely. Each task s test set is split into closed and open subsets with a ratio of 2:3.
Hardware Specification	No	The inference time varies across models. Smaller models complete evaluations within a few minutes, while larger models require significantly more time. On pure text-based NLP tasks, model inference is highly efficient; however, on video tasks, models demand more memory and have slower inference speeds. Our open-source codebase supports multi-GPU distributed inference, effectively accelerating the evaluation process. Also, we organize personnel into multiple groups to run models in parallel, further optimizing efficiency. The paper mentions
Software Dependencies	No	For different models, we consistently follow the settings provided in their respective Git Hub repositories, including model parameters and hyperparameters. We do not perform additional pre-training or fine-tuning. The paper refers to using existing models' settings and repositories but does not list specific version numbers for software dependencies of its own evaluation framework or tools.
Experiment Setup	Yes	For different models, we consistently follow the settings provided in their respective Git Hub repositories, including model parameters and hyperparameters. We do not perform additional pre-training or fine-tuning. Each task and dataset comes with a predefined instruction prompt text. During evaluation, we use the same default prompt across all MLLMs to ensure fairness. The overall results of part of the models on image comprehension and generation are presented in Table 2 and Table 3, respectively; Due to space limitations, we move the rest main results in Appendix C.2. video results are shown in Table 11; audio results are shown in Table 12; 3D results are shown in Table 13; The results of all generalists on NLP tasks are shown in Table 14. The complete performing scores of all MLLMs across all tasks and datasets are presented in Appendix C. Overall, we have the following observations. We note that all the generalists run the evaluation on our General-Bench data set under a zero-shot setting.