reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Modality-Fair Preference Optimization for Trustworthy MLLM Alignment

Authors: Songtao Jiang, Yan Zhang, Ruizhe Chen, Tianxiang Hu, Yeying Jin, Qinglin He, Yang Feng, Jian Wu, Zuozhu Liu

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments on three trustworthiness benchmarks demonstrate that MFPO significantly enhances the trustworthiness of MLLMs. In particular, it enables the 7B models to attain trustworthiness levels on par with, or even surpass, those of the 13B, 34B, and larger models.
Researcher Affiliation	Collaboration	1Zhejiang University 2Byte Dance 3National University of Singapore 4Angelalign Inc., China
Pseudocode	No	The paper describes methodologies using text and mathematical equations (e.g., LRM, LDPO, Ltext, Limage, Lmargin, H(P)) but does not include any explicitly labeled pseudocode or algorithm blocks with structured steps.
Open Source Code	No	The paper does not provide any explicit statement about releasing source code for the described methodology, nor does it include links to a code repository.
Open Datasets	Yes	We employ three widely used benchmarks to evaluate trustworthiness reflecting the degree of hallucination. Object Hal Bench [Rohrbach et al., 2018]... MMHal-Bench [Sun et al., 2023]... AMBER [Wang et al., 2023a]... For general capabilities, we employ the LLa VA-Bench [Liu et al., 2024b].
Dataset Splits	Yes	After calculating entropy for all training samples, we rank the training dataset according to their entropy scores, where higher values denoting more challenging inputs. We then divide the dataset into three distinct difficulty levels: easy , medium , and hard .
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory specifications used for running the experiments. It only mentions using 'LLa VA-v1.5 as the backbone for all experiments' without hardware context.
Software Dependencies	No	The paper mentions using 'LLa VA-v1.5 as the backbone' and validating with 'LLa VA-v1.6' but does not specify version numbers for these or any other ancillary software components, libraries, or frameworks used in the implementation.
Experiment Setup	No	The paper states, 'The training consists of three stages: the first two stages follow standard LLa VA training, while MFPO is introduced in the third stage. Here, we construct image preference data based on Section 3.1, using text preference data from RLHF-V [Yu et al., 2024a], and apply MFPO optimization. Details are in Supplementary Section 4.' The specific hyperparameters or detailed training configurations are deferred to the supplementary material and not provided in the main text.