reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multimodal Knowledge Retrieval-Augmented Iterative Alignment for Satellite Commonsense Conversation

Authors: Qian Li, Xuchen Li, Zongyu Chang, Yuzheng Zhang, Cheng Ji, Shangguang Wang

IJCAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Experimental results demonstrate that Sat-RIA outperforms existing large language models and provides more comprehensible answers with fewer hallucinations. The paper includes a dedicated section '5 Experiments' with subsections '5.1 Evaluation Datasets', '5.2 Evaluation Metrics', '5.3 Comparison Methods', and '5.5 Main Results' which features a performance comparison table.
Researcher Affiliation	Academia	1School of Computer Science, Beijing University of Posts and Telecommunications, China 2Institute of Automation, Chinese Academy of Sciences and Zhongguancun Academy, China 3SKLCCSE, School of Computer Science and Engineering, Beihang University, China EMAIL, EMAIL, EMAIL, EMAIL, EMAIL, EMAIL
Pseudocode	No	The paper describes methods through text and mathematical formulas (e.g., equations 1-7) but does not include any clearly labeled pseudocode or algorithm blocks.
Open Source Code	No	The paper does not contain any explicit statement about releasing the source code for the methodology, nor does it provide a link to a code repository.
Open Datasets	No	To evaluate our models on satellite commonsense conversation, we construct two datasets: one for satellite multi-turn dialogues (Sat Diag) and one for satellite visual question-answering (Sat VQA) (more details in Appendix C). The paper does not provide concrete access information (link, DOI, repository, or external citation) for these constructed datasets in the main body.
Dataset Splits	No	The paper describes the size and content of the constructed datasets (e.g., 'The Sat Diag dataset includes 2,000 dialogues', 'The Sat VQA dataset consists of 2,000 labeled examples') but does not specify any training, validation, or test splits, nor does it mention cross-validation or specific splitting methodologies.
Hardware Specification	Yes	We have trained our model through the method of full parameter fine-tuning, using a 2x A800 80G machine, and All experiments were conducted on the same machine.
Software Dependencies	No	The paper mentions 'Py Torch framework' and specific LLM models like 'Intern VL 2 8B' and 'LLa Ma3 8B', but it does not provide specific version numbers for PyTorch or any other ancillary software libraries or tools.
Experiment Setup	Yes	We use a total batch size of 1 throughout the training process. The Adam W [Loshchilov and Hutter, 2019] optimizer is applied with a cosine learning rate decay and a warm-up period. In the training stage, every alignment epoch number is 1 with a learning rate of 1 10 5 and a warmup ratio of 0.05.