reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Multilingual Mathematical Reasoning: Advancing Open-Source LLMs in Hindi and English

Authors: Avinash Anand, Kritarth Prasad, Chhavi Kirtani, Ashwin R Nair, Manvendra Kumar Nema, Raj Jaiswal, Rajiv Ratn Shah

AAAI 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Our experiments result in notable performance enhancements. Wizard Math 7B exceeds Gemini s accuracy on English datasets by +6% and matches Gemini s performance on Hindi datasets.
Researcher Affiliation	Academia	Indraprastha Institute of Information Technology, Delhi EMAIL
Pseudocode	No	The paper describes methodologies such as the Decomposition Strategy and Structured Solution Approach with Curriculum Learning in text and through figures, but does not present them in structured pseudocode or algorithm blocks.
Open Source Code	Yes	Code and Dataset https://github.com/midasresearch/Multilingual-Mathematical-Reasoning.git
Open Datasets	Yes	Our evaluations on the GSM8K (Cobbe et al. 2021) and MATH (Hendrycks et al. 2021) datasets reveal a stark contrast in their capabilities. ... (Sharma, Mishra, and Sharma 2022) released HAWP (Hindi Arithmetic Word Problems), which is the only publicly available dataset of Hindi mathematical questions. ... Code and Dataset https://github.com/midasresearch/Multilingual-Mathematical-Reasoning.git
Dataset Splits	Yes	These refined solutions, with 70% and 30% training/testing split, were then used to fine-tune the models Open Hathi 7B, Wizard Math-v1.1 7B, and LLe MMa 7B. ... Each dataset was divided into 70% for training and 30% for testing, ensuring this split was consistently applied across all problem categories: easy, medium, and hard.
Hardware Specification	No	The paper does not provide specific details about the hardware (e.g., GPU models, CPU types, memory) used for running the experiments.
Software Dependencies	No	The paper mentions using GPT-4 and LLAMA 3 (405B) for generating and translating data, but does not provide specific version numbers for these or other ancillary software components (e.g., programming languages, libraries, frameworks) required for reproducibility.
Experiment Setup	No	The paper describes the general methodology, including zero-shot, few-shot chain-of-thought, supervised fine-tuning, curriculum learning, and instruction-tuning. However, it does not provide concrete hyperparameter values such as learning rates, batch sizes, number of epochs, or optimizer settings used during training.