reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Collapsed Language Models Promote Fairness

Authors: Jingxuan Xu, Wuyang Chen, Linyi Li, Yao Zhao, Yunchao Wei

ICLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Comprehensive experiments on both intrinsic and extrinsic evaluations demonstrate that our regularization can consistently debias language models. It is orthogonal to a wide range of highly tailored fairness algorithms, and thus can be plug-and-play adopted without sacrificing the models performance on typical downstream language tasks.
Researcher Affiliation	Academia	Jingxuan Xu1 , Wuyang Chen2 , Linyi Li2, Yao Zhao1, Yunchao Wei1 1Beijing Jiaotong University 2Simon Fraser University
Pseudocode	No	The paper does not contain explicitly labeled pseudocode or algorithm blocks. It describes methods using mathematical equations and descriptive text.
Open Source Code	Yes	We attach our code at https://github.com/Xujxyang/Fairness-NC-main.
Open Datasets	Yes	Mabel on SNLI (Bowman et al., 2015) and MNLI (Williams et al., 2017); ASE on Onto Notes (Hovy et al., 2006); BEC on Tiny Stories (Eldan & Li, 2023); Among these, only BEC s training dataset (Webster et al., 2018) is relatively small, while the datasets in other works exceed 100K sentences.
Dataset Splits	No	The paper mentions training models for a certain number of epochs (e.g., "training it for two epochs," "trained for 50 epochs," "trained for three epochs") and fine-tuning on specific datasets (e.g., "We fine-tune models on the Onto Notes 5.0 dataset... and then evaluate on the Wino Bias benchmark"). However, it does not provide explicit percentages, absolute sample counts, or specific methodologies for dividing datasets into training, validation, and test sets. It often refers to existing benchmarks or datasets but without detailing how splits were handled for the experiments presented.
Hardware Specification	No	The paper does not provide specific hardware details such as GPU models, CPU types, or memory used for running the experiments.
Software Dependencies	No	The paper does not list specific software dependencies with their version numbers, such as Python libraries, frameworks (e.g., PyTorch, TensorFlow), or CUDA versions.
Experiment Setup	Yes	Mabel+(U)NC3: We implement Mabel with a batch size of 24, a learning rate of 5 10 5, and use the Adam optimizer, training it for two epochs. ASE+(U)NC3: ASE is trained for 50 epochs with the Adam optimizer. The learning rate is set to 2 10 5, a dropout probability of 0.1 is used, and a batch size of 6 is chosen. BEC+(U)NC3: BEC is trained for three epochs using the Adam optimizer, with a learning rate of 2 10 5 and a batch size of 16.