reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

ENSUR: Equitable and Statistically Unbiased Recommendation

Authors: Nitin Bisht, Xiuwen Gong, Guandong Xu

ICML 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Extensive experiments validate the effectiveness of the proposed framework, which aligns with our theoretical analysis. ... Finally, we conduct comprehensive experiments on top of five commonly used recommendation models and various datasets across multiple domains and fairness definitions, demonstrating the empirical efficiency and effectiveness of the proposed ENSUR, which also aligns with our theoretical analysis.
Researcher Affiliation	Academia	1University of Technology, Sydney 2The Education University of Hong Kong. Correspondence to: Xiuwen Gong <EMAIL>, Guandong Xu <EMAIL>.
Pseudocode	Yes	Algorithm 1 Guaranteed User Fairness Algorithm (GUFA)
Open Source Code	Yes	The code and implementation details are available at https://github.com/kalpiree/ENSUR
Open Datasets	Yes	We conduct experiments on four datasets with specific sensitive user attributes: (1) Amazon Office dataset (e Commerce) (Mc Auley et al., 2015); (2) Last.fm dataset (music streaming) (Cantador et al., 2011); (3) Movie Lens dataset (movie ratings) (Harper & Konstan, 2015); and (4) Book-Crossing dataset (book ratings) (Ziegler et al., 2005).
Dataset Splits	Yes	We employed the Leave-One-Out (LOO) strategy (He et al., 2017; Han et al., 2023) to partition the dataset into training, calibration, and testing sets. Specifically, for each user, one interaction was isolated for calibration and testing, while the remaining interactions were used for training.
Hardware Specification	No	The paper does not provide specific details about the hardware used for running the experiments. It mentions using software libraries but no CPU/GPU models or memory specifications.
Software Dependencies	No	The paper mentions using Python for optimization with MIP (Santos & Toffolo, 2020) and referring to Gurobi (Gurobi Optimization, LLC, 2024). It also mentions the Adam optimizer and Binary Cross Entropy Loss (BCELoss). However, it does not provide specific version numbers for Python itself or other key libraries/frameworks (e.g., PyTorch, TensorFlow) used for implementing the models, which are necessary for full reproducibility.
Experiment Setup	Yes	All base recommender models are trained for 20 epochs with a batch size of 256, a learning rate of 0.001, the Adam optimizer, and Binary Cross Entropy Loss (BCELoss). ... Deep FM: Combines 8 latent factors with deep layers of [50, 25, 10] and Re LU activation. GMF: Utilizes an embedding size of 8 for capturing linear interactions between user and item embeddings. MLP: Employs layers of [64, 32, 16] with Re LU activation for modeling non-linear interactions. Neu MF: Integrates GMF and MLP with a GMF embedding size of 8 and MLP layers of [64, 32, 16], using Re LU activation. Light GCN: Configured with an embedding size of 8 and 3 graph convolution layers.