reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

DivIL: Unveiling and Addressing Over-Invariance for Out-of- Distribution Generalization

Authors: Jiaqi WANG, Yuhang Zhou, Zhixiong Zhang, Qiguang Chen, Yongqiang Chen, James Cheng

TMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	Furthermore, we conduct experiments across multiple modalities across 12 datasets and 6 classic models, verifying our over-invariance insight and the effectiveness of our Div IL framework.
Researcher Affiliation	Academia	Jiaqi Wang EMAIL The Chinese University of Hong Kong Qiguang Chen EMAIL Harbin Institute of Technology Yongqiang Chen EMAIL CMU CLea R Group James Cheng EMAIL The Chinese University of Hong Kong
Pseudocode	Yes	Algorithm 1 Overall Training Objective of Div IL
Open Source Code	Yes	Our code is available in https://github.com/kokolerk/Div IL.
Open Datasets	Yes	We employed one synthetic dataset along with eight realistic datasets, including the Spurious Motif datasets introduced in Wu et al. (2022). We evaluate Div IL on the synthetic datasets Colored MNIST following Arjovsky et al. (2019). Inspired by Qin et al. (2024), we also demonstrated the effectiveness of our method in NLP through a Natural Language Inference (NLI) (Dagan et al., 2013) task, which assesses the logical relationship between two sentences... Our model was trained on a subset of the SNLI (Bowman et al., 2015) training set and evaluated on selected cases from the SNLI validation set, as well as the match and mismatch subsets of the MNLI (Williams et al., 2017) validation set. Drug OOD datasets To evaluate the OOD performance in realistic scenarios with realistic distribution shifts, we also include three datasets from Drug OOD benchmark (Ji et al., 2022).
Dataset Splits	Yes	For each dataset, we generate 3,000 graphs for each class in the training set, and 1,000 graphs for each class in the validation set and testing set, respectively. ... Table 5: Statistics of our constructed OOD NLI Dataset. Split Genre Examples Partition Data Domain Metrics Train set SNLI 7992 train Image Captions from the Flickr30k Corpus ACC Test set SNLI 991 validation Image Captions from the Flickr30k Corpus ACC MNLI 1000 validation-matched Fiction, Government, Slate, Telephone, Travel ACC 1000 validation-mismatched 9/11, Face-to-Face, Letters, OUP, Verbatim ACC
Hardware Specification	Yes	We ran our experiments on Linux Servers installed with 3090 graphics cards and CUDA 10.2.
Software Dependencies	Yes	We implement our methods with Py Torch (Paszke et al., 2019) and Py Torch Geometric (Fey & Lenssen, 2019). We ran our experiments on Linux Servers installed with 3090 graphics cards and CUDA 10.2.
Experiment Setup	Yes	In the experimental setup in Section 5.2, the network is a 3 layers MLP with Re Lu activation, optimized with Adam (Kingma & Ba (2015)). IRM selected the following hyperparameters by random search over 50 trials: hidden dimension of 390, l2 regularizer weight of 0.00110794568, learning rate of 0.0004898536566546834, penalty anneal iters (or warmup iter) of 190, penalty weight (λ) of 91257.18613115903, 501 epochs and batch size 25,000 (half of the dataset size). For the implementation of the invariant losses(IRM, VREx and Fishr), we strictly keep the same hyperparameters values in our implementation and the code is almost unchanged from https://github.com/alexrame/fishr. To account for the varying degrees of over-invariance introduced by different IL methods, we performed a straightforward search over β values of {0.01, 0.05, 0.1, 0.2} and projection mask probabilities of {0.3, 0.5, 0.7}, while keeping the random augmentation mask probability fixed at 0.2. We employed a pretrained GPT-2 model with a randomly initialized classification head. We set the maximum token length to 64 and trained the model for 5 epochs using the Adam W optimizer. The learning rate was configured at 2e-5, with a weight decay of 0.01 and a linear learning rate scheduler. We used a training batch size of 32. By default, we fix the temperature to be 1 in the unsupervised contrastive loss, and merely search the penalty weight of the contrastive loss from {0.1, 0.2, 0.5, 1, 2} according to the validation performances. We select the best of the random mask percentage p from the {0.2, 0.3, 0.5, 0.7} according to the validation performances. For the implementation of graph data augmentation, we use the tool from You et al. (2020). We select the best percentage p2 of node dropping, edge removing, and subgraph extraction from the {0.05, 0.1, 0.15, 0.2} according to the validation performances to create the positive pair and keep p1 = 0 representing the sample itself.