DivIL: Unveiling and Addressing Over-Invariance for Out-of- Distribution Generalization
Authors: Jiaqi WANG, Yuhang Zhou, Zhixiong Zhang, Qiguang Chen, Yongqiang Chen, James Cheng
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Furthermore, we conduct experiments across multiple modalities across 12 datasets and 6 classic models, verifying our over-invariance insight and the effectiveness of our Div IL framework. |
| Researcher Affiliation | Academia | Jiaqi Wang EMAIL The Chinese University of Hong Kong Qiguang Chen EMAIL Harbin Institute of Technology Yongqiang Chen EMAIL CMU CLea R Group James Cheng EMAIL The Chinese University of Hong Kong |
| Pseudocode | Yes | Algorithm 1 Overall Training Objective of Div IL |
| Open Source Code | Yes | Our code is available in https://github.com/kokolerk/Div IL. |
| Open Datasets | Yes | We employed one synthetic dataset along with eight realistic datasets, including the Spurious Motif datasets introduced in Wu et al. (2022). We evaluate Div IL on the synthetic datasets Colored MNIST following Arjovsky et al. (2019). Inspired by Qin et al. (2024), we also demonstrated the effectiveness of our method in NLP through a Natural Language Inference (NLI) (Dagan et al., 2013) task, which assesses the logical relationship between two sentences... Our model was trained on a subset of the SNLI (Bowman et al., 2015) training set and evaluated on selected cases from the SNLI validation set, as well as the match and mismatch subsets of the MNLI (Williams et al., 2017) validation set. Drug OOD datasets To evaluate the OOD performance in realistic scenarios with realistic distribution shifts, we also include three datasets from Drug OOD benchmark (Ji et al., 2022). |
| Dataset Splits | Yes | For each dataset, we generate 3,000 graphs for each class in the training set, and 1,000 graphs for each class in the validation set and testing set, respectively. ... Table 5: Statistics of our constructed OOD NLI Dataset. Split Genre Examples Partition Data Domain Metrics Train set SNLI 7992 train Image Captions from the Flickr30k Corpus ACC Test set SNLI 991 validation Image Captions from the Flickr30k Corpus ACC MNLI 1000 validation-matched Fiction, Government, Slate, Telephone, Travel ACC 1000 validation-mismatched 9/11, Face-to-Face, Letters, OUP, Verbatim ACC |
| Hardware Specification | Yes | We ran our experiments on Linux Servers installed with 3090 graphics cards and CUDA 10.2. |
| Software Dependencies | Yes | We implement our methods with Py Torch (Paszke et al., 2019) and Py Torch Geometric (Fey & Lenssen, 2019). We ran our experiments on Linux Servers installed with 3090 graphics cards and CUDA 10.2. |
| Experiment Setup | Yes | In the experimental setup in Section 5.2, the network is a 3 layers MLP with Re Lu activation, optimized with Adam (Kingma & Ba (2015)). IRM selected the following hyperparameters by random search over 50 trials: hidden dimension of 390, l2 regularizer weight of 0.00110794568, learning rate of 0.0004898536566546834, penalty anneal iters (or warmup iter) of 190, penalty weight (λ) of 91257.18613115903, 501 epochs and batch size 25,000 (half of the dataset size). For the implementation of the invariant losses(IRM, VREx and Fishr), we strictly keep the same hyperparameters values in our implementation and the code is almost unchanged from https://github.com/alexrame/fishr. To account for the varying degrees of over-invariance introduced by different IL methods, we performed a straightforward search over β values of {0.01, 0.05, 0.1, 0.2} and projection mask probabilities of {0.3, 0.5, 0.7}, while keeping the random augmentation mask probability fixed at 0.2. We employed a pretrained GPT-2 model with a randomly initialized classification head. We set the maximum token length to 64 and trained the model for 5 epochs using the Adam W optimizer. The learning rate was configured at 2e-5, with a weight decay of 0.01 and a linear learning rate scheduler. We used a training batch size of 32. By default, we fix the temperature to be 1 in the unsupervised contrastive loss, and merely search the penalty weight of the contrastive loss from {0.1, 0.2, 0.5, 1, 2} according to the validation performances. We select the best of the random mask percentage p from the {0.2, 0.3, 0.5, 0.7} according to the validation performances. For the implementation of graph data augmentation, we use the tool from You et al. (2020). We select the best percentage p2 of node dropping, edge removing, and subgraph extraction from the {0.05, 0.1, 0.15, 0.2} according to the validation performances to create the positive pair and keep p1 = 0 representing the sample itself. |