An Information Criterion for Controlled Disentanglement of Multimodal Data

Authors: Chenyu Wang, Sharut Gupta, Xinyi Zhang, Sana Tonekaboni, Stefanie Jegelka, Tommi Jaakkola, Caroline Uhler

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental Empirically, we demonstrate that DISENTANGLEDSSL successfully achieves both distinct coverage and disentanglement for representations on a suite of synthetic datasets and multiple real-world multimodal datasets. It consistently outperforms baselines on prediction tasks for vision-language data, as well as molecule-phenotype retrieval tasks for biological data. We conduct a simulation study and two real-world multimodal experiments to evaluate the efficacy of our proposed DISENTANGLEDSSL.
Researcher Affiliation Academia Chenyu Wang 1,2, Sharut Gupta 1, Xinyi Zhang1,2, Sana Tonekaboni2, Stefanie Jegelka1,3, Tommi Jaakkola1, Caroline Uhler1,2 1MIT 2Broad Institute of MIT and Harvard 3TU Munich
Pseudocode Yes We introduce a two-step training procedure. The first step focuses on optimizing the shared latent representation, ensuring it captures the minimum necessary information as close as possible. Building upon this, the second step utilizes the learned shared representations in step 1 to facilitate the learning of modality-specific representations. This sequential approach is formalized in the optimization objectives given in Equations 3 and 4, with the pseudocode provided in Appendix I.
Open Source Code Yes The code is available at https://github.com/uhlerlab/Disentangled SSL.
Open Datasets Yes Empirically, we demonstrate that DISENTANGLEDSSL successfully achieves both distinct coverage and disentanglement for representations on a suite of synthetic datasets and multiple real-world multimodal datasets... We utilize the real-world multimodal benchmark from Multi Bench (Liang et al., 2021)... We use two high-content drug screening datasets which provide phenotypic profiles after drug perturbation: RXRX19a (Subramanian et al., 2017) containing cell imaging profiles, and LINCS (Cuccarese et al., 2020) containing L1000 gene expression profiles.
Dataset Splits Yes We follow the same setting (dataset splitting, encoder architecture, pre-extracted features) as in Liang et al. (2024). We conduct train-validation-test splitting according to molecules.
Hardware Specification Yes Each experiment was conducted on 1 NVIDIA RTX A5000 GPU, each with 24GB of accelerator RAM.
Software Dependencies No All experiments were implemented using the Py Torch deep learning framework. We utilize Mol2vec (Jaeger et al., 2018) to featurize the molecular structures into 300-dimensional feature vectors. The paper does not provide specific version numbers for PyTorch or Mol2vec.
Experiment Setup Yes We assess the performance of the learned shared and modality-specific representations for different values of β and λ, as shown in Figure 4. For DISENTANGLEDSSL, we use β {0.0, 0.001, 0.01, 0.1, 0.5, 1.0, 5.0, 10.0, 50.0, 100.0, 300.0, 500.0, 1000.0} and λ {0.0, 0.001, 0.01, 0.1, 1.0, 10.0, 100.0}. For DISENTANGLEDSSL, we use β = 1.0 and λ = 10 3 for all datasets, except for MOSI where β = 0.01. For both molecular structures and phenotypes, we employ 3-layer MLP encoders with a hidden dimension of 2560.