Neural Networks beyond explainability: Selective inference for sequence motifs
Authors: Antoine Villié, Philippe Veber, Yohann De Castro, Laurent Jacob
TMLR 2023 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Here we intend to go beyond explainable machine learning and introduce SEISM, a selective inference procedure to test the association between these extracted features and the predicted phenotype. ... We illustrate the behavior of our method in terms of calibration, power and speed and discuss its power/speed trade-off with a simpler data-split strategy. ... In order to assess the statistical validity and of the SEISM procedure with the different strategies, we simulate datasets under the null hypothesis. ... Figure 5 (top) shows the Q-Q plot of the distribution of quantiles of the uniform distribution against the p-values obtained across 1000 datasets under the null hypothesis ... Figure 5 (bottom) shows the same Q-Q plot on data generated under the alternative hypothesis. |
| Researcher Affiliation | Academia | Antoine Villié EMAIL Université de Lyon, Université Lyon 1, CNRS, Vet Agro Sup, Laboratoire de Biométrie et Biologie Evolutive, UMR5558, Villeurbanne, France Philippe Veber EMAIL Université de Lyon, Université Lyon 1, CNRS, Vet Agro Sup, Laboratoire de Biométrie et Biologie Evolutive, UMR5558, Villeurbanne, France Yohann De Castro EMAIL Institut Camille Jordan, École Centrale Lyon, CNRS UMR 5208 Institut universitaire de France (IUF) Laurent Jacob EMAIL Sorbonne Université, CNRS, IBPS, Laboratory of Computational and Quantitative Biology (LCQB), UMR 7238, Paris 75005, France |
| Pseudocode | Yes | Algorithm 1 SEISM algorithm (general formulation) # Description: SEISM selects a set of sequence motifs (z1, . . . , zq) based on an association score s( , ), and evaluate their p-values based on a partition Z = F Mi. Inputs: Response y Y Rn, sequence samples X, feature function z Z 7 φz,X Rn, association score s : Z Y R, number of selected motifs q 1, meshes Z = F i=1 Mi, sampling algorithm HR. Result: ((p1, z1), . . . , (pq, zq)), sequence of p-values and sequence motifs. ... Algorithm 2 Hypersphere Directions hit-and-run sampler /* Description: The Hypersphere Directions hit-and-run sampler creates a discrete-time Markov chain on an open and bounded region and is used to approximate a uniform distribution on the selection event E. */ |
| Open Source Code | Yes | We provide a Py Torch implementation of SEISM at: https://gitlab.in2p3.fr/antoine.villie1/seism. |
| Open Datasets | Yes | In order to compare the accuracy of our selection step with existing motif discovery algorithms, we use the 40 ENCODE Transcription Factors Ch IP-seq datasets from K562 cells (ENCODE Project Consortium, 2004), each of which contains a known TF motif, denoted m , derived using completely independent assays (Jolma et al., 2013). ... We rely on the Ch IP-seq dataset from Chatagnon et al. (2015). |
| Dataset Splits | No | The data-split version of SEISM applies the same (i)-(ii) steps on a fraction of the data , and simply compares the scores of the selected motifs on the remaining data to the distribution of scores for the same motif with data sampled under the null distribution as opposed to the selective null generated by (iii)-(v). ... To that end, we draw one sequence motif z with length k = 8 for each simulated dataset using a uniform distribution on Z restricted to motifs with an information level fixed at 10 bits. Then, we draw a set of n = 30 biological sequences X as follows: all sites are generated according to a uniform distribution over A, C, T, G for all sequences, and for half of the sequences one k-mer is drawn according to the categorical model parameterized by z. The phenotypes y are drawn from N(0, σ2Cn) to generate data under the null hypothesis for calibration experiments, and from N(φ z,X, σ2Cn) to generate data under the alternative for experiments on statistical power, with σ = 0.1 in both cases. We then run the SEISM procedure to select and test two sequence motifs. |
| Hardware Specification | No | This work has been supported by ANR grants (FAST-BIG project ANR-17-CE23-0011-01 and PIECES project ANR-20-CE45-0017) and was performed using the computation facilities of the LBBE/PRABI. |
| Software Dependencies | No | We provide a Py Torch implementation of SEISM at: https://gitlab.in2p3.fr/antoine.villie1/seism. ... We rely on the Tomtom method (Gupta et al., 2007), which quantifies the probability that the euclidean distance between a random motif and m is lower than the distance between the discovered motif and m . ... The k-mer list is obtained using the DSK software (Rizk et al., 2013). |
| Experiment Setup | Yes | SEISM is run with a regularization parameter λ = 0.01. ... The phenotypes y are drawn from N(0, σ2Cn) to generate data under the null hypothesis for calibration experiments, and from N(φ z,X, σ2Cn) to generate data under the alternative for experiments on statistical power, with σ = 0.1 in both cases. ... For the data-split strategy, we sample 1000 replicates under the null hypothesis to compute the p-value. For SEISM, we sample 50, 000 replicates under the conditional null hypothesis using the hypersphere direction sampler, after 10, 000 burn-in iterations. |