Statistical Collusion by Collectives on Learning Platforms

Authors: Etienne Gauthier, Francis Bach, Michael I. Jordan

ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental We develop a framework that provides a theoretical and algorithmic treatment of these issues and present experimental results in a product evaluation domain. Our primary contribution is the introduction of a novel framework which empowers collectives via statistical inference. We thoroughly explore strategies that a collective can employ to achieve each of these goals and provide theoretical guarantees for the effectiveness of these strategies. We construct a synthetic dataset to validate our theoretical findings and to examine the influence of various parameters empirically.
Researcher Affiliation Academia 1Inria, Ecole Normale Sup erieure, PSL Research University, Paris 2Departments of EECS and Statistics, University of California, Berkeley. Correspondence to: Etienne Gauthier <EMAIL>.
Pseudocode Yes B. Algorithms Algorithm 1 Signal planting lower bound feature-label strategy 1: Input: X, Y, N, Ntest, n < N, D(n), g, y , δ > 0, ε > 0 2: Define X := {g(x) | x X} 3: Observe D(n) and compute (n) x for every x X 4: Define h : (x, y) 7 (g(x), y ) 5: Compute D(n) by applying h to all samples in D(n) 6: Compute ˆP D(n)( x) for every x X 7: Define δ := δ/(2 + 2# X + 2# X#Y) 8: Compute R δ(n), R δ(N n), and R δ(Ntest) 9: Compute and return: ˆP D(n)( x) 2R δ(n) N n N (n) x + 2R δ(n) + 2R δ(N n) ε 1 ε > 0 R δ(n) R δ(Ntest)
Open Source Code Yes Details on the algorithm for computing the lower bound are provided in Appendix B, and the code is available at: https://github.com/GauthierE/statistical-collusion.
Open Datasets No We construct a synthetic dataset to validate our theoretical findings and to examine the influence of various parameters empirically. Our empirical results highlight, among other things... In our experiments, we simulate a platform that collects data on vehicles... We generate a dataset of 3,000,000 instances. Each instance represents a car, characterized by multiple categorical features... For further details on the dataset composition, we refer to the code available at: https://github.com/Gauthier E/ statistical-collusion.
Dataset Splits Yes We generate separate consumer datasets, sampled without replacement from this base dataset. Unless otherwise specified, we choose N = 1,000,000 for the training set and Ntest = 100,000 for the test set.
Hardware Specification No The paper does not provide specific hardware details used for running its experiments.
Software Dependencies No The paper does not provide specific ancillary software details with version numbers.
Experiment Setup Yes In all the experiments, we set δ = 0.05 and ε = 0. In our experiments, the collective attempts to influence the platform by targeting features with specific characteristics defined through the transformation g: Model Type = SUV, Fuel Type = Diesel, Transmission Type = Manual, Drive Type = RWD, Safety Rating = 4 stars, Interior Material = Synthetic, Infotainment System = Premium, Warranty Length = 10 years, Number of Doors = 5, Number of Seats = 5, Air Conditioning = Yes, Navigation System = Advanced, Tire Type = All-Season, Sunroof = Yes, Sound System = Premium, Cruise Control = Yes, and Bluetooth Connectivity = Yes.