Statistical Collusion by Collectives on Learning Platforms
Authors: Etienne Gauthier, Francis Bach, Michael I. Jordan
ICML 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | We develop a framework that provides a theoretical and algorithmic treatment of these issues and present experimental results in a product evaluation domain. Our primary contribution is the introduction of a novel framework which empowers collectives via statistical inference. We thoroughly explore strategies that a collective can employ to achieve each of these goals and provide theoretical guarantees for the effectiveness of these strategies. We construct a synthetic dataset to validate our theoretical findings and to examine the influence of various parameters empirically. |
| Researcher Affiliation | Academia | 1Inria, Ecole Normale Sup erieure, PSL Research University, Paris 2Departments of EECS and Statistics, University of California, Berkeley. Correspondence to: Etienne Gauthier <EMAIL>. |
| Pseudocode | Yes | B. Algorithms Algorithm 1 Signal planting lower bound feature-label strategy 1: Input: X, Y, N, Ntest, n < N, D(n), g, y , δ > 0, ε > 0 2: Define X := {g(x) | x X} 3: Observe D(n) and compute (n) x for every x X 4: Define h : (x, y) 7 (g(x), y ) 5: Compute D(n) by applying h to all samples in D(n) 6: Compute ˆP D(n)( x) for every x X 7: Define δ := δ/(2 + 2# X + 2# X#Y) 8: Compute R δ(n), R δ(N n), and R δ(Ntest) 9: Compute and return: ˆP D(n)( x) 2R δ(n) N n N (n) x + 2R δ(n) + 2R δ(N n) ε 1 ε > 0 R δ(n) R δ(Ntest) |
| Open Source Code | Yes | Details on the algorithm for computing the lower bound are provided in Appendix B, and the code is available at: https://github.com/GauthierE/statistical-collusion. |
| Open Datasets | No | We construct a synthetic dataset to validate our theoretical findings and to examine the influence of various parameters empirically. Our empirical results highlight, among other things... In our experiments, we simulate a platform that collects data on vehicles... We generate a dataset of 3,000,000 instances. Each instance represents a car, characterized by multiple categorical features... For further details on the dataset composition, we refer to the code available at: https://github.com/Gauthier E/ statistical-collusion. |
| Dataset Splits | Yes | We generate separate consumer datasets, sampled without replacement from this base dataset. Unless otherwise specified, we choose N = 1,000,000 for the training set and Ntest = 100,000 for the test set. |
| Hardware Specification | No | The paper does not provide specific hardware details used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific ancillary software details with version numbers. |
| Experiment Setup | Yes | In all the experiments, we set δ = 0.05 and ε = 0. In our experiments, the collective attempts to influence the platform by targeting features with specific characteristics defined through the transformation g: Model Type = SUV, Fuel Type = Diesel, Transmission Type = Manual, Drive Type = RWD, Safety Rating = 4 stars, Interior Material = Synthetic, Infotainment System = Premium, Warranty Length = 10 years, Number of Doors = 5, Number of Seats = 5, Air Conditioning = Yes, Navigation System = Advanced, Tire Type = All-Season, Sunroof = Yes, Sound System = Premium, Cruise Control = Yes, and Bluetooth Connectivity = Yes. |