How Much is Unseen Depends Chiefly on Information About the Seen
Authors: Seongmin Lee, Marcel Boehme
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In our experiments, our search algorithm discovers estimators that have a substantially smaller MSE than the state-of-the-art Good Turing estimator. This holds for over 93% of runs when there are at least as many samples as classes. Our estimators MSE is roughly 80% of the Good-Turing estimator s. |
| Researcher Affiliation | Academia | Seongmin Lee and Marcel B ohme MPI for Security and Privacy, Germany EMAIL |
| Pseudocode | Yes | Algorithm 1 Genetic Algorithm Input: Target frequency k, Sample Xn, Iteration limit G, mutant size m 1: Population P0 = {r0} 2: Fitness f best = f0 = fitness(r0) 3: Limit GL = G 4: for g from 1 to GL do 5: P = select Top M(Pg 1, m) 6: P = lapply(P, mutate) 7: Pg = P {r0} select Top M(Pg 1, 3) 8: fg = min(lapply(Pg, fitness)) 9: if (g = GL) ((fg = f0) (f best > 0.95 fg)) then 10: GL = GL + G 11: f best = fg 12: Estimator ˆ M Evo k = instantiate(select Top M(PGL, 1)) Output: Minimal-MSE Estimator ˆ M Evo k |
| Open Source Code | Yes | Open Science and Replication. For scrutiny and replicability, we publish all our evaluation scripts at: https://github.com/ni Mgnoe See L/Unseen GA. |
| Open Datasets | Yes | Distibutions. We use the same six multinomial distributions that are used in previous evaluations (Orlitsky & Suresh, 2015; Orlitsky et al., 2016; Hao & Li, 2020): a uniform distribution (uniform), a half-and-half distribution where half of the elements have three times of the probability of the other half (half&half), two Zipf distributions with parameters s = 1 and s = 0.5 (zipf-1, zipf-0.5), and distributions generated by Dirichlet-1 prior and Dirichlet-0.5 prior (diri-1, diri-0.5, respectively). [...] We first use the Australian population-data-by-region dataset ((Arvidsson, 2023), S = 104) to estimate M0 for n = 50 random data points. [...] We also apply our method to the Shakespeare dataset (Sha) (S = 935, |Datatset| = 111, 396), commonly used in the literature (Efron & Thisted, 1976), focusing on missing mass for player frequency. |
| Dataset Splits | No | The paper describes repeating experiments 100 times by taking 100 different samples Xn of size n, and 10K times for extended samples. This describes a sampling strategy but does not provide specific train/test/validation dataset splits typically used for model training and evaluation. |
| Hardware Specification | No | The paper does not provide specific details about the hardware used for running its experiments. |
| Software Dependencies | No | The paper does not provide specific software dependency details with version numbers, such as programming languages or libraries. |
| Experiment Setup | Yes | E.1 HYPERPARAMETERS For evaluating Algorithm 1, we use the following hyperparameters: ... The number of generations G = 100. To avoid the algorithm from converging to a local minimum, we limit the maximum number of generations to be 2000. The mutant size m = 40. When selecting the individuals for the mutation, we use tournament selection with tournament size t = 3... To avoid the estimator from being too complex, we limit the maximum number of terms in the estimator to be 20. |