How Much is Unseen Depends Chiefly on Information About the Seen

Authors: Seongmin Lee, Marcel Boehme

ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In our experiments, our search algorithm discovers estimators that have a substantially smaller MSE than the state-of-the-art Good Turing estimator. This holds for over 93% of runs when there are at least as many samples as classes. Our estimators MSE is roughly 80% of the Good-Turing estimator s.
Researcher Affiliation Academia Seongmin Lee and Marcel B ohme MPI for Security and Privacy, Germany EMAIL
Pseudocode Yes Algorithm 1 Genetic Algorithm Input: Target frequency k, Sample Xn, Iteration limit G, mutant size m 1: Population P0 = {r0} 2: Fitness f best = f0 = fitness(r0) 3: Limit GL = G 4: for g from 1 to GL do 5: P = select Top M(Pg 1, m) 6: P = lapply(P, mutate) 7: Pg = P {r0} select Top M(Pg 1, 3) 8: fg = min(lapply(Pg, fitness)) 9: if (g = GL) ((fg = f0) (f best > 0.95 fg)) then 10: GL = GL + G 11: f best = fg 12: Estimator ˆ M Evo k = instantiate(select Top M(PGL, 1)) Output: Minimal-MSE Estimator ˆ M Evo k
Open Source Code Yes Open Science and Replication. For scrutiny and replicability, we publish all our evaluation scripts at: https://github.com/ni Mgnoe See L/Unseen GA.
Open Datasets Yes Distibutions. We use the same six multinomial distributions that are used in previous evaluations (Orlitsky & Suresh, 2015; Orlitsky et al., 2016; Hao & Li, 2020): a uniform distribution (uniform), a half-and-half distribution where half of the elements have three times of the probability of the other half (half&half), two Zipf distributions with parameters s = 1 and s = 0.5 (zipf-1, zipf-0.5), and distributions generated by Dirichlet-1 prior and Dirichlet-0.5 prior (diri-1, diri-0.5, respectively). [...] We first use the Australian population-data-by-region dataset ((Arvidsson, 2023), S = 104) to estimate M0 for n = 50 random data points. [...] We also apply our method to the Shakespeare dataset (Sha) (S = 935, |Datatset| = 111, 396), commonly used in the literature (Efron & Thisted, 1976), focusing on missing mass for player frequency.
Dataset Splits No The paper describes repeating experiments 100 times by taking 100 different samples Xn of size n, and 10K times for extended samples. This describes a sampling strategy but does not provide specific train/test/validation dataset splits typically used for model training and evaluation.
Hardware Specification No The paper does not provide specific details about the hardware used for running its experiments.
Software Dependencies No The paper does not provide specific software dependency details with version numbers, such as programming languages or libraries.
Experiment Setup Yes E.1 HYPERPARAMETERS For evaluating Algorithm 1, we use the following hyperparameters: ... The number of generations G = 100. To avoid the algorithm from converging to a local minimum, we limit the maximum number of generations to be 2000. The mutant size m = 40. When selecting the individuals for the mutation, we use tournament selection with tournament size t = 3... To avoid the estimator from being too complex, we limit the maximum number of terms in the estimator to be 20.