reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Spatio-Causal Patterns of Sample Growth

Authors: Andre F. Ribeiro

JAIR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We illustrate these theoretic patterns in the full American census from 1840 to 1940, and samples ranging from the street-level all the way to the national. This reveals new conditions for the generalizability of samples over space and time, and connections among the Shapley value, counterfactual statistics, and hyperbolic geometry. We then consider 100 years of the American census (and all variables in the census) as case study. For each cross-section (decade), we consider the important task of predicting economic growth for over 60K individual locations under increasing spatial samples. We demonstrate how (1) generalizability tradeoffs evolve across spatial levels, and (2) repeat the validation of generalizability limits derived in [27] for the spatial domain, and with the current census micro-data.
Researcher Affiliation	Academia	ANDRE F. RIBEIRO , Harvard University, USA and University of Sao Paulo, Brazil
Pseudocode	No	The paper describes methods and processes using mathematical formulations and descriptive text, but it does not contain any explicitly labeled 'Pseudocode' or 'Algorithm' blocks, nor does it present structured steps in a code-like format.
Open Source Code	No	The paper does not contain an explicit statement about releasing source code, nor does it provide any links to a code repository.
Open Datasets	Yes	The datasets analyzed are available in the IPUMS repository[17]. IPUMS. U.S. Individual-level Census (United States Bureau of the Census). 2022. url: https://usa.ipums.org/ usa/complete_count.shtml.
Dataset Splits	No	The paper mentions using 'held-out sample' for accuracy calculation and that 'One million location and year were chosen randomly', but it does not specify the percentages, exact counts, or methodology used for training, validation, or test splits of the dataset.
Hardware Specification	No	The paper does not provide any specific details about the hardware (e.g., CPU, GPU models, memory, or cluster specifications) used to run the experiments.
Software Dependencies	No	The paper lists various types of models and algorithms used (e.g., 'Neural Network Models, Generalized Linear Models, Boosting Models'), but it does not specify the names or version numbers of any particular software libraries, frameworks, or solvers.
Experiment Setup	No	The paper states that 'Detailed description of algorithms used, and their hyperparameter optimization, can be found on [27],' deferring the crucial experimental setup details to a separate publication rather than providing them in the main text.