reproducibilityindex.ai

Notice: The reproducibility variables underlying each score are classified using an automated LLM-based pipeline, validated against a manually labeled dataset. LLM-based classification introduces uncertainty and potential bias; scores should be interpreted as estimates. Full accuracy metrics and methodology are described in [1].

Data Acquisition: A New Frontier in Data-centric AI

Authors: Lingjiao Chen, Bilge Acun, Newsha Ardalani, Yifan Sun, Feiyang Kang, Hanrui Lyu, Yongchan Kwon, Ruoxi Jia, Carole-Jean Wu, Matei Zaharia, James Zou

DMLR 2025 | Venue PDF | LLM Run Details

Reproducibility Variable	Result	LLM Response
Research Type	Experimental	We ﬁrst present an investigation of current data marketplaces, revealing lack of platforms o ering detailed information about datasets, transparent pricing, standardized data formats. With the objective of inciting participation from the data-centric AI community, we then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers in a data marketplace. The benchmark was released1 as a part of Data Perf (Mazumder et al., 2023). Our evaluation of the submitted strategies underlines the need for e ective data acquisition strategies in ML.
Researcher Affiliation	Collaboration	Lingjiao Chen EMAIL Stanford University Bilge Acun EMAIL FAIR, Meta Newsha Ardalani EMAIL FAIR, Meta Yifan Sun EMAIL Columbia University Feiyang Kang EMAIL Virginia Tech Hanrui Lyu EMAIL Columbia University Yongchan Kwon EMAIL Columbia University Ruoxi Jia EMAIL Virginia Tech Carole-Jean Wu EMAIL FAIR, Meta Matei Zaharia EMAIL University of California, Berkeley James Zou EMAIL Stanford University
Pseudocode	No	The paper describes several strategies (e.g., "Strategy-Single", "Strategy-All", "Strategy-p%", "Strategy-RFE", "Strategy-Co FR", "Strategy-LP") using explanatory text and mathematical formulations, but it does not include any clearly labeled pseudocode blocks or algorithms with structured steps formatted like code.
Open Source Code	Yes	To encourage more research on this emerging topic, we have released our code at https: //github.com/facebookresearch/Data_Acquisition_for_ML_Benchmark
Open Datasets	Yes	With the objective of inciting participation from the data-centric AI community, we then introduce the DAM challenge, a benchmark to model the interaction between the data providers and acquirers in a data marketplace. The benchmark was released1 as a part of Data Perf (Mazumder et al., 2023). 1. https://www.dataperf.org/training-set-acquisition
Dataset Splits	No	The paper mentions "a small evaluation dataset Db" for the acquirer and discusses the generation of data for the benchmark (e.g., "ﬁve distinct market instances", "The original data pool contains 21 categories. For each data provider, we sample di erent number of samples from each category."). However, it does not provide specific details on how this evaluation dataset Db is split into training, validation, or test sets, or how any acquired data would be split for model training and evaluation.
Hardware Specification	No	The paper does not provide specific details regarding the hardware used for running the experiments. It mentions that "the submitted strategies are lightweight, i.e., require a small amount of computational resources (e.g., training some small models to measure similarity)", but no exact specifications like GPU models or CPU types are listed.
Software Dependencies	No	The paper states that "A logistic regression model is used as the ML model" and refers to "recursive feature elimination (RFE) (Darst et al., 2018)" but does not provide specific version numbers for any software libraries, programming languages (e.g., Python), or frameworks (e.g., scikit-learn, PyTorch) used in the implementation.
Experiment Setup	No	The paper describes the overall setup of the DAM benchmark, including the number of data providers, the task (sentiment analysis), the acquirer's budget ($150), and the evaluation metric's alpha parameter (0.98). However, it does not specify concrete experimental setup details for the machine learning models, such as hyperparameters for the logistic regression model (e.g., learning rate, regularization, solver, batch size, number of epochs) or specific training configurations beyond the model type.