DMLR: Data-centric Machine Learning Research - Past, Present and Future

Authors: Luis Oala, Manil Maskey, Lilith Bat-Leah, Alicia Parrish, Nezihe Merve Gürel, Tzu-Sheng Kuo, Yang Liu, Rotem Dror, Danilo Brajovic, Xiaozhe Yao, Max Bartolo, William A Gaviria Rojas, Ryan Hileman, Rainier Aliment, Michael W. Mahoney, Meg Risdal, Matthew Lease, Wojciech Samek, Debojyoti Dutta, Curtis G Northcutt, Cody Coleman, Braden Hancock, Bernard Koch, Girmaw Abebe Tadesse, Bojan Karlaš, Ahmed Alaa, Adji Bousso Dieng, Natasha Noy, Vijay Janapa Reddi, James Zou, Praveen Paritosh, Mihaela van der Schaar, Kurt Bollacker, Lora Aroyo, Ce Zhang, Joaquin Vanschoren, Isabelle Guyon, Peter Mattson

DMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Theoretical Drawing from discussions at the inaugural DMLR workshop at ICML 2023 and meetings prior, in this report we outline the relevance of community engagement and infrastructure development for the creation of next-generation public datasets that will advance machine learning science. We chart a path forward as a collective effort to sustain the creation and maintenance of these datasets and methods towards positive scientific, societal and business impact.
Researcher Affiliation Collaboration Luis Oala1 , Manil Maskey2, Lilith Bat-Leah3, Alicia Parrish4, Nezihe Merve G urel5, Tzu-Sheng Kuo6, Yang Liu7,8, Rotem Dror9, Danilo Brajovic10, Xiaozhe Yao34, Max Bartolo11, William Gaviria Rojas12, Ryan Hileman13, Rainier Aliment4, Michael W. Mahoney14,15,16, Meg Risdal17, Matthew Lease18, Wojciech Samek19,20, Debo Dutta21, Curtis Northcutt22, Cody Coleman12, Braden Hancock23, Bernard Koch24, Girmaw Abebe Tadesse25, Bojan Karlaˇs26, Ahmed Alaa14, Adji Bousso Dieng27, Natasha Noy4, Vijay Janapa Reddi26, James Zou28, Praveen Paritosh29, Mihaela van der Schaar30, Kurt Bollacker29, Lora Aroyo4, Ce Zhang31,24, Joaquin Vanschoren32, Isabelle Guyon4,33,25, Peter Mattson4,29 -- 1Dotphoton, 2NASA, 3Mod Op, 4Google, 5TU Delft, 6Carnegie Mellon University, 7UC Santa Cruz, 8Byte Dance Research, 9University of Haifa, 10Fraunhofer IPA, 11Cohere, 12Coactive AI, 13Talon, 14UC Berkeley, 15ICSI, 16LBNL, 17Kaggle, 18UT Austin, 19TU Berlin, 20Fraunhofer HHI, 21Nutanix, 22Cleanlab, 23Snorkel AI, 24University of Chicago, 25Microsoft AI for Good Lab, 26Harvard University, 27Princeton University, 28Stanford University, 29MLCommons, 30University of Cambridge, 31Together, 32TU Eindhoven, 33University of Paris-Saclay, 34ETH Zurich, 35Cha Learn
Pseudocode No The paper is an editorial and review of the data-centric machine learning field, and as such, it does not present any novel algorithms or pseudocode.
Open Source Code No The paper is an editorial discussing the field of data-centric machine learning. It does not present a novel methodology for which source code would be released. It does, however, refer to several existing open-source projects and initiatives, such as 'open-source projects such as Lance4', 'Croissant 5[41]', and 'open source data-centric libraries such as https://github.com/vanderschaarlab/datagnosis'.
Open Datasets Yes Historically, the ambivalence towards data has manifested in different ways. In the early 1990s, Wilson, Garris and Wilkinson [1 4] distributed Handwriting Sampling Forms at the National Institute of Standards and Technology (NIST), digitizing the resulting data into the raw ingredients that were later turned into the now infamous machine learning staple MNIST. ... One must only look at Image Net [8] or CIFAR [9] for great success stories. ... Exceptions do exist, especially by cooperative-style communities such as LAION [31], Common Crawl [32], or Eleuther [33], among others.
Dataset Splits No The paper is an editorial discussing the field of data-centric machine learning research. It does not present novel experimental results or methodologies that would require specifying dataset splits.
Hardware Specification No The paper is an editorial discussing the field of data-centric machine learning research. It does not describe any specific hardware used for running experiments, as the paper itself does not conduct experiments.
Software Dependencies No The paper is an editorial discussing the field of data-centric machine learning research. It does not specify particular software dependencies with version numbers, as it does not conduct its own experiments or present a specific software implementation.
Experiment Setup No The paper is an editorial and review of data-centric machine learning research. It does not detail a specific experimental setup, including hyperparameters or system-level training settings, as it does not conduct its own experiments.