DoubleML - An Object-Oriented Implementation of Double Machine Learning in Python

Authors: Philipp Bach, Victor Chernozhukov, Malte S. Kurz, Martin Spindler

JMLR 2022 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental To illustrate this effect, we simulate data from a PLR model. A naive ML approach consists of estimating g0 with ML methods, for example using random forests, and then plugging-in predictions ˆg0 to eventually obtain a naive estimate of θ0 from an OLS regression of Equation (1). The arising bias is substantial as illustrated in Figure 1a. As an alternative, we can partial out the effect of X on Y and X on D by estimating ˆg0 and ˆm0 with ML methods. θ0 can then be estimated from an OLS regression of Y ˆg0(X) on D ˆm0(X). This approach implements a Neyman orthogonal score function that identifies θ0. As shown in Figure 1c, the corresponding estimator is robust to the regularization bias. (...) Figure 2 provides a summary of the object-oriented structure and a code snippet demonstrating the API of the the Double ML package. (...) coef std err t P>|t| 2.5 % 97.5 % d 0.5161 0.0750 6.8805 0.0000 0.3691 0.6631
Researcher Affiliation Academia Philipp Bach EMAIL Victor Chernozhukov EMAIL Malte S. Kurz EMAIL Martin Spindler EMAIL Faculty of Business Administration, University of Hamburg, Moorweidenstraße 18, 20148 Hamburg, Germany Department of Economics and Center for Statistics and Data Science, Massachussets Institute of Technology, 50 Memorial Drive, Cambridge, MA 02142, USA
Pseudocode No The paper includes a code snippet in Figure 2, but it is an actual Python code example demonstrating the API rather than a pseudocode block or algorithm.
Open Source Code Yes Double ML is an open-source Python library implementing the double machine learning framework of Chernozhukov et al. (2018) for a variety of causal models. (...) Source code, documentation and an extensive user guide can be found at https://github.com/Double ML/doubleml-for-py and https://docs.doubleml.org.
Open Datasets No The paper mentions simulating data to illustrate effects: "To illustrate this effect, we simulate data from a PLR model." and "df = make_irm_data(return_type='Data Frame', n_obs=1000, theta=0.5)" in the code snippet. It does not use any pre-existing publicly available datasets.
Dataset Splits No The paper discusses 'sample splitting' as a key ingredient of the DML framework, stating "Sample splitting in K folds is applicable and the usage of repeated cross-fitting is recommended to obtain more efficient estimates." This describes the general methodology but does not provide specific split information (percentages, counts, or references to predefined splits) for any experiment conducted in the paper. The synthetic data generation does not specify splits.
Hardware Specification No The paper does not explicitly describe any specific hardware (e.g., GPU/CPU models, memory) used for running experiments or developing the software.
Software Dependencies No The package is distributed under the MIT license and relies on core libraries from the scientific Python ecosystem: scikit-learn, numpy, pandas, scipy, statsmodels and joblib. The paper lists software dependencies but does not provide specific version numbers for them (e.g., 'scikit-learn 0.24.1').
Experiment Setup Yes dml_model = Double MLIRM(dml_data, Random Forest Regressor(max_depth=5), Random Forest Classifier(max_depth=5), score='ATE')