Learning to Be Cautious
Authors: Montaser Mohammedalamen, Dustin Morrill, Alexander Sieusahai, yash satsangi, Michael Bowling
TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | In this paper, we present both a sequence of tasks where cautious behavior becomes increasingly non-obvious, as well as an algorithm to demonstrate that it is possible for a system to learn to be cautious. ... Experiments and results are shown in Section 4, and our theoretical contribution extending kof-N CFR to continuing MDPs is provided in Section 5. |
| Researcher Affiliation | Collaboration | Montaser Mohammedalamen EMAIL University of Alberta; Alberta Machine Intelligence Institute (Amii) Dustin Morrill EMAIL Sony AI Alexander Sieusahai EMAIL University of Alberta Yash Satsangi EMAIL Independent Researcher* Michael Bowling EMAIL University of Alberta; Alberta Machine Intelligence Institute (Amii) |
| Pseudocode | Yes | Algorithm 1 Learning to Be Cautious ... Algorithm 2 Learning to Be Cautious |
| Open Source Code | Yes | Our code is available at https://github.com/montaserFath/Learning-to-be-Cautious. |
| Open Datasets | Yes | The images are hand-drawn digits from MNIST (Le Cun et al., 1998)... a shoe image from MNIST fashion (Xiao et al., 2017)... or a letter from EMNIST (Cohen et al., 2017) |
| Dataset Splits | Yes | The familiar states are the 60K training images in the MNIST digit dataset... We construct novel MDPs from the MNIST fashion test set (Xiao et al., 2017) and EMNIST letters test set (Cohen et al., 2017)... training data are [1%, 10%, 100%] of the full digit dataset. |
| Hardware Specification | Yes | For all MNIST experiments, we used an NVIDIA Tesla V100 GPU and a 2.2 GHz Intel Xeon CPU with 100 GB memory. ... This experiment was run on a 3.60GHz Intel Core i9-9900K CPU with 7.7 GB of memory without a GPU. |
| Software Dependencies | No | Py Torch (Paszke et al., 2019) is used to build and train all neural networks. ... Adam optimizer (Kingma & Ba, 2015)... |
| Experiment Setup | Yes | Table 1: The batch size and number of epochs used to train the neural network reward models for each setting in the how caution depends on the extent of training data experiment. ... Networks are trained to minimize the mean-squared error (MSE) between reward predictions and target rewards with the Adam optimizer (Kingma & Ba, 2015) using a learning rate of 0.0016 (we also try 0.01, and 0.001). The remaining parameters for Adam in Py Torch (β1, β2, ϵ, and weight decay) are set to their defaults (0.9, 0.999, 10 8, 0)... We use a discount factor of γ = 0.99. |