Learning to Be Cautious

Authors: Montaser Mohammedalamen, Dustin Morrill, Alexander Sieusahai, yash satsangi, Michael Bowling

TMLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details

Reproducibility Variable Result LLM Response
Research Type Experimental In this paper, we present both a sequence of tasks where cautious behavior becomes increasingly non-obvious, as well as an algorithm to demonstrate that it is possible for a system to learn to be cautious. ... Experiments and results are shown in Section 4, and our theoretical contribution extending kof-N CFR to continuing MDPs is provided in Section 5.
Researcher Affiliation Collaboration Montaser Mohammedalamen EMAIL University of Alberta; Alberta Machine Intelligence Institute (Amii) Dustin Morrill EMAIL Sony AI Alexander Sieusahai EMAIL University of Alberta Yash Satsangi EMAIL Independent Researcher* Michael Bowling EMAIL University of Alberta; Alberta Machine Intelligence Institute (Amii)
Pseudocode Yes Algorithm 1 Learning to Be Cautious ... Algorithm 2 Learning to Be Cautious
Open Source Code Yes Our code is available at https://github.com/montaserFath/Learning-to-be-Cautious.
Open Datasets Yes The images are hand-drawn digits from MNIST (Le Cun et al., 1998)... a shoe image from MNIST fashion (Xiao et al., 2017)... or a letter from EMNIST (Cohen et al., 2017)
Dataset Splits Yes The familiar states are the 60K training images in the MNIST digit dataset... We construct novel MDPs from the MNIST fashion test set (Xiao et al., 2017) and EMNIST letters test set (Cohen et al., 2017)... training data are [1%, 10%, 100%] of the full digit dataset.
Hardware Specification Yes For all MNIST experiments, we used an NVIDIA Tesla V100 GPU and a 2.2 GHz Intel Xeon CPU with 100 GB memory. ... This experiment was run on a 3.60GHz Intel Core i9-9900K CPU with 7.7 GB of memory without a GPU.
Software Dependencies No Py Torch (Paszke et al., 2019) is used to build and train all neural networks. ... Adam optimizer (Kingma & Ba, 2015)...
Experiment Setup Yes Table 1: The batch size and number of epochs used to train the neural network reward models for each setting in the how caution depends on the extent of training data experiment. ... Networks are trained to minimize the mean-squared error (MSE) between reward predictions and target rewards with the Adam optimizer (Kingma & Ba, 2015) using a learning rate of 0.0016 (we also try 0.01, and 0.001). The remaining parameters for Adam in Py Torch (β1, β2, ϵ, and weight decay) are set to their defaults (0.9, 0.999, 10 8, 0)... We use a discount factor of γ = 0.99.