Actions Speak Louder Than Words: Rate-Reward Trade-off in Markov Decision Processes
Authors: Haotian Wu, Gongpu Chen, Deniz Gunduz
ICLR 2025 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental results validate Act2Comm s capability to enable reliable communication while maintaining a certain level of control performance. We evaluate Act2Comm across three distinct MDP environments, as detailed in Appendix D, with communication performance measured by the bit error rate (BER). |
| Researcher Affiliation | Academia | Haotian Wu , Gongpu Chen , Deniz G und uz Department of Electrical and Electronic Engineering Imperial College London, London SW7 2AZ, U.K. EMAIL |
| Pseudocode | Yes | To enhance readers understanding of the training and inference process, we provide the pseudocode and illustration figures for (see Fig. 7) Act2Comm, detailing both the training and inference phases, as shown in Algorithms 1 and 2. |
| Open Source Code | Yes | For additional details about the training, the training logs and source code are also available on the project page of this paper. |
| Open Datasets | No | The paper describes custom MDP environments: "Lucky Wheel", "Catch the Ball", and "Erratic Robot". It does not reference any publicly available datasets or provide access to these environments as datasets. For example, "The Catch the Ball game is set in a 3x3 grid..." and "The Erratic robot game takes place on a 4x4 grid map..." are descriptions of simulation setups, not external datasets. |
| Dataset Splits | No | The paper does not provide specific training/test/validation dataset splits. Instead, it describes simulations run on custom MDP environments and mentions that "The performance presented is averaged over 20,000 execution times," which implies repeated simulations rather than fixed dataset splits. |
| Hardware Specification | Yes | The experimental results, presented in Table 4, were obtained using a single GPU-A5000 with 10,000 runs for the Erratic Robot environment. |
| Software Dependencies | No | The paper mentions "an Adam-based lookahead optimizer (Zhang et al., 2019)" but does not specify version numbers for any programming languages, libraries, or solvers used for the implementation. |
| Experiment Setup | Yes | For the Act2Comm scheme, we train the model with a batch size of 4096, a learning rate of 0.001, and an Adam-based lookahead optimizer (Zhang et al., 2019). The inner-training for the critic network consists of sin = 20 steps, with a noise variance of σ2 w = 0.1. Each block has a length of µ = 3, and temperature parameter is as γ = 10, γ = 50, γ = 100, γ = 200. The performance presented is averaged over 20,000 execution times. To investigate the trade-offs, we train the Act2Comm model with λ ∈ [0.01, 20]. The detailed architecture of the Act2Comm scheme is provided in Fig. 8b. ...we set d = 32, Lt = 2 and Lt = 4 during the experiments. |