The Cross-entropy of Piecewise Linear Probability Density Functions
Authors: Tom S. F. Haines
TMLR 2024 | Venue PDF | Archive PDF | Plain Text | LLM Run Details
| Reproducibility Variable | Result | LLM Response |
|---|---|---|
| Research Type | Experimental | Experimental validation is presented, including a rigorous analysis of accuracy and a demonstration of using the presented result as the objective of a neural network. Previously, cross-entropy would need to be approximated via numerical integration, or equivalent, for which calculating gradients is impractical. Machine learning models with high parameter counts are optimised primarily with gradients, so if piecewise linear density representations are to be used then the presented analytic solution is essential. This paper contributes the necessary theory for the practical optimisation of information theoretic objectives when dealing with piecewise linear distributions directly. Removing this limitation expands the design space for future algorithms. |
| Researcher Affiliation | Academia | Tom S. F. Haines EMAIL Department of Computer Science University of Bath |
| Pseudocode | No | The paper includes Python code in Appendix B, which is actual implementation code, not pseudocode or a clearly labeled algorithm block as defined by the question. |
| Open Source Code | Yes | A complete implementation, including code to generate the included figures, is in the supplementary material and also available from https://github.com/thaines/orogram. |
| Open Datasets | No | There is no data. |
| Dataset Splits | No | The paper states "There is no data.", therefore no dataset splits are provided. |
| Hardware Specification | No | This research made use of Hex, the GPU Cloud in the Department of Computer Science at the University of Bath. This mentions a GPU cloud and a name 'Hex', but lacks specific GPU models, processor types, or detailed specifications needed for reproduction. |
| Software Dependencies | Yes | The below Python code is for Jax and has been developed with version 0.4.25. Validation, including of gradients, has been performed and may be found in the supplementary material alongside code for the demonstrations within the main text. |
| Experiment Setup | Yes | Nesterov’s accelerated gradient descent (Nesterov, 1983) is used, with 2048 iterations reducing the KL-divergence from 0.740 to 0.007. The network has two hidden layers of width 32, with Gaussian activations on all layers except the last, which remains linear. It is used as an offset (residual) for point positions, such that the final layer can be initialised with small values so it starts close to an identity transform. ADAM (Kingma & Ba, 2015) with 8192 iterations reduces the KL-divergence from 1.207 to 0.009. Stochastic gradient descent is used, i.e. each iteration a new sample of 256 points is drawn and pushed through the network for calculating the gradient. |