“What is the most information/physics-theoretical correct way to compute the entropy of an image?“
An excellent and timely question.
Contrary to popular belief, it is indeed possible to define an intuitively (and theoretically) natural information-entropy for an image.
Consider the following figure:
We can see that the differential image has a more compact histogram, therefore its Shannon information-entropy is lower. So we can get lower redundancy by using second order Shannon entropy (i.e. entropy derived from differential data). If we can extend this idea isotropically into 2D, then we might expect good estimates for image information-entropy.
A two dimensional histogram of gradients allows the 2D extension.
We can formalise the arguments and, indeed, this has been completed recently. Recapping briefly:
The observation that the simple definition (see for example MATLAB’s definition of image entropy) ignores spatial structure is crucial. To understand what is going on it is worth returning to the 1D case briefly. It has been long known that using the histogram of a signal to compute its Shannon information/entropy ignores the temporal or spatial structure and gives a poor estimate of the signal’s inherent compressibility or redundancy. The solution was already available in Shannon’s classic text; use the second order properties of the signal, i.e. transition probabilities. The observation in 1971 (Rice & Plaunt) that the best predictor of a pixel value in a raster scan is the value of the preceding pixel immediately leads to a differential predictor and a second order Shannon entropy that aligns with simple compression ideas such as run length encoding. These ideas were refined in the late 80s resulting in some classic lossless image (differential) coding techniques that are still in use (PNG, lossless JPG, GIF, lossless JPG2000) whilst wavelets and DCTs are only used for lossy encoding.
Moving now to 2D; researchers found it very hard to extend Shannon’s ideas to higher dimensions without introducing an orientation dependence. Intuitively we might expect the Shannon information-entropy of an image to be independent of its orientation. We also expect images with complicated spatial structure (like the questioner’s random noise example) to have higher information-entropy than images with simple spatial structure (like the questioner’s smooth gray-scale example). It turns out that the reason it was so hard to extend Shannon’s ideas from 1D to 2D is that there is a (one-sided) asymmetry in Shannon’s original formulation that prevents a symmetric (isotropic) formulation in 2D. Once the 1D asymmetry is corrected the 2D extension can proceed easily and naturally.
Cutting to the chase (interested readers can check out the detailed exposition in the arXiv preprint at https://arxiv.org/abs/1609.01117 ) where the image entropy is computed from a 2D histogram of gradients (gradient probability density function).
First the 2D pdf is computed by binning estimates of the images x and y derivatives. This resembles the binning operation used to generate the more common intensity histogram in 1D. The derivatives can be estimated by 2-pixel finite differences computed in the horizontal and vertical directions. For an NxN square image f(x,y) we compute NxN values of partial derivative fx and NxN values of fy. We scan through the differential image and for every pixel we use (fx,fy) to locate a discrete bin in the destination (2D pdf) array that is then incremented by one. We repeat for all NxN pixels. The resulting 2D pdf must be normalised to have overall unit probability (simply dividing by NxN achieves this). The 2D pdf is now ready for the next stage.
The computation of the 2D Shannon information entropy from the 2D gradient pdf is simple. Shannon’s classic logarithmic summation formula applies directly except for a crucial factor of one half which originates from special bandlimited sampling considerations for a gradient image (see arXiv paper for details). The half factor makes the computed 2D entropy even lower compared to other (more redundant) methods for estimating 2D entropy or lossless compression.
I’m sorry I haven’t written the necessary equations down here but everything is available in the preprint text. The computations are direct (non-iterative) and the computational complexity is of order (the number of pixels) NxN . The final computed Shannon information-entropy is rotation independent and corresponds precisely with the number of bits required to encode the image in a non-redundant gradient representation.
By the way, the new 2D entropy measure predicts an (intuitively pleasing) entropy of 8 bits per pixel for the random image and 0.000 bits per pixel for the smooth gradient image in the original question.