The Maximum Entropy Principle:

In today’s post, I am looking at the Maximum Entropy principle, a brainchild of the eminent physicist E. T. Jaynes. This idea is based on Claude Shannon’s Information Theory. The Maximum Entropy principle (an extension of the Principle of Insufficient Reason) is the ideal epistemic stance. Loosely put, we should model only what is known, and we should assign maximum uncertainty for what is unknown. To explain this further, let’s look at an example of a coin toss.

If we don’t know anything about the coin, our prior assumption should be that heads or tails are equally likely to happen. This is a stance of maximum entropy. If we assumed that the coin was loaded, we would be trying to “load” our assumption model, and claim unfair certainty. Entropy is a measure proposed by Claude Shannon as part of his information theory. Low entropy messages have low information content or low surprise content. High entropy messages on the other hand have high information content or high surprise content. The informational entropy is also inversely proportional to the probability of an event. Low probability events have high information content. For example, an unlikely defeat of a reigning sports team generates more surprise than a likely win. Entropy is the average level of information when we consider all of the probabilities. In the case of the coin toss, the entropy is the average level of information when we consider the probability of heads or tail. For discrete events, the entropy is maximum for equally likely events, or in other words for uniform distribution. Thus, when we say that the probability of heads or tails is 0.5, we are assuming a maximum entropy model. In the case of uniform distribution, the maximum entropy model is also the same as Laplace’s principle of insufficient reason. If the coin was always landing on heads, we have a zero entropy case because there is no new information available. If it is a loaded coin that makes one side more likely to occur, then the entropy is lower than if it is a fair coin. This is shown below, where the X-axis is the probability of Heads, and the Y-axis is the information entropy. We can see that Pr(0) or no Heads, and Pr(1) or 100% Heads have zero entropy value. The highest value for entropy happens when the probability for heads is 0.5 or 50%. For those who are interested, Jon von Neumann had a great idea to make a loaded coin fair. You can check out that here.

From this standpoint, if we take a game, where one team is more favored to win, we could say that the most informative part of a game is sometimes the coin toss.

Let’s consider the case of a die. There are six possible events (1 through 6) when we roll a die. The maximum entropy model will be to assume a uniform distribution, i.e., to assign 1/6 as the probability for 1 through 6 value. If we somehow knew that 6 is more likely to happen. For example, if the manufacturer of the loaded die says that the number 6 is likely to occur 3/6 of the times. Per the maximum entropy model, we should divide the remaining 3/6 equally among the remaining 5 numbers. With each additional piece of information, we should change our model so that the entropy is at its maximum. What I have discussed here is the basic information regarding maximum entropy. Each new piece of “valid” information that we need to incorporate into our model is called a constraint. The maximum entropy approach utilizes Lagrangian multipliers to find the solutions. For discrete events, with no additional information, the maximum entropy model is the uniform distribution. In a similar vein, if you are looking at a continuous distribution, and you knew what the mean and variance of the distribution is, the maximum entropy model is the normal distribution.

The Role of The Observer:

Jaynes asked a great question about the information content of a message. He noted:

In a communication process, the message m(i) is assigned probability p(i), and the entropy H, is a measure of information. But WHOSE information?… The probabilities assigned to individual messages are not measurable frequencies; they are only a means of describing a state of knowledge.

The general idea of probability in the frequentist’s version of statistics is that it is fixed. However, in the Bayesian version, the probability is not a fixed entity. It represents a state of knowledge. Jaynes continues:

Entropy, H, measures not the information of the sender, but the ignorance of the receiver that is removed by the receipt of the message.

To me, this brings up the importance of the observer and circularity. As the great cybernetician Heinz von Foerster said:

“The essential contribution of cybernetics to epistemology is the ability to change an open system into a closed system, especially as regards the closing of a linear, open, infinite causal nexus into closed, finite, circular causality.”

Let’s go back to the example of a coin. If I am an alien and if I knew nothing about coins, should my maximum entropy model only include two possibilities of heads or tails? Why should it not include the coin landing on its edge? Or if a magician is tossing the coin, should I account for the coin to vanish in thin air? The assumption of just two possibilities (head or tails) is the prior information that we are accounting for, by saying that the probability of a heads or a tail is 0.5. As we gain more knowledge about the coin toss, we can update the model to reflect it, and at the same time change the model to a new state of maximum entropy. This iterative, closed loop process is the backbone of scientific enquiry and skepticism. The use of the maximum entropy model is a stance that we are taking to state our knowledge. Perhaps a better way to explain the coin toss is that – given our lack of knowledge about the coin, we are saying that the heads is not more likely to happen than tails until we find more evidence. Let’s look at another interesting example where I think the maximum entropy model comes up.

The Veil of Ignorance:

The veil of ignorance is an idea about ethics proposed by the great American Political philosopher, John Rawls. Loosely put, in this thought experiment, Rawls is asking us what kind of society should we aim for? Rawls asks us to imagine that we are behind a veil of ignorance, where we are completely ignorant of our natural abilities, societal standing, family etc. We are then randomly assigned a role in society. The big question then is – what should society be like where this random assignment promotes fairness and equality? The random assignment is a maximum entropy model since any societal role is equally likely.

Final Words:

Maximum entropy principle is a way of saying to not put all of your eggs in one basket. It is a way to be aware of your biases and it is an ideal position for learning. It is similar to the Epicurus’ principle of Multiple Explanations, that says – “Keep all the different hypotheses that are consistent with the facts.”

It is important to understand that “I don’t know,” is a valid and acceptable answer. It marks the boundary for learning.

Jaynes explained maximum entropy as follows:

The maximum entropy distribution may be asserted for the positive reason that is uniquely determined as the one which is maximally noncommittal with regard to missing information, instead of the negative one that there was no reason to think otherwise… Mathematically, the maximum entropy distribution has the important property that no possibility is ignored; it assigns positive weight to every situation that Is not absolutely excluded by the given information.

We learned that probability and entropy are dependent on the observer. I will finish off with the wise words from James Dyke and Axel Kleidon.

Probability can now be seen as assigning a value to our ignorance about a particular system or hypothesis. Rather than the entropy of a system being a particular property of a system, it is instead a measure of how much we know about a system.

Please maintain social distance and wear masks. Stay safe and Always keep on learning…

In case you missed it, my last post was Destruction of Information/The Performance Paradox: