An Unforeseen Equivalence Between Uncertainty and Entropy

. Uncertainty and entropy are related concepts, so we would expect there to be some overlap, but the equality that is shown in this paper is unexpected. In Beta models, interactions between agents are evidence used to construct Beta distributions. In models based on the Beta Model, such as Subjective Logic, uncertainty is deﬁned to be inversely proportional to evidence. Entropy measures measure how much information is lacking in a distribution. Uncertainty was neither intended nor expected to be an entropy measure. We discover that a speciﬁc entropy measure we call EDRB coincides with uncertainty whenever uncertainty is deﬁned. EDRB is the expected Kullback-Leibler divergence between two Bernouilli trials with parameters randomly selected from the distribution. EDRB allows us to apply the notion of uncertainty to other distributions that may occur in the context of Beta models.


Introduction
The Beta model paradigm is a powerful formal approach to studying trust.Bayesian logic is at the core of the Beta model: "agents with high integrity behave honestly" becomes "honest behaviour evidences high integrity".Its simplest incarnation is to apply Beta distributions naively, and this approach has limited success.However, more powerful and sophisticated approaches are widespread (e.g.[13,3,17]).A commonality among many approaches, is that more evidence (in the form of observing instances of behaviour) yields more certainty of an opinion.Uncertainty is inversely proportional to the amount of evidence.
Evidence is often used in machine learning.It is no surprise that there is a close link between trust models and machine learning, since the goal is to automatically create a model, based on observed data.The Beta model is based a simple Bayesian technique found in machine learning.More involved techniques may introduce hidden variables [13] or hidden Markov models [3,18].Uncertainty as the inverse of (or lack of) evidence makes sense in this context.
We have obtained successful results applying information theory to analyse trust ratings [15,16].Informative ratings are more useful than uninformative ones.Others have applied information theory to trust modelling in different ways, e.g.[1,2].However, these approaches contrast the evidence-based approaches -they were not considered to be equivalent approaches.In fact, we have studied the possibility of combining uncertainty and entropy, to understand their interplay, in [12] -and we had not expected that they would turn out to coincide.
The purpose of this paper is to demonstrate a surprising equivalence.The uncertainty used in this paper is fundamentally different from entropy in information theory.There are various entropy measures that one can define, but the standard measures do not yield an equivalence to uncertainty.However, we formulate a specific entropy measure -that we call expected Kullback-Leibler divergence of random-parameter Bernoulli trials (EDRB) -which does equate to uncertainty.The proof is based on a specific properties of functions related to Beta distributions, and does not seem provide insight in why the two are equivalent.
The main motivation for this paper, is to present this surprising result.However, there are possible practical applications too.First, EDRB allows us to compute uncertainty of a given Beta distribution with unknown parameters.Secondly, EDRB can provide the uncertainty of other distributions than the Beta distribution, generalising uncertainty.Thirdly, using EDRB, we can apply techniques from information theory on uncertainty (e.g.apply MAXENT on uncertainty).
The paper is organised as follows: In Section 2, we introduce and shortly discuss existing definitions and properties.In Section 3, we discuss the general relation between uncertainty and entropy in the setting of the Beta model.In Section 4, we present our main result, Theorem 1. Finally, in Section 5, we look at the application of Theorem 1 on more general opinions.

Preliminaries
In this section, we introduce the existing definitions and formalisms that are relevant to our work.The definitions can be grouped into two types, definitions surrounding the Beta model and related models (Section 2.1), and informationtheoretic definitions (Section 2.2).

Beta models
The Beta models are a paradigm, and whether a specific model is a Beta model is up to debate.The core idea behind Beta models is a specific Bayesian approach to evidence [4].Interactions with agents form evidence, and they are used to construct an opinion.The interactions correspond to Bernoulli trials [5]: Definition 1.A Bernoulli trial is has two outcomes, "success" and "failure", and the probability of success is the same every time the trial is performed.A Bernoulli distribution is a discrete distribution with two outcomes, 0 and 1.Its probability mass function f B (p) has f B (0; p) = 1 − p and f B (1; p) = p.A random variable B i from a Bernoulli trial is distributed according to the Bernoulli distributions, so P (B i =1) = p and P (B i =0) = 1 − p.
There are agents A ∈ A. Each agent A has an unknown parameter x A , called its integrity.An agent may betray another agent, or the agent may cooperate.Which choice an agent makes is assumed to be a Bernoulli trial, where the probability of cooperating is equal to its integrity.A series of interactions, therefore, is a series of Bernoulli trials.Let B A,i be the random variable corresponding to the i th interaction with agent A, then P (B A,i = 1) = x A .We refer to outcome 1 as success and 0 as failure.However, x A is not a known quantity, so we apply the Bayesian idea of introducing a random variable X A for the integrity of agent A. An opinion about an agent can be denoted as the probability density function We assume that the opinion without evidence is the uniform distributionso p X A (x A ) = 1.One reason to select this prior distribution, is the principle of maximum entropy, which essentially dictates that we should pick the distribution with the highest entropy, if we want to model that we do not have any evidence -and this distribution is the uniform distribution.Another reason to select this prior distribution, is that it simplifies the notion of combining opinions.Most importantly, the prior can be changed to any arbitrary probability density function f , simply by multiplying f The reason for the name "Beta model" comes from a special relationship to Beta distributions.The Beta distribution is defined as: [5,8] Definition 2. The Beta distribution is a continuous distribution with support in the range [0, 1], with a probability density function , where B is the Beta function, B(α, β) = 1 0 x α−1 (1 − x) β−1 dx, which acts as a normalisation factor.Its cumulative distribution function is , which is also known as the regularised incomplete Beta function I x (α, β).
We are using important properties of the Beta function and the regularised incomplete Beta function (see [8]): Proposition 1.The following two equalities hold: Given the relations between the random variables, we find that any opinion p X A (x A |B A,1 , B A,2 , . . . ) is a Beta distribution.In fact, if the outcomes of the Bernoulli trials B A,1 , . . ., B A,n ) contain n s success and n − n s = n f failures, then the opinion p We can define a fusion operator ⊕, as p 1 (x) ⊕ p 2 (x) = Using a distribution to denote an opinion is a feasible approach, based on Bayesian logic, but the results are not intuitively obvious to people that may use the opinions.Subjective Logic is a formalism within the Beta model paradigm, which is developed with the purpose of being understandable to non-experts [7] 1 .A Subjective Logic opinion is defined [6]: Definition 3.An opinion is a triple of components (b, d, u), for positive real b, d, u with b + d + u = 1.The first component is belief, the second is disbelief, and the third is uncertainty.
Subjective logic also has a fusion operator, denoted (b, d, u) ⊕ (b , d , u ).The purpose of fusion in Subjective Logic is the same as fusion of distributions, namely to merge evidence.See [6].
That there is an isomorphism between fusion of Beta distributions and Subjective Logic fusion, is a known result [7].In fact, this isomorphism is the primary argument in favour of the shape of Definition 4. It turns out that there is a family of isomorphisms between the two: Proposition 4. Let B, S be the groups of Beta distributions with fusion, and of SL opinions with SL fusion.Let f r be a function For r>0, f r is an isomorphism between B and S. Proof.Keep in mind Proposition 3, so fusion simply adds α's and β's.The inverse of f r is f -1 r = br u + 1, dr u + 1 , since (w.l.o.g. for α): Remains to prove that f r and f -1 r are homomorphisms between B and S: Since Beta distributions and Subjective Logic are isomorphic w.r.t.fusion, we can apply notions of Subjective Logic directly to Beta distributions.So we can say that the uncertainty of ).Unless we explicitly state which isomorphism f r we use, we assume that f 1 was usedso unc = unc 1 .Observe that a Beta distribution based on n = n s + n f pieces of evidence has uncertainty unc(f β (x; ), so the inverse of uncertainty is equal to the amount of evidence (plus 1, to avoid divide-by-zero).

Information Theory
A core notion in information theory, is the notion of surprisal, also known as selfinformation or information content.The symbol I X is often used, but it is also used for the regularised incomplete Beta function, so we denote the surprisal of X with J X instead.The surprisal is defined: J X (x) = − log(P (X = x)) or J X (x) = − log(p X (x)) for discrete and continuous random variable X, respectively.
Shannon entropy is used to measure the expected amount of information carried in a random variable, which is determined by the uncertainty of the random variable [9]: Definition 5.The Shannon entropy of a discrete random variable X is given: The Shannon entropy is maximal when all possible outcomes are equiprobable.It means that our expected surprisal is maximal, which is a common way to express we know nothing about the random variable.Shannon entropy can be generalised for continuous random variables, to differential entropy.Differential entropy does not provide absolute values -values can go below 0 -but is useful for measuring the difference in information present in distributions.Definition 6.The differential entropy of a continuous random variable X is given: Kullback-Leibler divergence, also known as relative entropy, measures the distance from one distribution to another.Definition 7.For discrete random variables X, Y , the Kullback-Leibler divergence from X to Y is: For continuous random variables X, Y , it is: . Typically, X is the "true" random variable and Y is a model, in which case D KL (X||Y ) tells us how far the model is from the truth.A divergence of 0 implies that the two random variables are identically distributed.

Beta Models and Entropy
In this section, we discuss different entropy measures that can be applied to a Beta distribution.We formally state each of these measures, we discuss their intuitive meaning, their application, and how they differ from uncertainty.The measure of entropy that does match uncertainty will be introduced in the next section.This section helps appreciate why that measure of entropy is the way it is.

Integrity Parameter Entropy
The most obvious measure of entropy that can be applied, is the (differential) entropy of the integrity parameter.To be precise, the entropy measure is: The standard intuition of differential entropy applies.In the case of differential entropy, values are negative and the absolute quantity tells you how much information is gained relative to the uniform distribution.The information that is gained is about the precise value of the integrity parameter.Differently put, it measures how far away from the uniform distribution, values in the distribution tend to be. Figure 1 provides two examples of graphs, Figure 1a depicts a distribution with less information about integrity than Figure 1b.In reality, it is not important whether the integrity value is exactly 0.7, or say 0.705.For the purpose of measuring the entropy of the integrity value, these two values are considered to be completely different.For graphs such as the ones depicted in Figure 1, this is not a major issue, since the probabilities of similar integrity values tend to be similar too.However, in more extreme cases, such as in the graph depicted in Figure 2, it becomes an issue for our intuition.The graphs in Figure 2a and 2b are identical through the lens of the information measure, since both distributions have support on half the interval, and are uniformly distributed over the part with support.In both cases, the information gained over the uniform distribution is 1 bit -since we can exclude exactly half the possibilities.However, if we want to know whether we are dealing with a reliable person, the distribution in Figure 2a is likely to be helpful, but the one in Figure 2b is not.
Uncertainty is inversely proportional to the amount of evidence (i.e., the sum of the parameters of the Beta distribution).Adding evidence tends to increase the information about the integrity parameter too, as the peak tends to become narrower, as illustrated in Figure 1.However, it is not necessarily the case that adding evidence decreases the entropy, as illustrated in Figure 3.The distribution f β (8, 1) has an entropy of 1.7376 bits, whereas the distribution f β (8, 2) has an entropy of 1.1468 bits.Therefore, entropy about the integrity parameter fails to meet the basic criterion of uncertainty, which is being that it is monotonically decreasing as evidence is added.

Bernoulli Trial Entropy
An ingredient that was missing from integrity parameter entropy, was to take into account the values of integrity parameter, rather than just its probability density.Arguably, we are not necessarily interested in the exact integrity of other agents, but we are interested in knowing whether they will betray us or not.Whether an agent would betray us is determined by a Bernoulli trial based on the integrity parameter.In other words, an agent will not betray us with a probability equal to its integrity parameter.Since the Beta distribution is the estimate of that integrity parameter, the expected entropy of the Bernoulli trial is: Although we are computing the expectation of the entropy, the standard intuition of entropy applies: how much information about the outcome of the Bernoulli trial do we (expect to) have.The entropy of a Bernoulli trial is between 0 and 1 bits, where values close to 0 bits mean near certainty about whether we will be betrayed or not.The Beta distribution with maximal uncertainty -the uniform distribution -has an entropy of 0.7213 bits in this measure; strictly less than 1.
It can certainly be useful to measure how much you about the Bernoulli trial, but this measure has barely any connection to uncertainty.Consider a user with an integrity parameter of 0.5.A reasonable progression of Beta distributions as more evidence is accumulated is depicted in Figure 4. What we see in Figure 4 is that we are increasingly certain that the integrity parameter must be near 0.5.If the integrity parameter is 0.5, then we have 1 bit entropy of the Bernoulli trial, whereas the values near the extremes have near 0 bits entropy.As the evidence accumulates, this measure converges to 1 bit entropy.Again, this breaks the most basic requirement that entropy decreases as uncertainty decreases.

KL-Divergence from Truth
The problem with Bernoulli trial entropy as a measure for uncertainty, is that as evidence is added, it provides a value that is closer to the true Bernoulli entropy of that agent, rather than a smaller value.Assume that, somehow, we have access to the true integrity parameter of an agent, then we can measure the information-theoretic distance to that value.The standard technique is to use Kullback-Leibler divergence.Given a true integrity parameter of value x, we can apply KL-divergence to the Bernoulli trial entropy as: As an example, say we measure 6 successes and 1 failure with an agent with parameter 0.85, then we get the KL-divergence from the truth as: 1 0 f β (y; 6, 1)• (0.85 log( 0.85 y ) + 0.15 log( 0.15 1−y )) dy = 0.1247 bits.However, it is possible that we measure 6 successes and 1 failure with an agent with parameter 0.4, in which case the distance is 1.2460.The measure does not just depend on the distribution itself.
This measure cannot be applied to compute the entropy, given an arbitrary Beta distribution, since the true integrity parameter is an unknown.Notice that the shape of the equation is such that the formula for the expectation of what behaviour will be observed is similar to the equation for the expectation of the integrity parameter given the observed behaviour.By applying Bayes' theorem, we can alter this term to talk about the expected true integrity parameter given the observed behaviour: E x,y (D KL (f B (x)||f B (y)).This formula turns out to be EDRB, as we see in the next section.

Entropy-Uncertainty Equivalence
It may not be immediately obvious what it means for entropy measures and uncertainty measures to be equivalent.Both uncertainty and EDRB (expected KLdivergence of random Bernoulli trials) are actually families of measures, rather than a singular measure.Recall that if n e is the amount of evidence, the general expression for uncertainty is r ne+r .EDRB provides different outcomes, depending on the choice of the base of the logarithm b, we will prove that it is log(b)  ne+2 .In the case r = 2, b = e 2 , the two formulas are equal.However, we argue that the equivalence is stronger, since every member of the two families shares the crucial property that its inverse is a linear function of the amount of evidence.
Our goal, therefore, is to prove that E x,y (D KL (f B (x)||f B (y)) = log(b) ne+2 .Note that if we have s successes and f failures, our Beta distribution is f β (x; α, β), with α = s + 1 and β = f + 1.Therefore, α + β = s + f + 2 = n e + 2. Therefore, we can state our Theorem as the following equation: Proof.We will prove that : Swapping α and β while substituting x for 1 − x and y for 1 − y, it follows: This suffices to prove the theorem, since log(b)β (α+β = {Terms in square brackets evaluate to 0 at 0 and 1. Simplify formula.} There are two ways to interpret the theorem.Firstly, we can use the intuition from Section 3.3, and say that f β (x; α, β) is the Bayesian estimate of the true integrity parameter that generated the history, and we measure the expected KL-divergence between the Bernoulli trial with the true integrity parameter and a new randomly selected parameter (y).Simply put, we reuse the measure from Section 3.3, but substituting the true integrity for the expected integrity.KLdivergence is an oft-used way to measure the quality of a model distribution, compared to the real one.EDRB measures the expectation of the distance between the KL-divergence of the Bernoulli trail based on an estimated true one and an estimated model.Of course, taking the expectation of the true integrity used for the Bernoulli trial is intuitively dubious.
The alternative intuition does not involve true integrities for this reason.EDRB can be interpreted to say, given two agents with the same history, how much do we learn about one agent, if we observe a new interaction with the other.As more evidence accumulates, the possible choices for the parameter for the Bernoulli trial becomes more centered around a specific value.If the probability that two Bernoulli trials use similar parameters increases, then the KL-divergence between the two decreases.This intuition is a more direct reading of the actual formula, as we are taking the expectation of a pair of integrity parameters, distributed along the same Beta distribution.The weakness of this intuition is that KL-divergence is an asymmetric measure, where one distribution represents the true distribution and the other one the model distribution, whereas this intuition is measuring the distance between two model distributions.
While both intuitions are imperfect, they do offer an explanation why we might expect uncertainty and EDRB to be related.The fact that they are indeed equivalent is non-obvious, however.The proof does not provide us with an insight as to why they are indeed equivalent -other than the fact that they are.Based on the fact that the intuitions are imperfect, and the proof does not provide any intuition either, we consider the equivalence to be surprising.
Uncertainty is a useful concept and a basic tenet of Subjective Logic.To compute the uncertainty, from a Beta distribution f β (x; α, β), simply take u = 1 α+β−1 .However, this definition uses the parameters of the distribution, rather than the probability density function.Given a probability density function f , that happens to represent a Beta distribution, there is no elegant way to compute the uncertainty.For example if f = 6(x(1−x) 5 +(1−x) 6 ), then how to determine its uncertainty -given it may not be trivial to realise f = f β (6, 1).Alternatively, we can compute E X,Y (D KL (f B X||f B Y )) for X, Y ∼ f , and obtain 1  α+β without knowing α and β.
The fact that we can use our new measure as an alternative way to compute the uncertainty of a Beta distribution without explicitly using the parameters, is interesting in itself.More interesting, however, is the fact that the input probability density function need not be a Beta distribution at all, for it to work.As we more rigorously argue in the next section, there are cases where it does not make sense to use a Beta distribution as opinions.These cases have been recognised implicitly in the literature (e.g.[14]), but are not typically explicitly addressed.We can now reason about the uncertainty present in more esoteric distributions that may pop up.In the next section, we present some of the implications to these generalised distributions.

Generalised Opinions
That models of information fusion found in Subjective Logic are isomorphic to Beta distributions is not surprising.After all, these models are created with this purpose in mind.Subjective Logic further incorporates logical operations, and transitive trust operations.As shown in [10] and [11] respectively, the resulting distribution of these operations is not a Beta distribution (using the assumptions of the Beta model).In other words, the isomorphism does not hold if we add the new operations.In this section, we show examples of distributions resulting from logical or transitive operations, discuss why they are not Beta distributions, and extend the result from the previous section to these distributions.

Opinion Logic
Consider performing logic on the opinions.For example, we have a distribution for A and for A , but in order to obtain a success, we need both A and A to succeed.In the case that A and B are independent agents, the probability that A ∧ A succeeds is a Bernoulli trial with parameter x A × x A [10].According to [10], if we want to obtain our opinion on A ∧ A , based on our opinions on A and A , then we need to take their product distribution: 1 ) dy; the product distribution of the opinions on A, A .
In Subjective Logic, conjunction is defined as: In Figure 5, we see the conjunction of f β (x; 8, 4) and f β (x; 9, 2) as derived from the product distribution, as well as the results computed using the Subjective Logic conjunction definition under f 1 and f 5 .We can see that neither f 1 nor f 5 are isomorphisms w.r.t.conjunction, since the graphs differ.In fact, for no choice of r, or even for any other Subjective Logic definition for conjunction, will f r be an isomorphism.The reason is that all opinions in Subjective Logic are isomorphic to a Beta distribution, but the result of the product distribution is not generally (in fact, almost never) a Beta distribution.Therefore, no isomorphism can exist.
Although the resulting opinion is not a Beta distribution, we can compute the uncertainty via its equivalence to EDRB.The uncertainty of f β (x; 8, 4) and f β (x; 9, 2) is, therefore, equal to 0.0775.

Transitive Trust
Transitive trust is a fiercely debated topic.Using the assumptions of the Beta model, an issue arises.The formula contains a term χ, which is the attacker's strategy.In other words: how to use the advice of another agent, depends on how the agent would act, if he were malicious.See [11] for more details.The attacker strategy is not a topic for this paper, so we will assume the simplest attack strategy: random behaviour.
If an advisor A is honest with probability x A , and the advisor gives us the opinion p Xc , then our resulting opinion is simply x A • p Xc (x c ) + (1 − x A ).This can be derived from Theorem 2 in [11].However, the intuition behind it is also clear, namely that if the advisor speaks the truth, we should listen, and if he lies, we know nothing.We do not typically know x A , but we can use our opinion p X A to estimate this value.
The result of obtaining an opinion from advice, therefore, is not a Beta distribution, but a weighted sum of Beta distributions2 .However, like in Section 5.1, Subjective Logic must return a Beta distribution as the result of transitive trust.In fact, Subjective Logic defines transitive trust: In Figure 6, we see the propagation of f β (x; 8, 4) and f β (x; 9, 2) as derived from the summing Beta distributions, as well as the results computed using the Subjective Logic propagation definition under f 1 and f 5 .Compared to conjunction, we see that the difference between the two approaches is even larger.In particular, we notice that Figure 6a has raised flat tails.These raised flat tails are a consequence of the fact that, no matter what malicious agents say, if they are lying, then extremely high/low integrity values remain probable.
Although the resulting opinion is not a Beta distribution, we can compute the uncertainty via its equivalence to EDRB.The uncertainty of the opinion resulting from hearing f β (x; 9, 2) from an agent that we have the opinion f β (x; 8, 4) about, is equal to 0.5354.In this case, the uncertainty is (significantly) larger than  the uncertainty of f β (x; 9, 2) (which is 0.1000).However, it need not be the case that summing Beta distributions changes the uncertainty in a meaningful way.In particular, the uncertainty of 1 /3(f β (x; 3, 1) + f β (x; 2, 2) + f β (x; 1, 3)) is the maximum: 1, even though the individual distributions have far smaller uncertainty.There may be a more subtle pattern in the EDRB entropy of a sum of Beta distributions, but this is future work.A reasonable approach to selecting a strategy for the attacker, is to select the strategy that is the least informative.Typically, that means the strategy that gives the highest entropy.No closed formula has been found that maximises either the integrity entropy or the Bernoulli trial entropy.An open question is whether this approach can be more fruitful when using EDRB as the measure of entropy.

Conclusion
Theorem 1 is our main result.It states that uncertainty (the inverse of amount of evidence) is equal to a specific measure of entropy that we introduce: expected Kullback-Leibler divergence of random-parameter Bernoulli trials (EDRB).The intuition behind EDRB is that it measures the expected distance between two Bernoulli trials selected from a distribution -a more narrow distribution will have less distance between the Bernoulli trials.
While both entropy and uncertainty can be used to describe lack of knowledge.Any entropy measure is based on surprisal, whereas uncertainty is based on Bayesian evidence.Hence is surprising that they should coincide.
We discuss alternative measures of entropy in Section 3. Measures such as integrity entropy and Bernoulli trial entropy certainly have use-cases.Uncertainty simply measures something else than these two measures.
Finally, we study the implications of having EDRB on generalised opinions.These are distributions other than the Beta distributions.In Section 5, we show how these distributions arise, and why they are of interest.We plan to further study of the implications of generalised opinions under EDRB.In particular we want to explore the notion of malicious advisors maximising EDRB entropy.

Fig. 1 :
Fig. 1: Two Beta distributions equal in expected value, but not uncertainty.

Fig. 2 :
Fig. 2: Two distributions with uniform support on half the interval.

Fig. 3 :
Fig.3: Two Beta distributions with entropy increasing when adding evidence.