From Bayesian Inference to Logical Bayesian Inference: A New Mathematical Frame for Semantic Communication and Machine Learning

. Bayesian Inference (BI) uses the Bayes’ posterior whereas Logical Bayesian Inference (LBI) uses the truth function or membership function as the inference tool. LBI was proposed because BI was not compatible with the classical Bayes’ prediction and didn’t use logical probability and hence couldn’t express semantic meaning. In LBI, statistical probability and logical probability are strictly distinguished, used at the same time, and linked by the third kind of Bayes’ Theorem. The Shannon channel consists of a set of transition probability functions whereas the semantic channel consists of a set of truth functions. When a sample is large enough, we can directly derive the semantic channel from Shannon’s channel. Otherwise, we can use parameters to construct truth functions and use the Maximum Semantic Information (MSI) criterion to optimize the truth functions. The MSI criterion is equivalent to the Maximum Likelihood (ML) criterion, and compatible with the Regularized Least Square (RLS) criterion. By matching the two channels one with another, we can obtain the Chan-nels’ Matching (CM) algorithm. This algorithm can improve multi-label classifications, maximum likelihood estimations (including unseen instance classifications), and mixture models. In comparison with BI, LBI 1) uses the prior P ( X ) of X instead of that of Y or θ and fits cases where the source P ( X ) changes, 2) can be used to solve the denotations of labels, and 3) is more compatible with the classical Bayes’ prediction and likelihood method. LBI also provides a confirmation measure between -1 and 1 for induction.


Introduction 1
Bayesian Inference (BI) [1,2] was proposed by Bayesians. Bayesianism and Frequentism are contrary [3]. Frequentism claims that probability is objective and can be defined as the limit of the relative frequency of an event; whereas Bayesianism claims that probability is subjective or logical. Some Bayesians consider probability as degree of belief [3] whereas others, such as Keynes [4], Carnap [5], and Jaynes [6], so-called logical Bayesians, consider probability as the truth value. There are also minor logical Bayesians, such as Reichenbach [7] as well as the author of this paper, who use frequency to explain the logical probability and truth function.
Many frequentists, such as Fisher [8] and Shannon [9], also use Bayes' Theorem, but they are not Bayesians. Frequentist main tool for hypothesis-testing is Likelihood Inference (LI), which has achieved great successes. However, LI cannot make use of prior knowledge. For example, after the prior distribution P(x) of an instance x is changed, the likelihood function P(x|θ j ) will be no longer valid 1 . To make use of prior knowledge and to emphasize subjective probability, some Bayesians proposed BI [1] which used the Bayesian posterior P(θ|X), where X was a sequence of instances, as the inference tool. The Maximum Likelihood Estimation (MLE) was revised into the Maximum A Posterior estimation (MAP) [2]. Demonstrating some advantages especially for working with small samples and for solving the frequency distribution of a frequency producer, BI also has some limitations. The main limitations are: 1) It is incompatible with the classical Bayes' prediction as shown by Eq. (1); 2) It does not use logical probabilities or truth functions and hence cannot solve semantic problems. To overcome these limitations, we propose Logical Bayesian Inference (LBI), following earlier logical Bayesians to use the truth function as the inference tool and following Fisher to use the likelihood method. The author also set up new mathematical frame employing LBI to improve semantic communication and machine learning.
LBI has the following features:  It strictly distinguishes statistical probability and logical probability, uses both at the same time, and links both by the third kind of Bayes' Theorem, with which the likelihood function and the truth function can be converted from one to another.  It also uses frequency to explain the truth function, as Reichenbach did, so that optimized truth function can be used as transition probability function P(y j |x) to make Bayes' prediction even if P(x) is changed.  It brings truth functions and likelihood functions into information formulas to obtain the generalized Kullback-Leibler (KL) information and the semantic mutual information. It uses the Maximum Semantic Information (MSI) criterion to optimize truth functions. The MSI criterion is equivalent to the Maximum Likelihood (ML) criterion and compatible with the Regularized Least Squares (RLS) criterion [10].
Within the new frame, we convert sampling sequences into sampling distributions and then use the cross-entropy method [10]. This method has become popular in recent two decades because it is suitable to larger samples and similar to information theoretical method. This study is based on the author's studies twenty years ago on semantic information theory with the cross-entropy and mutual cross-entropy as tools [11][12][13][14]. This study also relates to the author's recent studies on machine learning for simplifying multi-label classifications [15], speeding the MLE for tests and unseen instance classifications [16], and improving the convergence of mixture models [17].
In the following sections, the author will discuss why LBI is employed (Section 2), introduce the mathematical basis (Section 3), state LBI (Section 4), introduce its applications to machine learning (Section 5), discuss induction (Section 6), and summarize the paper finally. includes n different sub-samples X j . If D is large enough, we can obtain distribution P(x, y) from D, and distribution P(x|y j ) from X j .
A Shannon's channel P(y|x) consists of a set of Transition Probability Functions (TPF) P(y j |x), j=1, 2, … A TPF P(y j |x) is a good prediction tool. With the Bayes' Theorem II (discussed in Section 3), we can make probability prediction P(x|y j ) according to P(y j |x) and P(x). Even if P(x) changes into P'(x), we can still obtain P'(x|y j ) by We call this probability prediction as "classical Bayes' prediction". However, if samples are not large enough, we cannot obtain continuous distributions P(y j |x) or P(x|y j ). Therefore, Fisher proposed Likelihood Inference (LI) [7].
For given X j , the likelihood of θ j is We use θ j instead of θ in the above equation because unlike the model θ in BI, the model in LI does not have a probability distribution. If there are N ji x i in X j , then P(x i |y j )=N ji /N j , and the likelihood can be expressed by a negative cross entropy: For conditional sample X j whose distribution is P(x|j) (the label is uncertain), we can find the MLE: When P(x|θ j )=P(x|j), H(X|θ j ) reaches its minimum.
The main limitation of LI is that it cannot make use of prior knowledge, such as P(x), P(y), or P(θ), and does not fit cases where P(x) may change. BI brings the prior distribution P(θ) of θ into the Bayes' Theorem II to have [2] ( | ) ( ) where P θ (X) is the normalized constant related to θ. For one Bayesian posterior, we need n or more likelihood functions. The MLE becomes the MAP: where P θ (X) is neglected. It is easy to find that 1) if P(θ) is neglected or is an equiprobable distribution, the MAP is equivalent to the MLE; 2) while the sample's size N increases, the MAP gradually approaches the MLE. There is also the Bayesian posterior of Y: It is different from P(θ|X) because P(Y|X,θ) is a distribution over the label space whereas P(θ|X) is a distribution over the parameter space. The parameter space is larger than the label space. P(Y|X, θ) is easier understood than P(θ|X). P(Y|X, θ) is also often used, such as for mixture models and hypothesis-testing.
BI has some advantages: 1) It considers the prior of Y or θ so that when P(X) is unknown, P(Y) or P(θ) is also useful, especially for small samples. 2) It can convert the current posterior P(θ|X) into the next prior P(θ).
3) The distribution P(θ|X) over θ space will gradually concentrate as the sample's size N increases. When N ∞ , only P(θ*|X)=1, where θ* is the MAP. So, P(θ|X) can intuitively show learning results.
However, there are also some serious problems with BI: 1) About Bayesian prediction BI predicts the posterior and prior distributions of x by [2]: From a huge sample D, we can directly obtain P(x|y j ) and P(x). However, BI cannot ensure P θ (x|X)=P(x|y j ) or P θ (x)=P(x). After P(x) changes into P'(x), BI cannot obtain the posterior that is equal to P'(x|y j ) in Eq. (1). Hence, the Bayesian prediction is not compatible with the classical Bayes' prediction. Therefore, we need an inference tool that is like the TPF P(y j |x) and is constructed with parameters.
2) About logical probability BI does not use logical probability because logical probability is not normalized; nevertheless, all probabilities BI uses are normalized. Consider labels "Non-rain", "Rain", "Light rain", "Moderate rain", "Light to moderate rain", …, in a weather forecast. The sum of their logical probabilities is greater than 1. The conditional logical probability or truth value of a label with the maximum 1 is also not normalized. BI uses neither truth values nor truth functions and hence cannot solve the denotation (or semantic meaning) of a label. Fuzzy mathematics [18,19] uses membership functions, which can also be used as truth functions. Therefore, we need an inference method that can derive truth functions or membership functions from sampling distributions.

3) About prior knowledge
In BI, P(θ) is subjective. However, we often need objective prior knowledge. For example, to make probability prediction about a disease according to a medical test result "positive" or "negative", we need to know the prior distribution P(x) [16]. To predict a car's real position according to a GPS indicator on a GPS map, we need to know the road conditions, which tell P(x) 1 .

4) About optimization criterion
According to Popper's theory [20], a hypothesis with less LP can convey more information. Shannon's information theory [9] contains a similar conclusion. The MAP is not well compatible with the information criterion.
The following example can further explain why we need LBI. Fig. 1. Solving the denotation of y1="x is adult" and probability prediction P'(x|y1 is true).

Example 1.
Given the age population prior distribution P(x) and posterior distribution P(x|"adult" is true), which are continuous, please answer: 1) How do we obtain the denotation (e. g. the truth function) of "adult" (see Fig. 1)? 2) Can we make a new probability prediction or produce new likelihood function with the denotation when P(x) is changed into P'(x)?
3) If the set {Adult} is fuzzy, can we obtain its membership function? It is difficult to answer these questions using either LI or BI. Nevertheless, using LBI, we can easily obtain the denotation and make the new probability prediction.

Mathematical Basis: Three Kinds of Probabilities and Three Kinds of Bayes' Theorems
All probabilities [3] can be divided into three kinds: 1) Statistical probability: relative frequency or its limit of an event; 2) Logical Probability (LP): how frequently a hypothesis is judged true or how true a hypothesis is; 3) Predicted (or subjective) probability: possibility or likelihood.
We may treat the predicted probability as the hybrid of the former two kinds. Hence, there are only two kinds of basic probabilities: the statistical and the logical.
A hypothesis or label has both statistical (or selected) probability and LP. They are very different. Consider labels in a weather forecast: "Light rain" and "Light to heavy rain". The former has larger selected probability and less LP. The LP of a tautology, such as "He is old or not old", is 1 whereas its selected probability is close to 0.
Each of existing probability systems [3,[5][6][7] only contains one kind of probabilities. Now we define a probability system with both statistical probabilities and LPs.
Definition 2 A label y j is also a predicate y j (X)= "X∈A j ." For y j , U has a subset A j , every x in which makes y j true. Let P(Y=y j ) denote the statistical probability of y j , and P(X∈A j ) denote the LP of y j . For simplicity, let P(y j )=P(Y=y j ) and T(A j )=P(X∈A j ).
We call P(X∈A j ) the LP because according to Tarski's theory of truth [21], P(X∈ A j )=P("X∈A j " is true)=P(y j is true). Hence, the conditional LP T(A j |X) of y j for given X is the feature function of A j and the truth function of y j . Hence the LP of y j is According to Davidson's truth-conditional semantics [21], T(A j |X) ascertains the semantic meaning of y j . Note that statistical probability distributions, such as P(Y), P(Y|x i ), P(X), and P(X|y j ), are normalized; however, LP distributions are not normalized. In general, T(A 1 )+T(A 2 )+…+T(A n )>1; T(A 1 |x i )+T(A 2 |x i )+…+T(A n |x i )>1.
If A j is fuzzy, T(A j |X) becomes the membership function, and T(A j ) becomes the fuzzy event probability defined by Zadeh [19]. For fuzzy sets, we use θ j to replace A j . Then T(θ j |X) becomes the membership function of θ j . That means We can also treat θ j as a sub-model of a predictive model θ. In this paper, the likelihood function P(X|θ j ) is equal to P(X|y j ; θ) in popular likelihood method. T(θ j |X) is different from P(θ|X) and is longitudinally normalized, e. g., m a x(T(θ j |X))=max(T(θ j |x 1 ), T(θ j |x 2 ), …, T(θ j |x m )) =1 (11) There are three kinds of Bayes' Theorems, which are used by Bayes [23], Shannon [9], and the author respectively.
There is also a symmetrical formula for T(A|B). Note there are only one random variable X and two logical probabilities T(A) and T(B).
Bayes' Theorem II (used by Shannon): P y X P X P y P y P x P y x    (13) There is also a symmetrical formula for P(y j |X) or P(Y|x i ). Note there are two random variables and two statistical probabilities.
Bayes' Theorem III: The two formulas are asymmetrical because there is a statistical probability and a logical probability. T(θ j ) in Eq. (15) may be called longitudinally normalizing constant.
The Proof of Bayes' Theorem III: The joint probability P(X, θ j )=P(X=x, X∈θ j ), then P(X|θ j )T(θ j )=P(X=x, X∈θ j )=T(θ j |X)P(X). Hence there is Using this formula, we can answer questions of Example 1 in Section 2 to obtain the denotation of "Adult" and the posterior distribution P'(x|y 1 is true) as shown in Fig. 1.

Logical Bayesian Inference (LBI)
LBI has three tasks: 1) To derive truth functions or a semantic channel from D or sampling distributions (e.g., multi-label learning [24,25]); 2) To select hypotheses or labels to convey information for given x or P(X|j) according to the semantic channel (e. g., multi-label classification); 3) To make logical Bayes' prediction P(X|θ j ) according to T(θ j |X) and P(X) or P'(X). The third task is simple. We can use Eq. (14) for this task.
For the first task, we first consider continuous sampling distributions from which we can obtain The Shannon channel P(Y|X). The TPF P(y j |X) has an important property: P(y j |X) by a constant k can make the same probability prediction because A semantic channel T(θ|X) consists of a set of truth functions T(θ j |X), j=1, 2, …, n. According to Eq. (17), if T(θ j |X)∝P(y j |X), then P(X|θ j )=P(X|y j ). Hence the optimized truth function is We can prove 1 that the truth function derived from (18) is the same as that from Wang's random sets falling shadow theory [26]. According to the Bayes' Theorem II, from Eq. (18), we obtain T*(θ j |X)= [P(X|y j )/P(X)]/max[P(X|y j )/P(X)] Eq. (19) is more useful in general because it is often hard to find P(y j |X) or P(y j ) for Eq. (18). Eqs. (18) and (19) fit cases involving large samples. When samples are not large enough, we need to construct truth functions with parameters and to optimize them. The semantic information conveyed by y j about x i is defined with log-normalizedlikelihood [12,14]: For an unbiased estimation y j , its truth function may be expressed by a Gaussian distribution without the coefficient: T(θ j |X)=exp[-(X-x j ) 2 /(2d 2 )]. Hence The log[1/T(θ j )] is the Bar-Hillel-Carnap semantic information measure [27]. Eq. (21) shows that the larger the deviation is, the less information there is; the less the LP is, the more information there is; and, a wrong estimation may convey negative information. These conclusions accord with Popper's thought [20].
To average I(x i ; θ j ), we have where P(x i |y j ) (i=1, 2, …) is the sampling distribution, which may be unsmooth or dis- Clearly, the MSI criterion is like the RLS criterion. H(θ|X) is like the mean squared error, and H(θ) is like the negative regularization term. The relationship between the log normalized likelihood and generalized KL information is The MSI criterion is equivalent to the ML criterion because P(X) does not change when we optimize θ j .
For the second task of LBI, given x i , we select a hypothesis or label by the classifier ( | ) *= ( ) arg max log ( ; )= arg max log ( ) This classifier produces a noiseless Shannon's channel. Using T(θ j ), we can overcome the class-imbalance problem [24]. If T(θ j |x)∈{0,1}, the classifier becomes with ( | ) 1 with ( | ) 1 *= ( ) arg max log[1/ ( )] arg min ( ) It means that we should select a label with the least LP and hence with the richest connotation. The above method of multi-label learning and classification is like the Binary Relevance (BR) method [25]. However, the above method does not demand too much of samples and can fit cases where P(X) changes (See [15] for details).

Logical Bayesian Inference for Machine Learning
In Section 4, we have introduced the main method of using LBI for multi-label learning and classification. From LBI, we can also obtain an iterative algorithm, the Channels' Matching (CM) algorithm, for the MLE [16] and mixture models [17].
Step I (the semantic channel matches the Shannon channel): For a given classifier Y=f(Z), we obtain P(Y|X), T(θ|X), and conditional information for given Z Step II (the Shannon channel matches the semantic channel): The classifier is Repeating the above two steps, we can achieve the MSI and ML classification. The convergence can be proved with the help of R(G) function [16].
For mixture models, the aim of Step II is to minimize the Shannon mutual information R minus the semantic mutual information G [17]. The convergence of the CM for mixture models is more reliable than that of the EM algorithm 2 .

Confirmation Measure b* for Induction
Early logical Bayesians [4][5][6][7] were also inductivists who used the conditional LP or truth function to indicate the degree of inductive support. However, contemporary inductivists use the confirmation measure or the degree of belief between -1 and 1 for induction [28,29]. By LBI, we can derive a new confirmation measure b*∈[-1,1]. Now we use the medical test as an example to introduce the confirmation measure b*. Let x 1 be a person with a disease, x 0 be a person without the disease, y 1 be the testpositive, and y 0 be the test-negative. The y 1 also means a universal hypothesis "For all people, if one's testing result is positive, then he/she has the disease". According to Eq. (18), the truth value of proposition y 1 (x 0 ) (x 0 is the counterexample of y 1 ) is b '*= T*(θ 1 |x 0 )= P(y 1 |x 0 )/P(y 1 |x 1 ) 1 CL '/ CL, if CL 0.5 * CL '/ CL 1, if CL 0.
where CL'=1-CL. If the evidence or sample fully supports a hypothesis, then CL=1 and b*=1. If the evidence is irrelevant to a hypothesis, CL=0.5 and b*=0. If the evidence fully supports the negative hypothesis, CL=0 and b*=-1. The b* can indicate the degree of inductive support better than CL because inductive support may be negative. For example, the confirmation measure of "All ravens are white" should be negative.
If |U|>2 and P(y j |X) is a distribution over U, we may use the confidence interval to convert a predicate into a universal hypothesis, and then, to calculate its confidence level and the confirmation measure 1 .
BI provides the credible level with given credible interval [3] for a parameter distribution instead of a hypothesis. The credible level or the Bayesian posterior does not well indicate the degree of inductive support of a hypothesis. In comparison with BI, LBI should be a better tool for induction.

Summary
This paper proposes the Logical Bayesian Inference (LBI), which uses the truth function as the inference tool like logical Bayesians and uses the likelihood method as frequentists. LBI also use frequencies to explain logical probabilities and truth functions, and hence is the combination of extreme frequentism and extreme Bayesianism. The truth function LBI uses can indicate the semantic meaning of a hypothesis or label and can be used for probability prediction that is compatible with the classical Baye's prediction. LBI is based on the third kind of Bayes' Theorem and the semantic information method. They all together form a new mathematical frame for semantic communication, machine learning, and induction. This new frame may support and improve many existing methods, such as likelihood method and fuzzy mathematics method, rather than replace them. As a new theory, it must be imperfect. The author welcomes researchers to criticize or improve it.