Learning Bivariate Functional Causal Models

Finding the causal direction in the cause-effect pair problem has been addressed in the literature by comparing two alternative generative models X → Y and Y → X. In this chapter, we first define what is meant by generative modeling and what are the main assumptions usually invoked in the literature in this bivariate setting. Then we present the theoretical identifiability problem that arises when considering causal graph with only two variables. It will lead us to present the general ideas used in the literature to perform a model selection based on the evaluation of a complexity/fit trade-off. Three main families of methods can be identified: methods making restrictive assumptions on the class of admissible causal mechanism, methods computing a smooth trade-off between fit and complexity and methods exploiting independence between cause and mechanism.


Introduction
A natural approach to address the cause-effect pair problem is from a reverse engineering perspective: given the available measurements {(x i ,y i )} n i=1 of the two variables X and Y, the task is to discover the underlying causal process that led the variables to take the values they have. To find this model, one needs implicitly to answer two questions: 1. First, is there a causal relationship between X and Y and what is the causal direction? Was X generated first and then Y generated from X, or the opposite? 2. Second, what is the causal mechanism that can explain the behavior of the system? How was Y generated from X or X generated from Y? Therefore this approach for causal discovery goes beyond finding the causal structure, as it requires also to define a generative model of the data, which do not seem mandatory at first sight if one is only interested in finding if X → Y or Y → X.
This type of generative model has notably been formalized with the framework of Functional Causal Models (FCMs) [21], also known as Structural Equation Models (SEMs), that can well represent the underlying data-generating process, supports interventions and allows counterfactual reasoning.
where U (−1, 1) denotes uniform distribution between −1 and 1. Here we use the symbol ":=" when writing the model to signify that it has to be seen as an assignment from cause to effect.
The correlation coefficient is equal to zero, but there is a nonlinear dependency between X and Y . Moreover, if one assumes that this dependency is due to the influence of one of the variable on the other and not due to a third variable influencing both, one of the two following causal hypotheses holds as stated in Chap. 1 of this book: • Y := f Y (X, N Y ) with N Y ⊥ ⊥ X (hypothesis 1, X → Y ), • X := f X (Y, N X ) with N X ⊥ ⊥ Y (hypothesis 2, Y → X), where the noise variable N Y (respectively N X ) summarizes all the other unobserved influences on X (respectively on Y ).
In order to recover the causal mechanism, one can propose to search in the class of polynomial functions of degree 4 by fitting regression models y =f Y (x) and x =f X (y) on the data with mean squared error loss (Fig. 3.2).
The expected mean squared error is lower for the modelf Y from X to Y than for the other modelf X from Y to X. The residuals of each polynomial regression are displayed on Fig. 3.3.
To some extent, as the residual Y −f Y (X) is independent of X, one can build an explanatory model of the data as : Y :=f Y (X) + N Y , withf Y quadratic and with almost N Y ⊥ ⊥ X and N Y ∼ U (−1, 1)/3. In this case the underlying data generative process of the data corresponding to the true model (see Eq. (3.3)) is recovered. The causal direction X → Y is identified and one has also built a simulator close to the true mechanism (up to small parameter adjustments) that can be used to simulate the effect of interventions on the system. Indeed with this functional model, one can now compute Y do(X=x) =f Y (x) + N Y , with N Y ∼ U (−1, 1)/3. It should give results on average similar to the true model of Eq. (3.3). The best fit appears with the green curve. Better seen in color In the opposite direction, as the residual is not independent of Y , one cannot build any model such as X :=f X (Y ) + N X with N X ⊥ ⊥ Y . However we will see later in this chapter that it is possible to prove that a model X :=f X (Y, N X ) with N X ⊥ ⊥ Y exists in this case, but its expression may be more complex as the noise term is not additive and the mechanismf X is not continuous on its definition domain. Therefore, from an explanatory perspective, the additive noise model Y := f Y (X)+N Y (hypothesis 1) offers a simpler explanation of the phenomenon than the other explanation X :=f X (Y, N X ) (hypothesis 2) with a non additive noise. For this reason, it seems intuitive to "prefer" hypothesis 1 to hypothesis 2. It is not a formal proof of causality, as it comes from an "Occam razor principle" that favors a simple explanation with the prior assumption that an additive noise model is simpler than a non additive noise model [12]. Another notion of simplicity could be preferred such as multiplicative noise Y := f Y (X) × N Y as explained later in this chapter.
Given the available measurements, the goal is to recover an explanatory model of the data by using statistical tools to test causal hypotheses. However, even if in this toy example the preferred explanatory model also corresponds to the model with the best predictive power, one has to keep in mind that this is not always the case.

Real Example: To Explain or to Predict?
Let us introduce now a real example well known in "the cause-effect pair community": the first cause-effect real pair of the Tübingen database. 2 Figure 3.4 displays collected data on altitude (X-axis) and temperature (Y-axis) in the atmosphere from 349 meteorological stations in Germany over the years 1961-1990.   T is the temperature and Z the corresponding altitude. We assume for this simple example that the observed dependency between T and Z is only due to a direct causal relation from one of the two variables to the other. We will see later in this chapter that it is not actually obvious and the model may be actually more complex, as at least one hidden variable latitude may also have an impact on both variables.
In order to test causal hypotheses, the data are re-scaled with zero mean and unit variance, the dataset are split a large number of times between train (80%) and test sets (20%) and two alternative nonlinear Gaussian process regression models are learned on the train set. 3 When regressing the temperature on altitude one obtains the model t =f t (z) (Fig. 3.5-left) and when regressing the altitude on the temperature one obtains the model z =f z (t) (Fig. 3.6-left).
Interestingly enough, the expected mean squared error on the train and test sets averaged over 100 runs are lower for the false causal orientation T → Z than for the true causal orientation Z → T . Therefore the overall predictive accuracy measured in term of mean squared error is better, even if the model z =f z (t) does not seem to accurately reproduce the data generative process notably in the non-reversible part of the relation (circled area on Fig. 3.6-left).
The best predictive model does not necessarily corresponds to the true causal orientation [29]. It comes from the fact that minimizing a predictive score such as an expected mean squared error does not give necessarily an explanation of the data  Left: real couples of points (altitude, temperature) in blue and regression model z =f z (t) in red. Right: residual of the non-linear regression of the altitude on temperature (red). The circled area corresponds to the non-reversible part of the relation (Z,T). To some extent this residual is not independent of the temperature T . Therefore, the causal hypothesis Z :=f Z (T ) + N Z with N Z ⊥ ⊥ T does not hold generative process and then may lead to misunderstandings when it comes to causal interpretation. 4 If one uses the same nonlinear regression model, but rather than looking at the mean squared error on the test set, one looks at the residuals on the test set defined respectively as n t (z) = t(z) −f t (z) and n z (t) = z(t) −f z (t), one can see that a causal footprint may be detected. Indeed, the residual N Z is almost independent of Z ( Fig. 3.5-right), while the residual N T is not independent of T ( Fig. 3.6-right).
It confirms, to some extent, that the causal hypothesis T :=f T (Z) + N T with N T ⊥ ⊥ Z holds, while the causal hypothesis Z :=f z (T ) + N Z with N Z ⊥ ⊥ T does not hold. Therefore, from an explanatory perspective, using the same conclusion as in the previous example, the additive noise model T :=f T (Z) + N T (model 1) is preferred to the other explanation Z :=f Z (T , N Z ) (model 2) with non additive noise.
From a probabilistic point of view, the model T :=f T (Z) + N T corresponds to the stochastic model P T |Z that accounts for uncertainty about the mechanism involved. Indeed for a given altitude there are several values of temperature possible, because the temperature may depend on other unobserved factors such as the latitude, the type of vegetation, the type of soil, the degree of humidity in the air, etc. In order to characterize a full generative model, one may consider that the altitude depends from other unknown variables such as the variations in terrain elevation: Z :=f Z (N Z ), where N Z models latent source causes. It gives a distribution of P Z , which when combined with P T |Z gives a full generative model P Z P T |Z = P Z,T , that can be used as a simulator to draw samples (z, t) in the region.
This explanatory model may recover a causal interpretation, formally defined with the do-notation [22]. The temperature T is said to be a cause of Z if: for some z, z ′ [20]. An intervention, denoted as do(Z = z), forces the variable altitude Z to take the value z, while the rest of the system remains unchanged. Concretely this mathematical formulation can be translated into: "all other things being equal", when modifying the altitude (climbing a mountain), it has an impact on the temperature (it decreases)". However P do(T =t) Z = P Z as modifying the temperature (heating the air) does not increase the altitude. Nevertheless this causal implication would not be true for hot air balloons! Indeed in causality "random variables" cannot be isolated from their context, because they are intimately related to an underlying specific system.
Moreover this functional causal model could also be used to derive counterfactual statements [22]. Indeed, for any specific meteorological station with a couple of datapoints altitude and temperature, (z i ,t i ), if one knows f T , one can calculate the value n i T such that t i = f T (z i ) + n i T , and therefore for any specific station, one can answer the question "what would have been the temperature t ′ i in this meteorological station if the altitude had changed from z i to z ′ i ?" by using the mathematical expression t ′ i = f T (z ′ i ) + n i T . This counterfactual reasoning on specific individuals would not be possible when having only a model P T |Z at the population level and not the underlying functional causal model including both causal orientation (Z → T ) and mechanism f T .

Comparing Alternative Data Generating Models
As shown with this real introductory example, finding the causal direction, when assuming that there are no confounding effects, consists in comparing two alternative data generating models and deciding whether the causal process Z → T is more natural or simpler than the backward process T → Z. Intuitively, we can think of it as a rudimentary physical model that generates the temperature (effect) from the altitude (cause), which provides a better explanation in some way (more natural or simpler) than generating the altitude from the temperature.
In order to compare these alternatives models, an Occam's razor principle is always invoked in one way or another in the literature. Generally speaking, an Occam's razor principle can be seen as a general heuristic used in science to guide the modeler to find the simplest explanation when testing different causal hypotheses on the data. In order to apply this principle, two things must be defined : what do we mean by simplest explanation, which refers to the notion of the complexity of a model? And what do we mean by testing a causal hypothesis on the data, which refers to the notion of the fit of a model? These two notions of complexity and model fit have been formalized in different ways in the literature. We will detail them in this chapter.
Furthermore, one has always to keep in mind however that this heuristic choice is not an irrefutable principle. It is impossible in the cause-effect pair problem from purely observational data to formally prove that an explainable causal model is true. It is easy to find examples where Occam's razor principle favors the wrong theory given available data. Indeed in the introductory example with altitude and temperature, the true causal mechanism could have been Z := f Z (T , N Z ) (as shown later on in this chapter one can always exhibit such mechanism f Z and variable N Z with N Z ⊥ ⊥ T ) and a conclusion based on the idea that T := f T (Z) + N T is simpler would have led to a false conclusion.
However this Occam razor principle has been implemented in the literature with good empirical success on artificial and real data [20]. By looking at the overall picture, we can distinguish three types of methods implementing this principle in different ways. The first class of methods uses fixed complexity of models and chooses the causal direction corresponding to the model that best fits the data. A second type of methods evaluates a weighted aggregation between two criteria: complexity and fit of the model. The last approaches exhibit two candidate models that are assumed to perfectly correspond to the data generative process and compare their complexities.
In Sect. 3.2, we introduce the bivariate problem setting with the usual assumptions invoked. In Sect. 3.3 we discuss the specific problem of identifiability that appears in this problem. The following Sect. 3.4 is devoted to the general method developed in the literature to tackle this identifiability issue. It will allow us to define a typology of the cause-effect inference methods that we will present more in detail in Sect. 3.5 with their practical implementations. In Sect. 3.6 we propose a benchmark of various methods presented in this chapter on artificial and real data. The next Sect. 3.7 is a discussion on open problems and extensions for the causeeffect pair setting. The last Sect. 3.8 concludes.

Problem Setting
In this section, we present the cause-effect pair problem from the generative approach perspective. We first introduce the notations used in the chapter as well as the main assumptions usually involved. Then we present the general bivariate structural model.

Notations and Assumptions
X and Y are two one-dimensional random variables in R with joint distribution P X,Y .

Identically and Distributed Samples
The given observations D ={ (x i ,y i )} n i=1 of the random variables X and Y are independent and identically distributed drawn from P X,Y .

Time
The time for which the observed data have been collected is not available. It is then impossible to exploit Granger causality tests for time series relying on the principle that if X → Y , then the predictions of the value of Y based on its own past values and on the past values of X are better than predictions of Y based only on its own past values [8]. In the approaches presented in this chapter, time is not explicitly modeled, even though it is assumed that causes precede their effects.

Faithfulness Assumption
This is the classical faithfulness assumption used in graphical causal inference, transposed for two variables: "if there is a causal relation between X and Y ,t h e two variables are not independent".
Pathological cases could arise for example as depicted on Fig. 3.7. The altitude has negative effect on temperature, but the altitude could have also negative impact on the degree of humidity in the air, which could have a negative effect on the temperature (as strange as it may seem, but this is an illustrative case). In this scenario the altitude would directly inhibit the temperature and indirectly improve it, which would result in a statistical independence between altitude and temperature by coincidence even if it is well known that altitude causes temperature.
In this chapter we will consider only pair effect problems where the two variables X and Y are statistically dependent. When detecting a dependency between X and Y five main cases may arise: 1. X causes Y 2. Y causes X 3. Selection bias: there is a unobserved common effect W of both X and Y on which X and Y were conditioned during the data acquisition. This selection bias creates a spurious dependency between X and Y . 4. Confounder: X and Y are both common consequences of a same third variable X ← Z → Y 5. Feedback loop: X is a cause of Y and Y is a cause of X. 6. Constraint relation: X and Y are linked together, but there is no causal relationship between them.
Let us note that multiple combinations such as case 1 and 4 may arise at the same time, X causes Y and both variables are also caused by an unobserved variable Z.

Selection Bias
A selection bias corresponds to an unobserved variable on which the two variables X and Y were implicitly conditioned: graph X → W ← Y (Fig. 3.8-(3)). In our introductory example, we may imagine for example that the variable temperature is in fact independent of the variable altitude. However the temperature has an impact on the number of dwarf mountain pines present in the region because this type of trees grows only in cold region. The altitude has also an impact of this type of vegetation as this type of trees grows only in mountains. Then if the weather stations were only constructed in area with this type of vegetation (as strange as it may seem), when collecting the data, an artificial dependency link would be created between altitude and temperature ( Fig. 3.9).
Throughout this chapter, it is assumed that the sample {(x i ,y i )} n i=1 , corresponding to the variables X and Y , was collected without selection bias.

Causal Sufficiency Assumption
Under this assumption it is assumed that X and Y are not common consequences of the same hidden variables (case corresponding to X ← Z → Y ( Fig. 3.8-(4)) excluded). This causal sufficiency assumption is made by many methods of the cause-effect pair literature that we will review in this chapter as it allows to considerably simplify the cause-effect pair problem. However there are many realistic cases where there are potential confounding effects that can affect both variables (the most typical examples of confounding factors are the age or the gender in epidemiological studies). In this regard, if we come back to our introductory example, a confounding effect may be present as illustrated on Fig. 3.10. Indeed, the hidden variable latitude is known to be a cause of the temperature, but it is also a cause of the altitude, because in Germany all the mountains are situated in the south of the country. Therefore the link between altitude and temperature could be completely spurious and disappears when conditioned on the latitude variable. However it is not the case with this example as we still have a well known causal effect from altitude to temperature, but it highlights the fact that for the cause-effect pair analysis, one should always study the relationship between X and Y after conditioning on the potential observed confounding variables. In Sect. 3.7, we will provide a discussion on the confounding case, which is a great challenge and the problem is far from being solved for the cause-effect pair setting. To keep the problem simple in the first instance and clearly explain the cause-effect pair problem where only X → Y or Y → X are considered, we will first assume in this chapter that the confounding case is excluded.

Feedback Loops
A feedback loop appears when both X causes Y and Y cause X: graph X ⇋ Y ( Fig. 3.8-(5)). Even if the notion of time is not present, this case may happen for example in cross-sectional studies where data are collected over a certain period of time. Mooij et al. [20] give an example where the two variables are the temperature and the amount of sea ice: "an increase of the global temperature causes sea ice to melt, which causes the temperature to rise further (because ice reflects more sun light)".
We will assume in this chapter that this case is excluded. Therefore the causal graph between X and Y refers to the literature on Directed Acyclic Graph (DAG) (with only two variables).

Constraint Relation
This case of dependency with a relation X−Y (Fig. 3.8-(4)) may arise for example if the two variables are linked by a logical or mathematical formula, and not related by a causal relation. As an example, if we write that the productivity P in a firm is equal to the total added value VA of the firm divided by the total number of employees N, the two variables P and VAare linked by a mathematical expression: P = VA N . By observing VAand knowing N we can immediately deduce P , but the opposite is also true as knowing P and N gives VA, with equivalent mathematical relation, without any notion of an underlying system that could have generated one of the variable from the other. We do not consider this case in this chapter.

Measurement Noise
In general in the literature, it is assumed that there are no measurement noise. Such measurement noise may happen if for example the altitude is not measured precisely, but its noisy version notedZ. However the variable temperature T is still a function of the original variable Z that is not corrupted by measurement noise. This problem is related to a cause-effect pair problem between T andZ in presence of a latent hidden variable which is the original Z. We refer the reader to [20] who propose a benchmark that verify the robustness of various cause-effect algorithms in presence of small perturbations of the data.

Variables Units
The orders of magnitude of the variable as well as their physical units are not considered in the problem. In most methods, the variables X and Y are re-scaled with zero mean and unit variance.

Bivariate Functional Causal Models
In the literature on causality, if G denotes an acyclic causal graph (DAG) obtained by drawing arrows from causes X Pa(i;G ) towards their effects X i , it is often assumed that the effects are expressed as a linear function of their cause and an additive Gaussian noise. These models are linear structural equation models, where each variable is continuous and modeled as: with Pa(i; G ) the subset of index of the parents of each variable X i in graph G and N i a random noise variable, meant to account for all unobserved variables. The parameters α j are real values. Each equation characterizes the direct causal relation explaining variable X i from the set of its causes X Pa(i;G ) , based on some linear causal mechanisms. These models are used a lot in social science fields such as econometric and sociology. Although this simplified model with linear mechanisms and additive Gaussian noise appears to be very convenient from a theoretical point of view, it is not often realistic as the interactions between cause and noise may be more complex in reality. Therefore a more general framework has been proposed by Pearl [21] with potential nonlinear interactions between cause and effect: When the DAG G is reduced to X → Y , this system of equation refers to a bivariate structural model.
is a couple of possibly nonlinear functions and (N X ,N Y ) are two independent random variables drawn according to continuous distribution P N = (P N X ,P N Y ), such that: One can notice that this definition holds for any type of continuous distribution of the noise P N . For example P N Y can be set to the uniform distribution on [0, 1] and this is not a general restriction, since one can always write Without loss of generality, we could also write X := N X , with P N X = P X ,b ut in the following we prefer to keep the formulation with two functions (f X ,f Y ) in order to stay consistent with the general formulation of FCM given by Eq. (3.6).
According to [20], the assumption that N X and N Y are independent, is justified by the assumptions that there is no confounding effect, no selection bias, and no feedback between X and Y (see Sect. 3.2.1).

The Problem of Identifiability with Two Variables
Given the formulation of the processes described in Sect. 3.2.2, the task is to identify the causal structure X → Y or Y → X that could have generated the observed data. By identifying we mean proving that f and P N exist so that the hypothesis B G ,f,P N holds in the causal direction G with respect to the observed data while there do not exist any f ′ and P ′ N so that B G ′ ,f ′ ,P ′ N holds in the opposite causal direction G ′ . This problem faces two difficulties. The first is a classical empirical problem because in general one has access to a finite sample size D ={(x j ,y j ]} n j =1 making impossible to evaluate perfectly P X,Y . This is why the evaluation will rely on the definition of a model Q X,Y of the data distribution, which we will discuss later on. The second difficulty is more profound as it is related to the inference of a DAG when only two variables are observed. In this case, it is impossible to identify the causal direction by using classical conditional independence tests (e.g. as in the PC algorithm of [30]), because the two graphs X → Y and Y → X are Markov equivalent.

The Particular Linear Gaussian Case
A first well known identifiability issue arises in the linear Gaussian case because it induces a perfectly symmetric distribution after rescaling ( Fig. 3.11).
A linear Gaussian generative bivariate FCM is defined by the system of equations: with α X ,α Y ,β Y ∈ R 3 . As shown by Mooij et al. [20], it is always possible in this case to find parameters α ′ (3.8) Therefore, for this pair a perfect symmetric generative model exists in both directions (that can only be dissociated by the values of its parameters) and it is impossible to determine the causal direction from these observational data.

General Case
Given any two random variables X and Y with continuous support, [36] shows that if F Y |X is the conditional cumulative distribution function of Y given X and q an arbitrary continuous and strictly monotonic function with a non-zero derivative, then the quantityÑ = q • F Y |X , where • denotes function composition is independent from X. Furthermore, the transformation from (X, Y ) to (X,Ñ) is always invertible, in the sense that Y can be uniquely reconstructed from (X,Ñ).
Stated in another way, given any two random variables X and Y with continuous support, one can always construct a function f Y and another variable, denoted by N Y , which is statistically independent from X and such that Y := f Y (X, N Y ).And equivalently, on can always construct a function f X and another variable, denoted by N X , which is statically independent from Y and such that X := f X (Y, N X ).

Characterization with Continuous Quantile Functions and Uniform Noise
More specifically if the joint density function h of P X,Y is continuous and strictly positive on a compact and convex subset of R 2 , and zero elsewhere, the model Indeed, if one considers the cumulative distribution F X defined over the domain of X (F X (x) = Pr(X < x)). F X is strictly monotonous as the joint density function is strictly positive therefore its inverse, the quantile function Q X : and by setting f X = Q X , we obtain X = f X (N X ). For any noise value n X let x be the value of X defined from n X . The conditional cumulative distribution F Y (y|X = x) = Pr(Y < y|X = x) is strictly continuous and monotonous with respect to y, and can be inverted using the same argument as above. Then we can define f Y (x, n y ) = F −1 Y (x, n y ) and we obtain Y := f Y (X, N Y ). In an equivalent manner, we can show that there exists a set Furthermore, it has been shown by Goudet et al. [7] that under the same assumptions on P X,Y for both candidate generative bivariate FCM X → Y and Y → X, the functions f X and f Y defined above are continuous on their definition domain.
An example of a continuous joint density function P X,Y , strictly positive on a compact and convex subset of R 2 and zero elsewhere is depicted on Fig. 3.12.On shown the quantile regression of X on Y , corresponding to the estimation of the FCM x := f X (y, n x ), with n x set to the values 0.25 (orange curve), 0.5 (green curve) and 0.75 (red curve). It highlights the fact that continuous FCM may be recovered in both directions in this case. However for the causal orientation X → Y , the FCM is linear for any fixed noise value n y , while for the causal orientation Y → X, the FCM is more complicated as it seems to be only linear for n x = 0.5 (green curve).

How to Overcome This Identifiability Problem?
This identifiability problem is a negative result for the cause-effect pair problem, because without any additional assumptions the problem is unsolvable.
However, even if both B X→Y,f,P N and B Y →X,f ′ ,P ′ N hold, there is almost always an asymmetry in the data generative process X → Y and Y → X, because in general the mechanisms f and f ′ do not belong to the same class of functions (except in the linear Gaussian case mentioned before).
If we go back to the introductory example with altitude and temperature, there exist two plausible causal models: However, the first model of Eq. (3.9) can be rewritten to some extent with an additive mechanism: while the alternative causal model with causal orientation T → Z cannot be expressed using the same type of expression with additive noise. If one accepts the fact that an additive mechanism is a "simpler" form of conditional, we may prefer the causal orientation Z → T according to Occam's Razor principle.
In general, one can see that the factorization of the joint density function P X,Y into P X P Y |X or P Y P X|Y may lead to models with different complexities, with respect to some appropriate notion of complexity to be defined.

Computing a Trade-Off Fit/Complexity
The determination of the best explainable model is then based in the literature on these main lines: • Different candidate bivariate models (hypotheses) are evaluated in both directions. • For each candidate model one evaluates a score monitoring the trade-off between the fit of the model (meaning its adequacy to the observational data) and the complexity of the mechanisms involved. • The model with the best score is returned, with its corresponding causal arrow X → Y or Y → X.

Defining Candidate Bivariate Models and Sampling Data
In order to model such continuous underlying bivariate generative process B G ,f,P N which is assumed to have generated the data, we introduce the notion of candidate model B G ,f,Q N described by: • a structure defined by a causal orientation X → Y or Y → X • its estimated mechanisms modeled byf • an estimated distribution of noise Q N

Structure of a Candidate Model
We note θ X and θ X the vectors of parameters of the estimated mechanismsf X andf Y . And we note respectively θ N X and θ N Y the vectors of parameters of the distribution of the modeled noise variables Q N X and Q N Y . The noise variables are independent. The global vector of parameters of the model is noted In some approaches proposed in the literature, the cause variable is not modelled but taken as the observed variable. In this case, Q X = P X . We refer the reader to Chap. 1 of this book explaining why it could be a problem for cause-effect inference.

Mechanisms
After having defined the overall structure, one needs to model the mechanisms: • A first characterization that needs to be specified is related to the type of interaction between the noise variable and the cause. Indeed in the literature, we distinguish mainly additive noise interaction of the form Y : where the cause variable and the noise are mixed with a non-linear mechanism. • The second characterization concerns the class of functions used to definê f . It may range from linear mechanisms as in LiNGAM algorithm [28]t o complex non-linear mechanisms modeled with Gaussian processes [31] or neural networks [16]. In general, the more complex the mechanisms are, the more the candidate model can fit the data and the more the model is general (meaning that it can be applied to a wide variety of cases). However, it may result in more difficulty to assess the causal orientation because the candidate model has more chances to fit equally well the data in both directions. A method for controlling the complexity of the mechanisms involved will be discussed later on in Sect. 3.4.3.

Sampling Data Points with a Candidate Model
Now we have all the ingredients required to sample data points of the estimated distribution Q X,Y with the generative model depicted in Fig. 3.13 by proceeding as follow: 1. Draw {(n X,j ,n Y,j )} n j =1 , n samples independent and identically distributed from the joint distribution Q N X (θ N X ) × Q N Y (θ N Y ) of independent noise variables (N X ,N Y ).
2. Generate n samples D ={ (x j ,ŷ j )} n j =1 , where each estimated samplex j of variable X is computed fromf X (θ X ) with the j -th estimated samples n X,j ; then each estimated sampleŷ j of variable Y is computed fromf Y (θ Y ) with the j -th estimated samples n Y,j andx j .
Generative candidate models support interventions, that is, freezing the variable X to some constant v i . The resulting joint distribution noted Q do(X=v i ) Y , called interventional distribution [22], can be computed from this model by clamping the value of X to the value v i . However when freezing the value of Y to some constant w i , it has no impact on X: Q do(Y =w i ) X = Q X . Generative candidate models support also counterfactual reasoning [22] as explained with the introductory example.

Model Fitness Score
In order to evaluate the quality of a candidate generative model, we introduce a fit score S B (θ ) of a candidate model B G ,f,Q N (θ ). This score has been implemented in different ways in the literature. It has always the property to be minimal when P = Q (perfect fit in the large sample limit).

Log-Likelihood Parametric Scores
An example of score used in [31,32] is the negative log-likelihood score defined for a candidate generative model with causal orientation X → Y by: This likelihood score is often computed in a parametric context with special constraints imposed on the class of densities for the distribution of the cause Q X and the distribution of the conditional Q Y |X . For example in [31], a Gaussian mixture model is used as a prior distribution of the cause and a Gaussian process with a zero mean function and a squared-exponential covariance function is chosen as prior of the conditional.

Implicit Fit Score Computed as an Independence Score Between Cause and Noise Variable
If one considers a model with causal orientation X → Y , [36] shows that computing this maximum likelihood score of the model with respect to the observational data is equivalent to minimizing a mutual information term I(X,N Y ; θ) between the estimated noiseN Y and the cause X. We re-transcribed here the theoretical justification given by Zhang et al. [36] with our notations.
We consider a candidate model B G ,f,Q N (θ ) with causal orientation X → Y and where the distribution of the source cause is not modelled and taken as P X .
One can write Q X,N Y = P X Q N Y as in this model X and N Y areassumedtobe independent. Therefore, the Jacobian matrix of the transformation from (X, N Y ) to (X, Y ) is: The absolute value of its determinant is |J X→Y |=| δf Y δN Y |. Then, which implies As the transformation from (X, N Y ) to (X, Y ) is invertible, given any parameter set θ Y involved in the function f Y , the noise N Y can be recovered, and the authors denote byN Y the estimated noise variable. For any parameter set θ = (θ Y ,θ N Y ), using Eq. (3.15) the negative log-likelihood score attained by the model is: (3.16) Now the fit of the model can be seen differently. Instead of fitting the model (Sect. 3.2.2) by modeling the noise N Y which is independent of X and then modeling the conditional Q Y |X , one can start from the true distribution P X,Y and look for such an estimateN Y thatN Y and X are independent. In this approach, (X, Y ) is recovered from (X,N Y ) as: In order to makeN Y and X independent, one can compute the mutual information between X andN Y that depends of the parameters θ of the model: (3.18) By using the sample version of this quantity and Eq. (3.17) the authors obtain: (3.19) Thus, (3.20) As the term n i=1 logP(X = x i ,Y = y i ) does not depend on the parameters θ of the model, minimizing the likelihood score S B (θ ) amounts to minimizing the mutual information between the cause X and the estimated noiseN Y .
This kind of evaluation score with an independence test is used to fit the model ANM [12] and PNL [35] presented in Sect. 3.5.1.

Non-parametric Scores
Other methods such as [7,16] use a non-parametric approach. In these methods the authors do not specify any type of distribution for the model of the data Q X,Y ,b ut instead compare directly with a two sample test the observed data sample D with the generated data sample D coming from any candidate model.
In [16], the score of the model is non-parametric and computed with Conditional Generative Adversarial Networks, or CGANs [18] (see Sect. 3.5.2.2). It has been shown by Goodfellow et al. [6] that in the large sample limit, the Generative Adversarial Networks allows to approximate the Jensen-Shannon divergence between the true distribution P and the generated distribution Q: (3.21) The Jensen-Shannon divergence is an information theory method of measuring the similarity between two probability distributions. This metric is always positive and equal to zero when P = Q.

Complexity of the Model
Now we introduce a term of complexity of a candidate bivariate model B G ,f,Q N (θ ) that we note C B (θ ). This complexity notion refers to a simplicity prior on the underlying data generative process. It has been handled in different ways in the literature from fix a class of admissible mechanisms to more flexible criteria.

Explicit Class of Admissible Mechanisms
A first type of methods in the literature defines a rudimentary notion of complexity: all mechanismsf belonging to a particular class F of bivariate FCM are assumed to be simple, while the others are assumed to be complex. In [12], for a causal model with orientation X → Y , a mechanism is assumed to be simple if it can be written under the form Y :=f Y (X) + N Y with X ⊥ ⊥ N Y (additive noise model). A more general class of mechanism is defined by the Post-Nonlinear model (PNL) [35], involving an additional nonlinear function on the top of an additive noise: , withĝ Y an invertible function and X ⊥ ⊥ N Y . For these methods, we write C B (θ ) = 0iff ∈ F and C B (θ ) = 1 otherwise.
In [7] (CGNN), the class of functional causal model is defined as Y := f Y (X, N Y ) wheref Y is a one hidden unit neural network with RelU activation functions and N Y ∼ N (0, 1) with X ⊥ ⊥ N Y . The class of admissible mechanisms is variable as it is measured as a number of hidden units n h in this one hidden layer neural network. In this framework, the complexity term can be expressed as C B (θ ) = n h .

Kolmogorov Complexity
In the previous section, the complexity term was defined in term of an explicit class of functionals, but another approach coming from the information theory has been developed in the literature.
In this information theory framework, the basic postulate that "the factorization of the joint density function P cause,effect into P cause P effect|cause should lead to a simpler model than P effect P cause|effect ", can be expressed with the Kolmogorov complexity framework as shown by [14]: This inequality comes from the postulate of algorithmic independence between the distribution of the cause P cause and the distribution of the causal mechanism P effect|cause stated by Janzing and Schölkopf [14]as: where I(P cause : P effect|cause ) denotes algorithmic mutual information.
Kolmogorov complexity and algorithmic mutual information are not computable in practice but they have led to two different practical implementations in the literature. For a candidate bivariate generative model B G ,f,Q N , we have defined a family of functionsf (i.e. linear mechanisms, neural networks, Gaussian processes) and a family of noise distributions Q N . The overall model (mechanisms and noise distributions) is parametrized by the vector of parameter θ ∈ Θ. Furthermore one can also define a prior probability distribution π of θ over Θ, that can be seen as a simplicity prior over the parameter space. Now according to the MML principle, the modeling problem of transmitting the observed data D ={(x i ,y i )} n i=1 to a receiver can be decomposed into two parts: 1. First, the model B G ,f,Q N with parameter θ from the parameter space Θ is send by the transmitter to the receiver (assertion). According to [33] the complexity of this model seen in term of total code length can be approximated as: where |J θ | is the determinant of the Fisher information matrix defined as the matrix of expected second derivatives of the negative log-likelihood function with respect to all continuous parameters of the model, with entry (i, j ) given by: ((x, y)|θ) δ 2 δθ i,j logf ((x, y)|θ)dxdy. This MML principle is used for example by Stegle et al. [31] to select the model associated to X → Y or Y → X with the lower total code length A B (θ ) = C B (θ ) − logf(D|θ) (see Sect. 3.5.2).

Independence Between Cause and Mechanism
Another characterization of complexity comes from the algorithmic independence principle of Eq. (3.23) to derive the idea that if X → Y , the marginal probability distribution of the causal mechanism P Y |X should be independent of the cause P X , hence estimating a model Q Y |X from P X should hardly be possible, while estimating a model Q Y |X based on P X may be possible.
This complexity measure has been evaluated directly as a covariance estimation between Q Y |X and P X in [4] or by a characterization of the variance of the kernel embedding of Q Y |X when X varies according in its definition domain [19] (see Sect. 3.5.3).
One can see that there is a direct connection between "the postulate of independence between the cause and the mechanism" and the complexity of the mechanism, as when P Y |X is independent on P X , the functionf Y required to model Q Y |X takes in general a simpler form than the functionf X required to model Q X|Y .

Bi-objective Trade-Off for Cause-Effect Inference
In the previous section, we have defined the complexity C B (θ ) and the fit S B (θ ) of a candidate bivariate model parameterized by the vector of parameters θ .F r o m a general point of view, we can now frame the model selection as a bi-objective trade-off with a Pareto front of optimal models (cf. Fig. 3.14).
This general bi-objective trade-off between fit and complexity is very general in science. It has been seen from two different angles. Some of the modeling approaches favor models of lower complexity, while others favor models with better explanatory power (Fig. 3.14):

Fig. 3.14 Bi-objective trade-off between complexity and reproduction • The first is the KISS approach (Keep It Simple, Stupid) proposed by Axelrod and
Hamilton [1] and stating that the models should include the minimal number of parameters and mechanisms and only add new ones only if required. • On the other hand, the KIDS principle (Keep It Descriptive, Stupid)i sa n anti-simplistic approach that favors the accuracy of the model and states that all parameters and mechanisms that appear relevant should be included, but parameters that do not add quality of the model should be removed [5].
When transposing these modeling approaches to the cause-effect problem, we can see that some methods favor models with low complexity and low explanatory power such as linear Gaussian models while others have a better explanatory power as they can model complex interactions between noise and causes (for example when using generative neural networks as in [16]), but at the cost of more complex mechanisms.
In the cause-effect pair literature, we can distinguish three different approaches used to deal with this complexity/fit trade-off: • Methods such as those by [7,12,16,35] that reason at fixed complexity C B (θ ) and choose the causal direction according to the best fit. For any candidate model B G ,f,Q N , we noteθ the set of parameters that minimizes the score S B (θ ) and for the purposes of simplified notations, we note S X→Y (θ) the best fit score corresponding to the model B G ,f, • Methods such as those by [4,27] that assume a comparative fit P = Q in both directions and choose the causal direction by comparing the complexity terms. In this case, for any candidate model B G ,f,Q N , we noteθ the set of parameters that minimizes the score C B (θ ) and C X→Y (θ) the lowest complexity score corresponding to the model • Methods such as those by Stegle et al. [31] that compute a weighted bi-criteria aggregation between fit and complexity A B (θ ) = S B (θ ) + λC B (θ ).W eu s e the same notationθ for the set of parameters that minimizes the score A B (θ ).If This bi-objective aggregation corresponds for example to cause-effect inference based on the MML principle. In this case, the total score is directly S B (θ ) + C B (θ ), without any aggregation parameter λ. However this aggregation parameter is "implicitly" set in the chosen prior π(θ).
With all these different approaches, a causal score that measures the confidence of the approach in the causal orientation can be defined as the difference between the two scores evaluated in each direction. For example for the methods that reason at fixed complexity and compare the fit scores if S X→Y (θ) < S Y →X (θ), X → Y is preferred with causal score Δ X→Y = S Y →X (θ) − S X→Y (θ) > 0.

Review of the Main Pairwise Methods
According to the complexity/fit trade-off framework defined in the previous section we can propose a reading grid for the main algorithms developed in the literature (Table 3.1).
We introduce a taxonomy of methods according to the type of functional involved, their fit scores, their complexity scores and their way to deal with the trade-off fit/complexity. Three main families of methods emerge: 1. Methods with a restricted class of mechanisms. 2. Non-parametric methods. 3. Methods exploiting the independence between cause and mechanism.
We present with more details these families in the next Sects. 3.5.1-3.5.3.

Methods with a Restricted Class of Mechanisms
This first class of methods developed from 2006 to 2010 rely on restrictive class of admissible mechanisms F and focus on identifiability results. The main idea is to show that in some cases there exist f ∈ F and Q N such that the hypothesis B G ,f,Q N holds in the causal direction G with respect to the observed data while there do not exist any f ′ ∈ F nor Q ′ N such that B G ′ ,f ′ ,Q ′ N holds in the opposite causal direction G ′ .
These methods range from very simple class of functions with LINGAM [28] to more complex class of functions such as PNL [34], with each time a trade-off between the identifiability and the generality of the proposed approach as depicted on Fig. 3.15.
Indeed, when the class of function is very restricted, there are fewer non identifiable cases, but the model can only successfully be used when encountering very specific observed data (such as data generated by linear mechanisms). When the class of functions becomes larger, the model becomes more general and can be used for more types of distribution, but at the cost of more non identifiable cases. In the extreme case of a completely general model without restriction on the class of mechanisms, all pairs become non identifiable as shown in Sect. 3.3.

Pairwise LiNGAM Inference
The LiNGAM [28] method was first developed for directed acyclic graph orientation for more than two variables. LiNGAM handles linear structural equation models, where each variable is continuous and modeled as: with P k a (X i ) the k-th parent of X i and α k a real value. Assuming further that all probability distributions of source nodes in the causal graph are non-Gaussian, [28] show that the causal structure is fully identifiable (all edges can be oriented). In the bivariate case, the authors assume that the variables X and Y are non Gaussian, as well as standardized to zero mean and unit variance. The goal is to distinguish between candidate linear causal models.
The first is denoted by X → Y and defined as: The second model with orientation Y → X is defined as: The parameter ρ is the same in the two models because it is equal to the correlation coefficient between X and Y .

Identifiability Result
A theoretical identifiability has been derived by Shimizu et al. [28] who prove that if P X,Y admits the linear model Y := aX + N Y with X ⊥ ⊥ N Y (model 1), then there exist b ∈ R and a random variable N X such that X := bY + N X with Y ⊥ ⊥ N X (model 2) if and only if X and N Y are Gaussian.
In different words there is only one non-identifiable case corresponding to the linear Gaussian case presented in Sect. 3.3.1. Moreover if X or N Y is non-Gaussian, when the candidate linear model with orientation X → Y holds, the candidate linear model with orientation Y → X does not hold. 5

Practical Evaluation
The candidate models correspond to a restrictive class of mechanism and the comparison of the candidate models is based on comparison of fit scores at fixed complexity. The fit score used for comparison is the likelihood score as defined in Sect. 3.4.2.1 and derived by Hyvärinen and Smith [13] in this case as: where G X (u) = logP X (u) and G N Y is the standardized log probability distribution function of the residual when regressing Y on X. S Y →X is computed similarly and the causal direction is decided by comparing S X→Y with S Y →X .

Additive Noise Model
An extension of the previous model to deal with nonlinear mechanism has been derived by Hoyer et al. [12].

Model
A bivariate additive noise model (ANM) X → Y is defined as:

Identifiability Result
Hoyer et al. [12] proved that the ANM model is generally identifiable, saying that if P X,Y satisfies an additive noise model with orientation X → Y , P X,Y cannot satisfy an additive model with orientation Y → X. However, specific cases have been identified where ANM is non-identifiable (see [35]). In particular the linear Gaussian case presented in Sect. 3.3.1 is non identifiable.
Practical Evaluation Based on Independence Score (ANM-HSIC) In the original paper of [12] the ANM is seen as a discriminative model and the fit score is based on an independence test between noise and cause (Sect. 3.4.2.2). More precisely for the two alternatives X → Y and Y → X, the estimated mechanismsf Y andf X are obtained via Gaussian process regressions. These estimated regression functions are used to estimate the residualsn Y = y −f Y (x) andn X = x −f X (y). The scores S X→Y and S Y →X correspond respectively to kernel HSIC independence test [9] betweenn Y and x (for X → Y ) and betweenn X and y (for Y → X).

Post Nonlinear Model
The ANM may fail to recover the causal direction when the noise is not additive. Therefore a generalization of ANM called Post-NonLinear model (PNL) that takes into account nonlinear interactions between the cause and the noise has been proposed by Zhang and Hyvärinen [34].

Model
A bivariate Post-NonLinear Model (PNL) X → Y is defined as: One can notice that LiNGAM and ANM are special cases of PNL. Indeed by settingĝ Y to the identity function we recover the ANM. By choosing alsof Y to be linear and one of X or N Y to be non Gaussian we recover the LiNGAM model.

Identifiability Result
Zhang and Hyvärinen [35] proved that the PNL model is generally identifiable, saying that if P X,Y satisfies a PNL model X → Y , P X,Y cannot satisfy a PNL Y → X, except when specific conditions are encountered. The set of non-identifiable distribution P X,Y is larger than for ANM. However PNL is more general and can handle more types of observed distribution.

Practical Evaluation
The model has been evaluated in [34] using an independence score between cause and noise (cf. Sect. 3

.4.2.2).
Asĝ Y is assumed to be invertible, the idea is that the noise variable in Eq. (3.31) can be recovered from P X,Y as: The noise variable is then estimated by functions l 1 and l 2 such asN Y = l 1 (Y ) − l 2 (X) withN Y independent of X. It comes back to solve a constrained nonlinear ICA problem, that can be achieved by minimizing I(X,N Y ; θ), the mutual information between X andN Y [34] with respect to the parameter of the model θ . Symmetrically, an optimization of I(Y,N X ; θ) is performed.
The causal direction X → Y is preferred if I(X,N Y ;θ) < I(Y,N X ;θ), Y → X otherwise.

Causal Inference by Choosing Graphs with Most Plausible Markov Kernel
In [32], the generative candidates model are derived from plausible Markov kernels and evaluated with log-likelihood scores.

Model
For a generative bivariate model X → Y , the evaluation of cause and mechanism are the following: • the distribution of the modeled cause Q X is recovered by maximizing the entropy H(Q X ) under constraints on the first and second moments of Q X to make them correspond to those of the observed data P X . • following the same idea the distribution of the mechanism Q Y |X is modeled by maximizing the entropy H(Q Y |X ) under constraints on the first and second moments.

Evaluation
Once the model is recovered, log-likelihood scores of the model Q X Q Y |X and Q Y Q X|Y are computed as explained in Sect. 3.4.2.1 and the causal direction is determined by comparing the two scores.

Non-parametric Methods
The methods presented in the last section always assumed that the causal mechanisms belong to a restricted class of functions F . However, this a priori restriction poses serious practical limitations when the task is to infer the causal direction on real data. Indeed in reality the mechanisms are often far from linearity and the interaction between noise and cause may be more complex than additive or even post-nonlinear noise. This is why more general methods have been proposed following pioneer works by Stegle et al. [31]. These methods in general offer better overall results on real data as they are more flexible, but they come with a loss of theoretical identifiability results, as no explicit restriction on the class of function is imposed (Sect. 3.3). The causal direction is often recovered by setting a smooth prior on the complexity of the mechanisms.

Probabilistic Latent Variable Models
A fully non-parametric Bayesian approach was proposed by Stegle et al. [31]. The name of the algorithm is GPI for Gaussian Process Inference.

Model
The approach aims to address the most general formulation of generative bivariate model with orientation X → Y with complex interaction between cause and noise: The distribution of cause in GPI if modeled with a Gaussian mixture model with k modes: with parameters θ X ={(α j ,μ j ,σ j )} k j =1 . The mechanism of GPI is a Gaussian process with zero mean and square exponential covariance function K θ Y whose entry (i, j ) is: where θ Y = (λ X ,λ Y ,λ N ) are hyper parameters. Gamma priors are set on all these lambda parameters.

Practical Evaluation
The model is then evaluated using the MML framework exposed in Sect. 3.4.3.2. According to [31], for a GPI model in the direction X → Y , the global aggregated MML score between fit and complexity can be expressed as: where the MML score for the cause is: The MML score for the mechanism is:

Conditional GAN for Causal Discovery
Lopez-Paz and Oquab [16] propose to use Conditional Generative Adversarial Networks, or CGANs [18] in order to model realistic causal mechanisms.

Model
The model with direction X → Y takes the form of a discriminative model, where the causal mechanism is defined by a one hidden unit neural networkf Y with parameter θ Y . The variable Y is generated as: (3.38) N Y is independent from X and drawn according to standard normal distribution N (0, 1). X is the conditioning variable input to the generator and follow the observed distribution P X .

Evaluation
In order to train this conditional GAN, a discriminative neural network d with parameter ω is used and the problem amounts to solve this min max optimization problem: (3.39) Then the principle used for causal discovery is to learn two CGANs: one with a generatorf Y from X to Y to synthesize the dataset D X→Y ={ (x i ,ŷ i )} n i=1 , and the other with a generatorf X from Y to X to synthesize the dataset D Y →X = {(x i ,y i )} n i=1 . Then, the causal direction is decided to be X → Y if the two-sample test statistic between the real sample D ={ (x i ,y i )} n i=1 and D X→Y is smaller than the one between D and D Y →X .

Causal Generative Neural Network
Goudet et al. [7] proposed and extension of [16] for multivariate causal discovery called CGNN for Causal Generative Neural Network. As in [16] the mechanisms are modelled with generative neural networks.
If the joint density function h of P X,Y is continuous and strictly positive on a compact and convex subset of R 2 , and zero elsewhere, it has been shown that there exist two CGNNs X → Y and Y → X, that approximates P X,Y with arbitrary accuracy. This result highlights the generality of the approach. However it raises also the issue that the CGNN can reproduce equally well the observational distribution in both directions. This non-identifiability issue is empirically mitigated by restricting the class of CGNNs considered, and specifically limiting the number n h of hidden neurons in each causal mechanism. This parameter n h can be seen as a complexity parameter that governs the CGNN ability to model the causal mechanisms: too small n h , and data patterns may be missed; too large n h , and overly complicated causal mechanisms may be retained.

Practical Evaluation
For practical use in the cause-effect pair setting n h is empirically set to 30 hidden units in [7]. The fit score of the model is evaluated with a kernel two sample test between the sample D coming from observed distribution P X,Y and the sample D coming from the generated distribution Q X,Y : where MMD k (D, D) is the empirical Maximum Mean Discrepancy (MMD) [10] defined as: (3.41) where z =[ x, y] is the two dimensional vector composed of x and y. The kernel k is usually taken as the Gaussian kernel (k(z, z ′ ) = exp(−γ z − z ′ 2 2 )). For a fix number of hidden units n h in each mechanismf X andf Y , the causal direction is then based on the comparison of these best fit scores S X→Y (θ) and S Y →X (θ) in both directions.

Methods That Exploit Independence Between Cause and Mechanism
This class of methods exploits the notion of complexity of the causal mechanisms from a different perspective. They are based on the principle stated in Sect. 3.4.3.2 saying that when X → Y the distribution of the cause P X should not contain information that can be useful to derive the conditional model Q Y |X on the data.

NonLinear Deterministic Mechanism
One of the first method that exploits the postulate of independence between cause and mechanism is the Information Geometric Causal Inference algorithm (IGCI) [4].

Model
When X → Y , the authors assume that Y was generated from X by a nonlinear deterministic and invertible function: Then the authors exploit a certain type of independence between P X and the estimate of the function h. In particular, they interpret x → P X (x) and x → logh ′ (x) as random variables on the probability space [0, 1]. Then they compute the covariance of these two random variables with respect to the uniform distribution on [0, 1]: (3.43) The authors show that if Cov(logh ′ ,P X ) is close to zero, meaning that P X does not contain information about Q Y |X . Moreover it implies that for the reverse direction P Y and logh ′−1 are positively correlated. Therefore P Y contains information about Q X|Y , which implies an asymmetry between X and Y .

Practical Evaluation
To evaluate this model, the authors compute: (3.44) The algorithm IGCI infers X → Y whenever C X→Y is negative. The authors also show that the evaluation can be simplified into: (3.46) Then it is decided that X → Y if H(X) > H(Y) and Y → X otherwise.

Unsupervised Inverse Regression
Sgouritsa et al. [27] propose a method based on the idea that if X → Y , P X should not contain information about P Y |X , while P Y may contain information about P X|Y . Therefore the estimation of Q Y |X based on P X should be less accurate than the estimation of Q X|Y based on P Y .

Model
The causal mechanism Q Y |X is modeled with a Gaussian process latent variable model whose likelihood function with respect to the data is given by: The entry (i, j ) of K is:

Causal Inference via Kernel Deviance Measures
Mitrovic et al. [19] exploit the same postulate that Q Y |X should be independent of P X whenever X → Y . The idea is that the estimate of the conditional distribution {Q Y |X=x i } n i=1 should be less sensitive to the different values x i taken by the variable X than the conditional {Q X|Y =y i } n i=1 is sensitive to the different values of Y = y i .

Model
For is evaluated with a conditional RBF kernel embedding into the Hilbert space of infinitely differentiable functions.

Evaluation
A score for the causal direction X → Y is evaluated as the deviance of this set of conditional embeddings with respect to the Reproducing Kernel Hilbert Space (RKHS) norm as: C Y →X is computed analogously and the causal direction X → Y is preferred if C X→Y <C Y →X .

Cause-Effect Inference by Comparing Regression Errors
A new method proposed by Bloebaum et al. [2] called RECI is based on a slightly different idea than the assumption of independence between cause and mechanism. The authors assume that the causal mechanism represents a law of nature that persists when the distribution of the cause and the distribution of the noise "change due to changing background conditions".

Model
They denote by φ the function that minimizes the expected least-squares error when predicting Y from X, φ(x) = E [Y |X = x], and ψ the minimizer of the leastsquares error for predicting X from Y , φ(y) = E [X|Y = y].
For a model with the causal orientation X → Y , the authors rewrite the bivariate functional causal model of Definition 3.1 under the following new form: (3.50) where α ∈ R + and N is a noise variable not necessarily independent of X.
They show under the regime of almost deterministic relations (when α → 0) that it implies: which can be translated into "the MSE of regressing Y on X is lower that the MSE of regressing X on Y ". To obtain this result, the authors make the main following assumptions that we briefly summarize here: • Invertible function: φ is a strictly monotonically increasing (or decreasing) twice differentiable function. • Compact supports: the distribution of X has compact support. • Independence postulate: the functions x → φ ′ (x) and x → Va r [N|X = x] p X (x) that define random variables are assumed to be uncorrelated, which is formally stated after re-scaling of the variables between 0 and 1 as: (3.52)

Practical Evaluation
The method RECI consists in fitting a non-linear regression model on both directions after re-scaling between 0 and 1 both variables and comparing the mean squared error losses. The regression models used by the authors may be logistic functions, polynomial functions or neural networks. In practice the authors report better empirical results with simple polynomial regression models such as the shifted monomial functions ax 3 + c.

Experimental Comparison of Cause-Effect Inference Algorithms
In the last section we have presented an overview of the main methods proposed for the cause-effect pair problem. Now we propose to evaluate the different algorithms whose code are available online on several cause-effect pair datasets.

Datasets
Five datasets with continuous variables are considered 6 : • CE-Cha: 300 continuous variable pairs from the cause-effect pair challenge [11] that will be presented with more details in the next chapter. Here we only consider pairs with label +1(X → Y ) and −1(Y → X) (notably the confounding case is excluded). • CE-Net: 300 artificial pairs generated with a neural network initialized with random weights and random distribution for the cause (exponential, gamma, lognormal, laplace...). • CE-Gauss: 300 artificial pairs without confounder sampled with the generator of [20]: Y := f Y (X, N Y ) and X := f X (N X ) with N X ∼ P N X and N Y ∼ P N Y . P N X and P N Y are randomly generated Gaussian mixture distributions. Causal mechanisms f X and f Y are randomly generated Gaussian processes. • CE-Multi: 300 artificial pairs generated with linear and polynomial mechanisms.
The effect variables are built with post additive noise setting ). • CE-Tueb: 99 real-world cause-effect pairs from the Tuebingen cause-effect pairs dataset, version August 2016 [20]. This version of this dataset is taken from various domains: climate, census, medicine data.
For all variable pairs, the size n of the data sample is set with a maximum of 1500 for the sake of an acceptable overall computational load. To provide an overview of the type of pairs encountered in each dataset, one hundred pairs of each of them are displayed in Fig. 3.16 (the real pair with altitude and temperature given as introductory example corresponds to the first pair in the top left corner of CE-Tueb).

Algorithms
We compare the performance of the following algorithms presented in this chapter: • Best mse: the method presented in the introduction that consists in fitting a nonlinear Gaussian process regression model on both directions after re-scaling with zero mean and unit variance and comparing the mean squared error loss (mse).
The direction corresponding to the fit with the lowest mse is preferred.  [28] relying on Independent Component Analysis to identify the linear relations between variables (see Sect. 3.5.1.1). For the methods Best mse and RECI we use the implementation provided in the CausalDiscoveryToolbox. 7 For the implementation of the algorithms ANM, IGCI, PNL, GPI and LiNGAM we use the R program available at https://github.com/ ssamot/causality.For CGNN we use the code available on github at https://github. com/GoudetOlivier/CGNN. We use default authors parameters for each algorithm implementation.

Performance Metric
The task of orienting each pair observed pair as X → Y or Y → X is a binary classification problem. We propose to use two scores to evaluate the performance of the algorithms: •T h e accuracy score computed as the ratio of well oriented edges over the total number of edges for the datasets CE-Cha, CE-Net, CE-Gauss and CE-Multi.For the CE-Tueb dataset, a weighted accuracy is computed as in [20] in order to take into account dependent pairs from the same domain. •T h e area under the ROC curve computed using the causal score given by each model that measures the confidence of the approach in the causal orientation for each pair. This score is defined as the difference between the two scores of each candidate model evaluated in both directions (cf. Sect. 3.4). Table 3.2 reports the area under the roc curve (AUROC) and the accuracy (in parenthesis) for each benchmark and each algorithm. The corresponding ROC curves are displayed in Fig. 3.17.

Results
The method LiNGAM is outperformed as it uses linear mechanisms to model the data generative process which is in many cases unrealistic. IGCI does not seem to be very robust: it takes advantage of some specific features of the dataset, (e.g. the cause entropy being lower than the effect entropy in CE-Multi), but remains near chance level otherwise. The method ANM yields good results when the additive assumption holds (e.g. on CE-Gauss), but fails otherwise. PNL, less restrictive than ANM, yields overall good results compared to the former methods. The method RECI based on comparison of mean square error scores after proper re-scaling provides overall good results too. However it fails for more complex pairs due to restrictive assumptions on the causal mechanisms involved. Lastly, methods like GPI and CGNN that admit the most flexible class of causal mechanisms and non-parametric metrics to match the distributions perform well on most datasets, including the realworld cause-effect pairs CE-Tueb, in counterpart for a higher computational cost (resp. 32 min on CPU for GPI and 24 min on GPU for CGNN). 8 Let us note that better scores for the dataset Tübingen (CE-Tueb) are reported in recent papers such as [16] (accuracy of 82.0%) and [19] (accuracy of 78.7%) using ensemble methods with the algorithms presented in Sects. 3.5.2.2 and 3.5.3.3. Underline values corresponds to best score for each dataset. Bold value corresponds to best overall score on all dataset.

Discussion and Open Problems
The cause-effect inference problem is relatively new in the Machine Learning literature and there are still a lot of open problems to be addressed in order to build a robust tool that can be used by the practitioner to confirm for example if a given treatment has an impact or not on a given disease or to discover if a gene has a regulatory power on an other one.

Relax the Causal Sufficiency Assumption
To build such useful tools for practitioners, one of the first assumption that needs to be relaxed is the causal sufficiency assumption as in a lot of real problems it is very rare to find a cause-effect problem that is not affected by hidden common confounder that can affect both variables such as age or gender. One idea proposed in the literature is to model confounders by introducing correlation between the noise variables N X and N Y that affect X and Y as in [26]or by modelling all the unobserved confounding effects by a new noise variable N XY entering in the generation process of X and Y [7,15].
If we reformulate the generative bivariate framework of Sect. 3.4.1.1 in presence of confounders three alternative candidate models must be considered: A diagram of the model is presented in Fig. 3.18. We use the same notation for the vector of parameters as in Sect. 3.4.1.1 but with a new set of parameters θ N XY for the distribution of the common noise variable N XY .
The same approach with a trade-off between fit and complexity could be used to compare these three alternative models in presence of potential confounders. An additional penalty term depending on the total number of edges | G| in the model could be introduced like in score based methods for multivariate inference [3], in order to remove the eventual spurious link due to the confounding effect between X and Y . We have indeed | G|=1 for the models with structures X → Y or Y → X, and | G|=0 for the models with structure Y ↔ X.
This framework could also take into account known variables as confounders. In this case the variable N XY would be replaced by a set of observed variables (e.g. the latitude in the introductory example of this chapter with altitude and temperature).
Furthermore one can notice that this extension of the generative approach in presence of the confounding effect would not be possible for discriminative approaches where the source cause is not modelled (when only comparing P Y |X with P X|Y ).

Need for Real Datasets of a Big Size
Another important problem that is often overlooked in the "cause-effect pair community" concerns benchmarking. As of the writing of this chapter, the main real data benchmark used to compare the different methods by the community is the Tübingen Dataset, 9 composed of only 100 hundred pairs which are often very similar. It is indeed very difficult to collect cause-effect pairs with enough data points from the real world with an authenticated known ground truth. But one must to keep in mind that it is very easy for most of the methods presented in this chapter to tune their hyper parameters (even unintentionally) in order to obtain the best results. This overfitting problem is often compounded by the fact that this dataset is, most of the time, not separated into train/validation/test sets. To overcome this problem a Cause-effect Pair Challenge has been proposed by Guyon [11] with real and artificial data generated with various mechanisms. It will be discussed in the next chapter.

Biased Assessment Due to Artifacts in Data
It is often observed that the cause variable is discrete ordinal while the other is continuous. This induces an artificial asymmetry between cause and effect which could lead to biased assessments. Let us consider for example real abalone data 10 representing the age of snails (X-axis) and the corresponding weight of its shell (Y-axis) as depicted in Fig. 3.19. The ground truth is age causes weight of the shell.
We see that due to the experimental conditions when collecting the data, the variable age is discrete (categorical ordered variable or ordinal variable) and not continuous. Because of this artifact on the age variable, the conditional P weight|age has really more chance to be simpler than P age|weight . Therefore it favors approaches that compare only P Y |X with P X|Y and it may lead to inconsistent methods for the cause-effect pair problem as stated in Chap. 1 of this book.
An open problem could be to find a way to correct this bias when one encounters this type of data. This was done in the design of the dataset of the Cause-effect Pair Challenge.

Extension of the Generative Approach for Categorical Variables
In this chapter we have discussed the cause-effect pair problem for continuous variables, but the same idea could be used for categorical variables or mixed variables. To the best of our knowledge, only few attempts have been made to solve the cause-effect pair problem with categorical data [17,24]. It may be explained by the fact that the cause-effect problem for categorical data may be really harder than for continuous data, because there is in general less information to exploit in the distribution of the noise and in the asymmetry of the causal mechanisms.
If one considers for example the extreme case of two binary variables, what kind of causal information can be really exploited from a matrix of size (2,2) that represents the repartition of the data points into the 4 different cases (1,0), (0,1), (1,1) and (0,0) ?

Extension of the Pairwise Setting for Complete Graph Inference
There is a need for methods that can really cross the bridge between the causeeffect pair problem and the complete problem of causal discovery with more than two variables. A way of research could be to propose an efficient approach for the general multivariate case that can potentially exploit all the information available, including the asymmetry between cause and effect and the conditional independence relations.
In this direction, an extension of the bivariate Post-NonLinear model (PNL) has been proposed [35], where an FCM is trained for any plausible causal structure, and each model is tested a posteriori for the required independence between errors and causes, but the main limitation is its super-exponential cost with the number of variables [35]. Another hybrid approach uses a constraint based algorithm to identify a Markov equivalence class, and thereafter uses bivariate modeling to orient the remaining edges [35]. For example, the constraint-based PC algorithm [30] can identify the v-structure, enabling the bivariate PNL method to further infer the remaining arrows that are not identifiable with conditional independence tests. However an effective combination of constraint-based and bivariate approaches requires a final verification phase to test the consistency between the v-structures and the edge orientations based on asymmetry.
An extension of the ANM model (Sect. 3.5.1.2) called CAM [25] proposed a consistent framework that can exploit the interaction and the asymmetry of the cause and effect, but the approach is restricted to the assumption of causal additive noise. The CGNN algorithm presented in Sect. 3.5.2.3 can be extended in the multivariate case [7]. It is more general than the CAM algorithm, but needs to start from a supposed known skeleton of the causal graph and employ simple exploratory heuristic to explore the space of DAG due to computational reasons.

Computational Complexity Limitations
Some methods suffer from computational complexity limitations. In the bivariate case, there are only two alternative DAGs to compare, but for the multivariate case with more than 1000 variables, the number of different DAGs to consider grows exponentially. Therefore it is a real challenge to make the successful methods of the cause-effect problem scale for big data problem. In particular the methods that can model complex interactions between cause and noise, such as those using Gaussian process regressions or neural networks are often really slow to compute and do not scale well in term of number of variables and number of data points.

Relax Restrictive Assumptions on Causal Mechanisms
Another open problem concerns the fact that all these methods rely on specific assumptions on the underlying data generative process. All of them can work well when these specific assumptions are encountered in the observed data. This is why a new Machine Learning approach, presented in the next chapter, has appeared in recent years. It is based on the idea to combine all the successful algorithms presented in this chapter into a single meta algorithm that could benefit from the advantages of each of them.

Conclusion
We have briefly explained the difference between explanatory models and predictive models and seen that causal discovery consists in finding the best explanatory model of the data. We have then defined the problem of cause-effect inference in the bivariate setting and the main assumptions usually involved in the literature. Under these assumptions we have formalized the notion of bivariate structural equation model. We have then seen that the task of recovering the good causal direction from X → Y or Y → X consists in comparing alternative candidate models estimated from data. This comparison is always based in one way or another on a complexity/fit trade-off. We have reviewed how the complexity and the fit terms are usually evaluated in order to compute a causal score in both directions. It has led us to propose a reading grid of the main methods proposed in the literature.
Three main families have emerged: (1) methods that restrict the class of admissible causal mechanisms and focus on identifiability results; (2) methods that do not explicitly restrict the class of admissible mechanisms and focus on the generality of the approach at the expense of theoretical identifiability guaranties; and (3) lastly methods that exploit the postulate that the cause should be independent of the causal mechanism.
We have then compared the main methods of the literature whose code is available online. It appears that methods that allow for flexible causal mechanism and complex realistic interactions can obtain consistent scores on a wide range of cause-effect problems (artificial and real data). These results need however to be confirmed on real datasets of bigger size with known ground truth.
One fruitful way of research could be to extend the best generative approach for causal discovery presented in this chapter to deal with potential confounding effects. Another interesting way of research could be to provide a theoretical framework that can unify the cause-effect pair methods presented in this chapter with the multivariate methods for causal discovery using conditional independence tests.