Variance reduction in purely random forests

Random forests (RFs), introduced by Leo Breiman in 2001, are a very effective statistical method. The complex mechanism of the method makes theoretical analysis difficult. Therefore, simplified versions of RF, called purely RFs (PRF), which can be theoretically handled more easily, have been considered. In this paper, we study the variance of such forests. First, we show a general upper bound which emphasises the fact that a forest reduces the variance. We then introduce a simple variant of PRFs, that we call purely uniformly RFs. For this variant and in the context of regression problems with a one-dimensional predictor space, we show that both random trees and RFs reach minimax rate of convergence. In addition, we prove that compared with random trees, RFs improve accuracy by reducing the estimator variance by a factor of three-fourths.


Introduction
Random forests (RFs), introduced in Breiman (2001), are a very effective statistical method. They have outstanding performances in several situations for both regression and classification problems. Mathematical understanding of these good performances remains quite unknown. As defined in Breiman (2001), a RF is a collection of tree-predictors {h(x, l ), 1 l q}, where ( l ) 1 l q are i.i.d. random vectors and a RF predictor is obtained by aggregating this collection of trees. In addition to consistency results, one of the main theoretical challenges is to explain why a RF improves so much the performance of a single tree. Breiman (2001) introduced a specific instance of RF, called RFs-random inputs (RFRI), which has been adopted in many fields as a reference method. Indeed, RFRI are simple to use, and are efficiently coded in the popular R-package randomForest (Liaw and Wiener 2002). They are effective for a predictive goal and they can also be used for variable selection (see e.g. Díaz-Uriarte and Alvarez de Andrés 2006; Genuer, Poggi and Tuleau 2010).

Framework
The framework we consider all along the paper is the classical random design regression framework.
More precisely, consider a learning set L n = {(X 1 , Y 1 ), . . . , (X n , Y n )} made of n i.i.d. observations of a vector (X, Y ) from an unknown distribution. Y is real-valued since we are in a regression framework. X ∈ X is a measurable space (e.g. X = R d with any d ≥ 1). We consider the following statistical model: (1) s : X → R is the unknown regression function and the goal is to estimate s. Finally, we suppose that (ε 1 , . . . , ε n ) are i.i.d. observations of ε with value in R, independent of L n , with E[ε] = 0 and Var(ε) = σ 2 < +∞.
This paper aims at comparing performances in estimating s using a single random tree or a RF. As a result, we emphasise a variance reduction brought by the forest.

A general upper bound for PRFs variance
We begin by giving a general analysis of the PRFs variance.

PRT definition
Let us first mention that, all along the paper, we make a slight language abuse. Indeed, we refer to random tree, the tree himself (as a graph), the corresponding partition of X , as well as the corresponding estimator.
Thus, the main difference between RFs and PRFs is that in the purely random case, partitions of the input space X are drawn randomly, independent of L n . We recall that in classical RFs, partitions of the input space are most of the time obtained by random perturbations of a partitioning scheme where splits are calculated using L n . Hence, for classical RFs, the random perturbations are independent of L n , but the partition is not.
We denote by U a random partition of X in k cells with distribution U. k is a natural integer which will depend on the number of observations n A PRT, associated with U, is defined for x ∈ X aŝ with E denoting the cardinality of the set E.
In addition, let us define, for x ∈ X :s Conditionally on U,s U is the best approximation of s among all the regressograms based on U, but of course it depends on the unknown distribution of (X, Y ).
With these notations, we can write a bias-variance decomposition of the quadratic risk ofŝ U as follows: To clarify these variance and bias terms, we emphasise that for a given partition u and a given x, we have ) 2 ] is its bias. We then integrate with respect to (w.r.t) X and U to get decomposition (2).

PRF definition
A RF is the aggregation of a collection of random trees. Therefore, in the context of PRFs, the principle is to generate several PRT by drawing several random partitions, and to aggregate them.

R. Genuer
A PRF, associated with V q , is defined for x ∈ [0, 1] as follows: Let us define, for x ∈ [0, 1]:s Again, we have a bias-variance decomposition of the quadratic risk ofŝ V q , given by = variance term + bias term (3)

PRT variance
We start to deal with the variance term of Decomposition (2). First, we work conditionally on U, then the problem reduces to the case of a regressogram on a deterministic partition, and we can apply the following proposition which comes from Arlot (2008).
Proposition 3.1 Conditionally on U, the variance term of Decomposition (2) satisfies: where We now integrate Equation (4) w.r.t. U, and we get the following equality: We will see in Section 4.1 that in a specific case, the three last terms of Equality (5) are negligible compared with the constant term σ 2 . Hence in this case, the variance of a tree is equivalent to σ 2 k/n. We claim, that if X is bounded, say X = [0, 1] d , any reasonable distribution of U may lead to a variance of σ 2 k/n , as soon as k − −−− → n→+∞ +∞ and (k/n) − −−− → n→+∞ 0. These last conditions on k and n are in fact what is required by Biau et al. (2008) to get the universal consistency of some PRF.

PRF variance upper bound
We now study the variance term of Decomposition (3). We begin to show that when letting the number of trees q grow to infinity, the variance of a PURF is close to the covariance between two PURT.
Indeed, sinceŝ V q (x) = (1/q) q l=1ŝ U l (x), the variance term satisfies: where the last equality comes from the fact that the (ŝ U l (X) −s U l (X)) 1 l q have the same distribution. Now, if we let q grow to infinity, we get The next step is to upper bound the covariance between two PURT Let us denote by U 1 = {λ 1 1 , . . . , λ 1 k } and U 2 = {λ 2 1 , . . . , λ 2 k } the cells of the partitions respectively associated with the treesŝ U 1 andŝ U 2 .
Finally, we state the following theorem, which gives a general upper bound for PRF variance: for several consecutive values of t and with l = 1 or 2. p t denotes for some j ∈ {1, . . . , k} either p λ 1 j or p λ 2 j depending on the relative positions between the (λ 1 1 , . . . , λ 1 k ) and the (λ 2 1 , . . . , λ 2 k ).
Theorem 3.2 is to be compared with Equality (5) and tells us that the variance of a PRF is upper bounded by a sum of terms of the same kind than those appearing in the variance of a PRT. The actual gain comes from the number of such terms in the sum. Indeed N U k, and the larger the double sum in the r.h.s of (7), the smaller the N U .
We stress that quantities appearing in N U only depend of the two partitions U 1 , U 2 . And our results mean that if U 1 and U 2 are different enough, the covariance betweenŝ U 1 andŝ U 2 will be smaller than a single tree variance.
In Section 4.4, we will see that in a special case, E[N U ] equals 3k/4. This allows to claim that the forest reduces variance by a factor of 3 4 for this case.

Risk bounds for PURFs
We now give a detailed analysis of a specific variant of PRF, in a context of a one-dimensional predictor space. Hence, in this section, we assume X = [0, 1]. The principle of PURT is that we draw k uniform random variables, which form the partition of the input space [0, 1]. Then we build a regressogram on this partition, that we call a tree.
Note that, unlike PRFs or RFRI, the tree structure of individual predictors is not obvious. This comes from the fact that in PURT, the partition is not obtained in a recursive manner. Nevertheless, we keep the vocabulary of trees and forests to distinguish individual predictors from aggregated ones.
More precisely, let

PURT variance
Therefore, using the fact that, in our case, U is made of k i.i.d. random variables of uniform distribution on [0, 1], we deduce from Equation (5) the following proposition: where the notation o Details of the proof of Corollary 4.1 can be found in Section 7.2. The first two hypotheses of Corollary 4.1 (k − −−− → n→+∞ +∞, k/n − −−− → n→+∞ 0) are the same natural conditions found by Biau et al. (2008) for consistency of PRF. They guarantee that the number of splits of the tree must grow to infinity but slower than the number of samples.

PURT bias
We now turn to the bias term of Decomposition (2). Direct calculations (see Section 7.3 for details) lead to the following upper bound for the bias term of a PURT:

Proposition 4.2 If μ is bounded by M > 0 and s is C-Lipschitz, the bias of a PURT is upper bounded by
(9)

Risk bounds for PURT
Putting together (8) and (9) leads to the following risk bound for a PURT.
Journal of Nonparametric Statistics

549
The balance between the two first terms of the right-hand side (r.h.s.) of (10) leads to take (k + 1) = n 1/3 , and gives the following upper bound for the risk of a PURT.
Therefore, a PURT reaches the minimax rate of convergence associated with the class of Lipschitz functions (see e.g. Ibragimov and Khasminskii 1981).
Let us now analyse PURFs.

PURF variance
From Theorem 3.2, we deduce the following proposition: the variance of a PURF satisfies the following upper bound: We give details of the proof of Proposition 4.5 in Section 7.4. Proposition 4.5 is to be compared with Corollary 4.1 and tells us that the variance of a PURF is upper bounded by three-fourths times the variance of a PURT. Hence, the rate of decay (in terms of power of n) of the PURF variance is the same as the PURT variance, and the actual gain appears in the multiplicative constant.
Let us, finally, comment on the hypotheses of Proposition 4.5. First, note that the hypotheses on k and n are the same as in Corollary 4.1, which allows a fair comparison between the two results. Finally, the other hypotheses (μ > 0, s is C-Lipschitz) are the same as in Corollary 4.1 and help to control negligible terms.

PURF bias
We now deal with the bias term of Decomposition (3). A convex inequality gives that the bias of a forest is not larger than the bias of a single tree: So from Proposition 4.2, we deduce that: Proposition 4.6 If μ is bounded by M > 0 and s is C-Lipschitz, the bias of a PURF satisfies the same inequality as (9), that is: 550 R. Genuer

Risk bounds for PURF
Putting together (11) and (12) leads to the following risk bound for a PURF.
Again, taking (k + 1) = n 1/3 gives the upper bound for the risk: where K is a positive constant.
So, a PURF reaches the minimax rate of convergence for C-Lipschitz functions. Secondly, as the variance of a PURF is systematically reduced compared with a PURT and the bias of a PURF is not larger than the one of a PURT, the risk of a PURF is actually lower.

Simulations
In this section, we explain simulation experiments aiming at illustrating the results of Section 4 and compare PURF with RFRI.

Experiments
Experimentations were performed on four simulated data sets. We keep the framework of Model (1). In addition, we assume that ε ∼ N (0, 1 4 ), and we take for s the following functions: We also take several values for the number of data: n ∈ {100, 500, 1000, 5000, 10, 000}. And for each value of n, we fix k = n 1/3 and q = k/n (the choice of k is motivated by discussions on PURT and PURF risks in Sections 4.3 and 4.6; the choice of q is sufficient condition for Equation (6) to hold).
We are interested in comparing PURT and PURF on one hand, and random trees-random inputs (RTRI) and RFRI on the other hand. Finally, we confront PURF and RFRI.
RFRI are extensively used in practice and present very good performances. We stress that they do not belong to the PRFs family, because partitions associated with each trees of RFRI are optimised using the learning sample. Hence, the results of the paper do not apply to them, but we still include RFRI in our simulation study to compare their behaviour with the PURF one.
For each function s and each value of n, we simulate 20 data sets, on which we run 20 forests (or trees) and evaluate performances on 50 additional points.

Results
The results of experiments on sinus data are summarised in Figure 1.
In the top-left graph, we plot the ratio PURFvariance/PURTvariance as a function of n. The ratios PURFbias/PURTbias and PURFrisk/PURTrisk are respectively plotted in the top-right and the bottom-left graphs. Finally, we plot in the bottom-right graph the ratio RFRI risk/RTRI risk: the solid line corresponds to the RFRI algorithm with default values, that is when each individual tree is grown to its maximal size (the rule being that we do not cut a node containing less than 5 data). The dashed line corresponds to the RFRI(k) algorithm, where each tree is grown until it has k terminal nodes. We choose this variant to give a fair comparison between PURF and RFRI, that is when trees, associated with both methods, construct partitions in k sets. Figure 2-4 are obtained in the same way as Figure 1 and present results respectively for square data, abs data and stump data.    Finally, we give in Figure 5, the estimated risks of RFRI, RFRI(k) and PURF on the four simulated data sets.

Comments and discussions
We see that graphs of Figures 1-3 are very similar. We summarise the results by the following comments: • the ratio PURFvariance/PURTvariance remains almost constant (when n is growing) around the value 0.4. This confirms that PURF and PURT variances are of the same order of magnitude, and shows that PURF effectively reduces the variance, here by a factor of less than 1 2 . So our upper bound with factor 3 4 seems to be improvable. • the ratio PURFbias/PURTbias decreases when n is growing. This suggests that PURF seems to reduce the order of magnitude of the bias. • as a consequence of the precedent remark, PURF seems to improve the rate of convergence of the risk (we see that the ratio PURFrisk/PURTrisk decreases as n increases). • the ratios between forest and tree risks for RFRI and RFRI(k) are roughly constant, again around 0.4. This suggests that RFRI reduces the risk, by a factor less than 1 2 .
Results for stump data in Figure 4 are quite different. Indeed, we again have a constant variance ratio (around a value a little bit less than 0.4), but here both bias ratio and risk ratio remain constant too (say around 0.4). So for stump data, PURF reduces the bias, but only by a constant. As a consequence, rates of convergence of PURF and PURT seem to be the same in this case.
The ratio RFRIrisk/RTRIrisk remains almost constant around 0.4 (may be with a slow decrease for the variant RFRI(k)).
These differences are natural because the regression function of stump data is discontinuous, whereas it is more regular in the other data sets (Lipschitz for abs data and indefinitely derivable of sinus data and square data).
As an intermediary conclusion, we can say that, in our simulation experiments, forests always improve the risk, compared with trees. More precisely, we observe that the risk is always at least two times smaller for a forest. Moreover, the improvement brought by forests can even affect the rate of convergence, especially for the PURF method when the regression function is regular.
Let us now give some comments about Figure 5. For sinus data, square data and abs data, we see that RFRI gives the worst performances, its risk being constant as n becomes larger. RFRI(k) significantly improves the performance, with a risk converging to 0. And finally PURF is even better than RFRI(k).
We do not manage to explain the constant behaviour of RFRI risk. But the comparison between RFRI and RFRI(k) is interesting, because it shows a case where using fully grown trees in a forest performs worse than using relatively small trees. However, we stress that this phenomenon is likely due to the fact that we deal with one-dimensional input data. Indeed, RFRI with fully grown trees have shown very good performances many times when the dimension of input data is large (and potentially very large), see e.g. Breiman (2001) and Goldstein, Hubbard, Cutler, and Barcellos (2010). This point surely deserves a more intensive study.

R. Genuer
The situation of stump data is again very different from other data sets. Here, PURF performs the worst (even if it manages to give a reasonable risk for very large values of n), RFRI risk is again constant, and RFRI(k) significantly reduces the risk. This suggests that RFRI(k) is much better than PURF for non-regular functions. Our explanation is that since RFRI(k) optimises the partitions associated with trees using the learning sample, it can better track some discontinuity of the regression function than PURF, in which partitions are chosen independently of the learning sample. In addition, even if PURF is better than RFRI(k) in the three other data sets, RFRI(k) is still competitive, whereas for stump data PURF is significantly worse than RFRI(k). Hence, when estimating regression with unknown regularity, RFRI(k) seems to be the better choice among the three methods we compared in this study.

Conclusion
In the context of PRFs, we give a general upper bound showing that the variance of a forest can actually be smaller than the variance of a tree.
We also emphasise, for a very simple version of RFs, the actual gain of using a RF instead of using a single random tree. First, we showed that both trees and forests reach the minimax rate of convergence. Then, we manage to highlight a reduction of the variance of a forest, compared with the variance of a tree. This is, in this specific context, a proof of the well-known conjecture for RFs: 'a RF, by aggregating several random trees, reduces variance and leaves the bias unchanged' which can be found for example in Hastie, Tibshirani and Friedman (2009).
Furthermore, our simulation study indicates that there is room for improvement, because a forest seems to be able to reduce bias as well. In addition, for sufficiently regular regression function, a forest could reduce the order of magnitude of the bias.
An interesting open problem would be to generalise this result, which could handle more complex versions of RFs and relax the hypotheses we made here. Obviously, a more ambitious goal would be to give some precise insights explaining the outstanding performances of RFRI (especially when the dimension of input data is large).

Proof of Theorem 3.2
Before entering into details of the proof of Theorem 3.2, we recall that in the proof of Proposition 3.1 (which can be found in Arlot 2008), calculations lead to the following equality: wherep λ = {i : X i ∈ U}/n. Then, an estimation of p λ E[1/np λ ] gives the expression (1/n)(1 + δ n,p λ ) in Proposition 3.1. We note a generic term of the sum in the r.h.s. of (13). We now address the proof of Theorem 3.2. We begin by introducing some notations In the sequel, we denote the covariance between two PURT by Let us consider U 1 = (λ 1 1 , . . . , λ 1 k ) and U 2 = (λ 2 1 , . . . , λ 2 k ) the partitions, respectively, associated withŝ U 1 andŝ U 2 .
Then we denote by W the intersection of the partitions U 1 and U 2 . That is, all sets of W are obtained in intersecting one set of U 1 and another set of U 2 . And we note (μ 1 , . . . , μ m ) the sets of the partition W, where m is the number of sets of W. We have 2k − 1 m k 2 . The Now, let us give some details for the first term of (15), denoted by S 1 (X). Without loss of generality, we suppose that μ 1 = λ 1 1 ∩ λ 2 1 = ∅. So, If we denote by E 1,2 [·] the conditional expectation E[· | (1 X i 1 ∈λ 1 1 ) 1 i 1 n , (1 X i 2 ∈λ 2 1 ) 1 i 2 n ], we have: where q 1 = P(X ∈ μ 1 ) but because Y i 1 and Y i 2 are independent. Hence:
Indeed for all t ∈ K r , where q t = P(X ∈ μ t ), Thus, • For the last term, the following inequality is sufficient to conclude:

Proof of Proposition 4.2
We keep the notations of Section 7.2. In addition, we define β j for any 0 j k by Function s is supposed to be C-Lipschitz, so = MC 2 6 (k + 2)(k + 3) 6MC 2 (k + 1) 2 .

Proof of Proposition 4.5
Using the fact that we explicitly know the distribution of U, we deduce from Theorem 3.2 the following corollary.

R. Genuer
Because of the simple draws of random partitions, the number M U is explicitly computable (we know the distribution of the two ordered statistics) and it is shown to be equivalent to 1 4 (k + 1) as k tends to +∞ (see Lemma 7.4).
As in Corollary 4.1, we have to prove that all terms of the sum are negligible compared with the constant one σ 2 . To deal with the fact that the number of terms in the sum is now random, we use the following inequality: So, we have Finally, the following technical result allows to conclude the proof of Corollary 7.3, and thus, using Equality (6), the proof of Proposition 4.5.
Let us prove Lemma 7.4.