Sected topics in multiscale data assimilation

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.


Introduction
In geophysical sciences, physics unfolds and should be modelled at many spatial and temporal scales. This stems from the very large scope of the systems often considered in ambitious applications, from the global synoptic motions of the atmosphere accounting for heat transfer and dominant winds, down to the microphysics accounting for clouds, hydrometeors, and aerosols. This is also intrinsic to the physics of turbulence that shapes geophysical flows and that correlates scales so that it becomes difficult to model them separately. In addition to the complexity of multiscale models that need to be dealt with in geophysical data assimilation systems, one needs to incorporate observations pertaining to many scales, such as in situ observations, satellite observations, radar, and lidar. Even though all of this is to a large extent ignored in current data assimilation systems, it is easy to envision future data assimilation systems accounting for these multiscale aspects, not only at a technical level by interfacing models and observations at different scales, but also at a more fundamental mathematical level in that respect. Potentially useful analysis tools exist, such as multigrid methods and wavelets, but developing a fully consistent data assimilation system based on them is still a challenge.
This chapter provides an exposition of a selection of problems considered in this field. Most of the results were obtained in the context of the ANR (French National Science Foundation) MSDAG project.
There are a few insightful papers in the geophysical literature addressing mathematical approaches to multiscale data assimilation, for instance Liu and Rabier (2002), Janjić and Cohn (2006), Oke and Sakov (2008), and . They offer a different, often complementary, view on the topic to what is presented here, although several concepts are common.
This chapter deals with a multiscale extension of the best linear unbiased estimate (BLUE) inference, and its application to several problems. It also deals with a more pragmatic and perhaps more rewarding approach of the same problem, focusing on the multiple-scales treatment of innovation statistics. These methods are applied to atmospheric transport and chemistry problems.

Bayesian multiscale analysis
In the following, we will discuss how to account for multiple scales in the BLUE estimation principle at the heart of most current data assimilation methods. Our starting point is the observation equation with the simplest assumptions, where µ and ǫ are the observation and error vectors defined in R p and σ is the control space state vector in R N . In the following discussion, σ also coincides with the state vector. The observation errors are assumed to follow a normal distribution, ǫ ∼N (0, R), whereas the background errors on the state vector σ are assumed to be distributed according to a Gaussian, σ ∼N (σ b , B).
We present some partial results drawn from Bocquet (2009.

Scaling operators
To climb up or down the spatial or temporal scales, one needs to define a restriction operator that tells how a state vector σ is coarse-grained and a prolongation operator that tells how the state vector is refined through the scales. See Rodgers (2000) for an in-depth discussion of the topic. First, let us consider the restriction operator. Assume σ is a state vector that is known in the finest regular grid Ω isomorphic to R N fg .Letω be a discretization of the control space, which we also call a representation. It is isomorphic to R N , meaning that it has N grid cells. These N grid cells are possibly of different sizes and shapes, but they form a partition of control space. It could be a regular grid, or an adaptive grid that partitions the space and possibly the time domain. The coarse-graining of σ in ω is defined by σ ω = Γ ω σ, where Γ ω : R N fg → R N stands for the linear coarsegraining operator. This operator is supposed to be unambiguously defined. It could stand for a simple averaging. But it could also be a more complex coarse-graining, with an associated prolongation operator given by a spline interpolation, or model-specific coarser Jacobian matrices, such as H.
A state vector can also be refined thanks to a prolongation operator Γ ⋆ ω : R N → R N fg that refines σ ω into σ = Γ ⋆ ω σ ω . This linear operator is ambiguous, since additional information is needed to reconstruct a source at higher resolution. A schematic of the use of the restriction and prolongation operators is displayed in Fig. 18.1 Fig. 18.1 Schematic of the restriction and prolongation operators from the finest regular grid to a representation (adaptive grid) ω, and vice versa.
One simple choice, which we shall call the deterministic one, is to set Γ ⋆ ω = Γ T ω . However, in this data assimilation framework, one has prior information on the control state vectors that may be exploited, in the form of a probability density function q(σ). Following our initial assumption, q is Gaussian: q(σ) ∼N(σ b , B). From this prior defined in the finest regular grid, one can infer, thanks to Γ ω , the prior pdf of σ in the representation ω: Conversely, assume that one knows σ ω in the representation ω. Since the problem is underdetermined, one could opt for the most likely refinement. This is given by the mode of q(σ|σ ω ). From Bayes' rule, it is clear that where δ is the Dirac distribution in R N . Then the mode of this posterior Gaussian distribution is given by Thus, Γ ⋆ ω would be an affine operator. We denote by Λ ⋆ ω its tangent linear component: so that we can choose as a prolongation operator (18.8) where I N fg is the identity operator in the finest grid control space. Since the refinement is now a probabilistic process, errors are attached to it and can be estimated. The corresponding error covariance matrix is As expected, if the representation ω is close to the finest grid then the refinement error is negligible. If the representation is coarse, Rank(Π ω )/ N fg ≪ 1, the refinement error is limited by that of the background. These scaling operators first satisfy which is a consistency identity. Any reasonable prolongation operator should satisfy it. Then, one verifies that The linear operator Π ω is a projector, since it can be checked that (18.12) where , B −1 is the scalar product built on B −1 . In matrix form, this is equivalent to Π ω cannot be the identity, because the coarse-graining implies a loss of information that, in general, cannot be fully recovered. Schematically, we have the following:

Observation equation in any representation
The mathematical formalism having been laid, the observation equation (18.1) can be written in any representation ω. The Jacobian H becomes H ω = HΓ ⋆ ω . By inheritance from Γ ⋆ ω , H ω is an affine operator. The observation equation reads (18.14) The error ǫ ω has been made scale-dependent, because several sources of errors depend on the scale, such as the aggregation errors (to be defined soon), or the errors in model subgrid physical parameterizations.

General scale-dependent errors
Indeed, a general decomposition of the error that account for scales could involve (i) the scale-independent observation error ǫ, which would also include model error that could be scale-independent, (ii) an error due to discretization ǫ c ω (coarse-graining), and (iii) a model error that would be scale-dependent ǫ m ω :

Representativeness/aggregation errors
Let us assume that errors are specified in the finest grid level, ǫ = µ − Hσ,a n d that they may originate from many sources, including scale-independent model error.
One assumes that there is no scale-dependent model error (ǫ m ω = 0), so that errors at larger scales ǫ ω = µ − H ω σ ω are supposed to be due solely to this original error, plus an error due entirely to coarsening, or aggregation error, which is a form of representativeness error, albeit from the control-space viewpoint. In that case, the model scaling is entirely described by the coarsening H ω = HΓ ⋆ ω . Since µ = Hσ +ǫ = Assuming independence of the error and source error priors, the computation of the covariance matrix of these errors yields (18.17) Since H I N fg − Π ω BH T is a positive matrix, the mean variance of the errors always increases because of the aggregation. Intuitively, the statistics of the innovation vector µ − Hσ b should not depend on the scale. However, when written in terms of errors, the innovation depends formally on the representation ω: We have used the fact that Fortunately, this paradox is only superficial, since one can check that the statistics of the innovation are truly scale-independent:

Estimation of the performance in the finest grid Ω
More generally, an analysis performed in the representation ω is obtained by coarsening the analysis at the finest scale. Hence, in this case, the multiscale formalism has no theoretical benefit compared with performing data assimilation in the finest grid. (There are major practical advantages, though, such as dividing the computational load.) This can be understood by directly applying Bayes' rule, using Gaussian statistics: This leads to the estimate with σ a the state estimation in the finest grid. The analysis error covariance matrix transforms similarly according to where P a is the analysis error covariance matrix in the finest grid. This can also be consistently obtained, through the finest scale: which yields (18.22) and (18.23) by a simple convolution of Gaussian probability density functions.

Estimation of the performance in the coarser grid ω
We have just shown that, theoretically, it is better to estimate the truth σ t (living in the finest grid) within the finest grid. First, that was assuming that (i) it is numerically affordable, and (ii) there is no scale-dependent model error. Therefore, in practice and with real data, the finest grid will not necessary be the best space in which to perform the analysis, and we believe that there should be an optimal degree of refinement of control space. The concept is illustrated Fig. 18.2. Secondly, we could be interested in the more modest but safer objective to estimate the truth projected in a coarser grid. Therefore, we are interested in the analysis of the error σ a ω − σ t ω , within the targeted representation ω coarser than Ω. We evaluate the inversion errors at different scales by decomposing the errors at the finest scale in B −1 -norm into two parts: one for the variability at the finest scale and the other for coarser variations. Substituting the equation Here γ f characterizes the variability at the finest scale and γ c is related to the departure σ a ω −σ t ω at coarse scales. As the representation ω is refined, the error σ a − σ t 2 B −1 is bound to decrease as fine details of the background state are better and better exploited. Meanwhile, the error γ c 2 B −1 is bound to increase, since retrieval at finer and finer scale is becoming erroneous.

Application to Bayesian control space design
One application of the above theoretical results is the Bayesian design of adaptive representations of control space. A typical problem is to know whether there is an optimal representation ω of control space using N grid cells. The above mathematical formalism can help us solve this problem. However, we first need to define 'optimal', by choosing an optimality criterion.

Design criteria
The Fisher criterion Given our original incentive, to construct an adaptive grid of control space that is optimal for data assimilation, the optimality criterion must be a measure of the quality of the analysis. In Bocquet (2009), the following criterion was chosen: It is inspired by the Fisher information matrix, 1 normalized by the background error covariance matrix, so that the criterion is invariant under a change of coordinate in control space (for a given grid). Specifically, it measures the reduction of uncertainty granted by the observations. In a representation ω, the criterion reads The operator H ω = HΛ ⋆ ω is the tangent linear operator of the affine operator H ω (which explains the difference of notation). Because only the linear part of H ω survives when averaging over the errors to obtain second-order moments, H ω appears in the criterion rather than H ω .
If one assumes that the errors are essentially scale-independent, then R ω ≃ R.I n that case, J ω can be written in terms of Π ω using the machinery developed earlier: Using the Bayesian prolongation operator Γ ⋆ ω , which makes use of the prior, one further obtains which is more difficult to optimize because of the nonlinear dependence of J ω on Π ω .

Fisher criterion and representativeness error
In the absence of scale-dependent model error, one can identify the normalized scalecovariant aggregation error, which is a measure of the representativeness error induced by the aggregation of grid cells: It is clear that the larger the Fisher criterion, the smaller the aggregation error. Hence, optimal representations for the Fisher criterion are representations that minimize the representativeness errors.

Degrees of freedom for the signal
This criterion J ω =T r I N − B −1 ω P a ω is known to measure the number of degrees of freedom for the signal (DFS), i.e. the information load that helps resolve the parameter space. It is actually more common in data assimilation literature than the cost function (18.32). In the absence of any source of errors, the DFS are equal to the number of scalar observations that are assimilated (p here). In the presence of errors, the DFS range between 0 and the number of observations p, because the information of the observations is also used to resolve the noise (Rodgers, 2000). So the maximization of J ω entails maximizing these degrees of freedom, which seems very natural.
Assuming scale-covariant errors of the aggregation form described earlier, the dependence on Π ω of the DFS criterion is actually simple:

Numerical solution
The explicit dependence of the criteria on the representation ω, through Π ω is the key to an optimization of the criteria on the ω in a library of potential discretizations of control space. This permits an algebraic formulation of any adaptive discretization that follows a hierarchy of discretization. This formulation enables the definition of a conventional objective function to be minimized. All the mathematical details can be found in Bocquet (2009), , and .

Example: the United Nations International Monitoring Network
Optimal representations of the control space can be constructed by the Bayesian multiscale framework and the design methodology introduced above. These optimal representations, under the information criteria such as the Fisher criterion or DFS, are in fact information maps indicating how information from observation sites could be propagated towards control space. One direct application of such information maps is to evaluate the performance of a monitoring network in the context of inverse modelling. In this section, we summarize one such example applied to the International Monitoring System (IMS) radionuclide network .
The IMS radionuclide network is operated by the United Nations CTBT Organisation (CTBTO) to enforce the Comprehensive Nuclear-Test-Ban Treaty (CTBT), which bans nuclear explosions. There is a total of 80 stations, of which 79 have assigned locations. We shall consider these 79 stations. As of June 2011, 60 stations were certified and operational to collect seismic, infrasound, and hydroacoustic data as well as radionuclide (particulate matter and noble gases) activity concentrations (triangles in Fig. 18.4 mark their locations). This IMS network is usually assessed by its detection capability using global atmospheric transport models (ATM), whereas, in our approach, we evaluate the potential of the IMS network for inverse modelling of the source.
The year 2009 is the focus of the study. Activity concentrations measurements are integrated over 24 hours. Therefore, 79 × 365 = 28 835 observations are considered. The relationship between the emission source parameters and the observations is derived from the influence functions (also known as adjoint solutions or footprints or retroplumes) obtained using the FLEXPART Lagrangian transport model. The temporal extent of each influence function is 14 days, with a time step of ∆t = 3 hours. In the absence of significant correlations between observation errors, and between background errors, we assume diagonal observation and background error covariance matrices.
We compute optimal adaptive grids of the source parameter space by maximizing the Fisher criterion or, alternatively, the DFS. This optimization takes into account the monitoring network, the uncertainty prescription, and the relationship between the source parameters and the observations derived from atmospheric transport over 2009.
The comparison of the performance of the regular and optimal representations shown in Fig. 18.3 demonstrates the efficiency of the optimal grids in terms of uncertainty reduction. We present one resulting optimal grid under the Fisher criterion in Fig. 18.4. Areas of the domain where the grid cells of the optimal adaptive grid are large emphasize zones where the retrieval is more uncertain, whereas areas where the grid cells are smaller and denser stress regions where more source variables can be resolved.
The observability of the globe through inverse modelling has been studied in cases with strong, realistic, and small model error. The strong-error and realistic-error cases yield heterogeneous adaptive grids, indicating that information does not propagate far from the monitoring stations, whereas in the small-error case, the grid is much more  homogeneous. In all cases, several specific continental regions remain poorly observed, such as Africa and the tropics, because of the trade winds. The Northern Hemisphere is better observed through inverse modelling, mostly because it contains more IMS stations. This unbalance leads to a better performance of inverse modelling in the Northern Hemisphere winter.

Example: Ring 2
The inversion of CO 2 surface fluxes from atmospheric concentration measurements involves discretizing the flux domain in time and space. Because of the sparsity of the available concentration observations, the spatial extent of fluxes, and the dispersive nature of the atmospheric transport, the inversion of carbon fluxes is an ill-posed inverse problem. One popular remedy addressing this observation-flux imbalance is to reduce the effective degrees of freedom of fluxes by aggregation of flux variables within large regions (so-called ecoregions). However, the failure to take into account the flux variations at fine scales leads to the aggregation error that, in some cases, can be of the same order as the flux magnitude.
Thanks to the Bayesian multiscale framework and the control space design methodology, optimal multiscale representations can be obtained by maximizing the DFS. Moreover scale-dependent aggregation errors can be identified and explicitly formulated for more reliable inversions . We demonstrate this by performing inversions using synthetic continuous hourly CO 2 concentration data in the context of the Ring 2 experiment in support of the North American Carbon Program Mid Continent Intensive (MCI, <http://www.ring2.psu.edu>).
A ring of eight towers around the state of Iowa collects hourly averaged CO 2 concentration observations (in ppm) in and out of the corn belt area. The locations of these towers are shown in Fig. 18.5. The time period of the experiment is from 1 June 2007 at 0000 UTC to 16 June 2007 at 0000 UTC. The time length is 15 days. The total number of hourly observations p is thus 2880 (8 × 24 × 15). The two-dimensional spatial domain (an area of 980 km × 980 km) is discretized into a finest regular grid of 128 × 128 points. Simulations of a vegetation model SiBcrop within this spatio-temporal domain are used as the reference true fluxes. Backward particle trajectories over 15 days are generated using the Lagrangian transport model LPDM driven by the atmospheric transport simulated by the meteorological model WRF. These particles within the surface grid cells are recorded to compute the influences of the fluxes on concentration observations to form the Jacobian matrix H. The background error covariance matrix B either is diagonal or follows a Balgovind model that parameterizes the isotropic exponential decaying correlation. We assume R to be diagonal; that is, the observation errors are spatio-temporally independent.
We plot optimal representations obtained with the DFS criterion in Fig. 18.5, which demonstrate the heterogeneity of optimal propagation of information from observation sites to the whole domain. The Balgovind correlation in background errors implicitly imposes aggregation on fluxes. This results in a more uniformly distributed optimal i-.. grid than that in the case of diagonal B. We compare the performance of the regular and optimal representations with or without taking into account the aggregation error in Fig .. 18.6. For diagonal B, the optimal grids are more efficient than regular grids in terms of information gain (similar to Fig. 18.3). For Balgovind B, this gain of opti mal grids is less obvious (Fig. 18.6(a)), since those optimal grids are more uniformly distributed. When the correlations in the errors of a priori fluxes are physically un realistic ( Fig. 18.6(b) ), it is recommended that the aggregation errors should be taken into account explicitly. In general, only a small part of observations from the sparse Ring 2 network are effectively assimilated (DFS/p < 20%).

Empirical multiscale statistics
In previous sections, we have introduced restriction and prolongation operators to transfer fluxes between scales. The aggregation error related to these operators has been identified and formulated for viable inversions. In C02 inversions, we have shown the important role of the background error covariance matrix.
In this section, we sketch a quite different approach (Chevallier et al., 2012)

Objective statistics at the finest scale
The objective statistics are based on the comparison between the surface fluxes calculated by a process-based terrestrial ecosystem model ORCHIDEE and daily averages of CO 2 flux measurements at 156 sites across the world in the global FLUXNET network. We use flux observations collected between 1991 and 2007. ORCHIDEE simulations are performed for these sites and are driven by meteorological data (Interim Reanalysis of the European Centre for Medium-Range Weather Forecasts). The background flux errors have been estimated by the statistics of the model-observation differences (the innovation). At the daily scale, the standard deviation of the model-data fit is 2.5g Cm −2 d −1 . Temporal autocorrelations (Fig. 18.7(b)) are significant at the weekly scale (>0.3f o r lags less than four weeks). The spatial structure of the error appears to be a function of the lag distance between pairs of sites based on the Pearson correlation coefficient of the model-observation differences (Fig. 18.7(a)). The median reveals spatial structure at short distances (<100 km). The correlation median is 0.33 for distances less than 100 km, 0.26 for distances between 100 and 200 km, and negligible beyond 400 km. The short spatial correlation length estimated from FLUXNET data does not seem to justify the aggregations by imposing biome-dependent correlations.

Background error at coarser scales
If each flux variable at a coarse scale is assumed to be the arithmetic mean of fluxes at finest pixels of 1 day and 1 km 2 , then the variance of the flux error on a coarser grid cell can be computed by multiplying the error variance of the finest pixel with the arithmetic mean of correlations of flux errors between two arbitrary finest pixels in that coarser grid cell. The correlation of errors of fluxes between two coarser-scale grid cells can be computed in a similar way. In Fig. 18.8, we show the effect of temporal and spatial aggregation on the error statistics at various aggregation scales. In general, aggregation reduces error variances, while increasing correlations. As an example, for a typical inversion at a global scale of grid-point (300 km × 300 km) monthly fluxes, the prior flux error follows an approximate e-folding correlation length of 500 km only, with correlations from one month to the next as large as 0.6.

Conclusion
These selected topics have offered only a restricted view on a few facets of the multiscale aspects of geophysical data assimilation. In spite of the apparent heterogeneity of the formalism and concepts, a few pivotal ideas and requirements have emerged. The first is the need for properly formalized operators for changing scales and converting/transferring information through the scales. These operators need to be able to consistently transfer not only the fields but also their statistics. Another key idea is the importance of accounting and formalizing representativeness errors or alternatively aggregation errors. Finally, the paradigm of information budget that underlies modern data assimilation is even more pregnant when dealing with the multiscale aspects of data assimilation, since changing scales may involve losing or gaining information, such as the forecast and analysis steps of standard data assimilation.