%!TEX root = ../s2hat_jpaper.tex
\section{Introduction}\label{sec:introduction} % (fold)
Spherical harmonic functions define an orthonormal complete basis for signals defined
on a 2-dimensional sphere. Spherical harmonic transforms are therefore common in
all scientific applications where such signals are encountered. These include a number of diverse areas ranging
from weather forecasts and climate modelling, through geophysics and planetology, to various
applications in astrophysics and cosmology. In these contexts a direct Spherical Harmonic Transform (SHT) is used to calculate
harmonic domain representations of the signals, which often
possesses simpler properties and are therefore more amenable to a further investigation. An inverse SHT
is then used to synthesize a sky image given its harmonic representation. Both of those are also used
as means of facilitating the multiplication
of huge matrices, defined on the sphere, and having a property of being diagonal in the harmonic
domain. Such matrices play an important role in statistical considerations of signals, which
are statistically isotropic and are among key operations involved in Monte Carlo Markov Chain sampling
approaches used to study such signals, e.g.~\cite{Tanner1992}.
\begin{figure}[htp]
\begin{center}
\includegraphics[width=0.8\textwidth]{figures/f8_map_example.png}
\end{center}
\caption{Sky map example synthesised using our MPI/CUDA implementation of the \alm2map routine. The units are $\mu$K.
An overall monopole term, with an amplitude of $\sim 2.7\,10^6\mu$K, has been subtracted to uncover the minute fluctuations.}\label{fig:skymap}
\end{figure}
The specific goal of this work is to assist simulation and analysis efforts related to the Cosmic Microwave Background
(CMB) studies. CMB is an electromagnetic radiation left over after the hot and dense initial
phase of the evolution of the Universe, which is popularly referred to as the Big Bang. Its observations
are one of the primary ways to study the Universe. The CMB photons reaching us from afar carry
an image of the Universe at the time when it had just a few percent of its current age of $\sim 13$Gyears.
This image confirms that the early
Universe was nearly homogeneous and only small, 1 part in $10^5$, deviations were present, as displayed in Fig~\ref{fig:skymap}. Those initiated the process
of so-called structure formation, which eventually led to the Universe as we observe today. The
studies of the statistical properties of these small deviations are one of the major pillars
on which the present day cosmology is built. On the observational front the CMB anisotropies
are targeted by an entire slew of operating and forthcoming experiments. These include a currently
observing European satellite called Planck\footnote{{\scriptsize {\sc planck:} http://sci.esa.int/science-e/www/area/index.cfm? fareaid
=17}}, which follows in the footsteps of two very successful American
missions, COBE\footnote{{\scriptsize {\sc cobe:} http://lambda.gsfc.nasa.gov/product/cobe/}} and WMAP\footnote{{\scriptsize {\sc wmap:} http://map.gsfc.nasa.gov/}},
as well as a number of balloon-borne and ground-based observatories.
Owing to quickly evolving detector technology, the volumes of the data collected by these experiments have been
increasing at the Moore's law rate, reaching at the present the values on order of petabytes. These new data
aiming at progressively
more challenging science goals not only provide, or are expected to do so, images of the sky with
an unprecedented resolution, but their scientific exploitation will require intensive, high precision,
and voluminous simulations, which in turn will require highly efficient novel approaches to a calculation
of SHTs.
For instance, the current and forthcoming balloon-borne and ground-based experiments will produce maps of the sky
containing as many as ${\cal O}(10^5-10^6)$ pixels and up to ${\cal O}(10^6 -10^7)$ harmonic modes. Maps from
already operating Planck satellite will consist of between ${\cal O}(10^6)$ and ${\cal O}(10^8)$ pixels and harmonic
modes. The production of high precision simulations reproducing reliably polarized properties of the CMB
signal, in particular generated due to the so called gravitational lensing, requires an effective
number of pixels and harmonic modes, as big as ${\cal O}(10^{9})$. These latter effects constitute some of the most
exciting new lines of research in the CMB area. The SHT implementation, which could successfully address all
this needs would not only have to scale well in the range of interest, but also be sufficiently quick to
be used as part of massive Monte Carlo simulations or extensive sampling solvers,
which are both important steps of the scientific exploitation of the CMB data.
At this time there are a few software packages available, which permit calculations of the SHT. These include,
{\sc healpix}~\cite{healpix_www}, {\sc glesp}~\cite{glesp_www}, {\sc ccsht}~\cite{ccsht_www},
{\sc libpsht}~\cite{libpsht_www}, and {\sc s$^2$hat}~\cite{s2hat_www}, which are commonly used in the CMB
research, and some others such as, {\sc spharmonickit/s2kit}~\cite{spharmonickit_www}
and {\sc spherpack}~\cite{spherpack_www}.
They propose different levels of parallelism and optimization, while implementing, with an exception of two last ones, essentially the same
algorithm. Among those the {\sc libpsht} package~\cite{libpsht} offers typically
the best performance, due to efficient code optimizations, and is considered as the state of the art implementation on serial and shared-memory platforms, at least for some of the popular sky pixelizations.
Conversely, the implementation offered by the {\sc s$^2$hat} library is best adapted to the distributed memory machines, and fully parallelized and scalable with
respect to both memory usage and calculation time. Moreover, it provides a convenient and flexible framework straightforwardly extensible to allow for multiple
parallelization levels based on hybrid programming models. Given the specific applications considered in this work, the distributed memory parallelism seems
inevitable and this is therefore the {\sc s$^2$hat} package, which we select as a starting point for this work.
The basic algorithm we use hereafter, and which is implemented in both {\sc libpsht} and {\sc s$^2$hat} libraries as described in the next section, scales as
${\cal O}({\cal R}_N\,\ell^2_{max})$,
where ${\cal R}_N$ is the number of rings in the analyzed map and $\ell_{max}$ fixes the resolution and thus
determines the number of modes in the harmonic domain, equal to $\simeq \ell_{max}^2$. For full sky maps, we have usually
$\ell_{max} \propto{\cal R}_N \propto n_{pix}^{1/2}$, where $n_{pix}$ is the total number of sky pixels,
and therefore the typical complexity is ${\cal O}(\ell_{max}^{3}) \propto {\cal O}(n_{pix}^{3/2})$.
We note that further improvements of this overall scaling are possible, for instance, if multiple identical transforms have to, or can, be done simultaneously,
what could lower the numerical cost to essentially that of a single transform. In this paper we however focus on the core algorithm and leave an investigation of these potential
extensions to future work.
We note that alternative algorithms have been also proposed, some of which display superior complexity. In particular, Driscoll and Healy~\cite{Driscoll_Healy_1994} proposed
a divide-and-conquer algorithm with a theoretical scaling of ${\cal O}(n_{pix}\,\ln^2 n_{pix})$. This approach is limited to special equidistant sphere grids and being
inherently numerically unstable requires corrective measures to ensure high precision results. This in turn affects its overall performance. For instance,
the software {\sc spharmonickit/s2kit}, which is the most widely used implementation of this approach, has been found a factor 3 slower than the {\sc healpix} transforms
implementing the standard ${\cal O}(n_{pix}^{3/2})$ method at the intermediate resolution of $\ell_{max} \sim 1024$~\cite{Wiaux_etal_2006}. Other algorithms also
exist and typically involve a precomputation step of the same complexity, but often less favorable prefactors, as the standard approach, and an actual calculation of the
transforms, which typically exploits either the Fast Multipole Methods, e.g., ~\cite{SudaTakami2002, Tygert2008} or matrix compression techniques, e.g., ~\cite{Mohlenkamp1999, Tygert2010}, to bring down
its overall scaling to ${\cal O}(n_{pix}\ln n_{pix})$ ~\cite{SudaTakami2002}, ${\cal O}(n_{pix}\ln^2 n_{pix})$~\cite{Tygert2008, Tygert2010}, or ${\cal O}(n_{pix}^{5/4} \ln n_{pix})$
~\cite{Mohlenkamp1999}.
The methods of this class typically require significant memory resources needed to store the precomputation products and are advantageous
only if a sufficient number of successive transforms of the same type has to be performed in order to compensate for the precomputation costs.
The most recent, and arguably satisfactory, implementation of such ideas is a package called {\sc wavemoth}\footnote{{\scriptsize {\sc Wavemoth: }https://github.com/wavemoth/wavemoth}}~\cite{wavemoth}, which achieves a speed-up of the inverse SHT with respect to the {\sc libpsht} library by a factor ranging from $3$ to $6$ for $\ell_{max} \sim 4000$. In such a case the required extra
memory is on order of $40$GBytes and it depends strongly, i.e., $\propto \ell_{max}^3$, on the resolution.
We also note that in some applications the need to use SHTs can be sometimes by-passed by resorting to approximate but numerically efficient means such as
the Fast Fourier Transforms as for instance in the context of convolutions on the sphere as discussed in ~\cite{Elsner2011}.
In many practical applications, as the ones driving this research, SHTs are just one of many processing steps, which need to be performed to accomplish a final task. In such a context, these are the memory high-water mark and algorithm scalability, which frequently emerge as the two most relevant requirements, as the cost of the SHT transforms
is often subdominant and their complexity more favorable than that of many other typical operations. From the memory point of view, the standard algorithm, having the smallest memory footprint, remains therefore an attractive option. The important question is then whether its efficient, scalable, parallel implementation is indeed possible.
Such a question is particularly pertinent in the context of the heterogeneous architectures and hybrid programming models and this is the context investigated in this work.
The past work on the parallelization of SHTs includes a parallel version of the two step algorithm of Driscoll and Healy introduced by \cite{inda2001efficient}, the algorithm of \cite{Drake_algorithm888} based on rephrasing the transforms as matrix operations and therefore making them effectively vectorizable, and a shared memory implementation
available in the {\sc libpsht} package. More recently, algorithms
have been developed for NVIDIA General-Purpose Graphics Processing Units (GPGPU)~\cite{gpu_soman,hupca}. This latter work provided a direct
motivation for the investigation presented here, which describes what, to the best of our knowledge, is the first hybrid design of parallel SHT algorithms suitable for a cluster of GPU-based
architectures and current high-performance multi-core processors, and involving hybrid OpenMP/MPI and MPI/CUDA programming.
We find that once carefully optimized and
implemented, the algorithm displays nearly perfect scaling in both
cases, at least up to 128 MPI processes mapped on the same number of pairs of multi-core CPU-GPU. We find that inverse SHT with GeForce 400 Series equipped with the latest CUDA device ("Fermi") outperforms state of the art implementation for a multi-core processors executed on latest Intel Core i7-2600K, while the direct transforms
in both these cases perform comparably.
This paper is organised as follows. In section \ref{sec:background} we introduce the algebraic background of the spherical harmonic transforms. In section~\ref{sec:basic_algorithm} we describe a basic, sequential algorithm and list useful assumptions concerning a sphere pixelisation, which facilitate a number of acceleration techniques used in our approach. In following section, we introduce a detailed description of our parallel algorithms along with two variants suitable for clusters of GPUs and clusters of multi-cores. Section \ref{sec:experiments} presents results for both implementation and finally section \ref{sec:conclusions} concludes the paper.
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "../s2hat_jpaper"
%%% End: