N. J. Higham, Computing the Polar Decomposition with Applications, SIAM Journal on Scientific and Statistical Computing, vol.7, issue.4, pp.1160-1174, 1986.

G. H. Golub and C. F. Van-loan, ser. John Hopkins Studies in the Mathematical Sciences, 1996.

I. Bar-itzhack, Iterative optimal orthogonalization of the strapdown matrix, IEEE Transactions on, issue.11, pp.30-37, 1975.

J. A. Goldstein and M. Levy, Linear algebra and quantum chemistry, Am. Math. Monthly, vol.98, issue.10, pp.710-718, 1991.

,

Y. Nakatsukasa, Z. Bai, and F. Gygi, Optimizing Halley's Iteration for Computing the Matrix Polar Decomposition, SIAM Journal on Matrix Analysis and Applications, pp.2700-2720, 2010.

Y. Nakatsukasa and N. J. Higham, Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD, SIAM Journal on Scientific Computing, vol.35, issue.3, pp.1325-1349, 2013.

L. S. Blackford, J. Choi, A. Cleary, E. F. Azevedo, J. W. Demmel et al., ScaLAPACK Users' Guide. Philadelphia: Society for Industrial and Applied Mathematics, 1997.

A. Danalis, G. Bosilca, A. Bouteiller, T. Herault, and J. Dongarra, PTG: An abstraction for unhindered parallelism, Proceedings of WOLFHPC 2014: 4th International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing -Held in Conjunction with SC 2014: The International Conference for High Performance Computing, Networking, Stor, pp.21-30, 2014.

E. Agullo, J. Demmel, J. Dongarra, B. Hadri, J. Kurzak et al., Numerical Linear Algebra on Emerging Architectures: The PLASMA and MAGMA projects, Journal of Physics: Conference Series, vol.180, 2009.

E. Chan, E. S. Quintana-orti, G. Quintana-orti, and R. Van-de-geijn, Supermatrix out-of-order scheduling of matrix operations for SMP and multicore architectures, SPAA '07: Proceedings of the nineteenth annual ACM symposium on Parallel algorithms and architectures, pp.116-125, 2007.

, The Chameleon Project, 2018.

J. Dongarra, M. Faverge, T. Hérault, M. Jacquelin, J. Langou et al., Hierarchical qr factorization algorithms for multicore clusters, Parallel Computing, vol.39, issue.4, pp.212-232, 2013.
URL : https://hal.archives-ouvertes.fr/hal-00809770

R. A. Van-de-geijn and J. Watts, SUMMA: scalable universal matrix multiplication algorithm, Concurrency: Practice and Experience, vol.9, issue.4, pp.255-274, 1997.

,

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, A. Haidar et al., Flexible Development of Dense Linear Algebra Algorithms on Massively Parallel Architectures with DPLASMA, IPDPS Workshops. IEEE, pp.1432-1441, 2011.

Q. Meng, A. Humphrey, J. Schmidt, and M. Berzins, Investigating Applications Portability with the Uintah DAG-based Runtime System on PetaScale Supercomputers, Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, ser. SC '13, 2013.

, The HiCMA Library, 2018.

K. Akbudak, H. Ltaief, A. Mikhalev, and D. Keyes, Tile Low Rank Cholesky Factorization for Climate/Weather Modeling Applications on Manycore Architectures, High Performance Computing: 32nd International Conference, pp.22-40, 2017.

K. Akbudak, H. Ltaief, A. Mikhalev, A. Charara, A. Esposito et al., Exploiting data sparsity for large-scale matrix computations, Euro-Par 2018: Parallel Processing, pp.721-734, 2018.

L. N. Trefethen and D. Bau, Numerical Linear Algebra, 1997.

D. Sukkari, H. Ltaief, and D. E. Keyes, A High Performance QDWH-SVD Solver Using Hardware Accelerators, ACM Trans. Math. Softw, vol.43, issue.1, pp.1-6, 2016.

, Matrix Algebra on GPU and Multicore Architectures, MAGMA, 2009.

E. Anderson, Z. Bai, C. H. Bischof, L. S. Blackford, J. W. Demmel et al., LAPACK User's Guide, 1999.

J. Poulson, B. Marker, R. A. Van-de-geijn, J. R. Hammond, and N. A. Romero, Elemental: A New Framework for Distributed Memory Dense Matrix Computations, ACM Trans. Math. Softw, vol.39, issue.2, p.13, 2013.

D. Sukkari, H. Ltaief, and D. Keyes, High Performance Polar Decomposition on Distributed Memory Systems, Euro-Par 2016: Parallel Processing -22nd International Conference on Parallel and Distributed Computing, vol.9833, pp.605-616, 2016.

,

D. Sukkari, H. Ltaief, A. Esposito, and D. Keyes, A QDWH-based SVD software framework on distributed-memory manycore systems, ACM Trans. Math. Softw, vol.45, issue.2, pp.1-18, 2019.

&. Cray and . Libsci,

D. Sukkari, H. Ltaief, M. Faverge, and D. Keyes, Asynchronous taskbased polar decomposition on single node manycore architectures, IEEE Transactions on Parallel and Distributed Systems, vol.29, issue.2, pp.312-323, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01585079

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: A Unified Platform for Task Scheduling on Heterogeneous Multicore Architectures, Concurrency and Computation: Practice and Experience, vol.23, issue.2, pp.187-198, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00384363

G. Bosilca, A. Bouteiller, A. Danalis, T. Herault, P. Lemariner et al., DAGuE: A generic distributed DAG Engine for High Performance Computing, Parallel Computing, vol.38, issue.1-2, pp.0-2012, 2012.

M. Cosnard, E. Jeannot, and T. Yang, Compact dag representation and its symbolic scheduling, Journal of Parallel and Distributed Computing, vol.64, issue.8, pp.921-935, 2004.
URL : https://hal.archives-ouvertes.fr/inria-00099958

A. Buttari, J. Langou, J. K. Jack, and J. Dongarra, A class of parallel tiled linear algebra algorithms for multicore architectures, Parallel Computing, vol.35, pp.38-53

A. Buttari, J. Langou, J. Kurzak, and J. Dongarra, Parallel tiled QR factorization for multicore architectures, Concurrency: Practice and Experience, vol.20, issue.13, pp.1573-1590, 2008.

G. Quintana-ortí, E. S. Quintana-ortí, R. A. Van-de-geijn, F. G. Zee, and E. Chan, Programming matrix algorithms-by-blocks for threadlevel parallelism, ACM Transactions on Mathematical Software, vol.36, issue.3, 2009.

J. W. Demmel, L. Grigori, M. Hoemmen, and J. Langou, Communication-avoiding parallel and sequential QR and LU factorizations: theory and practice, 2008.

B. Hadri, H. Ltaief, E. Agullo, and J. Dongarra, Tile QR factorization with parallel panel processing for multicore architectures, IPDPS'10, the 24st IEEE Int. Parallel and Distributed Processing Symposium, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00548899

H. Bouwmeester, M. Jacquelin, J. Langou, and Y. Robert, Tiled QR factorization algorithms, SC'2011, the IEEE/ACM Conference on High Performance Computing Networking, Storage and Analysis, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00585721

F. Song, H. Ltaief, B. Hadri, and J. Dongarra, Scalable tile communication-avoiding QR factorization on multicore cluster systems, SC'10, the 2010 ACM/IEEE conference on Supercomputing, 2010.

M. Faverge, J. Langou, Y. Robert, and J. Dongarra, Bidiagonalization and R-Bidiagonalization: Parallel Tiled Algorithms, Critical Paths and Distributed-Memory Implementation, IPDPS'17 -31st IEEE International Parallel and Distributed Processing Symposium, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01484113

E. Solomonik and J. Demmel, Communication-optimal parallel 2.5d matrix multiplication and lu factorization algorithms, 2011.

,. E. Parallel-processing, R. Jeannot, J. Namyst, and . Roman, , pp.90-109, 2011.

J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz et al., Communication-optimal parallel recursive rectangular matrix multiplication, 2013 IEEE 27th International Symposium on Parallel and Distributed Processing, pp.261-272, 2013.

N. J. Higham, Estimating the matrix p-norm, Numerische Mathematik, vol.62, issue.1, pp.539-555, 1992.

J. Bruck, C. Ho, S. Kipnis, E. Upfal, and D. Weathersby, Efficient algorithms for all-to-all communications in multiport messagepassing systems, IEEE Transactions on Parallel and Distributed Systems, vol.8, issue.11, pp.1143-1156, 1997.

J. A. Calvin, C. A. Lewis, and E. F. Valeev, Scalable taskbased algorithm for multiplication of block-rank-sparse matrices, Proceedings of the 5th Workshop on Irregular Applications: Architectures and Algorithms, ser. IA3 '15, vol.4, pp.1-4, 2015.

S. Blackford and J. J. Dongarra, Installation guide for LAPACK, 1992.

H. Ltaief, D. Sukkari, A. Esposito, Y. Nakatsukasa, and D. Keyes, Massively Parallel Polar Decomposition on Distributed-Memory Systems, Accepted at ACM Transactions on Parallel Computing, 2019.