M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis et al., Tensorflow: A system for large-scale machine learning, 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pp.265-283, 2016.

E. Agullo, O. Aumage, M. Faverge, N. Furmento, F. Pruvost et al., Harnessing Supercomputers with a Sequential Task-based Runtime System, 2014.

A. Asadchev and M. S. Gordon, Fast and Flexible Coupled Cluster Implementation, J. Chem. Theory Comput, vol.9, issue.8, pp.3385-3392, 2013.

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: A unified platform for task scheduling on heterogeneous multicore architectures, Conc. Comp. Pract. Exper, vol.23, pp.187-198, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00384363

M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, Legion: Expressing locality and independence with logical regions, International Conference for High Performance Computing, Networking, Storage and Analysis, 2012.

I. Bethune, A. Öss, J. Hutter, A. Lazzaro, H. Pabst et al., Porting of the DBCSR library for sparse matrix-matrix multiplications to intel xeon phi systems, Advances in Parallel Computing, pp.47-56, 2018.

U. Bor?tnik, J. Vandevondele, V. Weber, and J. Hutter, Sparse matrix multiplication: The distributed block-compressed sparse row library, Parallel Computing, vol.40, issue.5-6, pp.47-58, 2014.

J. Bosch, A. Filgueras, M. Vidal, D. Jimenez-gonzalez, C. Alvarez et al., Exploiting Parallelism on GPUs and FPGAs with OmpSs, Proceedings of the 1st Workshop on AutotuniNg and ADaptivity AppRoaches for Energy Efficient HPC Systems, ANDARE '17, 2017.

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Herault et al., PaRSEC: Exploiting Heterogeneity to Enhance Scalability, IEEE Computing in Science Engineering, vol.15, issue.6, pp.36-45, 2013.

J. A. Calvin, C. A. Lewis, and E. F. Valeev, Scalable task-based algorithm for multiplication of block-rank-sparse matrices, IA3 '15, pp.1-8, 2015.

. Chameleon, A dense linear algebra software for heterogeneous architectures, 2020.

, Open source molecular dynamics, 2020.

A. Danalis, G. Bosilca, A. Bouteiller, T. Hérault, and J. J. Dongarra, PTG: an abstraction for unhindered parallelism, Proceedings of the Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, WOLFHPC '14, pp.21-30, 2014.

T. Davis, SuiteSparse : a suite of sparse matrix software, 2020.

J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz et al., Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication, 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2013.

, Distributed Parallel Linear Algebra Software for Multicore Architectures

A. Duran, R. Ferrer, E. Ayguade, R. M. Badia, and J. Labarta, A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks, Intl. Journal of Parallel Programming, vol.37, issue.3, pp.292-305, 2009.

, Elemental: C++ library for distributed-memory linear algebra and optimization

M. Gates, J. Kurzak, A. Charara, A. Yarkhan, and J. Dongarra, SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library, SC'2019, 2019.

K. Goto and R. A. Geijn, Anatomy of High-performance Matrix Multiplication, ACM Trans. Math. Software, vol.34, issue.3, 2008.

S. Gray, A. Radford, and D. P. Kingma, Gpu kernels for block-sparse weights, p.3, 2017.

T. Herault, Y. Robert, G. Bosilca, and J. J. Dongarra, Generic matrix multiplication for multi-GPU accelerated distributed-memory platforms over PaRSEC, 10th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pp.33-41, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02282529

T. Herault, Y. Robert, G. Bosilca, R. J. Harrison, C. A. Lewis et al., Distributed-memory multi-GPU block-sparse tensor contraction for electronic structure: software artifact, 2020.

J. Hong and H. Kung, I/O complexity: the red-blue pebble game, STOC '81: Proceedings of the 13th ACM symposium on Theory of Computing, pp.326-333, 1981.

D. Ironya, S. Toledo, and A. Tiskin, Communication lower bounds for distributed-memory matrix multiplication, J. Parallel Distributed Computing, vol.64, issue.9, pp.1017-1026, 2004.

R. Kobayashi and A. P. , A direct coupled cluster algorithm for massively parallel computers, Chem. Phys. Lett, vol.265, issue.1-2, pp.1-11, 1997.

J. Kurzak, M. Gates, A. Charara, A. Yarkhan, I. Yamazaki et al., Linear systems solvers for distributed-memory machines with gpu accelerators, Euro-Par, pp.495-506, 2019.

G. Kwasniewski, M. Kabi?, M. Besta, J. Vandevondele, R. Solcà et al., Red-blue pebbling revisited: near optimal parallel matrixmatrix multiplication, 2019.

C. A. Lewis, J. A. Calvin, and E. F. Valeev, Clustered Low-Rank Tensor Format: Introduction and Application to Fast Construction of Hartree-Fock Exchange, J. Chem. Theory Comput, vol.12, issue.12, pp.5868-5880, 2016.

W. Ma, S. Krishnamoorthy, O. Villa, K. Kowalski, and G. , Optimizing tensor contraction expressions for hybrid CPU-GPU execution, Clust. Comput, vol.16, issue.1, pp.131-155, 2013.

. Openmp, , 2013.

, Parallel Linear Algebra PACKage

C. Peng, J. Calvin, and E. F. Valeev, Coupled-Cluster Singles, Doubles and Perturbative Triples with Density Fitting Approximation for Massively Parallel Heterogeneous Platforms, Int. J. Quant. Chem, vol.12, issue.119, p.25894, 2019.

C. Peng, J. A. Calvin, F. Pavo?evi?, J. Zhang, and E. F. Valeev, Massively Parallel Implementation of Explicitly Correlated Coupled-Cluster Singles and Doubles Using TiledArray Framework, J. Phys. Chem. A, vol.120, issue.51, pp.10231-10244, 2016.

C. Peng, C. Lewis, X. Wang, M. Clement, F. Pavosevic et al., The Massively Parallel Quantum Chemistry Program (MPQC), 2018.

J. Pineau, Y. Robert, F. Vivien, and J. Dongarra, Matrix product on heterogeneous master-worker platforms, ACM SIGPLAN PPoPP, pp.53-62, 2008.
URL : https://hal.archives-ouvertes.fr/hal-00803487

C. Riplinger, P. Pinski, U. Becker, E. F. Valeev, and F. Neese, Sparse maps-A systematic infrastructure for reduced-scaling electronic structure methods. II. Linear scaling domain based pair natural orbital coupled cluster theory, J Chem Phys, vol.144, issue.2, 2016.

E. H. Rubensson and E. Rudberg, Locality-aware parallel block-sparse matrix-matrix multiplication using the chunks and tasks programming model, Parallel Computing, vol.57, pp.87-106, 2016.

E. H. Rubensson, E. Rudberg, and P. Sa, A hierarchic sparse matrix data structure for large-scale hartree-fock/kohn-sham calculations, J. Computational Chemistry, vol.28, issue.16, pp.2531-2537, 2007.

, Scalable Linear Algebra PACKage

M. D. Schatz, R. A. Van-de-geijn, and J. Poulson, Parallel matrix multiplication: A systematic journey, SIAM J. Scientific Computing, vol.38, issue.6, pp.748-781, 2016.

O. Schütt, P. Messmer, J. Hutter, and J. Vandevondele, GPU-Accelerated Sparse Matrix-Matrix Multiplication for Linear Scaling Density Functional Theory, pp.173-190, 2016.

I. Shavitt and R. Bartlett, Many-Body Methods in Chemistry and Physics: MBPT and Coupled-Cluster Theory. Cambridge Molecular Science, 2009.

I. Sivkov, P. Seewald, A. Lazzaro, and J. Hutter, DBCSR: A blocked sparse tensor algebra library, Parallel Computing: Technology Trends, Proceedings of the International Conference on Parallel Computing, vol.36, pp.331-340, 2019.

E. Solomonik, D. Matthews, J. R. Hammond, J. F. Stanton, and J. Demmel, A massively parallel tensor contraction framework for coupled-cluster computations, Journal of Parallel and Distributed Computing, vol.74, issue.12, pp.3176-3190, 2014.

S. Toledo, A survey of out-of-core algorithms in numerical linear algebra, External Memory Algorithms and Visualization, pp.161-180, 1999.

. Top500, Top 500 Supercomputer Sites, 2019.

S. J. Treichler, Realm: Performance Portability through Composable Asynchrony, 2014.

R. A. Van-de-geijn and J. Watts, SUMMA: scalable universal matrix multiplication algorithm, Concurrency: Practice and Experience, vol.9, issue.4, pp.255-274, 1997.