M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis et al., Tensorflow: A system for large-scale machine learning, 12th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 16), pp.265-283, 2016.

E. Agullo, O. Aumage, M. Faverge, N. Furmento, F. Pruvost et al., Harnessing Supercomputers with a Sequential Task-based Runtime System, 2014.

A. Asadchev and . Mark-s-gordon, Fast and Flexible Coupled Cluster Implementation, J. Chem. Theory Comput, vol.9, issue.8, pp.3385-3392, 2013.

C. Augonnet, S. Thibault, R. Namyst, and P. Wacrenier, StarPU: A unified platform for task scheduling on heterogeneous multicore architectures, Conc. Comp. Pract. Exper, vol.23, pp.187-198, 2011.
URL : https://hal.archives-ouvertes.fr/inria-00384363

M. Bauer, S. Treichler, E. Slaughter, and A. Aiken, Legion: Expressing locality and independence with logical regions, International Conference for High Performance Computing, Networking, Storage and Analysis, 2012.

I. Bethune, A. Gl, A. Hutter, H. Lazzaro, F. Pabst et al., Porting of the DBCSR library for sparse matrixmatrix multiplications to intel xeon phi systems, Parallel Computing is Everywhere, vol.32, pp.47-56, 2018.

U. Bor?tnik, J. Vandevondele, V. Weber, and J. Hutter, Sparse matrix multiplication: The distributed block-compressed sparse row library, Parallel Computing, vol.40, issue.5-6, pp.47-58, 2014.

J. Bosch, A. Filgueras, M. Vidal, D. Jimenez-gonzalez, C. Alvarez et al., Exploiting Parallelism on GPUs and FPGAs with OmpSs, Proceedings of the 1st Workshop on AutotuniNg and ADaptivity AppRoaches for Energy Efficient HPC Systems, ANDARE '17, 2017.

G. Bosilca, A. Bouteiller, A. Danalis, M. Faverge, T. Herault et al., PaRSEC: Exploiting Heterogeneity to Enhance Scalability, IEEE Computing in Science Engineering, vol.15, issue.6, pp.36-45, 2013.

. Justus-a-calvin, A. Cannada, E. F. Lewis, and . Valeev, Scalable task-based algorithm for multiplication of block-rank-sparse matrices, IA3 '15, pp.1-8, 2015.

. Chameleon, A dense linear algebra software for heterogeneous architectures, 2020.

, Open source molecular dynamics, 2020.

A. Danalis, G. Bosilca, A. Bouteiller, T. Hérault, and J. J. Dongarra, PTG: an abstraction for unhindered parallelism, Proceedings of the Fourth International Workshop on Domain-Specific Languages and High-Level Frameworks for High Performance Computing, WOLFHPC '14, pp.21-30, 2014.

T. Davis, SuiteSparse : a suite of sparse matrix software, 2020.

J. Demmel, D. Eliahu, A. Fox, S. Kamil, B. Lipshitz et al., Communication-Optimal Parallel Recursive Rectangular Matrix Multiplication, 2013 IEEE International Symposium on Parallel & Distributed Processing (IPDPS, 2013.

, Distributed Parallel Linear Algebra Software for Multicore Architectures

A. Duran, R. Ferrer, E. Ayguade, R. M. Badia, and J. Labarta, A Proposal to Extend the OpenMP Tasking Model with Dependent Tasks, Intl. Journal of Parallel Programming, vol.37, issue.3, pp.292-305, 2009.

, Elemental: C++ library for distributed-memory linear algebra and optimization

M. Gates, J. Kurzak, A. Charara, A. Yarkhan, and J. Dongarra, SLATE: Design of a Modern Distributed and Accelerated Linear Algebra Library, SC'2019, 2019.

K. Goto and R. A. Van-de-geijn, Anatomy of Highperformance Matrix Multiplication, ACM Trans. Math. Software, vol.34, issue.3, 2008.

S. Gray, A. Radford, and D. Kingma, Gpu kernels for block-sparse weights, p.3, 2017.

T. Herault, Y. Robert, G. Bosilca, and J. J. Dongarra, Generic matrix multiplication for multi-GPU accelerated distributedmemory platforms over PaRSEC, 10th IEEE/ACM Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems, pp.33-41, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02282529

T. Herault, Y. Robert, G. Bosilca, R. J. Harrison, C. A. Lewis et al., Distributedmemory multi-GPU block-sparse tensor contraction for electronic structure: software artifact, 2020.

J. Hong and H. T. Kung, I/O complexity: the red-blue pebble game, STOC '81: Proceedings of the 13th ACM symposium on Theory of Computing, pp.326-333, 1981.

D. Ironya, S. Toledo, and A. Tiskin, Communication lower bounds for distributed-memory matrix multiplication, J. Parallel Distributed Computing, vol.64, issue.9, pp.1017-1026, 2004.

R. Kobayashi, A direct coupled cluster algorithm for massively parallel computers, Chem. Phys. Lett, vol.265, issue.1-2, pp.1-11, 1997.

J. Kurzak, M. Gates, A. Charara, and A. Yarkhan, Ichitaro Yamazaki, and Jack Dongarra. Linear systems solvers for distributedmemory machines with gpu accelerators, Euro-Par, pp.495-506, 2019.

G. Kwasniewski, M. Kabi?, and M. Besta, Joost VandeVondele, Raffaele Solcà, and Torsten Hoefler. Red-blue pebbling revisited: near optimal parallel matrix-matrix multiplication, 2019.

A. Cannada, . Lewis, A. Justus, E. F. Calvin, and . Valeev, Clustered Low-Rank Tensor Format: Introduction and Application to Fast Construction of Hartree-Fock Exchange, J. Chem. Theory Comput, vol.12, issue.12, pp.5868-5880, 2016.

, A sparse matrix library, p.2020

W. Ma, S. Krishnamoorthy, O. Villa, K. Kowalski, and G. Agrawal, Optimizing tensor contraction expressions for hybrid CPU-GPU execution, Clust. Comput, vol.16, issue.1, pp.131-155, 2013.

. Openmp, , 2013.

, Parallel Linear Algebra PACKage

C. Peng, J. Calvin, and E. F. Valeev, Coupled-Cluster Singles, Doubles and Perturbative Triples with Density Fitting Approximation for Massively Parallel Heterogeneous Platforms, Int. J. Quant. Chem, vol.12, issue.119, p.25894, 2019.

C. Peng, A. Justus, F. Calvin, J. Pavo?evi?, E. F. Zhang et al., Massively Parallel Implementation of Explicitly Correlated Coupled-Cluster Singles and Doubles Using TiledArray Framework, J. Phys. Chem. A, vol.120, issue.51, pp.10231-10244, 2016.

C. Peng, C. Lewis, X. Wang, M. Clement, F. Pavosevic et al., The Massively Parallel Quantum Chemistry Program (MPQC), 2018.

J. Pineau and Y. Robert, Frédéric Vivien, and Jack Dongarra. Matrix product on heterogeneous master-worker platforms, ACM SIGPLAN PPoPP, pp.53-62, 2008.

C. Riplinger, P. Pinski, U. Becker, F. Edward, F. Valeev et al., Sparse maps-A systematic infrastructure for reducedscaling electronic structure methods. II. Linear scaling domain based pair natural orbital coupled cluster theory, J Chem Phys, vol.144, issue.2, 2016.

H. Emanuel, E. Rubensson, and . Rudberg, Locality-aware parallel block-sparse matrix-matrix multiplication using the chunks and tasks programming model, Parallel Computing, vol.57, pp.87-106, 2016.

H. Emanuel and . Rubensson, Elias Rudberg, and Pawe l Sa lek. A hierarchic sparse matrix data structure for large-scale hartree-fock/kohn-sham calculations, J. Computational Chemistry, vol.28, issue.16, pp.2531-2537, 2007.

, Scalable Linear Algebra PACKage

M. D. Schatz, R. A. Van-de-geijn, and J. Poulson, Parallel matrix multiplication: A systematic journey, SIAM J. Scientific Computing, vol.38, issue.6, pp.748-781, 2016.

O. Schütt, P. Messmer, J. Hutter, and J. Vandevondele, GPU-Accelerated Sparse Matrix-Matrix Multiplication for Linear Scaling Density Functional Theory, pp.173-190, 2016.

I. Shavitt and R. J. Bartlett, Many-Body Methods in Chemistry and Physics: MBPT and Coupled-Cluster Theory. Cambridge Molecular Science, 2009.

I. Sivkov, P. Seewald, A. Lazzaro, and J. Hutter, DBCSR: A blocked sparse tensor algebra library, Parallel Computing: Technology Trends, Proceedings of the International Conference on Parallel Computing, vol.36, pp.331-340, 2019.

E. Solomonik, D. Matthews, J. R. Hammond, J. F. Stanton, and J. Demmel, A massively parallel tensor contraction framework for coupled-cluster computations, Journal of Parallel and Distributed Computing, vol.74, issue.12, pp.3176-3190, 2014.

S. Toledo, A survey of out-of-core algorithms in numerical linear algebra, External Memory Algorithms and Visualization, pp.161-180, 1999.

. Top500, Top 500 Supercomputer Sites, 2019.

S. J. Treichler, Realm: Performance Portability through Composable Asynchrony, 2014.

R. A. Van-de-geijn and J. Watts, SUMMA: scalable universal matrix multiplication algorithm, Concurrency: Practice and Experience, vol.9, issue.4, pp.255-274, 1997.