S. Alowayyed, D. Groen, P. V. Coveney, and A. G. Hoekstra, Multiscale computing in the exascale era, Journal of Computational Science, pp.15-25, 2017.

F. Chen, W. Ge, L. Guo, X. He, B. Li et al., Multi-scale HPC system for multi-scale discrete simulation-Development and application of a supercomputer with 1 Petaflops peak performance in single precision, pp.332-335, 2009.

J. Luttgau, S. Snyder, P. Carns, J. M. Wozniak, J. Kunkel et al., Toward Understanding I/O Behavior in HPC Workflows, 2018 IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems (PDSW-DISCS), 2018.

, European Technology Platform for High Performance Computing, ETP4HPC Strategic Research Agenda: Achieving HPC Leadership in Europe, 2013.

, European Technology Platform for High Performance Computing, 2015.

, Strategic Research Agenda 2017: European Multi-annual HPC Technology Roadmap, 2017.

, Eurolab-4-HPC Long-Term Vision on High-Performance Computing, 2017.

, The Opportunities and Challenges of Exascale Computing: Summary Report of the Advanced Scientific Computing Advisory Committee (ASCAC) Subcommittee", 2010.

, Exascale Programming Challenges: Report of the 2011 Workshop on Exascale Programming Challenges, 2011.

, Preliminary Conceptual Design for an Exascale Computing Initiative, 2014.

, Top Ten Exascale Research Challenges: DOE ASCAC Subcommittee Report, 2014.

F. Cappello, A. Geist, W. Gropp, L. Kale, B. Kramer et al., Toward Exascale Resilience, 2009.

F. Cappello, A. Geist, W. Gropp, S. Kale, B. Kramer et al., Toward Exascale Resilience: 2014 Update, 2014.

A. Avizienis, J. Laprie, B. Randell, and C. Landwehr, Basic concepts and taxonomy of dependable and secure computing, IEEE Transactions on Dependable and Secure Computing, pp.11-33, 2004.

S. Mukherjee, Architecture Design for Soft Errors, 2008.

A. Geist, How To Kill A Supercomputer: Dirty Power, Cosmic Rays, and Bad Solder, IEEE Spectrum, 2013.

, Intel Xeon Processor E7 Family: Reliability, Availability, and Serviceability Advanced data integrity and resiliency support for mission-critical deployments, 2011.

, Intel Xeon Processor E7-8800/4800/2800 v2 Product Family Based Platform Reliability, Availability and Serviceability (RAS) Integration and Validation Guide, 2014.

, Intel Corporation, New Reliability, Availability, and Serviceability (RAS) Features in the Intel Xeon Processor Family, 2017.

, Reliability, Availability, & Serviceability (RAS) of Intel Infrastructure Management. Technologies Feature Support. Feature Brief, 2017.

, Intel Xeon Scalable Platform. Product Brief, 2017.

, Intel Corporation, Intel Product Quick Reference Matrix -Servers", 2018.

, Intel Xeon Processor Scalable Family. Datasheet, Volume One: Electrical", 2018.

, ARM Reliability, Availability, and Serviceability (RAS) Specification ARMv8, for the ARMv8-A architecture profile, 2017.

A. Computing, Ampere 64-bit Arm Processor. Product brief", 2018.

. Bull/atos and . Technologies, Bullion S4 the most advanced workspace for fast data. Fact sheet, 2015.

A. , Bull Sequana S series. Technical specification, 2017.

D. Inc, Advanced Reliability for Intel Xeon Processors on Dell PowerEdge Servers, Technical White Paper, 2010.

D. Inc, PowerEdge R930, 2016.

D. Inc, Five Ways to Ensure Reliability, Availability, and Serviceability in Your Enterprise Environment, 2016.

H. Company, Avoiding server downtime from hardware errors in system memory with HP Memory Quarantine, 2012.

. Ibm-corp, Reliability, Availability, and Serviceability. Features of the IBM eX5 Portfolio

L. Press, Lenovo X6 Server RAS Features", 2018.

, RAS Features of the Lenovo ThinkSystem SR950 and SR850", 2018.

. Lenovo, Always-on" reliability on x86, 2018.

, Oracle Server X5-4 System Architecture, 2016.

L. Press, Five Highlights of the ThinkSystem SR950", 2018.

, Oracle Server X7-2 and Oracle Server X7-2L System Architecture. White paper, 2017.

, Oracle Server X7-2. Data sheet, 2017.

, Oracle Server X7-8 Eight-Socket Configuration. Data sheet, 2017.

, Memory RAS Configuration. User's guide, 2017.

. Xilinx, Device Reliability Report. UG116 (v10.9, 2018.

. Xilinx, 7 Series FPGAs Memory Resources. UG473 (v1.12), 2016.

. Intel/altera, Intel Stratix 10 Embedded Memory User Guide, v18.1", 2018.

. Intel/altera, AN 737: SEU Detection and Recovery in Intel Arria 10 Devices", 2018.

. Intel/altera, AN 711: Power Reduction Features in Intel Arria 10 Devices", 2018.

. Intel/altera, Reliability Report (MNL-1085), 2017.

D. Jauk, D. Yang, and M. Schulz, Predicting Faults in High Performance Computing Systems: An In-Depth Survey of the State-of-the-Practice, The International Conference for High Performance Computing, Networking, Storage, and Analysis (SC), 2019.

A. Frank, D. Yang, A. Brinkmann, M. Schulz, and T. Süss, Reducing False Node Failure Predictions in HPC, The 26th IEEE International Conference on High Performance Computing, Data, and Analytics, 2019.

A. Rico, J. A. Joao, C. Adeniyi-jones, and E. Van-hensbergen, ARM HPC Ecosystem and the Reemergence of Vectors, Proceedings of the Computing Frontiers Conference, 2017.

J. Weloli, S. Bilavarn, M. De, S. Vries, C. Derradji et al., Efficiency modeling and exploration of 64-bit ARM compute nodes for exascale, Microprocess. Microsyst, vol.53, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01586191