, Argonne Leadership Computing Facility. Mira log traces

B. Baker and J. Schwarz, Shelf algorithms for two-dimensional packing problems, SIAM Journal on Computing, vol.12, issue.3, pp.508-525, 1983.

B. S. Baker, E. G. Coffman, and R. L. Rivest, Orthogonal packings in two dimensions, SIAM Journal on Computing, vol.9, issue.4, pp.846-855, 1980.

L. Bautista-gomez and F. Cappello, Detecting silent data corruption through data dynamic monitoring for scientific applications, PPoPP, 2014.

J. Bruno, P. Downey, and G. N. Frederickson, Sequencing tasks with exponential service times to minimize the expected flow time or makespan, J. ACM, vol.28, issue.1, pp.100-113, 1981.

K. M. Chandy and P. F. Reynolds, Scheduling partially ordered tasks with probabilistic execution times, SIGOPS Oper. Syst. Rev, vol.9, issue.5, pp.169-177, 1975.

B. Chen and A. P. Vestjens, Scheduling on identical machines: How good is LPT in an on-line setting, Operations Research Letters, vol.21, issue.4, pp.165-169, 1997.

C. Chen, G. Eisenhauer, M. Wolf, and S. Pande, LADR: Low-cost application-level detector for reducing silent output corruptions, HPDC, pp.156-167, 2018.

Z. Chen, Online-ABFT: An online algorithm based fault tolerance scheme for soft error detection in iterative methods, SIGPLAN Not, vol.48, issue.8, pp.167-176, 2013.

E. G. Coffman, M. R. Garey, D. S. Johnson, and R. E. Tarjan, Performance bounds for level-oriented two-dimensional packing algorithms, SIAM J. Comput, vol.9, issue.4, pp.808-826, 1980.

J. Csirik and G. J. Woeginger, Shelf algorithms for on-line strip packing, Information Processing Letters, vol.63, issue.4, pp.171-175, 1997.

J. Csirik and G. J. Woeginger, On-line packing and covering problems, Online Algorithms: The State of the Art, pp.147-177, 1998.

D. G. Feitelson, L. Rudolph, U. Schwiegelshohn, K. C. Sevcik, and P. Wong, Theory and practice in parallel job scheduling, JSSPP, pp.1-34, 1997.

A. Feldmann, M. Kao, J. Sgall, and S. Teng, Optimal on-line scheduling of parallel jobs with dependencies, Journal of Combinatorial Optimization, vol.1, issue.4, pp.393-411, 1998.

A. Feldmann, J. Sgall, and S. Teng, Dynamic scheduling on parallel machines, Theoretical Computer Science, vol.130, issue.1, pp.49-72, 1994.

M. R. Garey and R. L. Graham, Bounds for multiprocessor scheduling with resource constraints, SIAM J. Comput, vol.4, issue.2, pp.187-200, 1975.

M. R. Garey and D. S. Johnson, Computers and Intractability, a Guide to the Theory of NP-Completeness, 1979.

E. Gaussier, J. Lelong, V. Reis, and D. Trystram, Online tuning of EASY-backfilling using queue reordering policies, IEEE Transactions on Parallel and Distributed Systems, vol.29, issue.10, pp.2304-2316, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01963216

A. Goel and P. Indyk, Stochastic load balancing and related problems, Proceedings of the 40th Annual Symposium on Foundations of Computer Science (FOCS), 1999.

P. Guhur, H. Zhang, T. Peterka, E. Constantinescu, and F. Cappello, Lightweight and accurate silent data corruption detection in ordinary differential equation solvers, Euro-Par, 2016.

X. Han, K. Iwama, D. Ye, and G. Zhang, Strip packing vs. bin packing, Algorithmic Aspects in Information and Management, pp.358-367, 2007.

T. Herault and Y. Robert, Fault-Tolerance Techniques for High-Performance Computing, Computer Communications and Networks, 2015.

K. Huang and J. A. Abraham, Algorithm-based fault tolerance for matrix operations, IEEE Trans. Comput, vol.33, issue.6, pp.518-528, 1984.

J. L. Hurink and J. J. Paulus, Online algorithm for parallel job scheduling and strip packing, Approximation and Online Algorithms, pp.67-74, 2008.

D. B. Jackson, Q. Snell, and M. J. Clement, Core Algorithms of the Maui Scheduler, JSSPP, pp.87-102, 2001.

K. Jansen, A (3/2+ ) approximation algorithm for scheduling moldable and non-moldable parallel tasks, SPAA, pp.224-235, 2012.

B. Johannes, Scheduling parallel jobs to minimize the makespan, J. of Scheduling, vol.9, issue.5, pp.433-452, 2006.

J. Kleinberg, Y. Rabani, and E. Tardos, Allocating bandwidth for bursty connections, Proceedings of the 29th Annual ACM Symposium on Theory of Computing (STOC), pp.664-673, 1997.

K. Li, Analysis of the list scheduling algorithm for precedence constrained parallel tasks, Journal of Combinatorial Optimization, vol.3, issue.1, pp.73-88, 1999.

D. A. Lifka, The ANL/IBM SP Scheduling System, JSSPP, pp.295-303, 1995.

A. Lodi, S. Martello, and M. Monaci, Two-dimensional packing problems: A survey, European Journal of Operational Research, vol.141, issue.2, pp.241-252, 2002.

M. Snir, Addressing failures in exascale computing, Int. J. High Perform. Comput. Appl, vol.28, issue.2, pp.129-173, 2014.

A. W. Mu and D. G. Feitelson, Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling, IEEE Trans. Parallel Distrib. Syst, vol.12, issue.6, pp.529-543, 2001.

E. Naroska and U. Schwiegelshohn, On an on-line scheduling problem for parallel jobs, Inf. Process. Lett, vol.81, issue.6, pp.297-304, 2002.

J. Niño-mora, Stochastic scheduling. Encyclopedia of Optimization, pp.3818-3824, 2009.

T. O'gorman, The effect of cosmic rays on the soft error rate of a DRAM at ground level, IEEE Trans. Electron Devices, vol.41, issue.4, pp.553-557, 1994.

M. L. Pinedo, Scheduling: Theory, Algorithms, and Systems, 2008.

D. B. Shmoys, J. Wein, and D. P. Williamson, Scheduling parallel machines on-line, SIAM J. Comput, vol.24, issue.6, pp.1313-1331, 1995.

J. Skovira, W. Chan, H. Zhou, and D. A. Lifka, The EASY -LoadLeveler API Project, JSSPP, pp.41-47, 1996.

S. Srinivasan, R. Kettimuthu, V. Subramani, and P. Sadayappan, Characterization of backfilling strategies for parallel job scheduling, International Conference on Parallel Processing Workshop, 2002.

G. Staples, TORQUE resource manager, Proceedings of the ACM/IEEE Conference on Supercomputing, 2006.

J. Turek, J. L. Wolf, and P. S. Yu, Approximate algorithms scheduling parallelizable tasks, SPAA, 1992.

R. R. Weber, Scheduling jobs with stochastic processing requirements on parallel machines to minimize makespan or flowtime, J Appl Probab, vol.19, issue.1, pp.167-182, 1982.

G. Weiss and P. , Scheduling tasks with exponential service times on non-identical processors to minimize various cost functions, J Appl Probab, vol.17, issue.1, pp.187-202, 1980.

A. K. Wong and A. M. Goscinski, Evaluating the EASY-backfill job scheduling of static workloads on clusters, CLUSTER, 2007.

P. Wu, C. Ding, L. Chen, F. Gao, T. Davies et al., Fault tolerant matrix-matrix multiplication: Correcting soft errors online, ScalA'11, pp.25-28, 2011.

D. Ye, X. Han, and G. Zhang, A note on online strip packing, Journal of Combinatorial Optimization, vol.17, issue.4, pp.417-423, 2009.

A. B. Yoo, M. A. Jette, and M. Grondona, SLURM: Simple Linux Utility for Resource Management, JSSPP, pp.44-60, 2003.

J. Ziegler, M. Nelson, J. Shell, R. Peterson, C. Gelderloos et al., Cosmic ray soft error rates of 16-Mb DRAM memory chips, IEEE Journal of Solid-State Circuits, vol.33, issue.2, pp.246-252, 1998.