A. Agarwal, D. Hsu, and S. Kale, Taming the monster: A fast and simple algorithm for contextual bandits, International Conference on Machine Learning, pp.1638-1646, 2014.

D. H. Ahn, J. Garlick, and M. Grondona, Flux: A Next-Generation Resource Management Framework for Large HPC Centers, 2014 43rd International Conference on Parallel Processing Workshops, pp.9-17, 2014.

K. Aida, Effect of Job Size Characteristics on Job Scheduling Performance, Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing. IPDPS '00/JSSPP '00, pp.1-17, 2000.

R. Akrour, M. Schoenauer, and M. Sebag, Preference-based Reinforcement Learning, Choice Models and Preference Learning Workshop at NIPS, vol.11, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00722744

P. Auer, N. Cesa-bianchi, and P. Fischer, Finite-time Analysis of the Multiarmed Bandit Problem, Machine Learning, vol.47, pp.235-256, 2002.

P. Auer, N. Cesa-bianchi, Y. Freund, and R. E. Schapire, The NonStochastic Multi-Armed Bandit Problem, 2002.

, Author(s) Alfredo Banos, Alfredo Baros, San Fernando, and Valley State College, Annals of Mathematical Statistics, pp.1932-1945, 1968.

L. Bottou, Stochastic learning, Advanced lectures on machine learning, 2004.

E. Breck, zymake: a computational workflow system for machine learning and natural language processing, Software Engineering, Testing, and Quality Assurance for Natural Language Processing, pp.5-13, 2008.

L. Breiman, Random Forests". English, Machine Learning, vol.45, pp.5-32, 2001.

L. Breiman, H. Jerome, R. A. Friedman, C. Olshen, and . Stone, Classification and regression trees, 1998.

L. Breiman, J. Friedman, C. J. Stone, and R. Olshen, Classification and regression trees, 1984.

S. Bubeck, Convex optimization: Algorithms and complexity, Foundations and Trends® in Machine Learning, vol.8, pp.231-357, 2015.

S. Bubeck and N. Cesa-bianchi, Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems, Foundations and Trends® in Machine Learning, vol.5, issue.1, pp.1-122, 2012.

L. Bu¸soniubu¸soniu, D. Ernst, B. D. Schutter, and R. Babu?ka, Approximate reinforcement learning: An overview, Adaptive Dynamic Programming And Reinforcement Learning (ADPRL), pp.1-8, 2011.

B. Bzeznik, O. Henriot, V. Reis, O. Richard, and L. Tavard, Nix as HPC package management system, 4th Workshop on HPC User Support Tools. Denver, United States, 2017.

N. Capit, G. D. Costa, and Y. Georgiou, A batch scheduler with high level components, CCGrid 2005. IEEE International Symposium on Cluster Computing and the Grid, vol.2, pp.776-783, 2005.
URL : https://hal.archives-ouvertes.fr/hal-00005106

N. Capit, G. D. Costa, and Y. Georgiou, A batch scheduler with high level components, Cluster Computing and the Grid, vol.2, pp.776-783, 2005.
URL : https://hal.archives-ouvertes.fr/hal-00005106

H. Casanova, A. Giersch, A. Legrand, M. Quinson, and F. Suter, Versatile, Scalable, and Accurate Simulation of Distributed Applications and Platforms, Journal of Parallel and Distributed Computing, vol.74, issue.10, pp.2899-2917, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01017319

H. Casanova, A. Giersch, A. Legrand, M. Quinson, and F. Suter, Versatile, Scalable, and Accurate Simulation of Distributed Applications and Platforms, pp.2899-2917, 2014.
URL : https://hal.archives-ouvertes.fr/hal-01017319

N. Cesa, -. Bianchi, and G. Lugosi, Prediction, Learning, and Games, 2006.

N. Cesa, -. Bianchi, and G. Lugosi, Prediction, learning, and games, 2006.

S. Chaudhuri and A. Solar-lezama, Smooth interpretation, ACM Sigplan Notices, vol.45, issue.6, pp.279-291, 2010.

S. Chiang, A. Arpaci-dusseau, and M. K. Vernon, The Impact of More Accurate Requested Runtimes on Production Job Scheduling Performance". en, Job Scheduling Strategies for Parallel Processing, 2002.

, Internet Systems Consortium. ISC license

A. and .. Doe, Synergistic challenges in Data-intensive Science and exascale computing, p.2013

E. Dolstra, The Purely Functional Software Deployment Model, 2006.

E. Dolstra, E. Visser, and M. De-jonge, Imposing a memory management discipline on software deployment, Software Engineering, 2004. ICSE 2004. Proceedings. 26th International Conference on, pp.583-592, 2004.

F. Rubing-duan, J. Nadeem, and . Wang, A Hybrid Intelligent Method for Performance Modeling and Prediction of Workflow Activities in Grids, Cluster Computing and the Grid, 2009.

P. Dutot, M. Mercier, M. Poquet, and O. Richard, Batsim: a Realistic Language-Independent Resources and Jobs Management Systems Simulator, 20th Workshop on Job Scheduling Strategies for Parallel Processing, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01333471

P. Dutot, G. Mounié, and D. Trystram, Handbook of Scheduling, Chap. Scheduling Parallel Tasks-Approximation Algorithms, 2004.

Y. Etsion and D. Tsafrir, A short survey of commercial cluster batch schedulers, vol.44221, pp.2005-2018, 2005.

, LSF (Load Sharing Facility) Features and Documentation

, Dror Feitelson. Parallel Workloads Archive: Cleaning Logs

G. Dror and . Feitelson, Metrics for Parallel Job Scheduling and Their Convergence, Job Scheduling Strategies for Parallel Processing, 2001.

G. Dror and . Feitelson, Metrics for Parallel Job Scheduling and Their Convergence, Job Scheduling Strategies for Parallel Processing: 7th International Workshop, pp.188-205, 2001.

. Dror-g-feitelson, Metrics for parallel job scheduling and their convergence, Workshop on Job Scheduling Strategies for Parallel Processing, pp.188-205, 2001.

G. Dror and . Feitelson, Resampling with Feedback-A New Paradigm of Using Workload Data for Performance Evaluation, Euro-Par 2016: Parallel Processing: 22nd International Conference on Parallel and Distributed Computing, pp.3-21, 2016.

G. Dror and . Feitelson, Workload Modeling for Computer Systems Performance Evaluation. 1st, 2015.

G. Dror and . Feitelson, Workload Modeling for Performance Evaluation, Performance Evaluation of Complex Systems: Techniques and Tools: Performance 2002 Tutorial Lectures, pp.114-141, 2002.

G. Dror, L. Feitelson, and . Rudolph, Metrics and benchmarking for parallel job scheduling, Job Scheduling Strategies for Parallel Processing, pp.1-24, 1998.

G. Dror, L. Feitelson, and . Rudolph, Toward convergence in job schedulers for parallel supercomputers, Workshop on Job Scheduling Strategies for Parallel Processing, pp.1-26, 1996.

G. Dror, L. Feitelson, U. Rudolph, and . Schwiegelshohn, Parallel Job Schedulinga Status Report, Proceedings of the 10th International Conference on Job Scheduling Strategies for Parallel Processing. JSSPP'04, pp.1-16, 2005.

G. Dror, D. Feitelson, D. Tsafrir, and . Krakov, Experience with using the Parallel Workloads Archive, Journal of Parallel and Distributed Computing, 2014.

G. Dror, D. Feitelson, D. Tsafrir, and . Krakov, Experience with using the Parallel Workloads Archive, Journal of Parallel and Distributed Computing, vol.74, pp.2967-2982, 2014.

M. Feldman, Probably approximately correct, 2014.

P. Message and . Forum, MPI: A Message-Passing Interface Standard, 1994.

E. Frachtenberg, G. Dror, and . Feitelson, Pitfalls in Parallel Job Scheduling Evaluation, Job Scheduling Strategies for Parallel Processing, 2005.

E. Frachtenberg and . Dror-g-feitelson, Pitfalls in parallel job scheduling evaluation, Job Scheduling Strategies for Parallel Processing, pp.257-282, 2005.

, Job Scheduling Strategies for Parallel Processing: 14th International Workshop, 2009.

J. Friedman, T. Hastie, and R. Tibshirani, The elements of statistical learning, Springer series in statistics, vol.1, 2001.

E. Gaussier, J. Lelong, V. Reis, and D. Trystram, Online Tuning of EASY-Backfilling using Queue Reordering Policies, IEEE Transactions on Parallel and Distributed Systems, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01963216

E. Gaussier, D. Glesser, V. Reis, and D. Trystram, Improving Backfilling by Using Machine Learning to Predict Running Times, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis. SC '15, pp.641-6410, 2015.
URL : https://hal.archives-ouvertes.fr/hal-01221186

Y. Georgiou, Resource and Job Management in High Performance Computing, 2010.
URL : https://hal.archives-ouvertes.fr/tel-01499598

Y. Georgiou, D. Glesser, K. Rzadca, and D. Trystram, A Scheduler-Level Incentive Mechanism for Energy Efficiency in HPC, 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), 2015.
URL : https://hal.archives-ouvertes.fr/hal-01230295

R. Gibbons, A historical application profiler for use by parallel schedulers, Job Scheduling Strategies for Parallel Processing, 1997.

S. Jeevana-priya-inala, S. Gao, A. Kong, and . Solar-lezama, REAS: Combining Numerical Optimization with SAT Solving, 2018.

D. Jackson, Q. Snell, and M. Clement, Core algorithms of the Maui scheduler, Job Scheduling Strategies for Parallel Processing, 2001.

T. Joachims, Optimizing search engines using clickthrough data, Proceedings of the eighth ACM SIGKDD international conference on Knowledge discovery and data mining, pp.133-142, 2002.

T. Joachims, Training linear SVMs in linear time, Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp.217-226, 2006.

S. Kullback and R. A. Leibler, On Information and Sufficiency, The Annals of Mathematical Statistics, vol.22, pp.79-86, 1951.

A. Lazaric, M. Ghavamzadeh, and R. Munos, Analysis of a classification-based policy iteration algorithm, ICML-27th International Conference on Machine Learning, pp.607-614, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00482065

J. Lelong, V. Reis, and D. Trystram, Tuning EASY-Backfilling Queues, 21st Workshop on Job Scheduling Strategies for Parallel Processing. 31st IEEE International Parallel & Distributed Processing Symposium, 2017.
URL : https://hal.archives-ouvertes.fr/hal-01522459

Y. T. Joseph and . Leung, Handbook of scheduling: algorithms, models, and performance analysis, 2004.

D. A. Lifka, The ANL/IBM SP Scheduling System, Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing. IPPS '95, pp.295-303, 1995.

D. A. Lifka, The ANL/IBM SP scheduling system". en. In: Job Scheduling Strategies for Parallel Processing, vol.949, 1995.

. David-a-lifka, The anl/ibm sp scheduling system, Job Scheduling Strategies for Parallel Processing, 1995.

A. M. Lindsay, M. Galloway-carson, C. R. Johnson, D. P. Bunde, and V. J. Leung, Backfilling with guarantees made as jobs arrive, Concurrency and Computation: Practice and Experience, vol.25, 2013.

C. Hamid-r-maei, S. Szepesvári, R. Bhatnagar, and . Sutton, Toward off-policy learning control with function approximation, Proceedings of the 27th International Conference on Machine Learning (ICML-10), pp.719-726, 2010.

A. Matsunaga and J. Fortes, On the Use of Machine Learning to Predict the Time and Resources Consumed by Applications, Cluster, Cloud and Grid Computing, 2010.

C. L. Mendes and D. A. Reed, Integrated compilation and scalability analysis for parallel systems, Parallel Architectures and Compilation Techniques, 1998.

T. Miu and P. Missier, Predicting the Execution Time of Workflow Activities Based on Their Input Features, High Performance Computing, Networking, Storage and Analysis (SCC), 2012 SC Companion, pp.64-72, 2012.

A. W. Mu and D. G. Feitelson, Utilization, predictability, workloads, and user runtime estimates in scheduling the IBM SP2 with backfilling, IEEE Transactions on Parallel and Distributed Systems, vol.12, issue.6, 2001.

W. Ahuva, D. G. Mu, and . Feitelson, Utilization, Predictability, Workloads, and User Runtime Estimates in Scheduling the IBM SP2 with Backfilling, Parallel and Distributed Systems, 2001.

K. Muller and T. Vignaux, SimPy: Simulating Systems in Python, ONLamp. com Python Devcenter, 2003.

F. Nadeem and T. Fahringer, Predicting the Execution Time of Grid Workflow Applications Through Local Learning, Proceedings of the Conference on High Performance Computing Networking, Storage and Analysis. SC '09, vol.33, p.12, 2009.

Y. Ngoko, D. Trystram, V. Reis, and C. Cerin, An Automatic Tuning System for Solving NP-Hard Problems in Clouds, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops, pp.1443-1452, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01427255

A. Nissimov, Locality and its usage in parallel job runtime distribution modeling using HMM, 2006.

A. Nissimov, G. Dror, and . Feitelson, Probabilistic Backfilling, Job Scheduling Strategies for Parallel Processing: 13th International Workshop, pp.102-115, 2007.

A. Nissimov, G. Dror, and . Feitelson, Probabilistic Backfilling, Job Scheduling Strategies for Parallel Processing, 2008.

B. Nitzberg, J. M. Schopf, and J. Jones, PBS Pro: Grid computing and scheduling attributes, Grid resource management, 2004.

F. Orabona, K. Crammer, and N. Cesa-bianchi, A Generalized Online Mirror Descent with Applications to Classification and Regression, 2013.

A. Palczewska, J. Palczewski, R. Marchese-robinson, and D. Neagu, Interpreting random forest classification models using a feature contribution method, 2013.

, PBS Pro 13.0 administrator's guide

F. Pedregosa, G. Varoquaux, and A. Gramfort, Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, vol.12, pp.2825-2830, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00650905

D. Perkovic and P. J. Keleher, Randomization, Speculation, and Adaptation in Batch Schedulers, Supercomputing, ACM/IEEE 2000 Conference, pp.7-7, 2000.

D. Perkovic and P. J. Keleher, Randomization, speculation, and adaptation in batch schedulers, Proceedings of the 2000 ACM/IEEE conference on Supercomputing, 2000.

W. B. Powell, Approximate Dynamic Programming: Solving the Curses of Dimensionality, Series in Probability and Statistics, 2007.

L. Rabiner, A tutorial on hidden Markov models and selected applications in speech recognition, Proceedings of the IEEE, 1989.

E. R. Rodrigues, R. L. Cunha, A. S. Marco, M. Netto, and . Spriggs, Helping HPC Users Specify Job Memory Requirements via Machine Learning, Proceedings of the Third International Workshop on HPC User Support Tools. HUST '16, pp.6-13, 2016.

S. Ross, P. Mineiro, and J. Langford, Normalized Online Learning". In: Uncertainty in Artificial Intelligence, 2013.

S. Ross, P. Mineiro, and J. Langford, Normalized online learning, 2013.

J. M. Schopf, F. Berman, J. M. Schopf, and F. Berman, Using Stochastic Intervals to Predict Application Behavior on Contended Resources, International Symposium on Parallel Architectures, Algorithms, and Networks, 1999.

M. Schulz, Flux: a next-generation resource management framework for large HPC centers, third IEEE Internat. Conf. on Parallel processing Workshops, pp.9-17, 2014.

U. Schwiegelshohn and R. Yahyapour, Analysis of First-come-first-serve Parallel Job Scheduling, Proceedings of the Ninth Annual ACM-SIAM Symposium on Discrete Algorithms. SODA '98, pp.629-638, 1998.

J. Skovira, W. Chan, H. Zhou, and D. A. Lifka, The EASY-LoadLeveler API Project, Proceedings of the Workshop on Job Scheduling Strategies for Parallel Processing. IPPS '96, pp.41-47, 1996.

, SLURM Online documentation

W. Smith, I. Foster, and V. Taylor, Predicting Application Run Times with Historical Information, Journal of Parallel and Distributed Computing, 2004.

S. Srinivasan, R. Kettimuthu, V. Subramani, and P. Sadayappan, Characterization of backfilling strategies for parallel job scheduling, Parallel Processing Workshops, pp.514-519, 2002.

G. Staples, TORQUE Resource Manager, 2006.

V. Stodden, F. Leisch, and R. Peng, Implementing reproducible research, 2014.

A. Streit, The self-tuning dynP job-scheduler, Proceedings International, IPDPS 2002, Abstracts and CD-ROM, 2002.

A. Streit, The self-tuning dynP job-scheduler, Proceedings 16th International Parallel and Distributed Processing Symposium, vol.8, p.pp, 2002.

R. S. Sutton and A. G. Barto, Introduction to Reinforcement Learning. 1st, 1998.

O. The and . Simulator,

W. R. Thompson, On the Likelihood that One Unknown Probability Exceeds Another in View of the Evidence of Two Samples, pp.285-294

M. Tokic, Adaptive-greedy Exploration in Reinforcement Learning Based on Value Differences, Proceedings of the 33rd Annual German Conference on Advances in Artificial Intelligence, pp.203-210, 2010.

. Top500-online-ranking,

. Torque-resource-manager-website,

D. Tsafrir and D. G. Feitelson, Instability in parallel job scheduling simulation: the role of workload flurries, Proceedings 20th IEEE International Parallel Distributed Processing Symposium, p.10, 2006.

D. Tsafrir, Y. Etsion, and D. G. Feitelson, Backfilling using runtime predictions rather than user estimates, Tech. Rep. TR, vol.5, 2005.

D. Tsafrir, Y. Etsion, and . Feitelson, Backfilling using system-generated predictions rather than user runtime estimates, 2007.

D. Tsafrir, Y. Etsion, and D. G. Feitelson, Modeling User Runtime Estimates, Job Scheduling Strategies for Parallel Processing, 2005.

D. Tsafrir, Y. Etsion, and D. G. Feitelson, Modeling user runtime estimates, Job Scheduling Strategies for Parallel Processing, 2005.

I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun, Support vector machine learning for interdependent and structured output spaces, Proceedings of the twenty-first international conference on Machine learning, p.104, 2004.

Y. Ukidave, X. Li, and D. Kaeli, Mystic: Predictive Scheduling for GPU Based Cloud Servers Using Machine Learning, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.353-362, 2016.

A. Vishnu, H. V. Dam, N. R. Tallent, D. J. Kerbyson, and A. Hoisie, Fault Modeling of Extreme Scale Applications Using Machine Learning, 2016 IEEE International Parallel and Distributed Processing Symposium (IPDPS), pp.222-231, 2016.

C. Watkins, Learning from Delayed Rewards, 1989.

H. Xu, Robust decision making and its applications in machine learning, 2009.

. Andy-b-yoo, M. Morris-a-jette, and . Grondona, SLURM: Simple linux utility for resource management, Job Scheduling Strategies for Parallel Processing, 2003.

Y. Zhang, W. Sun, and Y. Inoguchi, Predicting Running Time of Grid Tasks based on CPU Load Predictions, Grid Computing, 7th IEEE/ACM International Conference on, pp.286-292, 2006.

. , The RJMS manages the platform by giving users restricted access to some resource(s) during some amount of time. Both the resource request and allocated time are the result of the RJMS's decision, which is based on the Job request, Role of the Resource and Job Management System(RJMS)

. , Jobs are submitted anytime by users and scheduled by the RJMS, then executed

, Illustration of the scheduling process (time steps 1 to 4) using a simple 'first-come, first serve' policy. This policy waits for space to be available for the oldest job in the waiting queue, and starts it. The characteristics of the three job submissions are given in the table above, p.15

. .. , Example Loss function L, plotted with respect to the difference of it's second and third parameters f (x j ) ? p j (the prediction error), p.22

. , The decision to launch job 1 is taken at t 0. The second ready job (2) can not be executed since there are not enough CPUs/Nodes left. Thus, job 3 can also be launched at t 0. The decision to launch job 2 is finally taken when job 1 and 3 complete

. , Scatter plot of heuristic's relative performance between the MetaCentrum and SDSC-BLUE logs

. , Experimental cumulative distribution functions of prediction errors obtained using the Curie log

. , Experimental cumulative distribution functions of predicted values obtained using the Curie log

. .. , AvgWait obtained for the 7 main queue policies with FCFS backfilling for 150 generated weeks on the KTH-SP2 trace. First, in absolute value, and then normalized with respect to EASY-FCFS-FCFS, p.40

. .. , Comparison of the various predictive approaches

, AVEbsld performances of EASY (using requested times) and EASYCLAIRVOYANT (using actual running times). Values between parentheses show the corresponding decrease in AVEbsld, p.17

, Features extracted from the SWF data, for job j, belonging to user k, p.32

. , Weighting factors considered for training the model. The constants are chosen to ensure positivity of the weights with typical running times and resource requests in the HPC domain. Logarithms are used to alleviate the high range produced by ratios

. , Workload logs used in the simulations

. .. Combinations, Considered parameter values of the loss function, p.33

. , For predictive techniques, only the best and the worst AVEbsld are given. The best non-clairvoyant heuristic triples are outlined in bold

. , 34 4.8 MAE and E-Loss for different prediction techniques. All values are in seconds, AVEbsld performance of the heuristic triples resulting from cross validation. Values in parenthesis show the AVEbsld reduction obtained respective to EASY

. , AvgWait performance of EASY-EXP-EXP and EASY-SQF-SQF on the original CTC-SP2 and SDSC-SP2 traces, in seconds

. , AvgWait and MaxWait performance of EASY-SPF-SPF and EASY-FCFSFCFS on the original CTC-SP2 trace, in seconds

. , Workload logs used in the simulations

. , Workload logs used in the simulations, p.61

. .. , Hyper-parameter leave-one-out selection for, p.64

. , The precise definition of the value reported is provided in Eq (6.11), Average total waiting time diminution of fixed policies with respect to EASY(FCFS)

. , Average Cumulative waiting time improvement of the policies used with respect to EASY(FCFS)

, Additionally, the work

?. E. Gaussier, J. Lelong, V. Reis, and D. Trystram, Online Tuning of EASYBackfilling using Queue Reordering Policies, IEEE Transactions on Parallel and Distributed Systems, 2018.
URL : https://hal.archives-ouvertes.fr/hal-01963216

O. Bruno-bzeznik, V. Henriot, O. Reis, L. Richard, and . Tavard, Nix as HPC package management system, 4th Workshop on HPC User Support Tools, 2017.

?. Ngoko, D. Trystram, V. Reis, and C. Cerin, An Automatic Tuning System for Solving NP-Hard Problems in Clouds, 2016 IEEE International Parallel and Distributed Processing Symposium Workshops, IPDPS Workshops, pp.1443-1452, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01427255