A. Umut, G. Acar, M. Blelloch, S. K. Fluet, and . Mullerand-ram-raghunathan, Coupling Memory and Computation for Locality Management, Summit on Advances in Programming Languages (SNAPL), 2015.

A. Umut, G. E. Acar, R. D. Blelloch, and . Blumofe, The data locality of work stealing, Theory of Computing Systems (TOCS), vol.35, pp.321-347, 2002.

A. Umut, A. Acar, M. Charguéraud, and . Rainey, Scheduling Parallel Programs by Work Stealing with Private Deques, Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '13), 2013.

A. Umut, A. Acar, M. Charguéraud, and . Rainey, Oracleguided scheduling for controlling granularity in implicitly parallel languages, Journal of Functional Programming, vol.26, p.23, 2016.

S. Agarwal, R. Barik, D. Bonachea, V. Sarkar, R. K. Shyamasundar et al., Deadlock-free scheduling of X10 computations with bounded resources, SPAA 2007: Proceedings of the 19th Annual ACM Symposium on Parallelism in Algorithms and Architectures, pp.229-240, 2007.

S. Nimar, R. D. Arora, C. Blumofe, and . Greg-plaxton, Thread scheduling for multiprogrammed multiprocessors, Proceedings of the tenth annual ACM symposium on Parallel algorithms and architectures (SPAA '98), pp.119-129, 1998.

S. Nimar, R. D. Arora, C. Blumofe, and . Greg-plaxton, Thread Scheduling for Multiprogrammed Multiprocessors. Theory of Computing Systems, vol.34, pp.115-144, 2001.

G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and J. Shun, Internally deterministic parallel algorithms can be fast, PPoPP '12, pp.181-192, 2012.

G. E. Blelloch, J. T. Fineman, P. B. Gibbons, and H. Simhadri, Scheduling irregular parallel computations on hierarchical caches, Proceedings of the 23rd ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '11, pp.355-366, 2011.

E. Guy, P. B. Blelloch, and . Gibbons, Effectively sharing a cache among threads, SPAA, 2004.

G. E. Blelloch, P. B. Gibbons, and Y. Matias, Provably efficient scheduling for languages with fine-grained parallelism, J. ACM, vol.46, pp.281-321, 1999.

D. Robert, C. E. Blumofe, and . Leiserson, Space-Efficient Scheduling of Multithreaded Computations, SIAM J. Comput, vol.27, pp.202-229, 1998.

D. Robert, C. E. Blumofe, and . Leiserson, Scheduling multithreaded computations by work stealing, J. ACM, vol.46, pp.720-748, 1999.

R. P. Brent, The parallel evaluation of general arithmetic expressions, J. ACM, vol.21, pp.201-206, 1974.

F. , W. Burton, and M. R. Sleep, Executing functional programs on a virtual tree of processors. In Functional Programming Languages and Computer Architecture (FPCA '81), pp.187-194, 1981.

P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra et al., X10: an object-oriented approach to non-uniform cluster computing, Proceedings of the 20th annual ACM SIGPLAN conference on Object-oriented programming, systems, languages, and applications (OOPSLA '05), pp.519-538, 2005.

D. Chase and Y. Lev, Dynamic circular work-stealing deque, SPAA '05, pp.21-28, 2005.

A. Rezaul, V. Chowdhury, and . Ramachandran, Cacheefficient dynamic programming algorithms for multicores, Proc. 20th ACM Symposium on Parallelism in Algorithms and Architectures, pp.207-216, 2008.

A. Duran, J. Corbalan, and E. Ayguade, An adaptive cut-off for task parallelism, 2008 SC-International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-11, 2008.

D. L. Eager, J. Zahorjan, and E. D. Lazowska, Speedup versus efficiency in parallel systems, IEEE Transactions on Computing, vol.38, pp.408-423, 1989.

M. Feeley, A Message Passing Implementation of Lazy Task Creation, Parallel Symbolic Computing, pp.94-107, 1992.

M. Feeley, Polling efficiently on stock hardware, Proceedings of the conference on Functional programming languages and computer architecture (FPCA '93, pp.179-187, 1993.

M. Felleisen and D. P. Friedman, Control Operators, the SECD-Machine, and the Lambda-Calculus, Formal Description of Programming Concepts-III, pp.193-219, 1987.

M. Fluet, M. Rainey, J. Reppy, and A. Shaw, Implicitly threaded parallelism in Manticore, Journal of Functional Programming, vol.20, pp.1-40, 2011.

M. Fluet, M. Rainey, J. H. Reppy, and A. Shaw, Implicitly-threaded parallelism in Manticore, ICFP, pp.119-130, 2008.

M. Frigo, C. E. Leiserson, and K. H. Randall, The Implementation of the Cilk-5 Multithreaded Language, PLDI, pp.212-223, 1998.

K. E. Seth-copen-goldstein, D. Schauser, and . Culler, Enabling Primitives for Compiling Parallel Languages, Third Workshop on Languages, Compilers, and Run-Time Systems for Scalable Computers, 1995.

K. E. Seth-copen-goldstein, D. E. Schauser, and . Culler, Lazy threads: Implementing a fast parallel call, J. Parallel and Distrib. Comput, vol.37, pp.5-20, 1996.

J. Greiner and G. E. Blelloch, A Provably Time-efficient Parallel Implementation of Full Speculation, ACM Transactions on Programming Languages and Systems, vol.21, issue.2, pp.240-285, 1999.

A. Guatto, S. Westrick, R. Raghunathan, and U. Fluet, Hierarchical Memory Management for Mutable State, ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPOPP), 2018.

H. Robert and . Halstead, Implementation of Multilisp: Lisp on a Multiprocessor, Proceedings of the 1984 ACM Symposium on LISP and functional programming (LFP '84), pp.9-17, 1984.

E. A. Hauck and B. A. Dent, Burroughs' B6500/B7500 Stack Mechanism, Spring Joint Computer Conference (AFIPS '68 (Spring), pp.245-251, 1968.

T. Hiraishi, M. Yasugi, S. Umatani, and T. Yuasa, Backtracking-based load balancing, Proceedings of the 2009 ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming, vol.44, pp.55-64, 2009.

L. Huelsbergen, J. R. Larus, and A. Aiken, Using the run-time sizes of data structures to guide parallel-thread creation, Proceedings of the 1994 ACM conference on LISP and functional programming (LFP '94, pp.79-90, 1994.

M. Shams, V. Imam, and . Sarkar, Habanero-Java library: a Java 8 framework for multicore programming, 2014 International Conference on Principles and Practices of Programming on the Java Platform Virtual Machines, Languages and Tools, PPPJ '14, pp.75-86, 2014.

. Intel, Intel Threading Building Blocks, 2011.

S. Iwasaki and K. Taura, A static cut-off for task parallel programs, Proceedings of the 2016 International Conference on Parallel Architectures and Compilation, pp.139-150, 2016.

D. Lea, A Java fork/join framework, Proceedings of the ACM 2000 conference on Java Grande (JAVA '00, pp.36-43, 2000.

I. Lee, C. E. Leiserson, T. B. Schardl, Z. Zhang, and J. Sukha, On-the-Fly Pipeline Parallelism, TOPC, vol.2, pp.1-17, 2015.

I. Lee, S. Boyd-wickizer, Z. Huang, and C. E. Leiserson, Using Memory Mapping to Support Cactus Stacks in Work-stealing Runtime Systems, Proceedings of the 19th International Conference on Parallel Architectures and Compilation Techniques (PACT '10), pp.411-420, 2010.

D. Leijen, W. Schulte, and S. Burckhardt, The design of a task parallel library, Proceedings of the 24th ACM SIGPLAN conference on Object Oriented Programming Systems Languages and Applications (OOPSLA '09, pp.227-242, 2009.

P. Lopez, M. Hermenegildo, and S. Debray, A methodology for granularity-based control of parallelism in logic programs, Journal of Symbolic Computation, vol.21, pp.715-734, 1996.

S. Marlow, Parallel and Concurrent Programming in Haskell, 2013.

E. Mohr, D. A. Kranz, and R. H. Halstead, Lazy task creation: a technique for increasing the granularity of parallel programs, IEEE Transactions on Parallel and Distributed Systems, vol.2, pp.264-280, 1991.

J. Girija, G. E. Narlikar, and . Blelloch, Space-Efficient Scheduling of Nested Parallelism, ACM Transactions on Programming Languages and Systems, vol.21, 1999.

, OpenMP Architecture Review Board

, OpenMP Application Program Interface

J. Pehoushek and J. Weening, Low-cost process creation and dynamic partitioning in Qlisp, Parallel Lisp: Languages and Systems, vol.441, pp.182-199, 1990.

R. Raghunathan, S. K. Muller, U. A. Acar, and G. Blelloch, Hierarchical Memory Management for Parallel Programs, ICFP 2016, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01416237

D. Sanchez, R. M. Yoo, and C. Kozyrakis, Flexible architectural support for fine-grain scheduling, Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems (ASPLOS '10), pp.311-322, 2010.

J. Shun, G. E. Blelloch, J. T. Fineman, P. B. Gibbons, A. Kyrola et al., Brief Announcement: The Problem Based Benchmark Suite, Proceedings of the Twenty-fourth Annual ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '12, pp.68-70, 2012.

K. C. Sivaramakrishnan, L. Ziarek, and S. Jagannathan, MultiMLton: A multicore-aware runtime for standard ML, Journal of Functional Programming FirstView, pp.1-62, 2014.

D. Spoonhower, G. E. Blelloch, P. B. Gibbons, and R. Harper, Beyond Nested Parallelism: Tight Bounds on Workstealing Overheads for Parallel Futures, Proceedings of the Twentyfirst Annual Symposium on Parallelism in Algorithms and Architectures (SPAA '09), pp.91-100, 2009.

A. Tzannes, G. C. Caragea, R. Barua, and U. Vishkin, Lazy binary-splitting: a run-time adaptive work-stealing scheduler, Symposium on Principles & Practice of Parallel Programming, pp.179-190, 2010.

A. Tzannes, G. C. Caragea, R. Barua, and U. Vishkin, Lazy binary-splitting: a run-time adaptive work-stealing scheduler, PPoPP '10, pp.179-190, 2010.

A. Tzannes, G. C. Caragea, U. Vishkin, and R. Barua, Lazy Scheduling: A Runtime Adaptive Scheduler for Declarative Parallelism, TOPLAS, vol.36, issue.10, 2014.

L. G. Valiant, A bridging model for parallel computation, CACM, vol.33, pp.103-111, 1990.

J. S. Weening, Parallel Execution of Lisp Programs, 1989.

C. Yang and J. Mellor-crummey, A Practical Solution to the Cactus Stack Problem, Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures (SPAA '16, pp.61-70, 2016.