L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin et al., HPCTOOLKIT: tools for performance analysis of optimized parallel programs, Concurrency and Computation: Practice and Experience, vol.22, pp.685-701, 2010.

G. Ammons, T. Ball, and J. R. Larus, Exploiting Hardware Performance Counters with Flow and Context Sensitive Profiling, Proceedings of the ACM SIGPLAN 1997 Conference on Programming Language Design and Implementation (PLDI '97), 1997.

R. Ao, G. Tan, and M. Chen, ParaInsight: An Assistant for Quantitatively Analyzing Multi-granularity Parallel Region, High Performance Computing and Communications & 2013 IEEE International Conference on Embedded and Ubiquitous Computing (HPCC_EUC), pp.698-707, 2013.

K. Utpal and . Banerjee, Dependence Analysis for Supercomputing, 1988.

W. Bao, S. Krishnamoorthy, L. Pouchet, F. Rastello, and P. Sadayappan, PolyCheck: Dynamic Verification of Iteration Space Transformations on Affine Programs, Proceedings of the 43rd Annual ACM SIGPLAN-SIGACT Symposium on Principles of Programming Languages (POPL '16), pp.539-554, 2016.
URL : https://hal.archives-ouvertes.fr/hal-01234104

C. Bastoul, Generating loops for scanning polyhedra: Cloog users guide, Polyhedron, vol.2, p.10, 2004.

F. Bellard, QEMU, a Fast and Portable Dynamic Translator, Proceedings of the Annual Conference on USENIX Annual Technical Conference (ATEC '05), 2005.

E. Berg and E. Hagersten, Fast data-locality profiling of native execution, In ACM SIGMETRICS Performance Evaluation Review, vol.33, pp.169-180, 2005.

K. Beyls and E. H. D'hollander, Discovery of Localityimproving Refactorings by Reuse Path Analysis, Proceedings of the Second International Conference on High Performance Computing and Communications (HPCC'06), pp.220-229, 2006.

K. Beyls, D. Erik, and . Hollander, Discovery of LocalityImproving Refactorings by Reuse Path Analysis. High Performance Computing and Communications, pp.220-229, 2006.

U. Bondhugula, A. Hartono, J. Ramanujam, and P. Sadayappan, A Practical Automatic Polyhedral Program Optimization System, PLDI, 2008.

K. Butt, A. Qadeer, G. Mustafa, and A. Waheed, Runtime analysis of application binaries for function level parallelism potential using QEMU, Open Source Systems and Technologies (ICOSST), 2012 International Conference on. IEEE, pp.33-39, 2012.

E. Andres-s-charif-rubial, J. Oseret, W. Noudohouenou, G. Jalby, and . Lartigue, CQA: A code quality analyzer tool at binary level, High Performance Computing (HiPC), 2014.

S. Che, M. Boyer, J. Meng, D. Tarjan, J. W. Sheaffer et al., Rodinia: A benchmark suite for heterogeneous computing, IEEE International Symposium on, 2009.

S. Che, J. W. Sheaffer, M. Boyer, L. G. Szafaryn, L. Wang et al., A characterization of the Rodinia benchmark suite with comparison to contemporary CMP workloads, Workload Characterization (IISWC), 2010 IEEE International Symposium on, 2010.

J. Collard, D. Barthou, and P. Feautrier, Fuzzy array dataflow analysis, In ACM SIGPLAN Notices, vol.30, pp.92-101, 1995.

, Intel Architecture Code Analyzer -User's Guide, 2009.

A. Carvalho-de-melo, The new linux'perf'tools, Slides from Linux Kongress, vol.18, 2010.

L. Djoudi, D. Barthou, P. Carribault, C. Lemuet, J. Acquaviva et al., Maqao: Modular assembler quality analyzer and optimizer for itanium 2, The 4th Workshop on EPIC architectures and compiler technology, vol.200, 2005.
URL : https://hal.archives-ouvertes.fr/hal-00141075

J. Doerfert, C. Hammacher, K. Streit, and S. Hack, Spolly: speculative optimizations in the polyhedral model. IMPACT 2013, p.55, 2013.

K. Faxén and K. Popov, Embla-data dependence profiling for parallel programming, Complex, Intelligent and Software Intensive Systems, pp.780-785, 2008.

P. Feautrier, Parametric integer programming, RAIROOperations Research, vol.22, pp.243-268, 1988.

P. Feautrier and C. Lengauer, Polyhedron model, Encyclopedia of Parallel Computing, pp.1581-1592, 2011.

N. Sylvain-girbal, C. Vasilache, A. Bastoul, D. Cohen, M. Parello et al., Semi-Automatic Composition of Loop Transformations for Deep Parallelism and Memory Hierarchies, Intl. J. of Parallel Programming, vol.34, issue.3, 2006.

L. Susan, . Graham, B. Peter, M. K. Kessler, and . Mckusick, Gprof: A call graph execution profiler, ACM Sigplan Notices, vol.17, pp.120-126, 1982.

B. Gregg, The flame graph, Commun. ACM, vol.59, pp.48-57, 2016.

A. Griffith, GCC: the complete reference, 2002.

T. Grosser, A. Groesslinger, and C. Lengauer, Polly -Performing polyhedral optimizations on a low-level intermediate representation, Parallel Processing Letters, vol.22, p.4, 2012.

F. Gruber, M. Selva, D. Sampaio, C. Guillon, L. Pouchet et al., Building of a Polyhedral Representation from an Instrumented Execution: Making Dynamic Analyses of non-Affine Programs Scalable, 2019.
URL : https://hal.archives-ouvertes.fr/hal-01967828

C. Guillon, Program Instrumentation with QEMU, Proceedings of the International QEMU User's Forum, 2011.

P. Havlak, Nesting of Reducible and Irreducible Loops, ACM Trans. Program. Lang. Syst, vol.19, issue.4, 1997.

L. John and . Henning, SPEC CPU2006 Benchmark Descriptions, SIGARCH Comput. Archit. News, vol.34, pp.1-17, 2006.

J. Holewinski, R. Ramamurthi, and M. Ravishankar, Dynamic trace-based analysis of vectorization potential of applications, ACM SIGPLAN Notices, vol.47, pp.371-382, 2012.

A. Jimborean, L. Mastrangelo, V. Loechner, and P. Clauss, VMAD: An Advanced Dynamic Program Analysis and Instrumentation Framework, 2012.

W. Kelly and W. Pugh, A unifying framework for iteration reordering transformations, Proceedings 1st International Conference on Algorithms and Architectures for Parallel Processing, 1995.

A. Ketterlin and P. Clauss, Prediction and Trace Compression of Data Access Addresses Through Nested Loop Recognition, Proceedings of the 6th Annual IEEE/ACM International Symposium on Code Generation and Optimization (CGO '08), 2008.
URL : https://hal.archives-ouvertes.fr/inria-00504597

A. Ketterlin and P. Clauss, Profiling Data-Dependence to Assist Parallelization: Framework, Scope, and Optimization, p.45, 2012.
URL : https://hal.archives-ouvertes.fr/hal-00780782

, Annual IEEE/ACM International Symposium on Microarchitecture, pp.437-448

M. Kim, H. Kim, and C. Luk, Prospector: A dynamic data-dependence profiler to help parallel programming, HotPar'10: Proceedings of the USENIX workshop on Hot Topics in parallelism, 2010.

M. Kim, B. Nagesh, H. Lakshminarayana, C. Kim, and . Luk, SD3: An Efficient Dynamic Data-Dependence Profiling Mechanism, IEEE Trans. Comput, vol.62, pp.2516-2530, 2013.

C. Lattner and V. Adve, LLVM: A Compilation Framework for Lifelong Program Analysis & Transformation, Proceedings of the International Symposium on Code Generation and Optimization: Feedback-directed and Runtime Optimization (CGO '04), 2004.

Z. Li, R. Atre, Z. Ul-huda, A. Jannesari, and F. Wolf, DiscoPoP: A profiling tool to identify parallelization opportunities, Tools for High Performance Computing, pp.37-54, 2014.

Z. Li, M. Beaumont, A. Jannesari, and F. Wolf, Fast data-dependence profiling by skipping repeatedly executed memory operations, International Conference on Algorithms and Architectures for Parallel Processing, pp.583-596, 2015.

Z. Li, A. Jannesari, and F. Wolf, An efficient datadependence profiler for sequential and parallel programs, Parallel and Distributed Processing Symposium (IPDPS), pp.484-493, 2015.

X. Liu and J. Mellor-crummey, Pinpointing data locality problems using data-centric analysis, Code Generation and Optimization (CGO), 2011 9th Annual IEEE/ACM International Symposium on. IEEE, pp.171-180, 2011.

C. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser et al., Pin: building customized program analysis tools with dynamic instrumentation, Acm sigplan notices, vol.40, pp.190-200, 2005.

G. Marin, J. Dongarra, and D. Terpstra, MIAMI: A framework for application performance diagnosis, 2014 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), 2014.

J. Caamaño, M. Selva, P. Clauss, A. Baloian, and W. Wolff, Full runtime polyhedral optimizing loop transformations with the generation, instantiation, and scheduling of code-bones, Concurrency and Computation: Practice and Experience, vol.29, 2017.

S. Mehta and P. Yew, Improving Compiler Scalability: Optimizing Large Programs at Small Price, Proceedings of the 36th ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI '15), pp.143-152, 2015.

T. Moseley, A. Shye, V. J. Reddi, D. Grunwald, and R. Peri, Shadow profiling: Hiding instrumentation costs with parallelism, Code Generation and Optimization, 2007. CGO'07. International Symposium on. IEEE, pp.198-208, 2007.

N. Nethercote and A. Mycroft, Redux: A dynamic dataflow tracer. Electronic Notes in Theoretical Computer Science, vol.89, pp.149-170, 2003.

N. Nethercote and J. Seward, Valgrind: A Framework for Heavyweight Dynamic Binary Instrumentation. SIGPLAN Not, 2007.

H. Christos, K. Papadimitriou, and . Steiglitz, Combinatorial optimization: algorithms and complexity, 1998.

S. Pop, A. Cohen, C. Bastoul, and S. Girbal, GRAPHITE: Polyhedral analyses and optimizations for GCC, Proceedings of the 2006 GCC Developers Summit, 2006.

S. Pop, A. Cohen, and G. Silber, Induction Variable Analysis with Delayed Abstractions, Proceedings of the First International Conference on High Performance Embedded Architectures and Compilers (HiPEAC'05), 2005.
URL : https://hal.archives-ouvertes.fr/hal-01257294

L. Pouchet,

L. Pouchet, Polybench: The polyhedral benchmark suite, pp.2017-2026, 2017.

B. Benoît-pradelle, M. Meister, A. Baskaran, T. Konstantinidis, R. Henretty et al., Scalable hierarchical polyhedral compilation, Parallel Processing (ICPP), pp.432-441, 2016.

G. Ramalingam, On Loops, Dominators, and Dominance Frontiers, ACM Trans. Program. Lang. Syst, vol.24, 2002.

J. Reinders, VTune performance analyzer essentials, 2005.

Y. Sato, Y. Inoguchi, and T. Nakamura, On-thefly Detection of Precise Loop Nests Across Procedures on a Dynamic Binary Translation System, Proceedings of the 8th ACM International Conference on Computing Frontiers (CF '11), 2011.

M. Schordan, P. Lin, D. Quinlan, and L. Pouchet, Verification of polyhedral optimizations with constant loop bounds in finite state space computations, International Symposium On Leveraging Applications of Formal Methods, Verification and Validation, pp.493-508, 2014.

K. Serebryany, D. Bruening, A. Potapenko, and D. Vyukov, AddressSanitizer: A Fast Address Sanity Checker, USENIX Annual Technical Conference, pp.309-318, 2012.

A. Simbürger, S. Apel, A. Größlinger, and C. Lengauer, PolyJIT: Polyhedral Optimization Just in Time. International Journal of Parallel Programming, p.33, 2018.

D. Michael and . Smith, Overcoming the Challenges to Feedbackdirected Optimization (Keynote Talk). SIGPLAN Not, vol.35, pp.1-11, 2000.

O. A. Sopeju, M. Burtscher, A. Rane, and J. Browne, AutoSCOPE: Automatic Suggestions for Code Optimizations Using PerfExpert, 2011.

O. Tange, GNU Parallel -The Command-Line Power Tool. ;login: The USENIX Magazine, vol.36, pp.42-47, 2011.

G. Tournavitis and B. Franke, Semi-automatic extraction and exploitation of hierarchical pipeline parallelism using profiling information, Parallel Architectures and Compilation Techniques (PACT), pp.377-388, 2010.

K. Trifunovic, A. Cohen, D. Edelsohn, F. Li, T. Grosser et al., GRAPHITE Two Years After: First Lessons Learned From Real-World Polyhedral Compilation, GCC Research Opportunities Workshop, 2010.
URL : https://hal.archives-ouvertes.fr/inria-00551516

. Robert-a-van-engelen, Efficient symbolic analysis for optimizing compilers, International Conference on Compiler Construction, 2001.

H. Vandierendonck, S. Rul, and K. Bosschere, The Paralax infrastructure: automatic parallelization with a helping hand, Parallel Architectures and Compilation Techniques (PACT), pp.389-399, 2010.

S. Verdoolaege, isl: An Integer Set Library for the Polyhedral Model, Proceedings of the Third International Congress on Mathematical Software (ICMS '2010), 2010.

S. Verdoolaege, S. Guelton, T. Grosser, and A. Cohen, , 2014.

S. Verdoolaege, G. Janssens, and M. Bruynooghe, Equivalence checking of static affine programs using widening to handle recurrences, ACM Transactions on Programming Languages and Systems (TOPLAS), vol.34, p.11, 2012.

Z. Wang, G. Tournavitis, B. Franke, F. P. Michael, and . O'boyle, Integrating profile-driven parallelism detection and machine-learning-based mapping, ACM Transactions on Architecture and Code Optimization (TACO), vol.11, issue.2, 2014.

E. Michael, M. S. Wolf, and . Lam, A data locality optimizing algorithm, ACM Sigplan Notices, vol.26, pp.30-44, 1991.

Q. Zhao, D. Bruening, and S. Amarasinghe, Umbra: Efficient and scalable memory shadowing, Proceedings of the 8th annual IEEE/ACM international symposium on Code generation and optimization, pp.22-31, 2010.