E. Z. Zhang, Y. Jiang, and X. Shen, Does cache sharing on modern cmp matter to the performance of contemporary multithreaded programs?, PPoPP '10: Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2010.

L. Almagor, K. Cooper, A. Grosul, T. Harvey, S. Reeves et al., Finding effective compilation sequences, Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems, 2004.
DOI : 10.1145/997163.997196

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter et al., Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, 2008.
DOI : 10.1109/SC.2008.5222004

A. Qasem and K. Kennedy, Profitable loop fusion and tiling using model-driven empirical search, Proceedings of the 20th annual international conference on Supercomputing , ICS '06, 2006.
DOI : 10.1145/1183401.1183437

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.84.8965

C. Whaley and J. Dongarra, Automatically Tuned Linear Algebra Software, Proceedings of the IEEE/ACM SC98 Conference, 1998.
DOI : 10.1109/SC.1998.10004

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.3487

M. Frigo, A fast Fourier transform compiler, Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation, 1998.

F. Song, S. Moore, and J. Dongarra, Feedback-directed thread scheduling with memory considerations, Proceedings of the 16th international symposium on High performance distributed computing , HPDC '07, 2007.
DOI : 10.1145/1272366.1272380

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.110.8567

Q. Yi, The POET language manual, 2008.

R. Allen and K. Kennedy, Optimizing Compilers for Modern Architectures, 2002.

W. Thies, V. Chandrasekhar, and S. Amarasinghe, A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), 2007.
DOI : 10.1109/MICRO.2007.38

K. Papadopoulos, K. Stavrou, and P. Trancoso, Helpercore db: Exploiting multicore technology for databases, Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, 2007.

M. E. Wolf and M. Lam, A data locality optimizing algorithm, Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, 1991.

S. Coleman and K. S. Kinley, Tile size selection using cache organization, Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, 1995.
DOI : 10.1145/223428.207162

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.128.9167

C. Ding and K. Kennedy, Improving effective bandwidth through compiler enhancement of global cache reuse, International Parallel and Distributed Processing Symposium, 2001.

M. Wolf, D. Maydan, and D. Chen, Combining loop transformations considering caches and scheduling, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, 1996.
DOI : 10.1109/MICRO.1996.566468

S. N. Vadlamani and S. F. Jenks, The synchronized pipelined parallelism model, The 16th IASTED International Conference on Parallel and Distributed Computing and Systems, 2004.

S. Krishnamoorthy, M. Baskaran, U. Bondhugula, J. Ramanujam, A. Rountev et al., Effective automatic parallelization of stencil computations, PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, 2007.

M. Hall, J. Chame, C. Chen, J. Shin, G. Rudy et al., Loop Transformation Recipes for Code Generation and Auto-Tuning, The 22nd International Workshop on Languages and Compilers for Parallel Computing, 2009.
DOI : 10.1007/978-3-642-13374-9_4

D. Wonnacott, Using time skewing to eliminate idle time due to memory bandwidth and network limitations, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000, 2000.
DOI : 10.1109/IPDPS.2000.845979

URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.84.7663

L. Adhianto, S. Banerjee, M. Fagan, M. Krentel, G. Marin et al., HPCTOOLKIT: tools for performance analysis of optimized parallel programs, Concurrency and Computation: Practice and Experience, 2009.
DOI : http://doi.acm.org/10.1145/1654059.1654111

Q. Yi, K. Seymour, H. You, R. Vuduc, and D. Quinlan, POET: Parameterized Optimizations for Empirical Tuning, 2007 IEEE International Parallel and Distributed Processing Symposium, 2007.
DOI : 10.1109/IPDPS.2007.370637

Q. Yi and C. Whaley, Automated transformation for performance-critical kernels, Proceedings of the 2007 Symposium on Library-Centric Software Design, LCSD '07, 2007.
DOI : 10.1145/1512762.1512773