Does cache sharing on modern cmp matter to the performance of contemporary multithreaded programs?, PPoPP '10: Proceedings of the 15th ACM SIGPLAN symposium on Principles and practice of parallel programming, 2010. ,
Finding effective compilation sequences, Proceedings of the Conference on Languages, Compilers, and Tools for Embedded Systems, 2004. ,
DOI : 10.1145/997163.997196
Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, 2008. ,
DOI : 10.1109/SC.2008.5222004
Profitable loop fusion and tiling using model-driven empirical search, Proceedings of the 20th annual international conference on Supercomputing , ICS '06, 2006. ,
DOI : 10.1145/1183401.1183437
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.84.8965
Automatically Tuned Linear Algebra Software, Proceedings of the IEEE/ACM SC98 Conference, 1998. ,
DOI : 10.1109/SC.1998.10004
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.108.3487
A fast Fourier transform compiler, Proceedings of the SIGPLAN '98 Conference on Programming Language Design and Implementation, 1998. ,
Feedback-directed thread scheduling with memory considerations, Proceedings of the 16th international symposium on High performance distributed computing , HPDC '07, 2007. ,
DOI : 10.1145/1272366.1272380
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.110.8567
The POET language manual, 2008. ,
Optimizing Compilers for Modern Architectures, 2002. ,
A Practical Approach to Exploiting Coarse-Grained Pipeline Parallelism in C Programs, 40th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO 2007), 2007. ,
DOI : 10.1109/MICRO.2007.38
Helpercore db: Exploiting multicore technology for databases, Proceedings of the 16th International Conference on Parallel Architecture and Compilation Techniques, 2007. ,
A data locality optimizing algorithm, Proceedings of the SIGPLAN '91 Conference on Programming Language Design and Implementation, 1991. ,
Tile size selection using cache organization, Proceedings of the SIGPLAN '95 Conference on Programming Language Design and Implementation, 1995. ,
DOI : 10.1145/223428.207162
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.128.9167
Improving effective bandwidth through compiler enhancement of global cache reuse, International Parallel and Distributed Processing Symposium, 2001. ,
Combining loop transformations considering caches and scheduling, Proceedings of the 29th Annual IEEE/ACM International Symposium on Microarchitecture. MICRO 29, 1996. ,
DOI : 10.1109/MICRO.1996.566468
The synchronized pipelined parallelism model, The 16th IASTED International Conference on Parallel and Distributed Computing and Systems, 2004. ,
Effective automatic parallelization of stencil computations, PLDI '07: Proceedings of the 2007 ACM SIGPLAN conference on Programming language design and implementation, 2007. ,
Loop Transformation Recipes for Code Generation and Auto-Tuning, The 22nd International Workshop on Languages and Compilers for Parallel Computing, 2009. ,
DOI : 10.1007/978-3-642-13374-9_4
Using time skewing to eliminate idle time due to memory bandwidth and network limitations, Proceedings 14th International Parallel and Distributed Processing Symposium. IPDPS 2000, 2000. ,
DOI : 10.1109/IPDPS.2000.845979
URL : http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.84.7663
HPCTOOLKIT: tools for performance analysis of optimized parallel programs, Concurrency and Computation: Practice and Experience, 2009. ,
DOI : http://doi.acm.org/10.1145/1654059.1654111
POET: Parameterized Optimizations for Empirical Tuning, 2007 IEEE International Parallel and Distributed Processing Symposium, 2007. ,
DOI : 10.1109/IPDPS.2007.370637
Automated transformation for performance-critical kernels, Proceedings of the 2007 Symposium on Library-Centric Software Design, LCSD '07, 2007. ,
DOI : 10.1145/1512762.1512773