R. Allen and K. Kennedy, Optimizing compilers for modern architectures: a dependence-based approach, 2002.

N. Arora, A. Shringarpure, and R. W. Vuduc, Direct N-body Kernels for Multicore Platforms, 2009 International Conference on Parallel Processing, pp.379-387, 2009.
DOI : 10.1109/ICPP.2009.71

R. Bordawekar, U. Bondhugula, and R. Rao, Can cpus match gpus on performance with productivity? technical report rc25033, IBM, 2010.

K. Datta, M. Murphy, V. Volkov, S. Williams, J. Carter et al., Stencil computation optimization and auto-tuning on state-of-the-art multicore architectures, 2008 SC, International Conference for High Performance Computing, Networking, Storage and Analysis, pp.1-12, 2008.
DOI : 10.1109/SC.2008.5222004

U. Drepper, what every programmer should know about memory. technical report, Red Hat, 2007.

P. Estérie, M. Gaunard, J. Falcou, J. Lapresté, and B. Rozoy, Boost. simd: generic programming for portable simdization, International Conference on Parallel architectures and compilation techniques, pp.431-432, 2012.

M. Garland, S. L. Grand, J. Nickolls, J. Anderson, J. Hardwick et al., Parallel Computing Experiences with CUDA, Parallel computing experiences with cuda, pp.13-27, 2008.
DOI : 10.1109/MM.2008.57

C. Harris and M. Stephens, A Combined Corner and Edge Detector, Procedings of the Alvey Vision Conference 1988, 1988.
DOI : 10.5244/C.2.23

J. Iliffe, The use of the genie system in numerical calculation, Annual Review in Automatic Programming, vol.2, pp.1-28, 1961.

E. Lee, Multidimensional streams rooted in dataflow, IFIP Working Conference on Architectures and Compilation Techniques for Fine and Medium Grain Parallelism, 1993.

V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim et al., Debunking the 100x gpu vs. cpu myth: an evaluation of throughput computing on cpu and gpu, International Symposium on Computer Architecture, pp.451-460, 2010.

L. Pouchet, C. Bastoul, A. Cohen, and J. Cavazos, Iterative optimization in the polyhedral model: part ii, multidimensional time, ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI'08), pp.90-100, 2008.
URL : https://hal.archives-ouvertes.fr/hal-01257273

W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes in C book set: Numerical Recipes in C: The Art of Scientific Computing, pp.20-23, 1992.

N. Satish, C. Kim, J. Chhugani, H. Saito, R. Krishnaiyer et al., Can traditional programming bridge the ninja performance gap for parallel computing applications, International Symposium on Computer Architecture, pp.440-451, 2012.

D. N. Truong, F. Bodin, and A. Seznec, Improving cache behavior of dynamically allocated data structures, Proceedings. 1998 International Conference on Parallel Architectures and Compilation Techniques (Cat. No.98EX192), pp.322-329, 1998.
DOI : 10.1109/PACT.1998.727268

V. Volkov, Better performance at lower occupancy, GPU Technology Conference, 2010.

V. Volkov and J. Demmel, Lu, qr and cholesky factorizations using vector capabilities of gpus, technical report, 2008.