E. Yang, S. Kim, and T. Kim, An adaptive batch-orchestration algorithm for the heterogeneous gpu cluster environment in distributed deep learning system, IEEE International Conference on Big Data and Smart Computing, vol.18, pp.725-728, 2018.

K. K. Pal and K. S. Sudeep, Preprocessing for image classification by convolutional neural networks, RTEICT'16: 2016 IEEE International Conference on Recent Trends in Electronics, Information Communication Technology, pp.1778-1781, 2016.

I. Bruha, Pre-and Post-processing in Machine Learning and Data Mining, pp.258-266, 2001.

, Horovod repository

S. Tseng, B. Nicolae, G. Bosilca, E. Jeannot, and F. Cappello, Towards portable online prediction of network utilization using mpi-level monitoring, EuroPar'19 : 25th International European Conference on Parallel and Distributed Systems, pp.1-14, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02184204

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, CVPR'16: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp.770-778, 2016.

M. Li, D. G. Andersen, and A. Smola, Communication efficient distributed machine learning with the parameter server, NIPS'14:Proceedings of the 27th International Conference on Neural Information Processing Systems, vol.1, pp.19-27, 2014.

X. Lian, W. Zhang, C. Zhang, and J. Liu, Asynchronous decentralized parallel stochastic gradient descent, ICML'18: The 35th International Conference on Machine Learning, pp.3049-3058, 2018.

M. Abadi, A. Agarwal, and P. Barham, TensorFlow: Large-scale machine learning on heterogeneous systems, 2015, software available from tensorflow.org

Y. Jia, E. Shelhamer, and J. Donahue, Caffe: Convolutional architecture for fast feature embedding, ICM'14: The 22Nd ACM International Conference on Multimedia, pp.675-678, 2014.

, Torch: A scientific computing framework for luajit

M. Ott, S. Edunov, D. Grangier, and M. Auli, Scaling neural machine translation, WMT'18: Proceedings of the Third Conference on Machine Translation: Research Papers, pp.1-9, 2018.

D. Yu, A. Eversole, and M. Seltzer, An introduction to computational networks and the computational network toolkit, 2014.

, Distributed deep learning on hadoop and spark clusters

F. Iandola, M. W. Moskewicz, K. Ashraf, and K. Keutzer, Firecaffe: Near-linear acceleration of deep neural network training on compute clusters, CVPR'16: 2016 Conference on Computer Vision and Pattern Recognition, pp.2592-2600, 2016.

F. Chollet, Keras, 2015.

Q. Meng, W. J. Chen, and Y. Wang, Convergence analysis of distributed stochastic gradient descent with shuffling, Neurocomputing, vol.337, pp.46-57, 2017.

X. Wu, V. Taylor, and J. M. Wozniak, Performance, energy, and scalability analysis and improvement of parallel cancer deep learning candle benchmarks, ICPP'19: Proceedings of the 48th International Conference on Parallel Processing, vol.78, p.11, 2019.

Y. You, Z. Zhang, and C. Hsieh, Imagenet training in minutes, ICPP'18: Proceedings of the 47th International Conference on Parallel Processing, vol.1, pp.1-1, 2018.

A. A. Awan, J. Bdorf, C. Chu, H. Subramoni, and D. K. Panda, Scalable distributed dnn training using tensorflow and cuda-aware mpi: Characterization, designs, and performance evaluation, CCGRID'19: 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp.498-507, 2019.

S. W. Chien, S. Markidis, and C. P. Sishtla, Characterizing deeplearning I/O workloads in tensorflow, PDSW-DISCS'18: IEEE/ACM 3rd International Workshop on Parallel Data Storage & Data Intensive Scalable Computing Systems, 2018.

F. Chowdhury, Y. Zhu, and T. Heer, I/o characterization and performance evaluation of beegfs for deep learning, ICPP'19: Proceedings of the 48th International Conference on Parallel Processing, vol.80, pp.1-80, 2019.

J. Wozniak, R. Jain, and P. Balaprakash, Candle/supervisor: A workflow framework for machine learning applied to cancer research, BMC Bioinformatics, issue.19, 2018.

, Candle benchmarks

K. He, X. Zhang, S. Ren, and J. Sun, Deep residual learning for image recognition, CVPR'16: 2016 IEEE Conference on Computer Vision and Pattern Recognition, pp.770-778, 2016.

J. Deng, W. Dong, and R. Socher, ImageNet: A Large-Scale Hierarchical Image Database, CVPR'09: Conference on Computer Vision and Pattern Recognition, pp.248-255, 2009.

B. Nicolae, A. Moody, E. Gonsiorowski, K. Mohror, and F. Cappello, Veloc: Towards high performance adaptive asynchronous checkpointing at large scale, IPDPS'19: The 2019 IEEE International Parallel and Distributed Processing Symposium, pp.911-920, 2019.
URL : https://hal.archives-ouvertes.fr/hal-02184203