TY - JOUR AU - Egwutuoha, I. P. AU - Levy, D. AU - Selic, B. AU - Chen, S. PY - 2013 DA - 2013// TI - A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems JO - J Supercomput VL - 65 UR - https://doi.org/10.1007/s11227-013-0884-0 DO - 10.1007/s11227-013-0884-0 ID - Egwutuoha2013 ER - TY - STD TI - Martino CD, Kalbarczyk Z, Iyer RK, Baccanico F, Fullop J, Kramer W (2014) Lessons learned from the analysis of system failures at petascale: The case of blue waters In: 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 610–621. https://doi.org/10.1109/DSN.2014.62. ID - ref2 ER - TY - JOUR AU - Snir, M. AU - Wisniewski, R. W. AU - Abraham, J. A. AU - Adve, S. V. AU - Bagchi, S. AU - Balaji, P. AU - Belak, J. AU - Bose, P. AU - Cappello, F. AU - Carlson, B. AU - Chien, A. A. AU - Coteus, P. AU - Debardeleben, N. A. AU - Diniz, P. C. AU - Engelmann, C. AU - Erez, M. AU - Fazzari, S. AU - Geist, A. AU - Gupta, R. AU - Johnson, F. AU - Krishnamoorthy, S. AU - Leyffer, S. AU - Liberty, D. AU - Mitra, S. AU - Munson, T. AU - Schreiber, R. AU - Stearley, J. AU - Hensbergen, E. V. PY - 2014 DA - 2014// TI - Addressing failures in exascale computing JO - Int J High Perform Comput Appl VL - 28 UR - https://doi.org/10.1177/1094342014522573 DO - 10.1177/1094342014522573 ID - Snir2014 ER - TY - STD TI - Tiwari D, Gupta S, Vazhkudai SS (2014) Lazy checkpointing: exploiting temporal locality in failures to mitigate checkpointing overheads on extreme-scale systems In: 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 25–36. https://doi.org/10.1109/DSN.2014.101. ID - ref4 ER - TY - STD TI - Gioiosa R, Kestor G, Kerbyson DJ (2014) Online monitoring system for performance fault detection In: International Parallel Distributed Processing Symposium Workshops, 1475–1484. https://doi.org/10.1109/IPDPSW.2014.165. ID - ref5 ER - TY - BOOK AU - Nielsen, F. PY - 2016 DA - 2016// TI - Introduction to HPC with MPI for data science PB - Springer CY - Switzerland UR - https://doi.org/10.1007/978-3-319-21903-5 DO - 10.1007/978-3-319-21903-5 ID - Nielsen2016 ER - TY - CHAP AU - Fagg, G. E. AU - Dongarra, J. PY - 2000 DA - 2000// TI - Ft-mpi: Fault tolerant mpi, supporting dynamic applications in a dynamic world BT - Proceedings of the 7th European PVM/MPI Users’ Group Meeting on Recent Advances in Parallel Virtual Machine and Message Passing Interface PB - Springer CY - London ID - Fagg2000 ER - TY - STD TI - (2015) MPI Forum: document for a standard message-passing interface 3.1. Technical report, University of Tennessee. ID - ref8 ER - TY - JOUR AU - Bland, W. AU - Bouteiller, A. AU - Hérault, T. AU - Bosilca, G. AU - Dongarra, J. PY - 2013 DA - 2013// TI - Post-failure recovery of MPI communication capability: design and rationale JO - IJHPCA VL - 27 ID - Bland2013 ER - TY - JOUR AU - Gropp, W. AU - Lusk, E. PY - 2004 DA - 2004// TI - Fault tolerance in message passing interface programs JO - Int J High Perform Comput Appl VL - 18 UR - https://doi.org/10.1177/1094342004046045 DO - 10.1177/1094342004046045 ID - Gropp2004 ER - TY - CHAP AU - Birman, K. PY - 2010 DA - 2010// TI - Replication BT - A history of the virtual synchrony replication model PB - Springer CY - Berlin ID - Birman2010 ER - TY - JOUR AU - Veríssimo, P. E. PY - 2006 DA - 2006// TI - Travelling through wormholes: a new look at distributed systems models JO - SIGACT News VL - 37 UR - https://doi.org/10.1145/1122480.1122497 DO - 10.1145/1122480.1122497 ID - Veríssimo2006 ER - TY - CHAP AU - Huang, K. C. AU - Huang, T. C. AU - Tsai, M. J. AU - Chang, H. Y. ED - Park, J. J. J. H. ED - Stojmenovic, I. ED - Choi, M. ED - Xhafa, F. PY - 2014 DA - 2014// TI - Moldable job scheduling for HPC as a service BT - Future information technology: FutureTech 2013 PB - Springer CY - Berlin, Heidelberg UR - https://doi.org/10.1007/978-3-642-40861-8_7 DO - 10.1007/978-3-642-40861-8_7 ID - Huang2014 ER - TY - CHAP AU - Masson, G. M. AU - Blough, D. M. AU - Sullivan, G. F. PY - 1996 DA - 1996// TI - Fault-tolerant computer system design BT - System diagnosis PB - Prentice-Hall, Inc CY - Upper Saddle River ID - Masson1996 ER - TY - JOUR AU - Ye, T. L. AU - Hsieh, S. Y. PY - 2013 DA - 2013// TI - A scalable comparison-based diagnosis algorithm for hypercube-like networks JO - IEEE Trans Reliab VL - 62 UR - https://doi.org/10.1109/TR.2013.2284743 DO - 10.1109/TR.2013.2284743 ID - Ye2013 ER - TY - JOUR AU - Weber, A. AU - Kutzke, A. R. AU - Chessa, S. PY - 2012 DA - 2012// TI - Energy-aware test connection assignment for the self-diagnosis of a wireless sensor network JO - J Braz Comput Soc VL - 18 UR - https://doi.org/10.1007/s13173-012-0057-7 DO - 10.1007/s13173-012-0057-7 ID - Weber2012 ER - TY - JOUR AU - Wagar, B. PY - 1987 DA - 1987// TI - Hyperquicksort: A fast sorting algorithm for hypercubes JO - Hypercube Multiprocessors VL - 1987 ID - Wagar1987 ER - TY - STD TI - Cappello F, Geist A, Gropp W, Kale S, Kramer B, Snir M (2014) Toward Exascale Resilience: 2014 update. Supercomputing Frontiers and Innovations 1(1). http://superfri.org/superfri/article/view/14. UR - http://superfri.org/superfri/article/view/14 ID - ref18 ER - TY - STD TI - Ropars T, Martsinkevich TV, Guermouche A, Schiper A, Cappello F (2013) Spbc: Leveraging the characteristics of mpi hpc applications for scalable checkpointing In: International Conference for High Performance Computing, Networking, Storage and Analysis, 1–12. https://doi.org/10.1145/2503210.2503271. ID - ref19 ER - TY - JOUR AU - Bouteiller, A. AU - Herault, T. AU - Bosilca, G. AU - Dongarra, J. J. PY - 2013 DA - 2013// TI - Correlated set coordination in fault tolerant message logging protocols for many-core clusters JO - Concurr Comput Pract Exp VL - 25 UR - https://doi.org/10.1002/cpe.2859 DO - 10.1002/cpe.2859 ID - Bouteiller2013 ER - TY - JOUR AU - Fagg, G. E. AU - Dongarra, J. J. PY - 2004 DA - 2004// TI - Building and using a fault-tolerant mpi implementation JO - Int J High Perform Comput Appl VL - 18 UR - https://doi.org/10.1177/1094342004046052 DO - 10.1177/1094342004046052 ID - Fagg2004 ER - TY - JOUR AU - Batchu, R. AU - Dandass, Y. S. AU - Skjellum, A. AU - Beddhu, M. PY - 2004 DA - 2004// TI - Mpi/ft: A model-based approach to low-overhead fault tolerant message-passing middleware JO - Clust Comput VL - 7 UR - https://doi.org/10.1023/B:CLUS.0000039491.64560.8a DO - 10.1023/B:CLUS.0000039491.64560.8a ID - Batchu2004 ER - TY - STD TI - Suo G, Lu Y, Liao X, Xie M, Cao H (2013) Nr-mpi: A non-stop and fault resilient mpi In: International Conference on Parallel and Distributed Systems, 190–199. https://doi.org/10.1109/ICPADS.2013.37. ID - ref23 ER - TY - JOUR AU - Gropp, W. AU - Lusk, E. AU - Doss, N. AU - Skjellum, A. PY - 1996 DA - 1996// TI - A high-performance, portable implementation of the mpi message passing interface standard JO - Parallel Comput VL - 22 UR - https://doi.org/10.1016/0167-8191(96)00024-5 DO - 10.1016/0167-8191(96)00024-5 ID - Gropp1996 ER - TY - CHAP AU - Hursey, J. AU - Graham, R. L. AU - Bronevetsky, G. AU - Buntinas, D. AU - Pritchard, H. AU - Solt, D. G. ED - Cotronis, Y. ED - Danalis, A. ED - Nikolopoulos, D. S. ED - Dongarra, J. PY - 2011 DA - 2011// TI - Run-through stabilization: an MPI proposal for process fault tolerance BT - Recent advances in the message passing interface: 18th European MPI Users’ Group Meeting, EuroMPI 2011, Santorini, Greece, September 18-21, 2011. Proceedings PB - Springer CY - Berlin, Heidelberg UR - https://doi.org/10.1007/978-3-642-24449-0_40 DO - 10.1007/978-3-642-24449-0_40 ID - Hursey2011 ER - TY - CHAP AU - Herault, T. AU - Bouteiller, A. AU - Bosilca, G. AU - Gamell, M. AU - Teranishi, K. AU - Parashar, M. AU - Dongarra, J. PY - 2015 DA - 2015// TI - Practical scalable consensus for pseudo-synchronous distributed systems BT - Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, SC PB - ACM CY - New York ID - Herault2015 ER - TY - STD TI - Buntinas D (2012) Scalable distributed consensus to support mpi fault tolerance In: 26th International Parallel and Distributed Processing Symposium, 1240–1249. https://doi.org/10.1109/IPDPS.2012.113. ID - ref27 ER - TY - CHAP AU - Hursey, J. AU - Naughton, T. AU - Vallee, G. AU - Graham, R. L. ED - Cotronis, Y. ED - Danalis, A. ED - Nikolopoulos, D. S. ED - Dongarra, J. PY - 2011 DA - 2011// TI - A log-scaling fault tolerant agreement algorithm for a fault tolerant MPI BT - Recent Advances in the Message Passing Interface: 18th European MPI Users’ Group Meeting, EuroMPI 2011, Santorini, Greece, September 18-21, 2011. Proceedings PB - Springer CY - Berlin, Heidelberg UR - https://doi.org/10.1007/978-3-642-24449-0_29 DO - 10.1007/978-3-642-24449-0_29 ID - Hursey2011 ER - TY - JOUR AU - Huang, K. H. AU - Abraham, J. A. PY - 1984 DA - 1984// TI - Algorithm-based fault tolerance for matrix operations JO - IEEE Trans Comput VL - C-33 UR - https://doi.org/10.1109/TC.1984.1676475 DO - 10.1109/TC.1984.1676475 ID - Huang1984 ER - TY - JOUR AU - Chen, Z. AU - Dongarra, J. PY - 2008 DA - 2008// TI - Algorithm-based fault tolerance for fail-stop failures JO - IEEE Trans Parallel Distrib Syst VL - 19 UR - https://doi.org/10.1109/TPDS.2008.58 DO - 10.1109/TPDS.2008.58 ID - Chen2008 ER - TY - STD TI - Gamell M, Katz DS, Kolla H, Chen J, Klasky S, Parashar M (2014) Exploring automatic, online failure recovery for scientific applications at extreme scales In: SC14: International Conference for High Performance Computing, Networking, Storage and Analysis, 895–906. https://doi.org/10.1109/SC.2014.78. ID - ref31 ER - TY - STD TI - Gamell M, Teranishi K, Heroux MA, Mayo J, Kolla H, Chen J, Parashar M (2015) Local recovery and failure masking for stencil-based applications at extreme scales In: SC15: International Conference for High Performance Computing, Networking, Storage and Analysis, 1–12. https://doi.org/10.1145/2807591.2807672. ID - ref32 ER - TY - STD TI - Zheng G, Ni X, Kalé LV (2012) A scalable double in-memory checkpoint and restart scheme towards exascale In: International Conference on Dependable Systems and Networks Workshops (DSN), 1–6. https://doi.org/10.1109/DSNW.2012.6264677. ID - ref33 ER - TY - CHAP AU - Ferreira, K. AU - Stearley, J. AU - Laros III, J. H. AU - Oldfield, R. AU - Pedretti, K. AU - Brightwell, R. AU - Riesen, R. AU - Bridges, P. G. AU - Arnold, D. PY - 2011 DA - 2011// TI - Evaluating the viability of process replication reliability for exascale systems BT - Proceedings of International Conference for High Performance Computing, Networking, Storage and Analysis, SC PB - ACM CY - New York ID - Ferreira2011 ER - TY - JOUR AU - Genaud, S. AU - Jeannot, E. AU - Rattanapoka, C. PY - 2009 DA - 2009// TI - Fault-management in p2p-mpi JO - Int J Parallel Prog VL - 37 UR - https://doi.org/10.1007/s10766-009-0115-8 DO - 10.1007/s10766-009-0115-8 ID - Genaud2009 ER - TY - CHAP AU - Fiala, D. AU - Mueller, F. AU - Engelmann, C. AU - Riesen, R. AU - Ferreira, K. AU - Brightwell, R. PY - 2012 DA - 2012// TI - Detection and correction of silent data corruption for large-scale high-performance computing BT - Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC PB - IEEE Computer Society Press CY - Los Alamitos ID - Fiala2012 ER - TY - CHAP AU - Huang, C. AU - Zheng, G. AU - Kal’e, L. AU - Kumar, S. PY - 2006 DA - 2006// TI - Performance Evaluation of Adaptive MPI BT - Proceedings of the Eleventh ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming PB - ACM CY - New York ID - Huang2006 ER - TY - CHAP AU - Kale, L. V. AU - Krishnan, S. PY - 1993 DA - 1993// TI - CHARM++: A Portable Concurrent Object Oriented System Based on C++ BT - Proceedings of the Eighth Annual Conference on Object-oriented Programming Systems, Languages, and Applications PB - ACM CY - New York ID - Kale1993 ER - TY - CHAP AU - Petrini, F. AU - Kerbyson, D. J. AU - Pakin, S. PY - 2003 DA - 2003// TI - The Case of the Missing Supercomputer Performance: Achieving Optimal Performance on the 8,192 Processors of ASCI Q BT - Proceedings of the 2003 ACM/IEEE Conference on Supercomputing PB - ACM CY - New York UR - https://doi.org/10.1145/1048935.1050204 DO - 10.1145/1048935.1050204 ID - Petrini2003 ER - TY - STD TI - Aguilar X, Laure E, Fürlinger K (2013) Online performance data introspection with ipm In: 10th International Conference on High Performance Computing, 728–734. https://doi.org/10.1109/HPCC.and.EUC.2013.107. ID - ref40 ER - TY - CHAP AU - Huck, K. A. AU - Malony, A. D. AU - Shende, S. AU - Morris, A. ED - Mohr, B. ED - Träff, J. L. ED - Worringen, J. ED - Dongarra, J. PY - 2006 DA - 2006// TI - TAUg: runtime global performance data access using MPI BT - Recent advances in parallel virtual machine and message passing interface: 13th European PVM/MPI User’s Group Meeting Bonn, Germany, September 17-20, 2006 Proceedings PB - Springer CY - Berlin, Heidelberg UR - https://doi.org/10.1007/11846802_44 DO - 10.1007/11846802_44 ID - Huck2006 ER - TY - CHAP AU - Nataraj, A. AU - Sottile, M. AU - Morris, A. AU - Malony, A. D. AU - Shende, S. ED - Kermarrec, A. -. M. ED - Bougé, L. ED - Priol, T. PY - 2007 DA - 2007// TI - TAUoverSupermon: low-overhead online parallel performance monitoring BT - 13th International Euro-Par Conference PB - Springer CY - Berlin, Heidelberg ID - Nataraj2007 ER - TY - JOUR AU - Shende, S. S. AU - Malony, A. D. PY - 2006 DA - 2006// TI - The tau parallel performance system JO - Int J High Perform Comput Appl VL - 20 UR - https://doi.org/10.1177/1094342006064482 DO - 10.1177/1094342006064482 ID - Shende2006 ER - TY - STD TI - Sottile MJ, Minnich RG (2002) Supermon: a high-speed cluster monitoring system In: International Conference on Cluster Computing, 39–46. https://doi.org/10.1109/CLUSTR.2002.1137727. ID - ref44 ER - TY - JOUR AU - Duarte Jr., E. P. AU - Ziwich, R. P. AU - Albini, L. C. P. PY - 2011 DA - 2011// TI - A survey of comparison-based system-level diagnosis JO - ACM Comput Surv VL - 43 UR - https://doi.org/10.1145/1922649.1922659 DO - 10.1145/1922649.1922659 ID - Duarte Jr.2011 ER - TY - JOUR AU - Preparata, F. P. AU - Metze, G. AU - Chien, R. T. PY - 1967 DA - 1967// TI - On the connection assignment problem of diagnosable systems JO - IEEE Trans Electron Comput VL - EC-16 UR - https://doi.org/10.1109/PGEC.1967.264748 DO - 10.1109/PGEC.1967.264748 ID - Preparata1967 ER - TY - JOUR AU - Hakimi, S. L. AU - Nakajima, K. PY - 1984 DA - 1984// TI - On adaptive system diagnosis JO - IEEE Trans Comput VL - 33 UR - https://doi.org/10.1109/TC.1984.1676420 DO - 10.1109/TC.1984.1676420 ID - Hakimi1984 ER - TY - JOUR AU - Hosseini, S. H. AU - Kuhl, J. G. AU - Reddy, S. M. PY - 1984 DA - 1984// TI - A diagnosis algorithm for distributed computing systems with dynamic failure and repair JO - IEEE Trans Comput VL - C-33 UR - https://doi.org/10.1109/TC.1984.1676419 DO - 10.1109/TC.1984.1676419 ID - Hosseini1984 ER - TY - JOUR AU - Duarte, E. P. AU - Nanya, T. PY - 1998 DA - 1998// TI - A hierarchical adaptive distributed system-level diagnosis algorithm JO - IEEE Trans Comput VL - 47 UR - https://doi.org/10.1109/12.656078 DO - 10.1109/12.656078 ID - Duarte1998 ER - TY - JOUR AU - Rangarajan, S. AU - Dahbura, A. T. AU - Ziegler, E. A. PY - 1995 DA - 1995// TI - A distributed system-level diagnosis algorithm for arbitrary network topologies JO - IEEE Trans Comput VL - 44 UR - https://doi.org/10.1109/12.364542 DO - 10.1109/12.364542 ID - Rangarajan1995 ER - TY - STD TI - Lamport L (2001) Paxos made simple. ACM SIGACT News (Distrib Comput Column) 32, 4 (Whole Number 121, December 2001). pp. 51–58. ID - ref51 ER - TY - CHAP AU - Jacobson, V. PY - 1988 DA - 1988// TI - Congestion avoidance and control BT - Symposium Proceedings on Communications Architectures and Protocols, SIGCOMM PB - ACM CY - New York UR - https://doi.org/10.1145/52324.52356 DO - 10.1145/52324.52356 ID - Jacobson1988 ER - TY - STD TI - Paxson V, Allman M, Chu HKJ, Sargent M (2011) Computing TCP’s retransmission timer. http://www.rfc-editor.org/rfc/rfc6298.txt. UR - http://www.rfc-editor.org/rfc/rfc6298.txt ID - ref53 ER - TY - STD TI - Moraes DM, Jr EPD (2011) A failure detection service for internet-based multi-as distributed systems In: 17th International Conference on Parallel and Distributed Systems, 260–267. https://doi.org/10.1109/ICPADS.2011.5. ID - ref54 ER - TY - JOUR AU - Zaharia, M. AU - Xin, R. S. AU - Wendell, P. AU - Das, T. AU - Armbrust, M. AU - Dave, A. AU - Meng, X. AU - Rosen, J. AU - Venkataraman, S. AU - Franklin, M. J. AU - Ghodsi, A. AU - Gonzalez, J. AU - Shenker, S. AU - Stoica, I. PY - 2016 DA - 2016// TI - Apache spark: a unified engine for big data processing JO - Commun ACM VL - 59 UR - https://doi.org/10.1145/2934664 DO - 10.1145/2934664 ID - Zaharia2016 ER - TY - STD TI - Manikandan SG, Ravi S (2014) Big data analysis using apache hadoop In: International Conference on IT Convergence and Security (ICITCS), 1–4. https://doi.org/10.1109/ICITCS.2014.7021746. ID - ref56 ER -