[1] N. Ali, S. Krishnamoorthy, M. Halappanavar, and J. Daily. Tolerating correlated failures for generalized cartesian distributions via bipartite matching. In Proceedings of the 8th ACM International Conference on Computing Frontiers, CF ’11, pages 36:1–36:10, New York, NY, USA, 2011. ACM. [2] N. Ali, S. Krishnamoorthy, M. Halappanavar, and J. Daily. Multi-fault tolerance for cartesian data distributions. International Journal of Parallel Programming, 41(3):469–493, 2013. [3] G. Y. Ananth, V. Kumar, and P. Pardalos. Parallel processing of discrete optimization problems. In IN ENCYCLOPEDIA OF MICROCOMPUTERS, pages 129–147. Marcel Dekker Inc, 1993. [4] J. Anfinson and F. T. Luk. A linear algebraic model of algorithm-based fault tolerance. IEEE Trans. Comput., 37(12):1599–1604, Dec. 1988. [5] A. C. S. Association. The computer failure data repository (cfdr). [6] P. Banerjee, J. T. Rahmeh, C. Stunkel, V. S. Nair, K. Roy, V. Balasubramanian, and J. A. Abraham. Algorithm-based fault tolerance on a hypercube multiprocessor. IEEE Trans. Comput., 39(9):1132–1145, Sept. 1990. [7] L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. Fti: high performance fault tolerance interface for hybrid systems. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis, 2011, SC ’11, pages 32:1–32:32, New York, NY, USA, 2011. ACM. [8] R. Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716–719, August 1952. [9] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1957. [10] K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, K. Yelick, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Keckler, D. Klein, P. Kogge, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, 2008. [11] D. L. Boley, R. P. Brent, G. H. Golub, and F. T. Luk. Algorithmic fault tolerance using the lanczos method. SIAM J. Matrix Anal. Appl., 13(1):312–332, Jan. 1992. [12] G. Bosilca, A. Bouteiller, T. Herault, Y. Robert, and J. Dongarra. Assessing the impact of abft and checkpoint composite strategies. In 2014 IEEE International Parallel Distributed Processing Symposium Workshops, pages 679–688, May 2014. [13] G. Bosilca, A. Bouteiller, T. Hérault, Y. Robert, and J. J. Dongarra. Composing resilience techniques: Abft, periodic and incremental checkpointing. IJNC, 5(1):2–25, 2015. [14] G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput., 69(4):410–416, Apr. 2009. [15] G. Boslica, A. Bouteiller, T. Hérault, Y. Robert, and J. Dongarra, Jack. Assessing the impact of ABFT and checkpoint composite strategies. In IEEE International Parallel & Distributed Processing Symposium Workshops, pages 679–688, 2014. [16] A. Bouteiller, F. Cappello, T. Herault, G. Krawezik, P. Lemarinier, and F. Magniette. Mpich-v2: A fault tolerant mpi for volatile nodes based on pessimistic sender based message logging. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC ’03, pages 25–, New York, NY, USA, 2003. ACM. [17] A. Bouteiller, T. Hérault, G. Bosilca, P. Du, and J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Trans. Parallel Comput., 1(2):1–28, Feb. 2015. [18] G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Automated application-level checkpointing of mpi programs. In Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’03, pages 84–94. ACM, 2003. [19] S. Caminiti, I. Finocchi, and E. G. Fusco. Local dependency dynamic programming in the presence of memory faults. In 28th International Symposium on Theoretical Aspects of Computer Science (STACS 2011), volume 9 of Leibniz International Proceedings in Informatics (LIPIcs), pages 45–56, Dagstuhl, Germany, 2011. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik. [20] S. Caminiti, I. Finocchi, E. G. Fusco, and F. Silvestri. Resilient dynamic programming. Algorithmica, 77(2):389–425, February 2017. [21] F. Cappello. Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl., 23(3):212–226, Aug. 2009. [22] F. Cappello, G. Al, W. Gropp, S. Kale, B. Kramer, and M. Snir. Toward exascale resilience: 2014 update. Supercomput. Front. Innov.: Int. J., 1(1):5–28, Apr. 2014. [23] F. Cappello, H. Casanova, and Y. Robert. Checkpointing vs. migration for post-petascale supercomputers. In Proc. 39th International Conference on Parallel Processing, 2010, ICPP ’10, pages 168–177, Washington, DC, USA, 2010. IEEE Computer Society. [24] H. Casanova, F. Vivien, and D. Zaidouni. Using Replication for Resilience on Exascale Systems, pages 229–278. Springer International Publishing, Cham, 2015. [25] S. Chakravorty and L. V. Kale. A fault tolerance protocol with fast fault recovery. In IEEE International Parallel and Distributed Processing Symposium, IPDPS. IEEE, 2007. [26] Z. Chen. Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments. In Proceedings of IEEE International Symposium on Parallel and Distributed Processing, IPDPS, pages 1–8, New York, NY, USA, 2008. ACM. [27] Z. Chen. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of the 20th international symposium on High performance distributed computing, HPDC ’11, pages 73–84. ACM, 2011. [28] Z. Chen and J. Dongarra. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In Proceedings of the 20th international conference on Parallel and distributed processing, IPDPS’06. IEEE Computer Society, 2006. [29] Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems, 19(12):1628–1641, Dec 2008. [30] Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’05, pages 213–223, New York, NY, USA, 2005. ACM. [31] D. Cook and R. C. Varnell. Adaptive parallel iterative deepening search. Journal of Artificial Intelligence Research, 9:139–166, 1998. [32] I. Cores, G. Rodrı́guez, M. J. Martı́n, P. González, and R. R. Osorio. Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes. New Generation Comput., 31(3):163–185, 2013. [33] I. Cores, M. Rodriguez, P. González, and M. J. Martı́n. Reducing the overhead of an MPI application-level migration approach. Parallel Computing, 54:72–82, 2016. [34] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2006. [35] C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguez, and F. Cappello. Blocking vs. non-blocking coordinated checkpointing for largescale fault tolerant mpi. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC ’06, New York, NY, USA, 2006. ACM. [36] DANA-Faber Cancer Institute and Harvard School of Public Health. The computational biology and functional genomics laboratory. http://compbio.dfci. harvard.edu/tgi/, 2014. [37] S. Dasgupta, C. Papadimitriou, and U. Vazirani. Algorithms. McGraw-Hill Higher Education, 2008. [38] T. Davies, Z. Chen, C. Karlsson, and H. Liu. Algorithm-based recovery for hpl. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP ’11, pages 303–304, New York, NY, USA, 2011. [39] T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High performance linpack benchmark: a fault tolerant implementation without checkpointing. In Proceedings of the international conference on Supercomputing, ICS ’11, pages 162–171, New York, NY, USA, 2011. ACM. [40] D. A. G. de Oliveira, L. Pilla, C. Lunardi, L. Carro, P. O. Navaux, and P. Rech. The path to exascale: Code optimizations and hardening solutions reliability. In Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS ’15, pages 55–62, New York, NY, USA, 2015. ACM. [41] J. Dongarra, H. Meuer, and E. Strohmaier. Top 500 supercomputing sites, Jun 2017. [42] P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Program- ming, PPoPP ’12, pages 225–234. ACM, 2012. [43] P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations. SIGPLAN Not., 47(8):225–234, Feb. 2012. [44] I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J. Supercomput., 65(3):1302–1326, Sept. 2013. [45] E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375–408, Sept. 2002. [46] C. Engelmann and F. Lauer. Facilitating co-design for extreme-scale systems through lightweight simulation. In IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010. [47] G. E. Fagg and J. Dongarra. FT-MPI: fault tolerant mpi, supporting dynamic applications in a dynamic world. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, 7th European PVM/MPI Users’ Group Meeting, Balatonfüred, Hungary, September 2000, Proceedings, pages 346–353, 2000. [48] P. Felber, X. Défago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, DOA ’99, pages 132–, Washington, DC, USA, 1999. IEEE Computer Society. [49] P. F. Felzenszwalb and R. Zabih. Dynamic programming and graph algorithms in computer vision. IEEE Trans. Pattern Anal. Mach. Intell., 33(4):721–740, Apr. 2011. [50] K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pages 44:1–44:12, New York, NY, USA, 2011. ACM. [51] S. Fu. Failure-aware resource management for high-availability computing clusters with distributed virtual machines. J. Parallel Distrib. Comput., 70(4):384–393, Apr. 2010. [52] Q. Gao, W. Huang, M. J. Koop, and D. K. Panda. Group-based coordinated checkpointing for mpi: A case study on infiniband. In Proceedings of the 2007 International Conference on Parallel Processing, ICPP ’07, pages 47–, Wash- ington, DC, USA, 2007. IEEE Computer Society. [53] M. R. Garey and D. S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1990. [54] A. Geist and R. Lucas. Major computer science challenges at exascale. Int. J. High Perform. Comput. Appl., 23(4):427–436, Nov. 2009. [55] G. Gibson, B. Schroeder, and J. Digney. Failure tolerance in petascale computers. CTWatch Quarterly, 3(4):4–10, Nov. 2004. [56] A. Grama and V. Kumar. State of the art in parallel search techniques for discrete optimization problems. IEEE Trans. on Knowl. and Data Eng., 11(1):28–35, Jan. 1999. [57] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portableimplementation of the mpi message passing interface standard. Parallel Comput., 22(6):789–828, Sept. 1996. [58] I. Gupta, T. D. Chandra, and G. S. Goldszmidt. On scalable and efficient distributed failure detectors. In Proceedings of the Twentieth Annual ACM Symposium on Principles of Distributed Computing, PODC ’01, pages 170–179, New York, NY, USA, 2001. ACM. [59] T. J. Hacker, F. Romero, and C. D. Carothers. An analysis of clustered failures on large supercomputing systems. J. Parallel Distrib. Comput., 69(7):652–665, July 2009. [60] J. Haines, V. Lakamraju, I. Koren, and C. M. Krishna. Application-level fault tolerance as a complement to system-level fault tolerance. J. Supercomput., 16(1-2):53–68, May 2000. [61] P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (BLCR) for LINUX. [62] P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conference Series, 46(1):494, 2006. [63] K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput., 33(6):518–528, June 1984. [64] G. Jakadeesan and D. Goswami. A classification-based approach to fault-tolerance support in parallel programs. Parallel and Distributed Computing Applications and Technologies, International Conference on, 0:255–262, 2009. [65] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to parallel computing: design and analysis of algorithms. Benjamin-Cummings Publishing Co., Inc., Redwood City, CA, USA, 1994. [66] V. Kumar, K. Ramesh, and V. N. Rao. Parallel best-first search of state-space graphs: A summary of results. In in Proc. 10th Nat. Conf. AI, AAAI, pages 122–127. Press, 1988. [67] V. Kumar and V. N. Rao. Parallel depth first search. part ii. analysis. International Journal of Parallel Programming, 16(6):501–519, Dec 1987. [68] L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558–565, July 1978. [69] H. Liu. An algorithm-based recovery scheme for extreme scale computing. In Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, IPDPSW ’11, pages 2010–2013, Washington, DC, USA, 2011. IEEE Computer Society. [70] F. T. Luk and H. Park. An analysis of algorithm-based fault tolerance techniques. J. Parallel Distrib. Comput., 5(2):172–184, Apr. 1988. [71] C. D. Martino, Z. T. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN, 2014, pages 610–621, 2014. [72] A. Oliner. What supercomputers say: A study of five system logs. In 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2007. DSN ’07, pages 575–584, 2007. [73] J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9(10):972–986, October 1998. [74] C. Quebec. Calcul quebec, compute canada. http://www.calculquebec.ca/en/, 2017. [75] V. N. Rao and V. Kumar. Parallel depth first search. part i. implementation. Int. J. Parallel Program., 16(6):479–499, Dec. 1987. [76] V. N. Rao, V. Kumar, and K. Ramesh. A parallel implementation of iterative-deepening-a. In Proceedings of the sixth National conference on Artificial intelligence - Volume 1, AAAI’87, pages 178–182. AAAI Press, 1987. [77] V. N. Rao, V. Kumar, and K. Ramesh. A parallel implementation of iterative-deepening-a. In Proceedings of the Sixth National Conference on Artificial Intelligence - Volume 1, AAAI’87, pages 178–182. AAAI Press, 1987. [78] A. Reinefeld and V. Schnecke. Work-load balancing in highly parallel depthfirst search. In In Scalable High Performance Computing Conference, pages 773–780, 1994. [79] R. K. Sahoo, A. Sivasubramaniam, M. S. Squillante, and Y. Zhang. Failure data analysis of a large-scale heterogeneous server environment. In Proceedings of the 2004 International Conference on Dependable Systems and Networks, DSN ’04, pages 772–, Washington, DC, USA, 2004. IEEE Computer Society. [80] R. D. Schlichting and F. B. Schneider. Fail-stop processors: an approach to designing fault-tolerant computing systems. ACM Trans. Comput. Syst., 1(3):222–238, Aug. 1983. [81] B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks, DSN ’06, pages 249–258, Washington, DC, USA, 2006. IEEE Computer Society. [82] B. Schroeder and G. A. Gibson. Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX conference on File and Storage Technologies, FAST ’07, Berkeley, CA, USA, 2007. USENIX Association. [83] B. Schroeder and G. A. Gibson. Understanding failures in petascale computers, 2007. [84] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST ’10, pages 1–10, Washington, DC, USA, 2010. IEEE Computer Society. [85] M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. Debardeleben, P. C. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyffer, D. Liberty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, and E. V. Hensbergen. Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl., 28(2):129–173, May 2014. [86] P. Stodghil, G. Bronevetsky, D. Marques, K. Pingali, and R. Fernandes. Optimizing checkpoint sizes in the c3 system. Parallel and Distributed Processing Symposium, International, 11:226a, 2005. [87] R. Subramaniyan, V. Aggarwal, A. Jacobs, and A. D. George. Fempi: A lightweight fault-tolerant mpi for embedded cluster systems. In Proc. International Conference on Embedded Systems and Applications (ESA), Las Vegas, pages 26–29, 2006. [88] B. W. Wah and G.-j. Li. Systolic processing for dynamic programming problems. Circuits, Systems and Signal Processing, 7(2):119–149, Jun 1988. [89] J. Walters and V. Chaudhary. Application-level checkpointing techniques for parallel programs. In Distributed Computing and Internet Technology, pages 221–234. Springer Berlin/Heidelberg, Springer Berlin/Heidelberg, 2006. [90] C.-L. Wang, F. C. M. Lau, and J. C. Y. Ho. Scalable group-based checkpoint/restart for large-scale message-passing systems. In IEEE International Parallel & Distributed Processing Symposium, pages 1–12, Los Alamitos, CA, USA, 2008. IEEE Computer Society. [91] P. Wang, Y. Du, H. Fu, X. Yang, and H. Zhou. Static analysis for application-level checkpointing of mpi programs. In 2008 10th IEEE International Conference on High Performance Computing and Communications, pages 548–555, Sept 2008. [92] R. Wang, E. Yao, M. Chen, G. Tan, P. Balaji, and D. Buntinas. Building algorithmically nonstop fault tolerant mpi programs. In HiPC, pages 1–9, 2011. [93] Z. Wei and D. Goswami. A synchronization-induced checkpoint protocol for group-synchronous parallel programs. In Proceedings of 13th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT 2012), pages 632–637. IEEE, 2012. [94] T. White. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 1st edition, 2009. [95] M. Wu, X.-H. Sun, and H. Jin. Performance under failures of high-end computing. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC ’07, pages 48:1–48:11, New York, NY, USA, 2007. ACM. [96] P. Wu, N. DeBardeleben, Q. Guan, S. Blanchard, J. Chen, D. Tao, X. Liang, K. Ouyang, and Z. Chen. Silent data corruption resilient two-sided matrix factorizations. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’17, pages 415–427, New York, NY, USA, 2017. ACM. [97] E. Yao, M. Chen, W. Zhang, and G. Tan. A new and efficient algorithm-based fault tolerance scheme for a million way parallelism. In Proceedings of CoRR, volume abs/1106.4213, 2011. [98] E. Yao, R. Wang, M. Chen, G. Tan, and N. Sun. A case study of designing efficient algorithm-based fault tolerant application for exascale parallelism. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS ’12, pages 438–448, Washington, DC, USA, 2012. IEEE Computer Society. [99] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, pages 10–10, Berkeley, CA, USA, 2010. USENIX Association. [100] Z. Zheng, L. Yu, W. Tang, Z. Lan, R. Gupta, N. Desai, S. Coghlan, and D. Buettner. Co-analysis of ras log and job log on blue gene/p. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS ’11, pages 840–851, Washington, DC, USA, 2011. IEEE Computer Society. [101] Zuse Institute Berlin. Mp-testdata-traveling salesman problem instances. http://elib.zib.de/pub/mp-testdata/tsp/tsplib/tsplib.html.