Algorithm Based Fault Tolerance: A Perspective from Algorithmic and Communication Characteristics of Parallel Algorithms

Title:

Algorithm Based Fault Tolerance: A Perspective from Algorithmic and Communication Characteristics of Parallel Algorithms

Kabir, Upama (2017) Algorithm Based Fault Tolerance: A Perspective from Algorithmic and Communication Characteristics of Parallel Algorithms. PhD thesis, Concordia University.

Preview

Text (application/pdf)
Kabir_PhD_S2018.pdf - Accepted Version
Available under License Spectrum Terms of Access.

2MB

Abstract

Checkpoint and recovery cost imposed by checkpoint/restart (CP/R) is a crucial performance issue for high-performance computing (HPC) applications. In comparison, Algorithm-Based Fault Tolerance (ABFT) is a promising fault tolerance method with low recovery overhead, but it suffers from the inadequacy of universal applicability, i.e., tied to a specific application or algorithm. Till date, providing fault tolerance for matrix-based algorithms for linear systems has been the research focus of ABFT schemes. As a consequence, it necessitates a comprehensive exploration of ABFT research to widen its scope to other types of parallel algorithms and applications. In this thesis, we go beyond traditional ABFT and focus on other types of parallel applications not covered by traditional ABFT. In that regard, rather than an emphasis on a single application at a time, we consider the algorithmic and communication characteristics of a class of parallel applications to design efficient fault tolerance and recovery strategies for that class of parallel applications. The communication characteristics determine how to distributively replicate the fault recovery data (we call it the {\em critical data}) of a process, and the algorithmic characteristics determine what the application-specific data is to be replicated to minimize fault tolerance and recovery cost. Based on communication characteristics, parallel algorithms can be broadly classified as (i) embarrassingly parallel algorithms, where processes have infrequent or rare interactions, and (ii) communication-intensive parallel algorithms, where processes have significant interactions. In this thesis, through different case studies, we design ABFT for these two categories of algorithms by considering their algorithmic and communication characteristics. Analysis of these parallel algorithms reveals that a process contains sufficient information that can help to rebuild a computational state if any failure occurs during the computation. We define this information as critical data, the minimal application-level data required to be saved (securely) so that a failed process can be fully recovered from a most recent consistent state using this fault recovery data. How the communication dependencies among processes are utilized to replicate fault recovery data is directly related to the system’s fault tolerance performance. We propose ABFT for parallel search algorithms, which belong to the class of embarrassingly parallel algorithms. Parallel search algorithms are the well-known solution techniques for discrete optimization problems (DOP). DOP covers a broad class of (parallel) applications from search problems in AI to computer games, e.g., Chess and various games, traveling salesman problem, various AI search problems. As a case study, we choose the parallel iterative deepening A* (PIDA*) algorithm and integrate application-level fault tolerance with the algorithm by replicating critical data periodically to make it resilient. In the category of communication-intensive algorithms, we choose Dynamic programming (DP) which is a widely used algorithm paradigm for optimization problems. We choose parallel DP algorithm as a case study and propose ABFT for such applications. We present a detailed analysis of the characteristics of parallel DP algorithms and show that the algorithmic features reduce the cardinality of critical data into a single data in case of $n$-data dependent task. We demonstrate the idea with two popular DP class of applications: (i) the traveling salesman problem (TSP), and (ii) the longest common subsequence (LCS) problem. Minimal storage and recovery overhead are the prime concern in FT design. On that regard, we demonstrate that further optimization in critical data is possible for particular DP class of problems, where the degree of dependency for a subproblem is small and fixed at each iteration. We discuss it with the 0/1 knapsack problem as a case study and propose an ABFT scheme where, instead of replicating the critical data, we replicate a bit-vector flag in peer process's memory which is later used to rebuild the lost data of a failed process. Theoretical and experimental results demonstrate that our proposed methods perform significantly better than the conventional CP/R in terms of fault tolerance and recovery overheads, and also in storage overhead in the presence of single and multiple simultaneous failures.

Divisions:	Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering
Item Type:	Thesis (PhD)
Authors:	Kabir, Upama
Institution:	Concordia University
Degree Name:	Ph. D.
Program:	Computer Science
Date:	October 2017
Thesis Supervisor(s):	Goswami, Dhrubojyoti
Keywords:	HPC, ABFT, Fault Tolerance, Parallel and Distributed Systems, MPI, Parallel Search, Checkpointing, Parallel Dynamic Programming
ID Code:	983343
Deposited By:	UPAMA KABIR
Deposited On:	05 Jun 2018 14:49
Last Modified:	05 Jun 2018 14:49

References:

[1] N. Ali, S. Krishnamoorthy, M. Halappanavar, and J. Daily. Tolerating correlated failures for generalized cartesian distributions via bipartite matching. In Proceedings of the 8th ACM International Conference on Computing Frontiers, CF ’11, pages 36:1–36:10, New York, NY, USA, 2011. ACM.
[2] N. Ali, S. Krishnamoorthy, M. Halappanavar, and J. Daily. Multi-fault tolerance for cartesian data distributions. International Journal of Parallel Programming, 41(3):469–493, 2013.
[3] G. Y. Ananth, V. Kumar, and P. Pardalos. Parallel processing of discrete optimization problems. In IN ENCYCLOPEDIA OF MICROCOMPUTERS, pages 129–147. Marcel Dekker Inc, 1993.
[4] J. Anfinson and F. T. Luk. A linear algebraic model of algorithm-based fault tolerance. IEEE Trans. Comput., 37(12):1599–1604, Dec. 1988.
[5] A. C. S. Association. The computer failure data repository (cfdr).
[6] P. Banerjee, J. T. Rahmeh, C. Stunkel, V. S. Nair, K. Roy, V. Balasubramanian, and J. A. Abraham. Algorithm-based fault tolerance on a hypercube multiprocessor. IEEE Trans. Comput., 39(9):1132–1145, Sept. 1990.
[7] L. Bautista-Gomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. Fti: high performance fault tolerance interface for hybrid systems. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis, 2011, SC ’11, pages 32:1–32:32, New York, NY, USA, 2011. ACM.
[8] R. Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716–719, August 1952.
[9] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1957.
[10] K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, K. Yelick, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Keckler, D. Klein, P. Kogge, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, 2008.
[11] D. L. Boley, R. P. Brent, G. H. Golub, and F. T. Luk. Algorithmic fault tolerance using the lanczos method. SIAM J. Matrix Anal. Appl., 13(1):312–332, Jan. 1992.
[12] G. Bosilca, A. Bouteiller, T. Herault, Y. Robert, and J. Dongarra. Assessing the impact of abft and checkpoint composite strategies. In 2014 IEEE International Parallel Distributed Processing Symposium Workshops, pages 679–688, May 2014.
[13] G. Bosilca, A. Bouteiller, T. Hérault, Y. Robert, and J. J. Dongarra. Composing resilience techniques: Abft, periodic and incremental checkpointing. IJNC, 5(1):2–25, 2015.
[14] G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput., 69(4):410–416, Apr. 2009.
[15] G. Boslica, A. Bouteiller, T. Hérault, Y. Robert, and J. Dongarra, Jack. Assessing the impact of ABFT and checkpoint composite strategies. In IEEE International Parallel & Distributed Processing Symposium Workshops, pages
679–688, 2014.
[16] A. Bouteiller, F. Cappello, T. Herault, G. Krawezik, P. Lemarinier, and F. Magniette. Mpich-v2: A fault tolerant mpi for volatile nodes based on pessimistic sender based message logging. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC ’03, pages 25–, New York, NY, USA, 2003. ACM.
[17] A. Bouteiller, T. Hérault, G. Bosilca, P. Du, and J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Trans. Parallel Comput., 1(2):1–28, Feb. 2015.
[18] G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Automated application-level checkpointing of mpi programs. In Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’03, pages 84–94. ACM, 2003.
[19] S. Caminiti, I. Finocchi, and E. G. Fusco. Local dependency dynamic programming in the presence of memory faults. In 28th International Symposium on Theoretical Aspects of Computer Science (STACS 2011), volume 9 of Leibniz International Proceedings in Informatics (LIPIcs), pages 45–56, Dagstuhl, Germany, 2011. Schloss Dagstuhl–Leibniz-Zentrum fuer Informatik.
[20] S. Caminiti, I. Finocchi, E. G. Fusco, and F. Silvestri. Resilient dynamic programming. Algorithmica, 77(2):389–425, February 2017.
[21] F. Cappello. Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl., 23(3):212–226, Aug. 2009.
[22] F. Cappello, G. Al, W. Gropp, S. Kale, B. Kramer, and M. Snir. Toward exascale resilience: 2014 update. Supercomput. Front. Innov.: Int. J., 1(1):5–28, Apr. 2014.
[23] F. Cappello, H. Casanova, and Y. Robert. Checkpointing vs. migration for post-petascale supercomputers. In Proc. 39th International Conference on Parallel Processing, 2010, ICPP ’10, pages 168–177, Washington, DC, USA, 2010. IEEE Computer Society.
[24] H. Casanova, F. Vivien, and D. Zaidouni. Using Replication for Resilience on Exascale Systems, pages 229–278. Springer International Publishing, Cham, 2015.
[25] S. Chakravorty and L. V. Kale. A fault tolerance protocol with fast fault recovery. In IEEE International Parallel and Distributed Processing Symposium, IPDPS. IEEE, 2007.
[26] Z. Chen. Extending algorithm-based fault tolerance to tolerate fail-stop failures in high performance distributed environments. In Proceedings of IEEE International Symposium on Parallel and Distributed Processing, IPDPS, pages 1–8,
New York, NY, USA, 2008. ACM.
[27] Z. Chen. Algorithm-based recovery for iterative methods without checkpointing. In Proceedings of the 20th international symposium on High performance distributed computing, HPDC ’11, pages 73–84. ACM, 2011.
[28] Z. Chen and J. Dongarra. Algorithm-based checkpoint-free fault tolerance for parallel matrix computations on volatile resources. In Proceedings of the 20th international conference on Parallel and distributed processing, IPDPS’06. IEEE Computer Society, 2006.
[29] Z. Chen and J. Dongarra. Algorithm-based fault tolerance for fail-stop failures. IEEE Transactions on Parallel and Distributed Systems, 19(12):1628–1641, Dec 2008.
[30] Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’05, pages 213–223, New York, NY, USA, 2005. ACM.
[31] D. Cook and R. C. Varnell. Adaptive parallel iterative deepening search. Journal of Artificial Intelligence Research, 9:139–166, 1998.
[32] I. Cores, G. Rodrı́guez, M. J. Martı́n, P. González, and R. R. Osorio. Improving scalability of application-level checkpoint-recovery by reducing checkpoint sizes. New Generation Comput., 31(3):163–185, 2013.
[33] I. Cores, M. Rodriguez, P. González, and M. J. Martı́n. Reducing the overhead of an MPI application-level migration approach. Parallel Computing, 54:72–82, 2016.
[34] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGraw-Hill Higher Education, 2006.
[35] C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguez, and F. Cappello. Blocking vs. non-blocking coordinated checkpointing for largescale fault tolerant mpi. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC ’06, New York, NY, USA, 2006. ACM.
[36] DANA-Faber Cancer Institute and Harvard School of Public Health. The computational biology and functional genomics laboratory. http://compbio.dfci. harvard.edu/tgi/, 2014.
[37] S. Dasgupta, C. Papadimitriou, and U. Vazirani. Algorithms. McGraw-Hill Higher Education, 2008.
[38] T. Davies, Z. Chen, C. Karlsson, and H. Liu. Algorithm-based recovery for hpl. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP ’11, pages 303–304, New York, NY, USA, 2011.
[39] T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High performance linpack benchmark: a fault tolerant implementation without checkpointing. In Proceedings of the international conference on Supercomputing, ICS ’11, pages
162–171, New York, NY, USA, 2011. ACM.
[40] D. A. G. de Oliveira, L. Pilla, C. Lunardi, L. Carro, P. O. Navaux, and P. Rech. The path to exascale: Code optimizations and hardening solutions reliability. In Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS ’15, pages 55–62, New York, NY, USA, 2015. ACM.
[41] J. Dongarra, H. Meuer, and E. Strohmaier. Top 500 supercomputing sites, Jun 2017.
[42] P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Program-
ming, PPoPP ’12, pages 225–234. ACM, 2012.
[43] P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithm-based fault tolerance for dense matrix factorizations. SIGPLAN Not., 47(8):225–234, Feb. 2012.
[44] I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J. Supercomput., 65(3):1302–1326, Sept. 2013.
[45] E. N. M. Elnozahy, L. Alvisi, Y.-M. Wang, and D. B. Johnson. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Surv., 34(3):375–408, Sept. 2002.
[46] C. Engelmann and F. Lauer. Facilitating co-design for extreme-scale systems through lightweight simulation. In IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010.
[47] G. E. Fagg and J. Dongarra. FT-MPI: fault tolerant mpi, supporting dynamic applications in a dynamic world. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, 7th European PVM/MPI Users’ Group Meeting, Balatonfüred, Hungary, September 2000, Proceedings, pages 346–353, 2000.
[48] P. Felber, X. Défago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, DOA ’99, pages 132–, Washington, DC, USA, 1999. IEEE
Computer Society.
[49] P. F. Felzenszwalb and R. Zabih. Dynamic programming and graph algorithms in computer vision. IEEE Trans. Pattern Anal. Mach. Intell., 33(4):721–740, Apr. 2011.
[50] K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pages 44:1–44:12, New York, NY, USA, 2011. ACM.
[51] S. Fu. Failure-aware resource management for high-availability computing clusters with distributed virtual machines. J. Parallel Distrib. Comput., 70(4):384–393, Apr. 2010.
[52] Q. Gao, W. Huang, M. J. Koop, and D. K. Panda. Group-based coordinated checkpointing for mpi: A case study on infiniband. In Proceedings of the 2007 International Conference on Parallel Processing, ICPP ’07, pages 47–, Wash-
ington, DC, USA, 2007. IEEE Computer Society.
[53] M. R. Garey and D. S. Johnson. Computers and Intractability; A Guide to the Theory of NP-Completeness. W. H. Freeman & Co., New York, NY, USA, 1990.
[54] A. Geist and R. Lucas. Major computer science challenges at exascale. Int. J. High Perform. Comput. Appl., 23(4):427–436, Nov. 2009.
[55] G. Gibson, B. Schroeder, and J. Digney. Failure tolerance in petascale computers. CTWatch Quarterly, 3(4):4–10, Nov. 2004.
[56] A. Grama and V. Kumar. State of the art in parallel search techniques for discrete optimization problems. IEEE Trans. on Knowl. and Data Eng., 11(1):28–35, Jan. 1999.
[57] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A high-performance, portableimplementation of the mpi message passing interface standard. Parallel Comput., 22(6):789–828, Sept. 1996.
[58] I. Gupta, T. D. Chandra, and G. S. Goldszmidt. On scalable and efficient distributed failure detectors. In Proceedings of the Twentieth Annual ACM Symposium on Principles of Distributed Computing, PODC ’01, pages 170–179,
New York, NY, USA, 2001. ACM.
[59] T. J. Hacker, F. Romero, and C. D. Carothers. An analysis of clustered failures on large supercomputing systems. J. Parallel Distrib. Comput., 69(7):652–665, July 2009.
[60] J. Haines, V. Lakamraju, I. Koren, and C. M. Krishna. Application-level fault tolerance as a complement to system-level fault tolerance. J. Supercomput., 16(1-2):53–68, May 2000.
[61] P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (BLCR) for LINUX.
[62] P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conference Series, 46(1):494, 2006.
[63] K.-H. Huang and J. A. Abraham. Algorithm-based fault tolerance for matrix operations. IEEE Trans. Comput., 33(6):518–528, June 1984.
[64] G. Jakadeesan and D. Goswami. A classification-based approach to fault-tolerance support in parallel programs. Parallel and Distributed Computing Applications and Technologies, International Conference on, 0:255–262, 2009.
[65] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to parallel computing: design and analysis of algorithms. Benjamin-Cummings Publishing Co., Inc., Redwood City, CA, USA, 1994.
[66] V. Kumar, K. Ramesh, and V. N. Rao. Parallel best-first search of state-space graphs: A summary of results. In in Proc. 10th Nat. Conf. AI, AAAI, pages 122–127. Press, 1988.
[67] V. Kumar and V. N. Rao. Parallel depth first search. part ii. analysis. International Journal of Parallel Programming, 16(6):501–519, Dec 1987.
[68] L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558–565, July 1978.
[69] H. Liu. An algorithm-based recovery scheme for extreme scale computing. In Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, IPDPSW ’11, pages 2010–2013, Washington, DC, USA, 2011. IEEE Computer Society.
[70] F. T. Luk and H. Park. An analysis of algorithm-based fault tolerance techniques. J. Parallel Distrib. Comput., 5(2):172–184, Apr. 1988.
[71] C. D. Martino, Z. T. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN, 2014, pages 610–621, 2014.
[72] A. Oliner. What supercomputers say: A study of five system logs. In 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2007. DSN ’07, pages 575–584, 2007.
[73] J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9(10):972–986, October 1998.
[74] C. Quebec. Calcul quebec, compute canada. http://www.calculquebec.ca/en/, 2017.
[75] V. N. Rao and V. Kumar. Parallel depth first search. part i. implementation. Int. J. Parallel Program., 16(6):479–499, Dec. 1987.
[76] V. N. Rao, V. Kumar, and K. Ramesh. A parallel implementation of iterative-deepening-a. In Proceedings of the sixth National conference on Artificial intelligence - Volume 1, AAAI’87, pages 178–182. AAAI Press, 1987.
[77] V. N. Rao, V. Kumar, and K. Ramesh. A parallel implementation of iterative-deepening-a. In Proceedings of the Sixth National Conference on Artificial Intelligence - Volume 1, AAAI’87, pages 178–182. AAAI Press, 1987.
[78] A. Reinefeld and V. Schnecke. Work-load balancing in highly parallel depthfirst search. In In Scalable High Performance Computing Conference, pages 773–780, 1994.
[79] R. K. Sahoo, A. Sivasubramaniam, M. S. Squillante, and Y. Zhang. Failure data analysis of a large-scale heterogeneous server environment. In Proceedings of the 2004 International Conference on Dependable Systems and Networks, DSN ’04, pages 772–, Washington, DC, USA, 2004. IEEE Computer Society.
[80] R. D. Schlichting and F. B. Schneider. Fail-stop processors: an approach to designing fault-tolerant computing systems. ACM Trans. Comput. Syst., 1(3):222–238, Aug. 1983.
[81] B. Schroeder and G. A. Gibson. A large-scale study of failures in high-performance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks, DSN ’06, pages 249–258, Washington,
DC, USA, 2006. IEEE Computer Society.
[82] B. Schroeder and G. A. Gibson. Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX conference on File and Storage Technologies, FAST ’07, Berkeley, CA, USA,
2007. USENIX Association.
[83] B. Schroeder and G. A. Gibson. Understanding failures in petascale computers, 2007.
[84] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST ’10, pages 1–10, Washington, DC,
USA, 2010. IEEE Computer Society.
[85] M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. Debardeleben, P. C. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyffer, D. Liberty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, and E. V. Hensbergen. Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl., 28(2):129–173, May 2014.
[86] P. Stodghil, G. Bronevetsky, D. Marques, K. Pingali, and R. Fernandes. Optimizing checkpoint sizes in the c3 system. Parallel and Distributed Processing Symposium, International, 11:226a, 2005.
[87] R. Subramaniyan, V. Aggarwal, A. Jacobs, and A. D. George. Fempi: A lightweight fault-tolerant mpi for embedded cluster systems. In Proc. International Conference on Embedded Systems and Applications (ESA), Las Vegas, pages 26–29, 2006.
[88] B. W. Wah and G.-j. Li. Systolic processing for dynamic programming problems. Circuits, Systems and Signal Processing, 7(2):119–149, Jun 1988.
[89] J. Walters and V. Chaudhary. Application-level checkpointing techniques for parallel programs. In Distributed Computing and Internet Technology, pages 221–234. Springer Berlin/Heidelberg, Springer Berlin/Heidelberg, 2006.
[90] C.-L. Wang, F. C. M. Lau, and J. C. Y. Ho. Scalable group-based checkpoint/restart for large-scale message-passing systems. In IEEE International Parallel & Distributed Processing Symposium, pages 1–12, Los Alamitos, CA, USA, 2008. IEEE Computer Society.
[91] P. Wang, Y. Du, H. Fu, X. Yang, and H. Zhou. Static analysis for application-level checkpointing of mpi programs. In 2008 10th IEEE International Conference on High Performance Computing and Communications, pages 548–555, Sept 2008.
[92] R. Wang, E. Yao, M. Chen, G. Tan, P. Balaji, and D. Buntinas. Building algorithmically nonstop fault tolerant mpi programs. In HiPC, pages 1–9, 2011.
[93] Z. Wei and D. Goswami. A synchronization-induced checkpoint protocol for group-synchronous parallel programs. In Proceedings of 13th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT 2012), pages 632–637. IEEE, 2012.
[94] T. White. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 1st edition, 2009.
[95] M. Wu, X.-H. Sun, and H. Jin. Performance under failures of high-end computing. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC ’07, pages 48:1–48:11, New York, NY, USA, 2007. ACM.
[96] P. Wu, N. DeBardeleben, Q. Guan, S. Blanchard, J. Chen, D. Tao, X. Liang, K. Ouyang, and Z. Chen. Silent data corruption resilient two-sided matrix factorizations. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’17, pages 415–427, New York, NY, USA, 2017. ACM.
[97] E. Yao, M. Chen, W. Zhang, and G. Tan. A new and efficient algorithm-based fault tolerance scheme for a million way parallelism. In Proceedings of CoRR, volume abs/1106.4213, 2011.
[98] E. Yao, R. Wang, M. Chen, G. Tan, and N. Sun. A case study of designing efficient algorithm-based fault tolerant application for exascale parallelism. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS ’12, pages 438–448, Washington, DC, USA, 2012. IEEE Computer Society.
[99] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, pages 10–10, Berkeley, CA, USA, 2010. USENIX Association.
[100] Z. Zheng, L. Yu, W. Tang, Z. Lan, R. Gupta, N. Desai, S. Coghlan, and D. Buettner. Co-analysis of ras log and job log on blue gene/p. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS ’11, pages 840–851, Washington, DC, USA, 2011. IEEE Computer Society.
[101] Zuse Institute Berlin. Mp-testdata-traveling salesman problem instances. http://elib.zib.de/pub/mp-testdata/tsp/tsplib/tsplib.html.

Repository Staff Only: item control page

Download Statistics

Downloads per month over past year

Research related to the current document (at the CORE website)

Spectrum Research Repository

Algorithm Based Fault Tolerance: A Perspective from Algorithmic and Communication Characteristics of Parallel Algorithms

Algorithm Based Fault Tolerance: A Perspective from Algorithmic and Communication Characteristics of Parallel Algorithms

Abstract

References: