Kabir, Upama (2017) Algorithm Based Fault Tolerance: A Perspective from Algorithmic and Communication Characteristics of Parallel Algorithms. PhD thesis, Concordia University.

Text (application/pdf)
2MBKabir_PhD_S2018.pdf  Accepted Version Available under License Spectrum Terms of Access. 
Abstract
Checkpoint and recovery cost imposed by checkpoint/restart (CP/R) is a crucial performance issue for highperformance computing (HPC) applications. In comparison, AlgorithmBased Fault Tolerance (ABFT) is a promising fault tolerance method with low recovery overhead, but it suffers from the inadequacy of universal applicability, i.e., tied to a specific application or algorithm. Till date, providing fault tolerance for matrixbased algorithms for linear systems has been the research focus of ABFT schemes. As a consequence, it necessitates a comprehensive exploration of ABFT research to widen its scope to other types of parallel algorithms and applications. In this thesis, we go beyond traditional ABFT and focus on other types of parallel applications not covered by traditional ABFT. In that regard, rather than an emphasis on a single application at a time, we consider the algorithmic and communication characteristics of a class of parallel applications to design efficient fault tolerance and recovery strategies for that class of parallel applications. The communication characteristics determine how to distributively replicate the fault recovery data (we call it the {\em critical data}) of a process, and the algorithmic characteristics determine what the applicationspecific data is to be replicated to minimize fault tolerance and recovery cost. Based on communication characteristics, parallel algorithms can be broadly classified as (i) embarrassingly parallel algorithms, where processes have infrequent or rare interactions, and (ii) communicationintensive parallel algorithms, where processes have significant interactions. In this thesis, through different case studies, we design ABFT for these two categories of algorithms by considering their algorithmic and communication characteristics. Analysis of these parallel algorithms reveals that a process contains sufficient information that can help to rebuild a computational state if any failure occurs during the computation. We define this information as critical data, the minimal applicationlevel data required to be saved (securely) so that a failed process can be fully recovered from a most recent consistent state using this fault recovery data. How the communication dependencies among processes are utilized to replicate fault recovery data is directly related to the system’s fault tolerance performance. We propose ABFT for parallel search algorithms, which belong to the class of embarrassingly parallel algorithms. Parallel search algorithms are the wellknown solution techniques for discrete optimization problems (DOP). DOP covers a broad class of (parallel) applications from search problems in AI to computer games, e.g., Chess and various games, traveling salesman problem, various AI search problems. As a case study, we choose the parallel iterative deepening A* (PIDA*) algorithm and integrate applicationlevel fault tolerance with the algorithm by replicating critical data periodically to make it resilient. In the category of communicationintensive algorithms, we choose Dynamic programming (DP) which is a widely used algorithm paradigm for optimization problems. We choose parallel DP algorithm as a case study and propose ABFT for such applications. We present a detailed analysis of the characteristics of parallel DP algorithms and show that the algorithmic features reduce the cardinality of critical data into a single data in case of $n$data dependent task. We demonstrate the idea with two popular DP class of applications: (i) the traveling salesman problem (TSP), and (ii) the longest common subsequence (LCS) problem. Minimal storage and recovery overhead are the prime concern in FT design. On that regard, we demonstrate that further optimization in critical data is possible for particular DP class of problems, where the degree of dependency for a subproblem is small and fixed at each iteration. We discuss it with the 0/1 knapsack problem as a case study and propose an ABFT scheme where, instead of replicating the critical data, we replicate a bitvector flag in peer process's memory which is later used to rebuild the lost data of a failed process. Theoretical and experimental results demonstrate that our proposed methods perform significantly better than the conventional CP/R in terms of fault tolerance and recovery overheads, and also in storage overhead in the presence of single and multiple simultaneous failures.
Divisions:  Concordia University > Gina Cody School of Engineering and Computer Science > Computer Science and Software Engineering 

Item Type:  Thesis (PhD) 
Authors:  Kabir, Upama 
Institution:  Concordia University 
Degree Name:  Ph. D. 
Program:  Computer Science 
Date:  October 2017 
Thesis Supervisor(s):  Goswami, Dhrubojyoti 
Keywords:  HPC, ABFT, Fault Tolerance, Parallel and Distributed Systems, MPI, Parallel Search, Checkpointing, Parallel Dynamic Programming 
ID Code:  983343 
Deposited By:  UPAMA KABIR 
Deposited On:  05 Jun 2018 14:49 
Last Modified:  05 Jun 2018 14:49 
References:
[1] N. Ali, S. Krishnamoorthy, M. Halappanavar, and J. Daily. Tolerating correlated failures for generalized cartesian distributions via bipartite matching. In Proceedings of the 8th ACM International Conference on Computing Frontiers, CF ’11, pages 36:1–36:10, New York, NY, USA, 2011. ACM.[2] N. Ali, S. Krishnamoorthy, M. Halappanavar, and J. Daily. Multifault tolerance for cartesian data distributions. International Journal of Parallel Programming, 41(3):469–493, 2013.
[3] G. Y. Ananth, V. Kumar, and P. Pardalos. Parallel processing of discrete optimization problems. In IN ENCYCLOPEDIA OF MICROCOMPUTERS, pages 129–147. Marcel Dekker Inc, 1993.
[4] J. Anfinson and F. T. Luk. A linear algebraic model of algorithmbased fault tolerance. IEEE Trans. Comput., 37(12):1599–1604, Dec. 1988.
[5] A. C. S. Association. The computer failure data repository (cfdr).
[6] P. Banerjee, J. T. Rahmeh, C. Stunkel, V. S. Nair, K. Roy, V. Balasubramanian, and J. A. Abraham. Algorithmbased fault tolerance on a hypercube multiprocessor. IEEE Trans. Comput., 39(9):1132–1145, Sept. 1990.
[7] L. BautistaGomez, S. Tsuboi, D. Komatitsch, F. Cappello, N. Maruyama, and S. Matsuoka. Fti: high performance fault tolerance interface for hybrid systems. In Proc. International Conference for High Performance Computing, Networking, Storage and Analysis, 2011, SC ’11, pages 32:1–32:32, New York, NY, USA, 2011. ACM.
[8] R. Bellman. On the theory of dynamic programming. Proceedings of the National Academy of Sciences of the United States of America, 38(8):716–719, August 1952.
[9] R. Bellman. Dynamic Programming. Princeton University Press, Princeton, NJ, USA, 1957.
[10] K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Karp, S. Keckler, D. Klein, R. Lucas, M. Richards, A. Scarpelli, S. Scott, A. Snavely, T. Sterling, R. S. Williams, K. Yelick, K. Bergman, S. Borkar, D. Campbell, W. Carlson, W. Dally, M. Denneau, P. Franzon, W. Harrod, J. Hiller, S. Keckler, D. Klein, P. Kogge, R. S. Williams, and K. Yelick. Exascale computing study: Technology challenges in achieving exascale systems, 2008.
[11] D. L. Boley, R. P. Brent, G. H. Golub, and F. T. Luk. Algorithmic fault tolerance using the lanczos method. SIAM J. Matrix Anal. Appl., 13(1):312–332, Jan. 1992.
[12] G. Bosilca, A. Bouteiller, T. Herault, Y. Robert, and J. Dongarra. Assessing the impact of abft and checkpoint composite strategies. In 2014 IEEE International Parallel Distributed Processing Symposium Workshops, pages 679–688, May 2014.
[13] G. Bosilca, A. Bouteiller, T. Hérault, Y. Robert, and J. J. Dongarra. Composing resilience techniques: Abft, periodic and incremental checkpointing. IJNC, 5(1):2–25, 2015.
[14] G. Bosilca, R. Delmas, J. Dongarra, and J. Langou. Algorithmbased fault tolerance applied to high performance computing. J. Parallel Distrib. Comput., 69(4):410–416, Apr. 2009.
[15] G. Boslica, A. Bouteiller, T. Hérault, Y. Robert, and J. Dongarra, Jack. Assessing the impact of ABFT and checkpoint composite strategies. In IEEE International Parallel & Distributed Processing Symposium Workshops, pages
679–688, 2014.
[16] A. Bouteiller, F. Cappello, T. Herault, G. Krawezik, P. Lemarinier, and F. Magniette. Mpichv2: A fault tolerant mpi for volatile nodes based on pessimistic sender based message logging. In Proceedings of the 2003 ACM/IEEE Conference on Supercomputing, SC ’03, pages 25–, New York, NY, USA, 2003. ACM.
[17] A. Bouteiller, T. Hérault, G. Bosilca, P. Du, and J. Dongarra. Algorithmbased fault tolerance for dense matrix factorizations, multiple failures and accuracy. ACM Trans. Parallel Comput., 1(2):1–28, Feb. 2015.
[18] G. Bronevetsky, D. Marques, K. Pingali, and P. Stodghill. Automated applicationlevel checkpointing of mpi programs. In Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’03, pages 84–94. ACM, 2003.
[19] S. Caminiti, I. Finocchi, and E. G. Fusco. Local dependency dynamic programming in the presence of memory faults. In 28th International Symposium on Theoretical Aspects of Computer Science (STACS 2011), volume 9 of Leibniz International Proceedings in Informatics (LIPIcs), pages 45–56, Dagstuhl, Germany, 2011. Schloss Dagstuhl–LeibnizZentrum fuer Informatik.
[20] S. Caminiti, I. Finocchi, E. G. Fusco, and F. Silvestri. Resilient dynamic programming. Algorithmica, 77(2):389–425, February 2017.
[21] F. Cappello. Fault tolerance in petascale/ exascale systems: Current knowledge, challenges and research opportunities. Int. J. High Perform. Comput. Appl., 23(3):212–226, Aug. 2009.
[22] F. Cappello, G. Al, W. Gropp, S. Kale, B. Kramer, and M. Snir. Toward exascale resilience: 2014 update. Supercomput. Front. Innov.: Int. J., 1(1):5–28, Apr. 2014.
[23] F. Cappello, H. Casanova, and Y. Robert. Checkpointing vs. migration for postpetascale supercomputers. In Proc. 39th International Conference on Parallel Processing, 2010, ICPP ’10, pages 168–177, Washington, DC, USA, 2010. IEEE Computer Society.
[24] H. Casanova, F. Vivien, and D. Zaidouni. Using Replication for Resilience on Exascale Systems, pages 229–278. Springer International Publishing, Cham, 2015.
[25] S. Chakravorty and L. V. Kale. A fault tolerance protocol with fast fault recovery. In IEEE International Parallel and Distributed Processing Symposium, IPDPS. IEEE, 2007.
[26] Z. Chen. Extending algorithmbased fault tolerance to tolerate failstop failures in high performance distributed environments. In Proceedings of IEEE International Symposium on Parallel and Distributed Processing, IPDPS, pages 1–8,
New York, NY, USA, 2008. ACM.
[27] Z. Chen. Algorithmbased recovery for iterative methods without checkpointing. In Proceedings of the 20th international symposium on High performance distributed computing, HPDC ’11, pages 73–84. ACM, 2011.
[28] Z. Chen and J. Dongarra. Algorithmbased checkpointfree fault tolerance for parallel matrix computations on volatile resources. In Proceedings of the 20th international conference on Parallel and distributed processing, IPDPS’06. IEEE Computer Society, 2006.
[29] Z. Chen and J. Dongarra. Algorithmbased fault tolerance for failstop failures. IEEE Transactions on Parallel and Distributed Systems, 19(12):1628–1641, Dec 2008.
[30] Z. Chen, G. E. Fagg, E. Gabriel, J. Langou, T. Angskun, G. Bosilca, and J. Dongarra. Fault tolerant high performance computing by a coding approach. In Proceedings of the tenth ACM SIGPLAN symposium on Principles and practice of parallel programming, PPoPP ’05, pages 213–223, New York, NY, USA, 2005. ACM.
[31] D. Cook and R. C. Varnell. Adaptive parallel iterative deepening search. Journal of Artificial Intelligence Research, 9:139–166, 1998.
[32] I. Cores, G. Rodrı́guez, M. J. Martı́n, P. González, and R. R. Osorio. Improving scalability of applicationlevel checkpointrecovery by reducing checkpoint sizes. New Generation Comput., 31(3):163–185, 2013.
[33] I. Cores, M. Rodriguez, P. González, and M. J. Martı́n. Reducing the overhead of an MPI applicationlevel migration approach. Parallel Computing, 54:72–82, 2016.
[34] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson. Introduction to Algorithms. McGrawHill Higher Education, 2006.
[35] C. Coti, T. Herault, P. Lemarinier, L. Pilard, A. Rezmerita, E. Rodriguez, and F. Cappello. Blocking vs. nonblocking coordinated checkpointing for largescale fault tolerant mpi. In Proceedings of the 2006 ACM/IEEE Conference on Supercomputing, SC ’06, New York, NY, USA, 2006. ACM.
[36] DANAFaber Cancer Institute and Harvard School of Public Health. The computational biology and functional genomics laboratory. http://compbio.dfci. harvard.edu/tgi/, 2014.
[37] S. Dasgupta, C. Papadimitriou, and U. Vazirani. Algorithms. McGrawHill Higher Education, 2008.
[38] T. Davies, Z. Chen, C. Karlsson, and H. Liu. Algorithmbased recovery for hpl. In Proceedings of the 16th ACM symposium on Principles and practice of parallel programming, PPoPP ’11, pages 303–304, New York, NY, USA, 2011.
[39] T. Davies, C. Karlsson, H. Liu, C. Ding, and Z. Chen. High performance linpack benchmark: a fault tolerant implementation without checkpointing. In Proceedings of the international conference on Supercomputing, ICS ’11, pages
162–171, New York, NY, USA, 2011. ACM.
[40] D. A. G. de Oliveira, L. Pilla, C. Lunardi, L. Carro, P. O. Navaux, and P. Rech. The path to exascale: Code optimizations and hardening solutions reliability. In Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale, FTXS ’15, pages 55–62, New York, NY, USA, 2015. ACM.
[41] J. Dongarra, H. Meuer, and E. Strohmaier. Top 500 supercomputing sites, Jun 2017.
[42] P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithmbased fault tolerance for dense matrix factorizations. In Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Program
ming, PPoPP ’12, pages 225–234. ACM, 2012.
[43] P. Du, A. Bouteiller, G. Bosilca, T. Herault, and J. Dongarra. Algorithmbased fault tolerance for dense matrix factorizations. SIGPLAN Not., 47(8):225–234, Feb. 2012.
[44] I. P. Egwutuoha, D. Levy, B. Selic, and S. Chen. A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems. J. Supercomput., 65(3):1302–1326, Sept. 2013.
[45] E. N. M. Elnozahy, L. Alvisi, Y.M. Wang, and D. B. Johnson. A survey of rollbackrecovery protocols in messagepassing systems. ACM Comput. Surv., 34(3):375–408, Sept. 2002.
[46] C. Engelmann and F. Lauer. Facilitating codesign for extremescale systems through lightweight simulation. In IEEE International Conference On Cluster Computing Workshops and Posters (CLUSTER WORKSHOPS), 2010.
[47] G. E. Fagg and J. Dongarra. FTMPI: fault tolerant mpi, supporting dynamic applications in a dynamic world. In Recent Advances in Parallel Virtual Machine and Message Passing Interface, 7th European PVM/MPI Users’ Group Meeting, Balatonfüred, Hungary, September 2000, Proceedings, pages 346–353, 2000.
[48] P. Felber, X. Défago, R. Guerraoui, and P. Oser. Failure detectors as first class objects. In Proceedings of the International Symposium on Distributed Objects and Applications, DOA ’99, pages 132–, Washington, DC, USA, 1999. IEEE
Computer Society.
[49] P. F. Felzenszwalb and R. Zabih. Dynamic programming and graph algorithms in computer vision. IEEE Trans. Pattern Anal. Mach. Intell., 33(4):721–740, Apr. 2011.
[50] K. Ferreira, J. Stearley, J. H. Laros, III, R. Oldfield, K. Pedretti, R. Brightwell, R. Riesen, P. G. Bridges, and D. Arnold. Evaluating the viability of process replication reliability for exascale systems. In Proceedings of 2011 International Conference for High Performance Computing, Networking, Storage and Analysis, SC ’11, pages 44:1–44:12, New York, NY, USA, 2011. ACM.
[51] S. Fu. Failureaware resource management for highavailability computing clusters with distributed virtual machines. J. Parallel Distrib. Comput., 70(4):384–393, Apr. 2010.
[52] Q. Gao, W. Huang, M. J. Koop, and D. K. Panda. Groupbased coordinated checkpointing for mpi: A case study on infiniband. In Proceedings of the 2007 International Conference on Parallel Processing, ICPP ’07, pages 47–, Wash
ington, DC, USA, 2007. IEEE Computer Society.
[53] M. R. Garey and D. S. Johnson. Computers and Intractability; A Guide to the Theory of NPCompleteness. W. H. Freeman & Co., New York, NY, USA, 1990.
[54] A. Geist and R. Lucas. Major computer science challenges at exascale. Int. J. High Perform. Comput. Appl., 23(4):427–436, Nov. 2009.
[55] G. Gibson, B. Schroeder, and J. Digney. Failure tolerance in petascale computers. CTWatch Quarterly, 3(4):4–10, Nov. 2004.
[56] A. Grama and V. Kumar. State of the art in parallel search techniques for discrete optimization problems. IEEE Trans. on Knowl. and Data Eng., 11(1):28–35, Jan. 1999.
[57] W. Gropp, E. Lusk, N. Doss, and A. Skjellum. A highperformance, portableimplementation of the mpi message passing interface standard. Parallel Comput., 22(6):789–828, Sept. 1996.
[58] I. Gupta, T. D. Chandra, and G. S. Goldszmidt. On scalable and efficient distributed failure detectors. In Proceedings of the Twentieth Annual ACM Symposium on Principles of Distributed Computing, PODC ’01, pages 170–179,
New York, NY, USA, 2001. ACM.
[59] T. J. Hacker, F. Romero, and C. D. Carothers. An analysis of clustered failures on large supercomputing systems. J. Parallel Distrib. Comput., 69(7):652–665, July 2009.
[60] J. Haines, V. Lakamraju, I. Koren, and C. M. Krishna. Applicationlevel fault tolerance as a complement to systemlevel fault tolerance. J. Supercomput., 16(12):53–68, May 2000.
[61] P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (BLCR) for LINUX.
[62] P. H. Hargrove and J. C. Duell. Berkeley lab checkpoint/restart (BLCR) for Linux clusters. Journal of Physics: Conference Series, 46(1):494, 2006.
[63] K.H. Huang and J. A. Abraham. Algorithmbased fault tolerance for matrix operations. IEEE Trans. Comput., 33(6):518–528, June 1984.
[64] G. Jakadeesan and D. Goswami. A classificationbased approach to faulttolerance support in parallel programs. Parallel and Distributed Computing Applications and Technologies, International Conference on, 0:255–262, 2009.
[65] V. Kumar, A. Grama, A. Gupta, and G. Karypis. Introduction to parallel computing: design and analysis of algorithms. BenjaminCummings Publishing Co., Inc., Redwood City, CA, USA, 1994.
[66] V. Kumar, K. Ramesh, and V. N. Rao. Parallel bestfirst search of statespace graphs: A summary of results. In in Proc. 10th Nat. Conf. AI, AAAI, pages 122–127. Press, 1988.
[67] V. Kumar and V. N. Rao. Parallel depth first search. part ii. analysis. International Journal of Parallel Programming, 16(6):501–519, Dec 1987.
[68] L. Lamport. Time, clocks, and the ordering of events in a distributed system. Commun. ACM, 21(7):558–565, July 1978.
[69] H. Liu. An algorithmbased recovery scheme for extreme scale computing. In Proceedings of the 2011 IEEE International Symposium on Parallel and Distributed Processing Workshops and PhD Forum, IPDPSW ’11, pages 2010–2013, Washington, DC, USA, 2011. IEEE Computer Society.
[70] F. T. Luk and H. Park. An analysis of algorithmbased fault tolerance techniques. J. Parallel Distrib. Comput., 5(2):172–184, Apr. 1988.
[71] C. D. Martino, Z. T. Kalbarczyk, R. K. Iyer, F. Baccanico, J. Fullop, and W. Kramer. Lessons learned from the analysis of system failures at petascale: The case of blue waters. In 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, DSN, 2014, pages 610–621, 2014.
[72] A. Oliner. What supercomputers say: A study of five system logs. In 37th Annual IEEE/IFIP International Conference on Dependable Systems and Networks, 2007. DSN ’07, pages 575–584, 2007.
[73] J. S. Plank, K. Li, and M. A. Puening. Diskless checkpointing. IEEE Transactions on Parallel and Distributed Systems, 9(10):972–986, October 1998.
[74] C. Quebec. Calcul quebec, compute canada. http://www.calculquebec.ca/en/, 2017.
[75] V. N. Rao and V. Kumar. Parallel depth first search. part i. implementation. Int. J. Parallel Program., 16(6):479–499, Dec. 1987.
[76] V. N. Rao, V. Kumar, and K. Ramesh. A parallel implementation of iterativedeepeninga. In Proceedings of the sixth National conference on Artificial intelligence  Volume 1, AAAI’87, pages 178–182. AAAI Press, 1987.
[77] V. N. Rao, V. Kumar, and K. Ramesh. A parallel implementation of iterativedeepeninga. In Proceedings of the Sixth National Conference on Artificial Intelligence  Volume 1, AAAI’87, pages 178–182. AAAI Press, 1987.
[78] A. Reinefeld and V. Schnecke. Workload balancing in highly parallel depthfirst search. In In Scalable High Performance Computing Conference, pages 773–780, 1994.
[79] R. K. Sahoo, A. Sivasubramaniam, M. S. Squillante, and Y. Zhang. Failure data analysis of a largescale heterogeneous server environment. In Proceedings of the 2004 International Conference on Dependable Systems and Networks, DSN ’04, pages 772–, Washington, DC, USA, 2004. IEEE Computer Society.
[80] R. D. Schlichting and F. B. Schneider. Failstop processors: an approach to designing faulttolerant computing systems. ACM Trans. Comput. Syst., 1(3):222–238, Aug. 1983.
[81] B. Schroeder and G. A. Gibson. A largescale study of failures in highperformance computing systems. In Proceedings of the International Conference on Dependable Systems and Networks, DSN ’06, pages 249–258, Washington,
DC, USA, 2006. IEEE Computer Society.
[82] B. Schroeder and G. A. Gibson. Disk failures in the real world: what does an mttf of 1,000,000 hours mean to you? In Proceedings of the 5th USENIX conference on File and Storage Technologies, FAST ’07, Berkeley, CA, USA,
2007. USENIX Association.
[83] B. Schroeder and G. A. Gibson. Understanding failures in petascale computers, 2007.
[84] K. Shvachko, H. Kuang, S. Radia, and R. Chansler. The hadoop distributed file system. In Proceedings of the 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST), MSST ’10, pages 1–10, Washington, DC,
USA, 2010. IEEE Computer Society.
[85] M. Snir, R. W. Wisniewski, J. A. Abraham, S. V. Adve, S. Bagchi, P. Balaji, J. Belak, P. Bose, F. Cappello, B. Carlson, A. A. Chien, P. Coteus, N. A. Debardeleben, P. C. Diniz, C. Engelmann, M. Erez, S. Fazzari, A. Geist, R. Gupta, F. Johnson, S. Krishnamoorthy, S. Leyffer, D. Liberty, S. Mitra, T. Munson, R. Schreiber, J. Stearley, and E. V. Hensbergen. Addressing failures in exascale computing. Int. J. High Perform. Comput. Appl., 28(2):129–173, May 2014.
[86] P. Stodghil, G. Bronevetsky, D. Marques, K. Pingali, and R. Fernandes. Optimizing checkpoint sizes in the c3 system. Parallel and Distributed Processing Symposium, International, 11:226a, 2005.
[87] R. Subramaniyan, V. Aggarwal, A. Jacobs, and A. D. George. Fempi: A lightweight faulttolerant mpi for embedded cluster systems. In Proc. International Conference on Embedded Systems and Applications (ESA), Las Vegas, pages 26–29, 2006.
[88] B. W. Wah and G.j. Li. Systolic processing for dynamic programming problems. Circuits, Systems and Signal Processing, 7(2):119–149, Jun 1988.
[89] J. Walters and V. Chaudhary. Applicationlevel checkpointing techniques for parallel programs. In Distributed Computing and Internet Technology, pages 221–234. Springer Berlin/Heidelberg, Springer Berlin/Heidelberg, 2006.
[90] C.L. Wang, F. C. M. Lau, and J. C. Y. Ho. Scalable groupbased checkpoint/restart for largescale messagepassing systems. In IEEE International Parallel & Distributed Processing Symposium, pages 1–12, Los Alamitos, CA, USA, 2008. IEEE Computer Society.
[91] P. Wang, Y. Du, H. Fu, X. Yang, and H. Zhou. Static analysis for applicationlevel checkpointing of mpi programs. In 2008 10th IEEE International Conference on High Performance Computing and Communications, pages 548–555, Sept 2008.
[92] R. Wang, E. Yao, M. Chen, G. Tan, P. Balaji, and D. Buntinas. Building algorithmically nonstop fault tolerant mpi programs. In HiPC, pages 1–9, 2011.
[93] Z. Wei and D. Goswami. A synchronizationinduced checkpoint protocol for groupsynchronous parallel programs. In Proceedings of 13th International Conference on Parallel and Distributed Computing, Applications and Technologies (PDCAT 2012), pages 632–637. IEEE, 2012.
[94] T. White. Hadoop: The Definitive Guide. O’Reilly Media, Inc., 1st edition, 2009.
[95] M. Wu, X.H. Sun, and H. Jin. Performance under failures of highend computing. In Proceedings of the 2007 ACM/IEEE conference on Supercomputing, SC ’07, pages 48:1–48:11, New York, NY, USA, 2007. ACM.
[96] P. Wu, N. DeBardeleben, Q. Guan, S. Blanchard, J. Chen, D. Tao, X. Liang, K. Ouyang, and Z. Chen. Silent data corruption resilient twosided matrix factorizations. In Proceedings of the 22Nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, PPoPP ’17, pages 415–427, New York, NY, USA, 2017. ACM.
[97] E. Yao, M. Chen, W. Zhang, and G. Tan. A new and efficient algorithmbased fault tolerance scheme for a million way parallelism. In Proceedings of CoRR, volume abs/1106.4213, 2011.
[98] E. Yao, R. Wang, M. Chen, G. Tan, and N. Sun. A case study of designing efficient algorithmbased fault tolerant application for exascale parallelism. In Proceedings of the 2012 IEEE 26th International Parallel and Distributed Processing Symposium, IPDPS ’12, pages 438–448, Washington, DC, USA, 2012. IEEE Computer Society.
[99] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. In Proceedings of the 2Nd USENIX Conference on Hot Topics in Cloud Computing, HotCloud’10, pages 10–10, Berkeley, CA, USA, 2010. USENIX Association.
[100] Z. Zheng, L. Yu, W. Tang, Z. Lan, R. Gupta, N. Desai, S. Coghlan, and D. Buettner. Coanalysis of ras log and job log on blue gene/p. In Proceedings of the 2011 IEEE International Parallel & Distributed Processing Symposium, IPDPS ’11, pages 840–851, Washington, DC, USA, 2011. IEEE Computer Society.
[101] Zuse Institute Berlin. Mptestdatatraveling salesman problem instances. http://elib.zib.de/pub/mptestdata/tsp/tsplib/tsplib.html.
Repository Staff Only: item control page