- Original Paper
- Open access
- Published:
Reliable management of checkpointing and application data in opportunistic grids
Journal of the Brazilian Computer Society volume 16, pages 177–190 (2010)
Abstract
Opportunistic computational grids use idle processor cycles from shared machines to enable the execution of long-running parallel applications. Besides computational power, these applications may also consume and generate large amounts of data, requiring an efficient data storage and management infrastructure. In this article, we present an integrated middleware infrastructure that enables the use of not only idle processor cycles, but also unused disk space of shared machines. Our middleware enables the reliable distributed storage of application data in the shared machines in a redundant and fault-tolerant way. A checkpointing-based mechanism monitors the execution of parallel applications, saves periodical checkpoints in the shared machines, and in case of node failures, supports the application migration across heterogeneous grid nodes. We evaluate the feasibility of our middleware using experiments and simulations. Our evaluation shows that the proposed middleware promotes important improvements in grid data management reliability while imposing a low performance overhead.
References
Antoniu G, Bougé L, Jan M (2005) Juxmem: An adaptive supportive platform for data sharing on the grid. Scalable Comput Pract Exp 6(3):45–55
Batten C, Barr K, Saraf A, Trepetin S (2002) pStore: A secure peer-to-peer backup system. Tech Rep MIT-LCS-TM-632, MIT LCS
Blackham B (2009) Cryopid page. http://cryopid.berlios.de/
Blake C, Rodrigues R (2003) High availability, scalable storage, dynamic peer networks: pick two. In: HotOS’03: Proc of the 9th workshop on hot topics in operating systems, USENIX
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426. doi:10.1145/362686.362692
Bolosky WJ, Douceur JR, Ely D, Theimer M (2000) Feasibility of a serverless distributed file system deployed on an existing set of desktop pcs. SIGMETRICS Perform Eval Rev 28(1):34–43. doi:10.1145/345063.339345
Bronevetsky G, Marques D, Pingali K, Stodghill P (2003) Automated application-level checkpointing of MPI programs. In: PPoPP ’03: Proceedings of the 9th ACM, SIGPLAN symposium on principles and practice of parallel programming, pp 84–89
Cai M, Chervenak A, Frank M (2004) A peer-to-peer replica location service based on a distributed hash table. In: SC ’04: Proceedings of the 2004 ACM/IEEE conference on supercomputing. IEEE Computer Society, Washington, p 56. doi:10.1109/SC.2004.7
de Camargo RY, Kon F (2006) Distributed data storage for opportunistic grids. In: ACM/IFIP/USENIX middleware doctoral symp, Melbourne, Australia
de Camargo RY, Kon F (2007) Design and implementation of a middleware for data storage in opportunistic grids. In: Proceedings of the 7th IEEE international symposium on cluster computing and the grid (CCGRID 2007), Rio de Janeiro, Brazil. IEEE Computer Society, Washington, pp 23–30
de Camargo RY, Kon F, Goldman A (2005) Portable checkpointing and communication for BSP applications on dynamic heterogeneous Grid environments. In: SBAC-PAD’05: The 17th international symposium on computer architecture and high performance computing, Rio de Janeiro, Brazil
de Camargo RY, Castor Filho F, Kon F (2009) Efficient maintenance of distributed data in highly dynamic opportunistic grids. In: Proceedings of the 24th ACM symposium on applied computing (SAC 2009), Track on dependable and adaptive distributed systems (DADS), Honolulu, HI, USA. ACM, New York
Chervenak AL, Palavalli N, Bharathi S, Kesselman C, Schwartzkopf R (2004) Performance and scalability of a replica location service. In: HPDC ’04: Proceedings of the 13th IEEE international symposium on high performance distributed computing (HPDC’04). IEEE Computer Society, Washington, pp 182–191. doi:10.1109/HPDC.2004.27
Chiba S (1995) A metaobject protocol for C++. In: OOPSLA ’95: Proceedings of the 10th ACM conference on object-oriented programming systems, languages, and applications, pp 285–299
Chien A, Calder B, Elbert S, Bhatia K (2003) Entropia: architecture and performance of an enterprise desktop grid system. J Parallel Distrib Comput 63(5):597–610. doi:10.1016/S0743-7315(03)00006-6
Cirne W, Brasileiro F, Andrade N, Costa L, Andrade A, Novaes R, Mowbray M (2006) Labs of the world, unite!!! J Grid Comput 4(3):225–246
Dabek F, Kaashoek MF, Karger D, Stoica I Morris R (2001) Wide-area cooperative storage with cfs. In: SOSP ’01: Proceedings of the eighteenth ACM symposium on operating systems principles. ACM, New York, pp 202–215. doi:10.1145/502034.502054
Domingues P, Marques P, Silva L (2005) Resource usage of windows computer laboratories. In: Proc of the int conf on parallel processing (ICCP’05): workshops, pp 469–476
Elnozahy M, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv 34(3):375–408
Goldchleger A, Kon F, Goldman A, Finger M, Bezerra GC (2004) InteGrade: Object-oriented grid middleware leveraging idle computing power of desktop machines. Concurr Comput Pract Exp 16:449–459
Goldchleger A, Goldman A, Hayashida U, Kon F (2005) The implementation of the bsp parallel computing model on the integrade grid middleware. In: MGC ’05: Proceedings of the 3rd international workshop on middleware for grid computing. ACM, New York, pp 1–6. doi:10.1145/1101499.1101504
Hayashida UK, Okuda K, Panetta J, Song SW (2005) Generating parallel algorithms for cluster and grid computing. In: ICCSA ’05: The 2005 international conference on computational science and its applications. Springer, Berlin, pp 509–516
Karablieh F, Bazzi RA, Hicks M (2001) Compiler-assisted heterogeneous checkpointing. In: SRDS ’01: Proceedings of the 20th IEEE symposium on reliable distributed systems, New Orleans, USA, pp 56–65
Kircher M, Jain P (2004) Pattern-oriented software architecture, Volume 3: patterns for resource management. Wiley, New York
Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7):558–565
Landers M, Zhang H, Tan KL (2004) Peerstore: Better performance by relaxing in peer-to-peer backup. In: P2P ’04: Proc of the 4th int conf on peer-to-peer computing. IEEE Computer Society, Washington, pp 72–79. doi:10.1109/P2P.2004.38
Litzkow M, Livny M, Mutka M (1988) Condor—a hunter of idle workstations. In: ICDCS ’88: Proceedings of the 8th int conference of distributed computing systems, pp 104–111
Luckow A, Schnor B (2008) Adaptive checkpoint replication for supporting the fault tolerance of applications in the grid. In: Proceedings of the 2008 seventh IEEE international symposium on network computing and applications. IEEE Computer Society, Washington, pp 299–306. doi:10.1109/NCA.2008.38
Malluhi QM, Johnston WE (1998) Coding for high availability of a distributed-parallel storage system. IEEE Trans Parallel Distrib Syst 9(12):1237–1252. doi:10.1109/71.737699
Mutka MW, Livny M (1991) The available capacity of a privately owned workstation environment. Perform Eval 12(4):269–284. doi:10.1016/0166-5316(91)90005-N
Plank JS, Kingsley MBG, Li K (1995) Libckpt: Transparent checkpointing under unix. In: Proceedings of the USENIX winter 1995 technical conference, pp 213–323
Plank JS, Li K, Puening MA (1998) Diskless checkpointing. IEEE Trans Parallel Distrib Syst 9(10):972–986. doi:10.1109/71.730527
Pruyne J, Livny M (1996) Managing checkpoints for parallel programs. In: IPPS ’96: Proceedings of the workshop on job scheduling strategies for parallel processing. Springer, London, pp 140–154
Rabin MO (1989) Efficient dispersal of information for security, load balancing, and fault tolerance. J ACM 36(2):335–348. doi:10.1145/62044.62050
Ripeanu M, Foster I (2002) A decentralized, adaptive replica location mechanism. In: HPDC ’02: Proceedings of the 11th IEEE international symposium on high performance distributed computing. IEEE Computer Society, Washington
Rowstron A, Druschel P (2001) Storage management and caching in past, a large-scale, persistent peer-to-peer storage utility. In: SOSP ’01: Proceedings of the eighteenth ACM symposium on operating systems principles. ACM, New York, pp 188–201. doi:10.1145/502034.502053
Rowstron AIT, Druschel P (2001) Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Middleware 2001: IFIP/ACM international conference on distributed systems platforms, Heidelberg, Germany, pp 329–350
Sobe P (2003) Stable checkpointing in distributed systems without shared disks. In: IPDPS ’03: Proceedings of the 17th international symposium on parallel and distributed processing. IEEE Computer Society, Washington, p 214.2
Stoica I, Morris R, Karger D, Kaashock M, Balakrishman H (2003) Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Trans Netw 11(1):17–32
Strumpen V, Ramkumar B (1996) Portable checkpointing and recovery in heterogeneous environments. Tech Rep UI-ECE TR-96.6.1, University of Iowa
Valiant L (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111
Vazhkudai SS, Ma X, Freeh VW, Strickland JW, Tammineedi N, Scott SL (2005) Freeloader: Scavenging desktop storage resources for scientific data. In: SC ’05: Proceedings of the 2005 ACM/IEEE conference on supercomputing. IEEE Computer Society, Washington, p 56. doi:10.1007/s13173-010-0016-0
Weatherspoon H, Kubiatowicz J (2002) Erasure coding vs. replication: a quantitative comparison. In: IPTPS ’01: Revised papers from the first international workshop on peer-to-peer systems. Springer, London, pp 328–338
Author information
Authors and Affiliations
Corresponding author
Additional information
This research was supported by CNPq/Brazil, grants #481147/2007-1 and #550895/2007-8.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
de Camargo, R.Y., Castor, F. & Kon, F. Reliable management of checkpointing and application data in opportunistic grids. J Braz Comput Soc 16, 177–190 (2010). https://doi.org/10.1007/s13173-010-0016-0
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s13173-010-0016-0