Reliable management of checkpointing and application data in opportunistic grids

de Camargo, Raphael Y.; Castor, Fernando; Kon, Fabio

doi:10.1007/s13173-010-0016-0

Original Paper
Open access
Published: 28 July 2010

Reliable management of checkpointing and application data in opportunistic grids

Raphael Y. de Camargo¹,
Fernando Castor² &
Fabio Kon³

Journal of the Brazilian Computer Society volume 16, pages 177–190 (2010)Cite this article

559 Accesses
1 Citations
Metrics details

Abstract

Opportunistic computational grids use idle processor cycles from shared machines to enable the execution of long-running parallel applications. Besides computational power, these applications may also consume and generate large amounts of data, requiring an efficient data storage and management infrastructure. In this article, we present an integrated middleware infrastructure that enables the use of not only idle processor cycles, but also unused disk space of shared machines. Our middleware enables the reliable distributed storage of application data in the shared machines in a redundant and fault-tolerant way. A checkpointing-based mechanism monitors the execution of parallel applications, saves periodical checkpoints in the shared machines, and in case of node failures, supports the application migration across heterogeneous grid nodes. We evaluate the feasibility of our middleware using experiments and simulations. Our evaluation shows that the proposed middleware promotes important improvements in grid data management reliability while imposing a low performance overhead.

References

Antoniu G, Bougé L, Jan M (2005) Juxmem: An adaptive supportive platform for data sharing on the grid. Scalable Comput Pract Exp 6(3):45–55
Google Scholar
Batten C, Barr K, Saraf A, Trepetin S (2002) pStore: A secure peer-to-peer backup system. Tech Rep MIT-LCS-TM-632, MIT LCS
Blackham B (2009) Cryopid page. http://cryopid.berlios.de/
Blake C, Rodrigues R (2003) High availability, scalable storage, dynamic peer networks: pick two. In: HotOS’03: Proc of the 9th workshop on hot topics in operating systems, USENIX
Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426. doi:10.1145/362686.362692
Article MATH Google Scholar
Bolosky WJ, Douceur JR, Ely D, Theimer M (2000) Feasibility of a serverless distributed file system deployed on an existing set of desktop pcs. SIGMETRICS Perform Eval Rev 28(1):34–43. doi:10.1145/345063.339345
Article Google Scholar
Bronevetsky G, Marques D, Pingali K, Stodghill P (2003) Automated application-level checkpointing of MPI programs. In: PPoPP ’03: Proceedings of the 9th ACM, SIGPLAN symposium on principles and practice of parallel programming, pp 84–89
Cai M, Chervenak A, Frank M (2004) A peer-to-peer replica location service based on a distributed hash table. In: SC ’04: Proceedings of the 2004 ACM/IEEE conference on supercomputing. IEEE Computer Society, Washington, p 56. doi:10.1109/SC.2004.7
Google Scholar
de Camargo RY, Kon F (2006) Distributed data storage for opportunistic grids. In: ACM/IFIP/USENIX middleware doctoral symp, Melbourne, Australia
de Camargo RY, Kon F (2007) Design and implementation of a middleware for data storage in opportunistic grids. In: Proceedings of the 7th IEEE international symposium on cluster computing and the grid (CCGRID 2007), Rio de Janeiro, Brazil. IEEE Computer Society, Washington, pp 23–30
Chapter Google Scholar
de Camargo RY, Kon F, Goldman A (2005) Portable checkpointing and communication for BSP applications on dynamic heterogeneous Grid environments. In: SBAC-PAD’05: The 17th international symposium on computer architecture and high performance computing, Rio de Janeiro, Brazil
de Camargo RY, Castor Filho F, Kon F (2009) Efficient maintenance of distributed data in highly dynamic opportunistic grids. In: Proceedings of the 24th ACM symposium on applied computing (SAC 2009), Track on dependable and adaptive distributed systems (DADS), Honolulu, HI, USA. ACM, New York
Chervenak AL, Palavalli N, Bharathi S, Kesselman C, Schwartzkopf R (2004) Performance and scalability of a replica location service. In: HPDC ’04: Proceedings of the 13th IEEE international symposium on high performance distributed computing (HPDC’04). IEEE Computer Society, Washington, pp 182–191. doi:10.1109/HPDC.2004.27
Chapter Google Scholar
Chiba S (1995) A metaobject protocol for C++. In: OOPSLA ’95: Proceedings of the 10th ACM conference on object-oriented programming systems, languages, and applications, pp 285–299
Google Scholar
Chien A, Calder B, Elbert S, Bhatia K (2003) Entropia: architecture and performance of an enterprise desktop grid system. J Parallel Distrib Comput 63(5):597–610. doi:10.1016/S0743-7315(03)00006-6
Article Google Scholar
Cirne W, Brasileiro F, Andrade N, Costa L, Andrade A, Novaes R, Mowbray M (2006) Labs of the world, unite!!! J Grid Comput 4(3):225–246
Article MATH Google Scholar
Dabek F, Kaashoek MF, Karger D, Stoica I Morris R (2001) Wide-area cooperative storage with cfs. In: SOSP ’01: Proceedings of the eighteenth ACM symposium on operating systems principles. ACM, New York, pp 202–215. doi:10.1145/502034.502054
Chapter Google Scholar
Domingues P, Marques P, Silva L (2005) Resource usage of windows computer laboratories. In: Proc of the int conf on parallel processing (ICCP’05): workshops, pp 469–476
Elnozahy M, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv 34(3):375–408
Article Google Scholar
Goldchleger A, Kon F, Goldman A, Finger M, Bezerra GC (2004) InteGrade: Object-oriented grid middleware leveraging idle computing power of desktop machines. Concurr Comput Pract Exp 16:449–459
Article Google Scholar
Goldchleger A, Goldman A, Hayashida U, Kon F (2005) The implementation of the bsp parallel computing model on the integrade grid middleware. In: MGC ’05: Proceedings of the 3rd international workshop on middleware for grid computing. ACM, New York, pp 1–6. doi:10.1145/1101499.1101504
Chapter Google Scholar
Hayashida UK, Okuda K, Panetta J, Song SW (2005) Generating parallel algorithms for cluster and grid computing. In: ICCSA ’05: The 2005 international conference on computational science and its applications. Springer, Berlin, pp 509–516
Google Scholar
Karablieh F, Bazzi RA, Hicks M (2001) Compiler-assisted heterogeneous checkpointing. In: SRDS ’01: Proceedings of the 20th IEEE symposium on reliable distributed systems, New Orleans, USA, pp 56–65
Kircher M, Jain P (2004) Pattern-oriented software architecture, Volume 3: patterns for resource management. Wiley, New York
Google Scholar
Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7):558–565
Article MATH Google Scholar
Landers M, Zhang H, Tan KL (2004) Peerstore: Better performance by relaxing in peer-to-peer backup. In: P2P ’04: Proc of the 4th int conf on peer-to-peer computing. IEEE Computer Society, Washington, pp 72–79. doi:10.1109/P2P.2004.38
Google Scholar
Litzkow M, Livny M, Mutka M (1988) Condor—a hunter of idle workstations. In: ICDCS ’88: Proceedings of the 8th int conference of distributed computing systems, pp 104–111
Luckow A, Schnor B (2008) Adaptive checkpoint replication for supporting the fault tolerance of applications in the grid. In: Proceedings of the 2008 seventh IEEE international symposium on network computing and applications. IEEE Computer Society, Washington, pp 299–306. doi:10.1109/NCA.2008.38
Chapter Google Scholar
Malluhi QM, Johnston WE (1998) Coding for high availability of a distributed-parallel storage system. IEEE Trans Parallel Distrib Syst 9(12):1237–1252. doi:10.1109/71.737699
Article Google Scholar
Mutka MW, Livny M (1991) The available capacity of a privately owned workstation environment. Perform Eval 12(4):269–284. doi:10.1016/0166-5316(91)90005-N
Article MATH Google Scholar
Plank JS, Kingsley MBG, Li K (1995) Libckpt: Transparent checkpointing under unix. In: Proceedings of the USENIX winter 1995 technical conference, pp 213–323
Plank JS, Li K, Puening MA (1998) Diskless checkpointing. IEEE Trans Parallel Distrib Syst 9(10):972–986. doi:10.1109/71.730527
Article Google Scholar
Pruyne J, Livny M (1996) Managing checkpoints for parallel programs. In: IPPS ’96: Proceedings of the workshop on job scheduling strategies for parallel processing. Springer, London, pp 140–154
Rabin MO (1989) Efficient dispersal of information for security, load balancing, and fault tolerance. J ACM 36(2):335–348. doi:10.1145/62044.62050
Article MATH MathSciNet Google Scholar
Ripeanu M, Foster I (2002) A decentralized, adaptive replica location mechanism. In: HPDC ’02: Proceedings of the 11th IEEE international symposium on high performance distributed computing. IEEE Computer Society, Washington
Google Scholar
Rowstron A, Druschel P (2001) Storage management and caching in past, a large-scale, persistent peer-to-peer storage utility. In: SOSP ’01: Proceedings of the eighteenth ACM symposium on operating systems principles. ACM, New York, pp 188–201. doi:10.1145/502034.502053
Chapter Google Scholar
Rowstron AIT, Druschel P (2001) Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Middleware 2001: IFIP/ACM international conference on distributed systems platforms, Heidelberg, Germany, pp 329–350
Sobe P (2003) Stable checkpointing in distributed systems without shared disks. In: IPDPS ’03: Proceedings of the 17th international symposium on parallel and distributed processing. IEEE Computer Society, Washington, p 214.2
Google Scholar
Stoica I, Morris R, Karger D, Kaashock M, Balakrishman H (2003) Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Trans Netw 11(1):17–32
Article Google Scholar
Strumpen V, Ramkumar B (1996) Portable checkpointing and recovery in heterogeneous environments. Tech Rep UI-ECE TR-96.6.1, University of Iowa
Valiant L (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111
Article Google Scholar
Vazhkudai SS, Ma X, Freeh VW, Strickland JW, Tammineedi N, Scott SL (2005) Freeloader: Scavenging desktop storage resources for scientific data. In: SC ’05: Proceedings of the 2005 ACM/IEEE conference on supercomputing. IEEE Computer Society, Washington, p 56. doi:10.1007/s13173-010-0016-0
Chapter Google Scholar
Weatherspoon H, Kubiatowicz J (2002) Erasure coding vs. replication: a quantitative comparison. In: IPTPS ’01: Revised papers from the first international workshop on peer-to-peer systems. Springer, London, pp 328–338
Google Scholar

Download references

Author information

Authors and Affiliations

Center for Mathematics, Computation and Cognition, Federal University of ABC (UFABC), R. Catequese, 242, Santo André/SP, 09090-400, Brazil
Raphael Y. de Camargo
Informatics Center, Federal University of Pernambuco (UFPE), Recife/PE, Brazil
Fernando Castor
Department of Computer Science, University of São Paulo (USP), São Paulo/SP, Brazil
Fabio Kon

Authors

Raphael Y. de Camargo
View author publications
You can also search for this author in PubMed Google Scholar
Fernando Castor
View author publications
You can also search for this author in PubMed Google Scholar
Fabio Kon
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Raphael Y. de Camargo.

Additional information

This research was supported by CNPq/Brazil, grants #481147/2007-1 and #550895/2007-8.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

de Camargo, R.Y., Castor, F. & Kon, F. Reliable management of checkpointing and application data in opportunistic grids. J Braz Comput Soc 16, 177–190 (2010). https://doi.org/10.1007/s13173-010-0016-0

Download citation

Received: 22 January 2010
Accepted: 17 June 2010
Published: 28 July 2010
Issue Date: September 2010
DOI: https://doi.org/10.1007/s13173-010-0016-0