Skip to main content

Reliable management of checkpointing and application data in opportunistic grids

Abstract

Opportunistic computational grids use idle processor cycles from shared machines to enable the execution of long-running parallel applications. Besides computational power, these applications may also consume and generate large amounts of data, requiring an efficient data storage and management infrastructure. In this article, we present an integrated middleware infrastructure that enables the use of not only idle processor cycles, but also unused disk space of shared machines. Our middleware enables the reliable distributed storage of application data in the shared machines in a redundant and fault-tolerant way. A checkpointing-based mechanism monitors the execution of parallel applications, saves periodical checkpoints in the shared machines, and in case of node failures, supports the application migration across heterogeneous grid nodes. We evaluate the feasibility of our middleware using experiments and simulations. Our evaluation shows that the proposed middleware promotes important improvements in grid data management reliability while imposing a low performance overhead.

References

  1. Antoniu G, Bougé L, Jan M (2005) Juxmem: An adaptive supportive platform for data sharing on the grid. Scalable Comput Pract Exp 6(3):45–55

    Google Scholar 

  2. Batten C, Barr K, Saraf A, Trepetin S (2002) pStore: A secure peer-to-peer backup system. Tech Rep MIT-LCS-TM-632, MIT LCS

  3. Blackham B (2009) Cryopid page. http://cryopid.berlios.de/

  4. Blake C, Rodrigues R (2003) High availability, scalable storage, dynamic peer networks: pick two. In: HotOS’03: Proc of the 9th workshop on hot topics in operating systems, USENIX

  5. Bloom BH (1970) Space/time trade-offs in hash coding with allowable errors. Commun ACM 13(7):422–426. doi:10.1145/362686.362692

    Article  MATH  Google Scholar 

  6. Bolosky WJ, Douceur JR, Ely D, Theimer M (2000) Feasibility of a serverless distributed file system deployed on an existing set of desktop pcs. SIGMETRICS Perform Eval Rev 28(1):34–43. doi:10.1145/345063.339345

    Article  Google Scholar 

  7. Bronevetsky G, Marques D, Pingali K, Stodghill P (2003) Automated application-level checkpointing of MPI programs. In: PPoPP ’03: Proceedings of the 9th ACM, SIGPLAN symposium on principles and practice of parallel programming, pp 84–89

  8. Cai M, Chervenak A, Frank M (2004) A peer-to-peer replica location service based on a distributed hash table. In: SC ’04: Proceedings of the 2004 ACM/IEEE conference on supercomputing. IEEE Computer Society, Washington, p 56. doi:10.1109/SC.2004.7

    Google Scholar 

  9. de Camargo RY, Kon F (2006) Distributed data storage for opportunistic grids. In: ACM/IFIP/USENIX middleware doctoral symp, Melbourne, Australia

  10. de Camargo RY, Kon F (2007) Design and implementation of a middleware for data storage in opportunistic grids. In: Proceedings of the 7th IEEE international symposium on cluster computing and the grid (CCGRID 2007), Rio de Janeiro, Brazil. IEEE Computer Society, Washington, pp 23–30

    Chapter  Google Scholar 

  11. de Camargo RY, Kon F, Goldman A (2005) Portable checkpointing and communication for BSP applications on dynamic heterogeneous Grid environments. In: SBAC-PAD’05: The 17th international symposium on computer architecture and high performance computing, Rio de Janeiro, Brazil

  12. de Camargo RY, Castor Filho F, Kon F (2009) Efficient maintenance of distributed data in highly dynamic opportunistic grids. In: Proceedings of the 24th ACM symposium on applied computing (SAC 2009), Track on dependable and adaptive distributed systems (DADS), Honolulu, HI, USA. ACM, New York

  13. Chervenak AL, Palavalli N, Bharathi S, Kesselman C, Schwartzkopf R (2004) Performance and scalability of a replica location service. In: HPDC ’04: Proceedings of the 13th IEEE international symposium on high performance distributed computing (HPDC’04). IEEE Computer Society, Washington, pp 182–191. doi:10.1109/HPDC.2004.27

    Chapter  Google Scholar 

  14. Chiba S (1995) A metaobject protocol for C++. In: OOPSLA ’95: Proceedings of the 10th ACM conference on object-oriented programming systems, languages, and applications, pp 285–299

    Google Scholar 

  15. Chien A, Calder B, Elbert S, Bhatia K (2003) Entropia: architecture and performance of an enterprise desktop grid system. J Parallel Distrib Comput 63(5):597–610. doi:10.1016/S0743-7315(03)00006-6

    Article  Google Scholar 

  16. Cirne W, Brasileiro F, Andrade N, Costa L, Andrade A, Novaes R, Mowbray M (2006) Labs of the world, unite!!! J Grid Comput 4(3):225–246

    Article  MATH  Google Scholar 

  17. Dabek F, Kaashoek MF, Karger D, Stoica I Morris R (2001) Wide-area cooperative storage with cfs. In: SOSP ’01: Proceedings of the eighteenth ACM symposium on operating systems principles. ACM, New York, pp 202–215. doi:10.1145/502034.502054

    Chapter  Google Scholar 

  18. Domingues P, Marques P, Silva L (2005) Resource usage of windows computer laboratories. In: Proc of the int conf on parallel processing (ICCP’05): workshops, pp 469–476

  19. Elnozahy M, Alvisi L, Wang YM, Johnson DB (2002) A survey of rollback-recovery protocols in message-passing systems. ACM Comput Surv 34(3):375–408

    Article  Google Scholar 

  20. Goldchleger A, Kon F, Goldman A, Finger M, Bezerra GC (2004) InteGrade: Object-oriented grid middleware leveraging idle computing power of desktop machines. Concurr Comput Pract Exp 16:449–459

    Article  Google Scholar 

  21. Goldchleger A, Goldman A, Hayashida U, Kon F (2005) The implementation of the bsp parallel computing model on the integrade grid middleware. In: MGC ’05: Proceedings of the 3rd international workshop on middleware for grid computing. ACM, New York, pp 1–6. doi:10.1145/1101499.1101504

    Chapter  Google Scholar 

  22. Hayashida UK, Okuda K, Panetta J, Song SW (2005) Generating parallel algorithms for cluster and grid computing. In: ICCSA ’05: The 2005 international conference on computational science and its applications. Springer, Berlin, pp 509–516

    Google Scholar 

  23. Karablieh F, Bazzi RA, Hicks M (2001) Compiler-assisted heterogeneous checkpointing. In: SRDS ’01: Proceedings of the 20th IEEE symposium on reliable distributed systems, New Orleans, USA, pp 56–65

  24. Kircher M, Jain P (2004) Pattern-oriented software architecture, Volume 3: patterns for resource management. Wiley, New York

    Google Scholar 

  25. Lamport L (1978) Time, clocks, and the ordering of events in a distributed system. Commun ACM 21(7):558–565

    Article  MATH  Google Scholar 

  26. Landers M, Zhang H, Tan KL (2004) Peerstore: Better performance by relaxing in peer-to-peer backup. In: P2P ’04: Proc of the 4th int conf on peer-to-peer computing. IEEE Computer Society, Washington, pp 72–79. doi:10.1109/P2P.2004.38

    Google Scholar 

  27. Litzkow M, Livny M, Mutka M (1988) Condor—a hunter of idle workstations. In: ICDCS ’88: Proceedings of the 8th int conference of distributed computing systems, pp 104–111

  28. Luckow A, Schnor B (2008) Adaptive checkpoint replication for supporting the fault tolerance of applications in the grid. In: Proceedings of the 2008 seventh IEEE international symposium on network computing and applications. IEEE Computer Society, Washington, pp 299–306. doi:10.1109/NCA.2008.38

    Chapter  Google Scholar 

  29. Malluhi QM, Johnston WE (1998) Coding for high availability of a distributed-parallel storage system. IEEE Trans Parallel Distrib Syst 9(12):1237–1252. doi:10.1109/71.737699

    Article  Google Scholar 

  30. Mutka MW, Livny M (1991) The available capacity of a privately owned workstation environment. Perform Eval 12(4):269–284. doi:10.1016/0166-5316(91)90005-N

    Article  MATH  Google Scholar 

  31. Plank JS, Kingsley MBG, Li K (1995) Libckpt: Transparent checkpointing under unix. In: Proceedings of the USENIX winter 1995 technical conference, pp 213–323

  32. Plank JS, Li K, Puening MA (1998) Diskless checkpointing. IEEE Trans Parallel Distrib Syst 9(10):972–986. doi:10.1109/71.730527

    Article  Google Scholar 

  33. Pruyne J, Livny M (1996) Managing checkpoints for parallel programs. In: IPPS ’96: Proceedings of the workshop on job scheduling strategies for parallel processing. Springer, London, pp 140–154

  34. Rabin MO (1989) Efficient dispersal of information for security, load balancing, and fault tolerance. J ACM 36(2):335–348. doi:10.1145/62044.62050

    Article  MATH  MathSciNet  Google Scholar 

  35. Ripeanu M, Foster I (2002) A decentralized, adaptive replica location mechanism. In: HPDC ’02: Proceedings of the 11th IEEE international symposium on high performance distributed computing. IEEE Computer Society, Washington

    Google Scholar 

  36. Rowstron A, Druschel P (2001) Storage management and caching in past, a large-scale, persistent peer-to-peer storage utility. In: SOSP ’01: Proceedings of the eighteenth ACM symposium on operating systems principles. ACM, New York, pp 188–201. doi:10.1145/502034.502053

    Chapter  Google Scholar 

  37. Rowstron AIT, Druschel P (2001) Pastry: Scalable, decentralized object location, and routing for large-scale peer-to-peer systems. In: Middleware 2001: IFIP/ACM international conference on distributed systems platforms, Heidelberg, Germany, pp 329–350

  38. Sobe P (2003) Stable checkpointing in distributed systems without shared disks. In: IPDPS ’03: Proceedings of the 17th international symposium on parallel and distributed processing. IEEE Computer Society, Washington, p 214.2

    Google Scholar 

  39. Stoica I, Morris R, Karger D, Kaashock M, Balakrishman H (2003) Chord: a scalable peer-to-peer lookup protocol for internet applications. IEEE/ACM Trans Netw 11(1):17–32

    Article  Google Scholar 

  40. Strumpen V, Ramkumar B (1996) Portable checkpointing and recovery in heterogeneous environments. Tech Rep UI-ECE TR-96.6.1, University of Iowa

  41. Valiant L (1990) A bridging model for parallel computation. Commun ACM 33(8):103–111

    Article  Google Scholar 

  42. Vazhkudai SS, Ma X, Freeh VW, Strickland JW, Tammineedi N, Scott SL (2005) Freeloader: Scavenging desktop storage resources for scientific data. In: SC ’05: Proceedings of the 2005 ACM/IEEE conference on supercomputing. IEEE Computer Society, Washington, p 56. doi:10.1007/s13173-010-0016-0

    Chapter  Google Scholar 

  43. Weatherspoon H, Kubiatowicz J (2002) Erasure coding vs. replication: a quantitative comparison. In: IPTPS ’01: Revised papers from the first international workshop on peer-to-peer systems. Springer, London, pp 328–338

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Raphael Y. de Camargo.

Additional information

This research was supported by CNPq/Brazil, grants #481147/2007-1 and #550895/2007-8.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 2.0 International License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

de Camargo, R.Y., Castor, F. & Kon, F. Reliable management of checkpointing and application data in opportunistic grids. J Braz Comput Soc 16, 177–190 (2010). https://doi.org/10.1007/s13173-010-0016-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s13173-010-0016-0

Keywords