Computer Communications and Networks

Fault-Tolerance Techniques for High-Performance Computing

Editors: Herault, Thomas, Robert, Yves (Eds.)

  • The first complete overview of this increasingly important field
  • Presents a unique, rigorous approach based on the design of analytical models to predict performance
  • Provides a coherent collection of valuable insights from internationally-renowned experts with considerable expertise
see more benefits

Buy this book

eBook $99.00
price for USA (gross)
  • ISBN 978-3-319-20943-2
  • Digitally watermarked, DRM-free
  • Included format: PDF, EPUB
  • ebooks can be used on all reading devices
  • Immediate eBook download after purchase
Hardcover $129.00
price for USA
  • ISBN 978-3-319-20942-5
  • Free shipping for individuals worldwide
  • Usually dispatched within 3 to 5 business days.
Softcover $129.00
price for USA
  • Customers within the U.S. and Canada please contact Customer Service at 1-800-777-4643, Latin America please contact us at +1-212-460-1500 (Weekdays 8:30am – 5:30pm ET) to place your order.
  • Due: November 4, 2016
  • ISBN 978-3-319-35560-3
  • Free shipping for individuals worldwide
About this book

This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as ABFT. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Features: provides a survey of resilience methods and performance models; examines the various sources for errors and faults in large-scale systems; reviews the spectrum of techniques that can be applied to design a fault-tolerant MPI; investigates different approaches to replication; discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems.

Table of contents (5 chapters)

  • Fault Tolerance Techniques for High-Performance Computing

    Dongarra, Jack (et al.)

    Pages 3-85

  • Errors and Faults

    Gainaru, Ana (et al.)

    Pages 89-144

  • Fault-Tolerant MPI

    Bouteiller, Aurélien

    Pages 145-228

  • Using Replication for Resilience on Exascale Systems

    Casanova, Henri (et al.)

    Pages 229-278

  • Energy-Aware Checkpointing Strategies

    Aupy, Guillaume (et al.)

    Pages 279-317

Buy this book

eBook $99.00
price for USA (gross)
  • ISBN 978-3-319-20943-2
  • Digitally watermarked, DRM-free
  • Included format: PDF, EPUB
  • ebooks can be used on all reading devices
  • Immediate eBook download after purchase
Hardcover $129.00
price for USA
  • ISBN 978-3-319-20942-5
  • Free shipping for individuals worldwide
  • Usually dispatched within 3 to 5 business days.
Softcover $129.00
price for USA
  • Customers within the U.S. and Canada please contact Customer Service at 1-800-777-4643, Latin America please contact us at +1-212-460-1500 (Weekdays 8:30am – 5:30pm ET) to place your order.
  • Due: November 4, 2016
  • ISBN 978-3-319-35560-3
  • Free shipping for individuals worldwide
Loading...

Recommended for you

Loading...

Bibliographic Information

Bibliographic Information
Book Title
Fault-Tolerance Techniques for High-Performance Computing
Editors
  • Thomas Herault
  • Yves Robert
Series Title
Computer Communications and Networks
Copyright
2015
Publisher
Springer International Publishing
Copyright Holder
Springer International Publishing Switzerland
eBook ISBN
978-3-319-20943-2
DOI
10.1007/978-3-319-20943-2
Hardcover ISBN
978-3-319-20942-5
Softcover ISBN
978-3-319-35560-3
Series ISSN
1617-7975
Edition Number
1
Number of Pages
IX, 320
Number of Illustrations and Tables
113 b/w illustrations
Topics