Skip to main content
Book cover

Fault-Tolerance Techniques for High-Performance Computing

  • Book
  • © 2015

Overview

  • The first complete overview of this increasingly important field
  • Presents a unique, rigorous approach based on the design of analytical models to predict performance
  • Provides a coherent collection of valuable insights from internationally-renowned experts with considerable expertise
  • Includes supplementary material: sn.pub/extras

Part of the book series: Computer Communications and Networks (CCN)

This is a preview of subscription content, log in via an institution to check access.

Access this book

eBook USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book USD 109.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Other ways to access

Licence this eBook for your library

Institutional subscriptions

Table of contents (5 chapters)

  1. General Overview

  2. Technical Contributions

Keywords

About this book

This timely text presents a comprehensive overview of fault tolerance techniques for high-performance computing (HPC). The text opens with a detailed introduction to the concepts of checkpoint protocols and scheduling algorithms, prediction, replication, silent error detection and correction, together with some application-specific techniques such as ABFT. Emphasis is placed on analytical performance models. This is then followed by a review of general-purpose techniques, including several checkpoint and rollback recovery protocols. Relevant execution scenarios are also evaluated and compared through quantitative models. Features: provides a survey of resilience methods and performance models; examines the various sources for errors and faults in large-scale systems; reviews the spectrum of techniques that can be applied to design a fault-tolerant MPI; investigates different approaches to replication; discusses the challenge of energy consumption of fault-tolerance methods in extreme-scale systems.

Editors and Affiliations

  • University of Tennessee, Knoxville, USA

    Thomas Herault

  • Ecole Normale SupĂ©rieure de Lyon, Lyon, France

    Yves Robert

Bibliographic Information

  • Book Title: Fault-Tolerance Techniques for High-Performance Computing

  • Editors: Thomas Herault, Yves Robert

  • Series Title: Computer Communications and Networks

  • DOI: https://doi.org/10.1007/978-3-319-20943-2

  • Publisher: Springer Cham

  • eBook Packages: Computer Science, Computer Science (R0)

  • Copyright Information: Springer International Publishing Switzerland 2015

  • Hardcover ISBN: 978-3-319-20942-5Published: 15 July 2015

  • Softcover ISBN: 978-3-319-35560-3Published: 15 October 2016

  • eBook ISBN: 978-3-319-20943-2Published: 01 July 2015

  • Series ISSN: 1617-7975

  • Series E-ISSN: 2197-8433

  • Edition Number: 1

  • Number of Pages: IX, 320

  • Number of Illustrations: 113 b/w illustrations

  • Topics: System Performance and Evaluation, Performance and Reliability, Numeric Computing

Publish with us