High performance checksum computation for fault-tolerant MPI over InfiniBand
DENIS, Alexandre
Efficient runtime systems for parallel architectures [RUNTIME]
Laboratoire Bordelais de Recherche en Informatique [LaBRI]
Efficient runtime systems for parallel architectures [RUNTIME]
Laboratoire Bordelais de Recherche en Informatique [LaBRI]
DENIS, Alexandre
Efficient runtime systems for parallel architectures [RUNTIME]
Laboratoire Bordelais de Recherche en Informatique [LaBRI]
< Leer menos
Efficient runtime systems for parallel architectures [RUNTIME]
Laboratoire Bordelais de Recherche en Informatique [LaBRI]
Idioma
en
Communication dans un congrès
Este ítem está publicado en
the 19th European MPI Users' Group Meeting (EuroMPI 2012), 2012-09-24, Vienna. 2012-09-24, vol. 7490
Springer
Resumen en inglés
With the increase of the number of nodes in clusters, the probability of failures and unusual events increases. In this paper, we present checksum mechanisms to detect data corruption. We study the impact of checksums on ...Leer más >
With the increase of the number of nodes in clusters, the probability of failures and unusual events increases. In this paper, we present checksum mechanisms to detect data corruption. We study the impact of checksums on network communication performance and we propose a mechanism to amortize their cost on InfiniBand. We have implemented our mechanisms in the NEWMADELEINE communication library. Our evaluation shows that our mechanisms to ensure message integrity do not impact noticeably the application performance, which is an improvement over the state of the art MPI implementations.< Leer menos
Palabras clave en inglés
NewMadeleine
MadMPI
MPI
Orígen
Importado de HalCentros de investigación