MA, Teng; BOSILCA, George; BOUTEILLER, Aurélien; GOGLIN, Brice; SQUYRES, Jeffrey; DONGARRA, Jack

La plateforme OSKAR Bordeaux évolue pour rejoindre l'archive ouverte HAL. Retrouvez tous vos dépôts sur le nouveau portail HAL UB : https://u-bordeaux.hal.science/. Pour toute aide ou information, contactez-nous info@oskar-bordeaux.fr

Métadonnées

Afficher la notice complète

Licence d’utilisation du document

MA, Teng
Innovative Computing Laboratory [Knoxville] [ICL]

BOSILCA, George
Innovative Computing Laboratory [Knoxville] [ICL]

BOUTEILLER, Aurélien
Innovative Computing Laboratory [Knoxville] [ICL]

Langue

Rapport

Ce document a été publié dans

2010-12p. 11

Résumé en anglais

More memory hierarchies, NUMA architectures and network-style interconnection are widely used in modern many-core CPU design to achieve performance scalability. As a leading intra-node programming model, Message Passing Interface (MPI) implementations must exploit these architectures to provide reliable performance portability. These new architectures not only require specialized MPI point-to-point messaging protocols, they also require carefully designed and tuned algorithms for MPI collective operations. Multiple issues must be taken into account: 1) minimizing the number of copies required, 2) minimizing traffic to ''remote'' NUMA memory, and 3) carefully avoiding memory bottlenecks for ''rooted'' collective operations. In this paper, we present a kernel assisted intra-node collective module addressing those three issues on many-core systems. A kernel level inter-process memory copy module, called KNEM, is used by a novel Open MPI collective module to implement several improved strategies based on decreasing the number of intermediate memory copies and improving locality to reduce both the pressure on the memory banks and the cache pollution. The collective topology is mapped onto the NUMA topology to minimize cross traffic on inter-socket links. Experiments illustrate that the KNEM enabled Open MPI collective module can achieve up to a threefold speedup on synthetic benchmarks, resulting in a 12% improvement for a parallel graph shortest path discovery application.< Réduire

Mots clés en anglais

MPI

Multicore

Shared memory

NUMA

Kernel

Collective communication

Métadonnées

Partager cette publication !

Licence d’utilisation du document

Kernel Assisted Collective Intra-node Communication Among Multicore and Manycore CPUs

Langue

Ce document a été publié dans

Résumé en anglais

Mots clés en anglais

URI

Origine

Unités de recherche