AGULLO, Emmanuel; AUGONNET, Cédric; DONGARRA, Jack; FAVERGE, Mathieu; LTAIEF, Hatem; THIBAULT, Samuel; TOMOV, Stanimire

doi:10.1109/IPDPS.2011.90

The system will be going down for regular maintenance. Please save your work and logout.

hal.structure.identifier	High-End Parallel Algorithms for Challenging Numerical Simulations [HiePACS]
hal.structure.identifier	Laboratoire Bordelais de Recherche en Informatique [LaBRI]
dc.contributor.author	AGULLO, Emmanuel
hal.structure.identifier	Laboratoire Bordelais de Recherche en Informatique [LaBRI]
hal.structure.identifier	Efficient runtime systems for parallel architectures [RUNTIME]
dc.contributor.author	AUGONNET, Cédric
hal.structure.identifier	Innovative Computing Laboratory [Knoxville] [ICL]
dc.contributor.author	DONGARRA, Jack
hal.structure.identifier	Innovative Computing Laboratory [Knoxville] [ICL]
dc.contributor.author	FAVERGE, Mathieu
hal.structure.identifier	Innovative Computing Laboratory [Knoxville] [ICL]
dc.contributor.author	LTAIEF, Hatem
hal.structure.identifier	Laboratoire Bordelais de Recherche en Informatique [LaBRI]
hal.structure.identifier	Efficient runtime systems for parallel architectures [RUNTIME]
dc.contributor.author	THIBAULT, Samuel
hal.structure.identifier	Innovative Computing Laboratory [Knoxville] [ICL]
dc.contributor.author	TOMOV, Stanimire
dc.date.accessioned	2024-04-15T09:48:13Z
dc.date.available	2024-04-15T09:48:13Z
dc.date.issued	2011-05
dc.date.conference	2011-05-16
dc.identifier.uri	https://oskar-bordeaux.fr/handle/20.500.12278/198163
dc.description.abstractEn	One of the major trends in the design of exascale architectures is the use of multicore nodes enhanced with GPU accelerators. Exploiting all resources of a hybrid accelerators- based node at their maximum potential is thus a fundamental step towards exascale computing. In this article, we present the design of a highly efficient QR factorization for such a node. Our method is in three steps. The first step consists of expressing the QR factorization as a sequence of tasks of well chosen granularity that will aim at being executed on a CPU core or a GPU. We show that we can efficiently adapt high-level algorithms from the literature that were initially designed for homogeneous multicore architectures. The second step consists of designing the kernels that implement each individual task. We use CPU kernels from previous work and present new kernels for GPUs that complement kernels already available in the MAGMA library. We show the impact on performance of these GPU kernels. In particular, we present the benefits of new hybrid CPU/GPU kernels. The last step consists of scheduling these tasks on the computational units. We present two alternative approaches, respectively based on static and dynamic scheduling. In the case of static scheduling, we exploit the a priori knowledge of the schedule to perform successive optimizations leading to very high performance. We, however, highlight the lack of portability of this approach and its limitations to relatively simple algorithms on relatively homogeneous nodes. Alternatively, by relying on an efficient runtime system, StarPU, in charge of ensuring data availability and coherency, we can schedule more complex algorithms on complex heterogeneous nodes with much higher productivity. In this latter case, we show that we can achieve high perfor- mance in a portable way thanks to a fine interaction between the application and the runtime system. We demonstrate that the obtained performance is very close to the theoretical upper bounds that we obtained using Linear Programming.
dc.language.iso	en
dc.title.en	QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
dc.type	Communication dans un congrès
dc.identifier.doi	10.1109/IPDPS.2011.90
dc.subject.hal	Informatique [cs]/Calcul parallèle, distribué et partagé [cs.DC]
bordeaux.hal.laboratories	Laboratoire Bordelais de Recherche en Informatique (LaBRI) - UMR 5800	*
bordeaux.institution	Université de Bordeaux
bordeaux.institution	Bordeaux INP
bordeaux.institution	CNRS
bordeaux.conference.title	25th IEEE International Parallel & Distributed Processing Symposium
bordeaux.country	US
bordeaux.conference.city	Anchorage
bordeaux.peerReviewed	oui
hal.identifier	inria-00547614
hal.version	1
hal.invited	non
hal.proceedings	oui
hal.conference.end	2011-05-20
hal.popular	non
hal.audience	Internationale
hal.origin.link	https://hal.archives-ouvertes.fr//inria-00547614v1
bordeaux.COinS	ctx_ver=Z39.88-2004&rft_val_fmt=info:ofi/fmt:kev:mtx:journal&rft.date=2011-05&rft.au=AGULLO,%20Emmanuel&AUGONNET,%20C%C3%A9dric&DONGARRA,%20Jack&FAVERGE,%20Mathieu&LTAIEF,%20Hatem&rft.genre=unknown

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

Laboratoire Bordelais de Recherche en Informatique (LaBRI) - UMR 5800

Show simple item record

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

Files in this item

This item appears in the following Collection(s)