Using Static Allocation Algorithms for Matrix Matrix Multiplication on Multicores and GPUs
EYRAUD-DUBOIS, Lionel
Reformulations based algorithms for Combinatorial Optimization [Realopt]
Laboratoire Bordelais de Recherche en Informatique [LaBRI]
Reformulations based algorithms for Combinatorial Optimization [Realopt]
Laboratoire Bordelais de Recherche en Informatique [LaBRI]
LAMBERT, Thomas
Reformulations based algorithms for Combinatorial Optimization [Realopt]
School of Computer Science [Manchester]
Reformulations based algorithms for Combinatorial Optimization [Realopt]
School of Computer Science [Manchester]
EYRAUD-DUBOIS, Lionel
Reformulations based algorithms for Combinatorial Optimization [Realopt]
Laboratoire Bordelais de Recherche en Informatique [LaBRI]
Reformulations based algorithms for Combinatorial Optimization [Realopt]
Laboratoire Bordelais de Recherche en Informatique [LaBRI]
LAMBERT, Thomas
Reformulations based algorithms for Combinatorial Optimization [Realopt]
School of Computer Science [Manchester]
< Leer menos
Reformulations based algorithms for Combinatorial Optimization [Realopt]
School of Computer Science [Manchester]
Idioma
en
Communication dans un congrès
Este ítem está publicado en
ICPP 2018 - 47th International Conference on Parallel Processing, 2018-08-13, Eugene, OR.
Resumen en inglés
We consider the problem of data allocation when performing matrix multiplication on a heterogeneous node, with multicores and GPUs. Classical (cyclic) allocations designed for homogeneous settings are not appropriate, but ...Leer más >
We consider the problem of data allocation when performing matrix multiplication on a heterogeneous node, with multicores and GPUs. Classical (cyclic) allocations designed for homogeneous settings are not appropriate, but the advent of task-based runtime systems makes it possible to use more general allocations. Previous theoretical work has proposed square and cube partitioning algorithms aimed at minimizing data movement for matrix multiplication. We propose techniques to adapt these continuous square partitionings to allocating discrete tiles of a matrix, and strategies to adapt the static allocation at run-time. We use these techniques in an implementation of Matrix Multiplication based on the StarPU runtime system, and we show through extensive experiments that this implementation allows to consistently obtain a lower communication volume while improving slightly the execution time, compared to standard state-of-the-art dynamic strategies.< Leer menos
Palabras clave en inglés
Mathematics of computing → Solvers
Computing methodologies → Linear algebra algorithms
Parallel algorithms
General and reference → Performance
Proyecto ANR
Solveurs pour architectures hétérogènes utilisant des supports d'exécution - ANR-13-MONU-0007
Orígen
Importado de HalCentros de investigación