Autotuning GEMM Kernels for the Fermi GPU

Jakub Kurzak, Member, IEEE, Stanimire Tomov, Member, IEEE, and Jack Dongarra, Life Fellow, IEEE

Abstract—in recent years, the use of graphics chips has been recognized as a viable way of accelerating scientific and engineering applications, even more so since the introduction of the Fermi architecture by NVIDIA, with features essential to numerical computing, such as fast double precision arithmetic and memory protected with error correction codes. Being the crucial component of numerical software packages, such as LAPACK and ScaLAPACK, the general dense matrix multiplication routine is one of the more important workloads to be implemented on these devices. This article presents a methodology for producing matrix multiplication kernels tuned for a specific architecture, through a canonical process of heuristic autotuning, based on generation of multiple code variants and selecting the fastest ones through benchmarking. The key contribution of this work is in the method for generating the search space; specifically, pruning it to a manageable size. Performance numbers match or exceed other available implementations.

Index Terms—graphics processing unit, matrix multiplication, code generation, automatic tuning, GEMM, BLAS, CUDA.

1 INTRODUCTION

Graphics Processing Units (GPUs) maintain a strong lead over more traditional multicore CPUs in peak floating-point performance and memory bandwidth [31], which also translates to higher power efficiency. Hybrid accelerator-based systems have also been identified as likely candidates to deliver Exascale performance in the future [12], [22], [39]. Today, many key scientific and engineering applications rely on GPUs to deliver performance in excess of what standard multicores are capable of providing [24]. Due to its computational intensity and algorithmic regularity, dense linear algebra is a perfect candidate for GPU acceleration, and matrix multiplication is the canonical GPU programming example [31].

The hardware target of this article is the NVIDIA Fermi GPU (GF100 architecture) GPUs [33], [35], [32], the first line of GPUs with essential high performance computing features, such as high performance in double precision arithmetic and memory with Error Correction Code (ECC) protection. This device is usually programmed with NVIDIA’s Compute Unified Device Architecture (CUDA) [31]. The OpenCL standard [21] could be used as an alternative, but currently its available implementations are known to lag behind CUDA in performance [14].

The workload implemented here is general matrix multiplication, referred to as GEMM, following the Basic Linear Algebra Subroutines (BLAS) standard [6]. The GEMM routine is a building block of software packages such as LAPACK [3] and ScaLAPACK [8], absolutely essential to their performance, and can also be used as the basis for implementing all other Level-3 BLAS routines [20].

Not without significance is the fact that GEMM is also critical to the performance of the High Performance Linpack Benchmark (HPL) [13], used to rate the systems on the Top500 list of the fastest (disclosed) computers in the world. In June 2011, the top spot was captured by the Tianhe-1A supercomputer in China, a hybrid system based on Intel Xeon processors and NVIDIA Fermi GPUs. However, in November 2011, the Tianhe-1A supercomputer was pushed to the second place by the K computer in Japan, which does not use GPUs.

This work addresses the development of BLAS-compliant GEMM, with support for all parameters specified by the standard. Different variants of GEMM, with respect to the floating-point precision (single/double) and the type of arithmetic (real/complex), are referred to by their BLAS names (SGEMM, DGEMM, CGEMM, ZGEMM). Column-major matrix layout is used here, following the “legacy” BLAS interface and the convention of LAPACK and ScaLAPACK.

The software is being developed as a component of the Matrix Algebra for GPUs and Multicore Architectures (MAGMA) project [2]. It is the authors’ intention to use it as an initial proof-of-concept prototype for a framework, Automatic Stencil TuneR for Accelerators (ASTRA).

2 CUDA BASICS

In November 2006, NVIDIA introduced the Compute Unified Device Architecture (CUDA™), a general purpose parallel computing architecture, with a new parallel programming model and instruction set architecture, that leverages the parallel compute engine in NVIDIA GPUs to solve complex computational problems.

At its core are three key abstractions: a hierarchy of thread groups, shared memories, and barrier synchronization, that are exposed to the programmer as a set of language extensions. They guide the programmer to
partition the problem into coarse sub-problems that can be solved independently in parallel by blocks of threads, and each sub-problem into finer pieces that can be solved cooperatively in parallel by all threads within the block. CUDA C extends C by allowing the programmer to define C functions, called kernels, that, when called, are executed N times in parallel by N different CUDA threads.

The CUDA architecture is built around a scalable array of multithreaded Streaming Multiprocessors (SMs). When a CUDA program on the host CPU invokes a kernel grid, the blocks of the grid are enumerated and distributed to multiprocessors with available execution capacity. The threads of a thread block execute concurrently on one multiprocessor, and multiple thread blocks can execute concurrently on one multiprocessor. As thread blocks terminate, new blocks are launched on the vacated multiprocessors.

A multiprocessor is designed to execute hundreds of threads concurrently. To manage such a large amount of threads, it employs a unique architecture called Single-Instruction, Multiple-Thread (SIMT). The instructions are pipelined to leverage instruction-level parallelism within a single thread, as well as thread-level parallelism extensively with simultaneous hardware multithreading. However, unlike CPU cores, they are issued in order and there is no branch prediction and no speculative execution.

The multiprocessor creates, manages, schedules, and executes threads in groups of 32 parallel threads called warps. Individual threads composing a warp start together at the same program address, but they have their own instruction address counter and register state and are therefore free to branch and execute independently. The term warp originates from weaving, the first parallel thread technology. When a multiprocessor is given one or more thread blocks to execute, it partitions them into warps that get scheduled by a warp scheduler for execution.

3 Motivation
Initially, this work was motivated by the observation that, while CUBLAS and MAGMA single precision GEMM achieved much higher performance in complex arithmetic than in real arithmetic, double precision did not. The higher performance was due to the higher computational intensity of complex arithmetic and should have manifested itself equally in both single precision and double precision. Since this was not the case, a clear performance improvement opportunity presented itself. At the same time, there are important applications where complex double precision GEMM is essential [5].

The main motivation for this work, however, was the delivery of optimized GEMM GPU kernels, produced automatically through a robust process of code generation and autotuning. Until now, the GEMM kernels in MAGMA were produced through exhaustive experimentation, rather than a systematic autotuning process. With the new Kepler and Maxwell architectures planned for 2011 and 2013 respectively, as disclosed in NVIDIA’s roadmap, a much more sustainable process is in high demand.

It is of significance that the high level abstraction of CUDA maps very well to the hardware architectures of NVIDIA GPUs. Programmers do not have to resort to lower level abstractions, such as the Parallel Thread Execution (PTX) [34] (the pseudo-assembly of CUDA), for the development of fast GEMM kernels for NVIDIA cards. At the same time, CUDA GEMM codes proved not to be “performance-portable”, as was shown by efforts of porting kernels for the GT200 (Tesla) architecture to the GF100 (Fermi) architecture. This combination of factors makes NVIDIA GPUs attractive targets for autotuning efforts.

Autotuning is an attractive option for GPU code development. First of all, many important architectural details are proprietary knowledge, undisclosed to the public, such as the mechanism for scheduling of blocks within the device and scheduling of warps within the multiprocessor. Second, low-level programming constructs are inaccessible, i.e., there is no publicly available assembler for NVIDIA GPUs. The lowest accessible level is the one of PTX. Yet the strongest motivation for autotuning on GPUs is probably the complexity of these massively parallel, massively hardware-multithreaded devices.

For conventional architectures it has been shown that hand-tuning has the capability to outperform autotuning [17], [23], and also, tuning parameters can be determined analytically [44], [23]. Nonetheless, autotuned libraries are in wide use due to their ability to quickly adapt to new platforms. The autotuning approach presented here is envisioned to have similar benefits as new generations of GPUs are developed.

4 Related Work
The list of prominent autotuning software projects includes: Automatically Tuned Linear Algebra Software (ATLAS) [43], and its predecessor Portable High Performance ANSI C (PHiPAC) [7], Optimized Sparse Kernel Interface (OSKI) [42], Fastest Fourier Transform in the West (FFTW) [16] and SPIRAL [37] (code generation for digital signal processing transforms). All these projects address autotuning for standard processors (not accelerators).

Jiang and Snir [19] developed an ATLAS-like autotuning system for matrix multiplication on Nvidia GPUs before CUDA was available. As such, pre-CUDA efforts are often referred to as General Purpose GPU (GPGPU) programming. The BrookGPU language was used to express a parametrized kernel with 458,752 possible instantiations. The authors used ad-hoc pruning based on arbitrary problem-specific choices, and exploited orthogonality of parameters during the actual search.

Barrachina et al. [4] carried out a preliminary study of Level 3 CUDA BLAS (CUBLAS) using the GeForce 8800...
Ultra card (G80 architecture). Performance of SGEMM, SSYRK and STRSM routines was reported. Subsequently, three optimization techniques were applied, which did not involve any modifications to the CUDA source code: padding of input matrices, implementation of SSYRK and STRSM on top of SGEMM, and splitting the work between the GPU and a CPU. Altogether, the application of these techniques produced substantial performance improvements with minimal programming effort and no coding in CUDA.

Early work on tuning GEMMs in CUDA for NVIDIA GPUs targeted the previous generation of GPUs, of the GT200 architecture, such as the popular GTX 280. Pioneering work was done by Volkov and Demmel [41]. Similar efforts followed in the MAGMA project [25]. The introduction of the NVIDIA Fermi architecture triggered the development of MAGMA GEMM kernels tuned for that architecture [29], [28]. Although tuning was an important part of this work, it was accomplished through exhaustive experimentation rather than a systematic autotuning effort.

One important development in MAGMA was the implementation of complex GEMM routines by expressing the complex matrix multiplication through three real matrix multiplications and five real matrix additions [15], which results in up to 25% decrease in the number of floating-point operations [29]. However, Higham observes that this method has a fundamental numerical weakness, since the “imaginary part may be contaminated by relative errors much larger than those for conventional multiplication” [18]. Although, Higham also notes that “if the errors are measured relative to $||A|| ||B||$ [...], then they are just as small as for conventional multiplication” [18]. The method simply employs the SGEMM and DGEMM routines for an implementation of the CGEMM and ZGEMM routines with a reduced number of floating-point operations and different numerical properties. Since it does not involve implementation of any new kernels, it will not be further discussed here.

An important approach to the development of optimized GEMM routines is code generation through compiler transformations. Rudy et al. [38] presented the CUDA-ChiLL source-to-source compiler transformation and code generation framework, which transforms sequential loop nests to high-performance GPU code, based on a polyhedral transformation system CHILL [9]. Autotuning was used to explore a small parameter space (tiling in multiples of 16, up to 128). Fermi SGEMM $A \times B$ kernel was produced with performance slightly lower than CUBLAS, due to not using texture caches (which has been remedied since then, according to the authors).

Cui et al. [11] presented a similar system built using the Open64 compiler [36] and the WRaP-IT / URUK / URGenT polyhedral toolchain [10]. Here the authors started with optimized MAGMA/CUBLAS Fermi SGEMM kernels ($A \times B$, $A^T \times B$, $A \times B^T$, $A^T \times B^T$) and used automatic code transformations to extrapolate the SGEMM performance to the other three Level 3 BLAS kernels (STRMM, STRSM, SSYMM) with all combinations of inputs covered (left/right, lower/upper). Indeed, performance very close to SGEMM was reported for all the other kernels, greatly outperforming CUBLAS.

Two recent articles describe low-level development of GPU matrix multiplication, without any use of autotuning methodology. Tan et al. [40] describe a joint work by the Institute of Computing Technology, Chinese Academy of Science and NVIDIA on the the development of the double precision (DGEMM) $A \times B$ kernel to be used in the *High Performance Linpack* (HPL) benchmark for the Tianhe-1A and Nebulea Chinese supercomputers. The process is based on meticulous analysis of the hardware and instruction scheduling in assembly using NVIDIA’s assembler for Fermi (not publicly available). Impressive performance of 362 GFlop/s is reported, which corresponds to 70% efficiency.

Similar work has been done by Nakasato [27] who developed GEMM kernels for the Cypress GPU from ATI. Single and double precision $A \times B$ and $A^T \times B$ kernels were developed in real arithmetic (using row-major layout). An astounding performance of 2 TFlop/s in single precision and 470 GFlop/s in double precision was shown. The kernels were coded using AMD Intermediate Language (IL), an assembly-like language for the AMD IL virtual instruction set architecture [1].

5 Original Contribution

One contribution of this work is the introduction of a universal code stencil for producing all variants of the GEMM routine included in the BLAS standard. This universal code supports: real and complex arithmetic, single and double precision, transposed, non-transposed and conjugate transposed layout of input matrices. The code also supports memory access with and without using texture caches, with texture reads implemented as both 1D texture reads and 2D texture reads.

The main contribution of this work is in the search space generator, specifically in the mechanism for pruning the search space. Especially important is the fact that the size of the search space can easily be controlled and adjusted to a smaller size for quicker searches or to a bigger size for more exhaustive searches. At the same time, the parameters controlling the size of the search space are intuitive to anyone with basic understanding of the Single Instruction Multiple Threads (SIMT) GPU programming model.

Finally, the desired products of this work are GEMM kernels for the NVIDIA Fermi architecture that match or exceed existing CUBLAS kernels and previous MAGMA kernels in all cases, with significant improvement in the case of the complex double precision kernel (ZGEMM).

6 Solution

6.1 Hardware Target

A number of articles are available with the details of the Fermi architecture [33], [35], [32]. Here, the most
The primary tool in optimizing matrix multiplication is loop tiling. Tiling replaces one loop counter with the step of one and the outer loop counter with the step equal to the tiling factor. In the case of matrix multiplication, tiling replaces the three loops of Algorithm 1 with the six loops of Algorithm 2. Tiling improves locality of reference by exploiting the fact that matrix multiplication involves $O(n^3)$ floating-point operations over $O(n^2)$ data items, which is referred to as the surface to volume effect.

**Algorithm 1** Canonical form of matrix multiplication.

```
1: for $m = 0$ to $M$ do
2:    for $n = 0$ to $N$ do
3:        for $k = 0$ to $K$ do
4:            $C_{m,n} += \alpha A_{m,k} \times B_{k,n}$
5:        end for
6:    end for
7: end for
```

**Algorithm 2** Tiling of matrix multiplication.

```
1: for $\tilde{m} = 0$ to $M$ step $M_{tile}$ do
2:    for $\tilde{n} = 0$ to $N$ step $N_{tile}$ do
3:        for $\tilde{k} = 0$ to $K_{tile}$ do
4:            for $m = 0$ to $M_{tile}$ do
5:                for $n = 0$ to $N_{tile}$ do
6:                    for $k = 0$ to $K_{tile}$ do
7:                        $C_{\tilde{m}+m,\tilde{n}+n} += \alpha A_{\tilde{m}+m,\tilde{k}+k} \times B_{\tilde{k}+k,\tilde{n}+n}$
8:                    end for
9:                end for
10:            end for
11:        end for
12:    end for
13: end for
```

Scaling $C$ by $\beta$ is skipped for clarity.

Finally, the technique of loop unrolling is applied, which replaces the three innermost loops with a single block of straight-line code (a single basic block), as shown by Algorithm 3. The purpose of unrolling is twofold: to reduce the penalty of looping (the overhead of incrementing loop counters, advancing data pointers and branching), and to increase instruction-level parallelism by creating sequences of independent instructions, which can fill out the processor’s pipeline.

This optimization scheme is universal for almost any computer architecture, including “standard” superscalar processors with cache memories, GPU accelerators, and other unconventional architectures, such as the IBM Cell B. E. processor [23]. Tiling, also referred to as blocking, is often applied at multiple levels, e.g., L2 cache, L1 cache, registers file, etc. Unrolling an outer loop, and then fusing together copies of the inner loop, sometimes called unroll and jam, is usually applied until the point where the register file is exhausted. Unrolling the inner loop beyond that point, but without reordering instructions, is used to further reduce the overhead of looping.

6.2 GEMM Basics

The BLAS standard defines the general matrix multiplication operation as $C = \alpha A \times B + \beta C$, where $C$, $A$ and $B$ are matrices of sizes $m \times n$, $m \times k$ and $k \times n$, respectively and $\alpha$ and $\beta$ are scalars. In canonical form, matrix multiplication is represented by three nested loops (Algorithm 1) 1.

The primary tool in optimizing matrix multiplication is the technique of loop tiling. Tiling replaces one loop with two loops: the inner loop incrementing the loop counter with the step of one and the outer loop counter with the step equal to the tiling factor. In the case of matrix multiplication, tiling replaces the three loops of Algorithm 1 with the six loops of Algorithm 2. Tiling improves locality of reference by exploiting the fact that matrix multiplication involves $O(n^3)$ floating-point operations over $O(n^2)$ data items, which is referred to as the surface to volume effect.

Finally, the technique of loop unrolling is applied, which replaces the three innermost loops with a single block of straight-line code (a single basic block), as shown by Algorithm 3. The purpose of unrolling is twofold: to reduce the penalty of looping (the overhead of incrementing loop counters, advancing data pointers and branching), and to increase instruction-level parallelism by creating sequences of independent instructions, which can fill out the processor’s pipeline.

This optimization scheme is universal for almost any computer architecture, including “standard” superscalar processors with cache memories, GPU accelerators, and other unconventional architectures, such as the IBM Cell B. E. processor [23]. Tiling, also referred to as blocking, is often applied at multiple levels, e.g., L2 cache, L1 cache, registers file, etc. Unrolling an outer loop, and then fusing together copies of the inner loop, sometimes called unroll and jam, is usually applied until the point where the register file is exhausted. Unrolling the inner loop beyond that point, but without reordering instructions, is used to further reduce the overhead of looping.

1. In this paper, “for” loops are understood to go from the starting value up to, but not including, the ending value. For example, “for $i = A$ to $B$” is understood to be the same as the C code “for ($i = A$; $i < B$; $i++$)”.

1. In this paper, “for” loops are understood to go from the starting value up to, but not including, the ending value. For example, “for $i = A$ to $B$” is understood to be the same as the C code “for ($i = A$; $i < B$; $i++$)”.

Finally, the technique of loop unrolling is applied, which replaces the three innermost loops with a single block of straight-line code (a single basic block), as shown by Algorithm 3. The purpose of unrolling is twofold: to reduce the penalty of looping (the overhead of incrementing loop counters, advancing data pointers and branching), and to increase instruction-level parallelism by creating sequences of independent instructions, which can fill out the processor’s pipeline.

This optimization scheme is universal for almost any computer architecture, including “standard” superscalar processors with cache memories, GPU accelerators, and other unconventional architectures, such as the IBM Cell B. E. processor [23]. Tiling, also referred to as blocking, is often applied at multiple levels, e.g., L2 cache, L1 cache, registers file, etc. Unrolling an outer loop, and then fusing together copies of the inner loop, sometimes called unroll and jam, is usually applied until the point where the register file is exhausted. Unrolling the inner loop beyond that point, but without reordering instructions, is used to further reduce the overhead of looping.
Algorithm 3 Unrolling of matrix multiplication.

1: for \( \tilde{m} = 0 \) to \( M \) step \( M_{tile} \) do
2: for \( \tilde{n} = 0 \) to \( N \) step \( N_{tile} \) do
3: for \( \tilde{k} = 0 \) to \( K \) step \( K_{tile} \) do
4: instruction
5: instruction
6: instruction
7: ...
8: end for
9: end for
10: end for

6.3 Universal GEMM Stencil

6.3.1 General Structure

A GPU is a data-parallel device with the barrier being the only mechanism for synchronization. Therefore, parallelization relies on identifying independent work. Parallelization at the device level is shown on Figure 1. Matrix multiplication of the general form \( C = C + A \times B \) of size \( M_{dev} \times N_{dev} \times K_{dev} \) is parallelized by spanning matrix \( C \) with a two dimensional grid of tiles. Each tile is processed by one thread block. Each thread block passes through a \( M_{blk} \times K_{dev} \) stripe of \( A \) and a \( K_{dev} \times N_{blk} \) stripe of \( B \) and produces the final result for a \( M_{blk} \times N_{blk} \) tile of \( C \). In one iteration of the outermost loop a thread block produces the partial result of a \( M_{blk} \times N_{blk} \times K_{blk} \) matrix multiplication. (While \( K_{dev} \) is the loop boundary for the outermost loop, \( K_{blk} \) is the tiling factor for that loop.) The tile of \( C \) is read from the device memory and kept in registers throughout the duration of the thread block’s operation. \( M_{blk} \times K_{blk} \) stripe of \( A \) and \( K_{blk} \times N_{blk} \) stripe of \( B \) are placed in shared memory for each iteration of the outermost loop.

![Fig. 1. GEMM at the device level.](image1)

![Fig. 2. GEMM at the block level.](image2)

Two comments about the use of shared memory are in place here. One important detail is that the shared memory is allocated in a skewed fashion, i.e. an array of size \( M \times N \) is declared as \( M \times N + 1 \), which is the usual “trick” to eliminate bank conflicts whether a warp accesses the matrix by rows or by columns. Skewing is required for matrix \( A \) if it is transposed, and always required for matrix \( B \). Here, for simplicity, skewing is always applied to both matrices.

Another comment concerns the use of the shared memory in general. For some GPU architectures the shared memory can be bypassed altogether for one of the input matrices [41], [27]. This turns out not to be the case on the Fermi, where the use of shared memory is required to mitigate strided access to the device memory and redundant reads from the device memory by multiple threads in the same block. Here, matrices \( A \) and \( B \) are always placed in the shared memory, which also simplifies the coding of different transposed/non-transposed scenarios.
6.3.2 Pipelined Loop

The last important concept in optimizing matrix multiplication is the classic technique of software pipelining (Figure 4), also referred to as double buffering. The objective of software pipelining is to increase the instruction level parallelism by overlapping arithmetic operations from even iterations with memory operations from odd iterations. The concept is especially applicable to dual-issue architectures with two pipelines, one devoted to arithmetic and the other to memory accesses, which can issue instructions in the same cycle. An iconic dual-issue architecture was the IBM Cell B. E. processor.

The Fermi chip is also a dual-issue architecture and double buffering is applicable to the GEMM loops. One difference from the canonical structure of Figure 4 stems from the data-parallel nature of the device. Since all the work is partitioned to completely independent sets, there is no need for storing the results until the very end. Because of that, computations of even iterations are only overlapped with reads of odd iterations, but not stores, which are done at the very end.

Figure 5 shows the flow of data through the memory system of the Fermi GPU. Normally, data is read to registers and written back to the memory through L1 and L2 caches. Alternatively, read-only data can be read through texture caches. Texture caches do not have any fundamental performance advantage over L1 and L2 caches. However, passing read-only data through texture caches leaves more space in L1 and L2 for read/write data, register spills, etc. For GEMMs it always pays off to use texture caches to access the A and B input matrices. An added advantage is the use of texture clamping to avoid cleanup codes.

Data can also be transferred between the registers and the shared memory. The purpose of the shared memory is to enable communication between threads. It is a fallacy to think of the shared memory as a cache / scratchpad memory / local store. The GPU philosophy is to hide memory latency through massive

6.3.2 Pipelined Loop

SIMT, not through the use of a scratchpad memory. If one treats the GPU as a vector device, where a warp is the vector, then the shared memory can be thought of as a vector register file, enabling shuffles / permutations of vector elements. In the context of matrix multiplication, the shared memory allows one to deal efficiently with transposed access.

Algorithm 4 shows the pseudocode for the generic GEMM stencil. The code follows the classic pipelined loop scheme with a prologue and epilogue. In the steady state, the loop body loads elements of A and B from shared memory to registers and computes their product, while at the same time loading another batch of elements of A and B to a separate set of registers. The loading from the device memory to the “odd” registers happens in lines 6 an 7. The loading from the shared memory to the “even” registers and the computation happens in the loop between lines 8 and 12. The values in the “odd” registers are passed to the shared memory in lines 14 and 15, to be used in the upcoming iteration

Fig. 3. GEMM at the thread level.

Fig. 4. Software pipelining.

Fig. 5. Data paths in the Fermi GPU. L1 and L2 are caches, T1 and T2 are texture caches.
of the loop. This last part has to be protected with the
\_syncthreads() barriers, to avoid the data hazard on
access to the shared memory. One of these barriers can
be eliminated at the cost of doubling the shared mem-
ory usage. This, however, decreases occupancy, which
inevitably decreases performance. The factors alpha and
beta are applied when storing the results in the device
memory (line 23). This pipelining scheme, developed for
the MAGMA project by Nath et al. [29] has shown itself
to be fastest in practice.

Algorithm 4 Generic GEMM stencil pseudocode.

```c
1: C\_regs \leftarrow 0
2: A_0 \text{dev} \Rightarrow \text{shmem}
3: B_0 \text{dev} \Rightarrow \text{shmem}
4: \_syncthreads();
5: for K = 0 to K_{\text{dev}} - K_{\text{blk}} step K_{\text{blk}} do
6: A_{\text{odd dev}} \Rightarrow \text{regs}
7: B_{\text{odd dev}} \Rightarrow \text{regs}
8: for k = 0 to K_{\text{blk}} step 1 do
9: A_{\text{even}[k]} \text{shmem} \Rightarrow \text{regs}
10: B_{\text{even}[k]} \text{shmem} \Rightarrow \text{regs}
11: C = C + A_{\text{even}[k]} \times B_{\text{even}[k]}
12: end for
13: \_syncthreads();
14: A_{\text{odd regs}} \Rightarrow \text{shmem}
15: B_{\text{odd regs}} \Rightarrow \text{shmem}
16: \_syncthreads();
17: end for
18: for k = 0 to K_{\text{blk}} step 1 do
19: A_{\text{dev}[k]} \text{shmem} \Rightarrow \text{regs}
20: B_{\text{dev}[k]} \text{shmem} \Rightarrow \text{regs}
21: C = C + A_{\text{even}[k]} \times B_{\text{even}[k]}
22: end for
23: C_{\text{dev}} = alpha \times C_{\text{regs}} + beta \times C_{\text{dev}}
regs - registers, shmem - shared memory, dev - device memory
even - even iteration, odd - odd iteration
```

6.3.3 Parametrization

The code is generalized to handle: double and single
precision, real and complex arithmetic and transposition
(and conjugation) of A and B. It also allows for reading
the device memory with or without the use of texture
 caches. If texture caches are used, matrices A and B
can be accessed either as 1D textures or 2D textures.
Further on, only the use of 1D textures is discussed,
since this is the fastest performing scenario. All options
are controlled using the C preprocessor’s macro defini-
tions (#define) and the C language type definitions
typedef). Altogether the code can be compiled into 78
different variants. This does not result in a code
bloat, because for the most part, different options are
orthogonal. The entire stencil is roughly 500 lines long.
Such small size is also due to the fact that unrolling
is left entirely to the compiler. Only \#pragma unroll
directives are used. All loops are unrolled, except for
the outermost loop (line 5). Such aggressive unrolling is
a common practice on GPUs. The volume of resulting
code very rarely prevents the compiler from unrolling
it.

Different precisions (single/double) are handled by
macros with type definitions. Complex arithmetic is han-
dled by inline functions defined in the CUBLAS library
(cuCadd(), cuCmul(), cuCfma(), etc.), which are
 cast to additions and multiplications for real arith-
metic. So is conjugation of an input matrix if it is
conjugate-transposed. Different ways of accessing the
device memory (texture caches or no texture caches)
are implemented through conditional compilation of the
address translation blocks. Transposition of A and B is
handled when loading from device memory to registers
and shared memory (lines 2, 3; 6, 7 and 14, 15). The in-
nermost loops performing the actual computation (lines
8-12) are oblivious to the layout of the input matrices.

Finally, the size of work for a thread block is
parametrized (M_{\text{blk}}, N_{\text{blk}}, K_{\text{blk}}), as well as the shape
of the thread grid (blockDim.x, blockDim.y). It can also
be observed that the thread grid can be reshaped for
reading of A and B as long as each of the three shapes
perfectly overlay the corresponding matrix (Figure 6). Therfore the values M_{\text{dim}A}, N_{\text{dim}A}, M_{\text{dim}B}, N_{\text{dim}B}
are also the stencil’s parameters (subject to preprocessor
correctness checks). Also, for consistency, blockDim.x
and blockDim.y will be referred to, from now on, as
M_{\text{dim}} and N_{\text{dim}}.

![Reshaping the thread block for reading A and B.](image_url)

6.4 Search Space Generator

The search space generator is a brute-force machinery
that runs through all possible values of parameters
M_{\text{blk}}, N_{\text{blk}}, K_{\text{blk}}, M_{\text{dim}}, N_{\text{dim}}, M_{\text{dim}A}, N_{\text{dim}A}, M_{\text{dim}B},
N_{\text{dim}B} and rejects the combinations that produce in-
valid code and the combinations that do not meet
certain performance guidelines, e.g., minimum occu-
pancy requirement. To start with, the 9-dimensional
parameter space is enormous. Constraints come from
a few different sources. Here, the following categories
of constraints are identified: queryable hardware constraints, non-queryable hardware constraints, hard implementation constraints and soft implementation constraints.

6.4.1 Hardware and Implementation Constraints

Query hardware constraints are hardware constraints which can be queried at runtime using calls to the CUDA runtime library (specifically the cuDeviceGetAttribute() function). Non-queryable hardware constraints are hardware constraints which cannot be queried like that, but are tied to the GPU compute capability and defined in CUDA documentation (e.g. “NVIDIA CUDA C Programming Guide” [31], Appendix G). Hard implementation constraints are constraints that would make the implementation invalid if violated, and soft implementation constraints are constraints that would make the implementation perform poorly if violated, but not make it invalid.

The following device parameters are queried:

- \( WARP\_SIZE \)
- \( MAX\_THREADS\_PER\_BLOCK \)
- \( MAX\_REGISTERS\_PER\_BLOCK \)
- \( MAX\_SHARED\_MEMORY\_PER\_BLOCK \)

Then the compute capability is checked using the cuDeviceComputeCapability() function and the following parameters are set using a table lookup:

- \( MAX\_WARP\_SM \)
- \( MAX\_BLOCKS\_PER\_SM \)

Here, a few simplifying assumptions are made, that seem to hold for all compute capabilities so far. It is assumed that the maximum number of threads per multiprocessor is defined by the maximum number of warps:

\[
MAX\_THREADS\_PER\_SM = \frac{MAX\_WARPS\_PER\_SM \times WARP\_SIZE}{\text{Warp Size}}
\]

the number of 32-bit registers per multiprocessor equals the maximum number of registers per block, i.e.

\[
MAX\_REGS\_PER\_SM = \frac{MAX\_REGISTERS\_PER\_BLOCK}{\text{Block Size}}
\]

and the amount of shared memory per multiprocessor equals the maximum amount of shared memory per block, i.e.

\[
MAX\_SHMEM\_PER\_SM = \frac{MAX\_SHARED\_MEMORY\_PER\_BLOCK}{\text{Block Size}}
\]

6.4.2 Performance Guidelines

Performance guidelines are provided to the generator as input and allow for adjusting the amount of generated combinations. Three such guidelines are used here: minimum occupancy, minimum number of blocks per multiprocessor and minimum register reuse. Minimum occupancy defines the minimum number of threads, per multiprocessor, that the kernel is required to launch. The SIMT computation model of the GPU relies on a massive number of simultaneously active threads to deliver performance, so it is reasonable to specify that a kernel should allow for, e.g., the minimum of 512 threads (out of 1536) to be active at the same time in a multiprocessor. This constraint eliminates kernels that consume resources, such as register and shared memory, too aggressively. Similarly, the minimum number of blocks per multiprocessor requirement eliminates kernels that consume resources too heavily to allow for at least the given number of blocks to reside in one multiprocessor. Finally, the register reuse requirement forces a given number of floating-point operations to be performed per single memory operation. This constraint eliminates kernels that move data too much and do not compute enough. To be precise, the value is a floating-point ratio of FMAs to loads in the innermost loop of the kernel (Algorithm 4, lines 8-12).

6.4.3 Generation and Pruning

Algorithm 5 shows the nested loops of the search space generator. The two outermost loops iterate over the sizes of the thread grid. The three innermost loops iterate over the sizes of the block’s working space. The steps for \( M_{blk} \) and \( N_{blk} \) are \( M_{dim} \) and \( N_{dim} \), respectively (the thread grid has to overlay a tile of \( C \)). \( K_{blk} \) does not have to be constrained in any way. (The step is 1.)

\[
\begin{align*}
\text{for } N_{dim} &= 1 \text{ to } N_{dim\_MAX} \text{ step 1 do} \\
\text{for } M_{dim} &= 1 \text{ to } M_{dim\_MAX} \text{ step 1 do} \\
&\quad \forall K_{blk} = 1 \text{ to } K_{blk\_MAX} \text{ step 1 do} \\
&\quad \forall N_{blk} = 1 \text{ to } N_{blk\_MAX} \text{ step 1 do} \\
&\quad \forall M_{blk} = 1 \text{ to } M_{blk\_MAX} \text{ step 1 do} \\
&\quad \text{if parameters meet constraints then} \\
&\quad \quad \text{generate all variants} \\
&\quad \quad \quad (M_{dimA}, N_{dimA}, M_{dimB}, N_{dimB}) \\
&\quad \quad \quad \text{such that:} \\
&\quad \quad \quad M_{dimA} \times N_{dimA} = M_{dim} \times N_{dim} \\
&\quad \quad \quad \frac{M_{dimA}}{M_{dim}} = \frac{N_{dimA}}{N_{dim}} \\
&\quad \quad \quad \frac{M_{dimB}}{M_{dim}} = \frac{N_{dimB}}{N_{dim}} \\
&\quad \quad \quad \text{end if} \\
&\quad \text{end for} \\
&\quad \text{end for} \\
&\quad \text{end for} \\
&\quad \text{end for} \\
&\quad \text{end for} \\
&\quad \text{end for}
\end{align*}
\]

In principle, the loops upper boundaries could be set to some device parameters, e.g., the upper boundaries for the two outermost loops could be set to \( MAX\_BLOCK\_DIM\_X \) and
MAX_BLOCK_DIM_Y. Here the boundaries are set to 256 for \( M_{\text{dimMAX}} \), \( N_{\text{dimMAX}} \), \( M_{\text{blkMAX}} \) and \( N_{\text{blkMAX}} \), and to 64 for \( K_{\text{blkMAX}} \). The choice was made experimentally, such that no combinations are missed because of the loop boundaries being too low. I.e., increasing the boundaries does not produce any more valid combinations. (All such combinations are eliminated by the constraints discussed further.) At the same time, the running time of the generator is kept short (on the order of seconds).

Algorithm 6 shows the set of constraints enforced inside the nested loops of Algorithm 5. It is divided into four sections. The first section enforces a mixed set of hardware and implementation (hard and soft) constraints. The second section enforces the minimum occupancy performance guideline, based on the amount of available shared memory. The third section enforces the minimum occupancy performance guideline, based on the number of available registers. Finally, the fourth section enforces the minimum register reuse performance guideline. Next, each block is discussed in detail.

The first block applies a set of straightforward checks. Line one verifies the thread grid does not exceed the maximum number of threads. Line two checks if the thread grid is divisible into warps. Line three checks if the thread grid can be used (regardless of its shape) to read a stripe of \( A \), without any threads being idle. Similarly, line four checks if the thread grid can be used (regardless of its shape) to read a stripe of \( B \), without any threads being idle.

The second block enforces the minimum occupancy based on shared memory consumption. First, the amount of shared memory required by the kernel is calculated. This equals the amount of shared memory required to store a stripe of \( A \) and a stripe of \( B \) (Algorithm 4, lines 2, 3 and 14, 15). Then, the number of possible thread blocks per multiprocessor is calculated and filtered though the hardware maximum. Next, the number of warps is calculated and also filtered through the hardware maximum. Finally, the number of possible blocks per multiprocessor and the number of possible threads per multiprocessor are recalculated and checked against the corresponding performance guidelines. Admittedly, some checks are redundant.

The third block performs similar checks with respect to the register consumption. First, the number of registers required by the kernel is calculated. This equals the number of registers to “prefetch” odd iteration \( A \) and \( B \) (Algorithm 4, lines 6, 7), stream in even iteration \( A \) and \( B \) (lines 9, 10) and accumulate the results in \( C \) (line 11). What follows closely resembles the preceding block. The number of possible thread blocks per multiprocessor is calculated and filtered through the hardware maximum. Next, the number of warps is calculated and also filtered through the hardware maximum. Finally, the number of possible blocks per multiprocessor and the number of possible threads per multiprocessor are recalculated and checked against the corresponding performance guidelines. It has to be pointed out that the approach is heuristic. When compiled, the code will use more registers. (Registers will be used for local variables, loop counters, etc.)

It has to be pointed out that formulas for resource usage are only approximations. Specifically, the formula for register usage specifies only the minimum number of registers necessary for the double-buffered loop to function efficiently. It is almost guaranteed to be the lower bound on the actual register usage of the compiled kernel. This is in perfect alignment with the autotuning and pruning philosophy. The idea is to only eliminate kernels, which are certain to perform poorly. Some kernels with the actual register usage beyond what was anticipated will make it to the benchmarking phase, but will perform poorly, and not be picked as optimum. The point is that the elimination process is permissive rather than restrictive.

The last block simply calculates the ratio of loads to FMAs in the innermost loops (Algorithm 4, lines 9-11) and checks it against the performance guideline. The conditional takes into account the different ratio of memory operations to computation for real arithmetic and complex arithmetic. (Complex arithmetic is twice as compute intensive as real arithmetic.)

### 7 RESULTS AND DISCUSSION

#### 7.1 Generation Results

Table 1 shows the performance guidelines applied. The values were chosen experimentally to produce the number of combinations for each of the 16 versions of the GEMM to be on the order of hundreds. Notably, the minimum occupancy of 512 threads per multiprocessor (0.33) was always used and the minimum number of blocks per multiprocessor of two. The only parameter that varied was the register reuse, ranging from two to five. (Integer values were used, although in principle, the ratio is a floating-point number.)

<table>
<thead>
<tr>
<th>kernel type</th>
<th>min. occupancy</th>
<th>min. reg. reuse</th>
<th>min. no. blocks</th>
</tr>
</thead>
<tbody>
<tr>
<td>SGEMM</td>
<td>512</td>
<td>3.0</td>
<td>2</td>
</tr>
<tr>
<td>CGEMM</td>
<td>512</td>
<td>5.0</td>
<td>2</td>
</tr>
<tr>
<td>DGEMM</td>
<td>512</td>
<td>2.0</td>
<td>2</td>
</tr>
<tr>
<td>ZGEMM</td>
<td>512</td>
<td>2.0</td>
<td>2</td>
</tr>
</tbody>
</table>

*minimum number of threads

*minimum ratio of load instructions to fused multiply-add instructions in the innermost loop

*minimum number of thread blocks per multiprocessor

Figure 7 shows the number of combinations produced for each of the 16 versions of the GEMM when the guidelines from Table 1 are applied. The number varies from slightly below one hundred to slightly above four hundred. It should be noted that the choice of performance guidelines and the number of combinations...
produced is an arbitrary decision, which trades off the range of the search sweep with the time required to perform the sweep.

The pruning of the search space is a powerful and necessary mechanism here. For instance, with the performance guidelines taken away (and only hardware and implementation constraints applied) the generator will produce slightly more than one million combinations of SGEMM, for ZGEMM. Three runs were made for each case and the maximum performance taken. This dimensions of 10,000 for SGEMM, 8,000 for CGEMM and DGEMM and 6,000 for ZGEMM. Three runs were made for each case and the maximum performance taken. This was more of a precaution than an actual need, since performance fluctuation was virtually inexistent. With roughly three thousand cases to run, the process takes one day on a single GPU.

The timing runs confirm that the generator with the pruning mechanism could not be a selection tool on its own. As was already mentioned, using strict performance guidelines does not result in converging on the fastest case. Although, under the constraints used, all

### 7.2 Selection Results

With all combinations generated, the next steps are runs, performance measurements and selection of the fastest kernels. A few words about the hardware/software setup are in place here. The process of autotuning was conducted and the final performance results were produced on an NVIDIA Tesla S2050 system, with the Fermi GPU containing 14 multiprocessors and clocked at 1.147 GHz. CUDA SDK 4.0 release candidate 11 was used, the newest version at the time of the experiments. Square matrices $A$, $B$ and $C$ were used in all cases and problem sizes were chosen such that all data would occupy 1 GB of the GPU memory. This results in the dimensions of\(10,000\) for SGEMM, \(8,000\) for CGEMM and DGEMM and \(6,000\) for ZGEMM. Three runs were made for each case and the maximum performance taken. This was more of a precaution than an actual need, since performance fluctuation was virtually inexistent. With roughly three thousand cases to run, the process takes one day on a single GPU.

The timing runs confirm that the generator with the pruning mechanism could not be a selection tool on its own. As was already mentioned, using strict performance guidelines does not result in converging on the fastest case. Although, under the constraints used, all

---

**Algorithm 6** Search space generator constraints.

```plaintext
Require: \(M_{dim} \times N_{dim} \leq MAX\_THREADS\_PER\_BLOCK\)
Require: \((M_{dim} \times N_{dim})/WARP\_SIZE == 0\)
Require: \((M_{blk} \times K_{blk})/(M_{dim} \times N_{dim}) == 0\)
Require: \((K_{blk} \times N_{blk})/(M_{dim} \times N_{dim}) == 0\)

\[\text{shmem\_per\_block} = ((M_{blk} + 1) \times K + (K + 1) \times N) \times \text{sizeof(type)}\]
\[\text{blocks\_per\_sm} = \min(MAX\_SHMEM\_PER\_SM/shmem\_per\_block, MAX\_BLOCKS\_PER\_SM)\]
\[\text{warp\_per\_block} = (M_{dim} \times N_{dim})/WARP\_SIZE\]
\[\text{warp\_per\_sm} = \min(\text{blocks\_per\_sm} \times warps\_per\_block, MAX\_WARPS\_PER\_SM)\]
\[\text{blocks\_per\_sm} = \text{warps\_per\_sm}/warps\_per\_block\]

**Require:** \(\text{blocks\_per\_sm} \geq MIN\_BLOCKS\_PER\_SM\)

**Require:** \(\text{threads\_per\_sm} = M_{dim} \times N_{dim} \times \text{blocks\_per\_sm}\)

**Require:** \(\text{threads\_per\_sm} \geq MIN\_THREADS\_PER\_SM\)

\[\text{regs\_per\_thread} = (M_{thr} \times N_{thr}) + (M_{thr} + N_{thr})\]
\[\text{regs\_per\_block} = \text{regs\_per\_thread} \times (M_{dim} \times N_{dim})\]
\[\text{blocks\_per\_sm} = \min(MAX\_REGS\_PER\_SM/\text{regs\_per\_block}, MAX\_BLOCKS\_PER\_SM)\]
\[\text{warp\_per\_block} = (M_{dim} \times N_{dim})/WARP\_SIZE\]
\[\text{warp\_per\_sm} = \min(\text{blocks\_per\_sm} \times warps\_per\_block, MAX\_WARPS\_PER\_SM)\]
\[\text{blocks\_per\_sm} = \text{warps\_per\_sm}/warps\_per\_block\]

**Require:** \(\text{blocks\_per\_sm} \geq MIN\_BLOCKS\_PER\_SM\)

**Require:** \(\text{threads\_per\_sm} = M_{dim} \times N_{dim} \times \text{blocks\_per\_sm}\)

**Require:** \(\text{threads\_per\_sm} \geq MIN\_THREADS\_PER\_SM\)

if real arithmetic then
\[\text{regs\_re相} = (M_{thr} \times N_{thr})/(M_{thr} + N_{thr})\]
else (complex arithmetic)
\[\text{regs\_re相} = (4 \times M_{thr} \times N_{thr})/(2 \times (M_{thr} + N_{thr}))\]
end if

**Require:** \(\text{regs\_re相} \geq MIN\_REGS\_REUSE\)
tested kernels are good candidates for fast kernels, their performance can vary wildly. Here, ZGEMM showed the smallest performance variation. The slowest of all ZGEMM kernels ran at 180 Gflop/s, which is slightly more than half of the speed of the fastest, running at 340 Gflop/s. At the same time, the slowest SGEMM kernel ran at 64 Gflop/s which is less than 10% of the speed of the fastest one, running at 662 Gflop/s.

Table 2 shows the final selection of the fastest kernels. It needs to be pointed out that for each case there was a large number of kernels with performance very close to the fastest one (sometimes a couple of kernels within one percent). Here, simply the fastest one in each case is reported. The table shows a comparison against CUBLAS and MAGMA (whichever was faster for each case). Small improvements can be seen in almost all cases. Significant improvement can be observed for ZGEMM. While for CUBLAS and MAGMA ZGEMM runs only as fast as DGEMM, the new ZGEMM runs substantially faster. The autotuning process revealed ZGEMM kernels that successfully take advantage of its higher computational intensity versus DGEMM. One distinct feature of the ZGEMM kernels is that, unlike for all other cases, the tiles of the C matrix are not square. This could be one reason that someone coding the kernels by hand would not explore the case.

Figure 8 shows the autotuning sweep for the $A \times B$ ZGEMM (NoTrans, NoTrans). The flat line at 306 Gflop/s shows the performance of the equivalent CUBLAS kernel. The autotuning run identifies 25 variants with higher performance. Some of them tile the C matrix in square tiles ($16 \times 16, 24 \times 24, 32 \times 32$). Most of them, however, tile the C matrix in rectangular tiles with dimensions taking values of: $8, 16, 24, 32, 48$ and $64$. The two spikes on the left side of the chart identify the fastest variants, one corresponding to tiling of $24 \times 16$, the other corresponding to tiling of $16 \times 24$. The former is faster by a fraction of a Gflop/s.

Fig. 7. Number of variants generated for each kernel type under the constraints listed in Table 1.

One reason for concern could be the fact that the timing part of the process was performed for specific (large) sizes. One could speculate that the kernels are tuned specifically for these sizes and perform suboptimally for other sizes. Just to make sure that this is not the case, the performance for the chosen kernels was measured across all matrix sizes. Figure 9 shows the results. Square matrices are used with sizes corresponding to the tiling factor, 96 for SGEMM, 64 for CGEMM and DGEMM and 48 for ZGEMM. Here it was considered pointless to time cases with partially filled border tiles. Generally, the impact on performance is negative, but negligible. (Very efficient method of dealing with such scenarios was designed by Nath et al. [29].) Figure 9 shows clearly that the kernels perform consistently across all problem sizes, rising quickly to asymptotic performance, with the usual jitter at the beginning (more prominent for the more bandwidth-limited cases of single precision SGEMM and CGEMM).

8 CONCLUSIONS

It is the author’s belief that this work provides strong evidence that autotuning is a crucial component in GPU code development. The essential component in this process is the capability of generating and effectively pruning the search space. For matrix multiplication, pruning turned out to be straightforward with the use of hardware and implementation constraints and constraints referred to as performance guidelines, such as
minimum required occupancy. The choice of constraints allows for trading the thoroughness of the search with its duration.

The process can be generalized to other types of workloads, including more complex kernels and more bandwidth-bound kernels. In principle, this should be the case as long as the code can be parameterized and its properties, such as demand for registers and shared memory, are expressed as functions of the parameters.

It came as a surprise that, this late into the process of BLAS development for the Fermi architecture, an autotuning process managed to prove superior to hand-tuned codes. Although significant, the performance improvements were not dramatic. Hopefully, the system will show its power with the appearance of new architectures and also as a platform for the development of more complex kernels.

9 Future Plans

There are areas where the usefulness of the system is immediately applicable. The generation process revealed a huge number of kernels with much smaller tiling factors, performing nearly as good as kernels with larger tiling factors. The kernels with smaller tiles can readily replace the other kernels for smaller matrix sizes, where the use of large tiles limits the amount of parallelism, preventing the device from achieving good performance. For instance, for the DGEMM $A \times B$ operation, the fastest kernel uses tiles of $64 \times 64$ and asymptotically achieves the performance of 300 Gflop/s. The autotuning process revealed a kernel that uses tiles of $32 \times 32$ and asymptotically achieves the performance of 286 Gflop/s. For small matrix sizes the use of the latter kernel will quadruple parallelism at the loss of 5% of asymptotic kernel performance. Depending on the problem size, this can result in a huge performance gain.

Another opportunity presents itself where one dimension of the operation is significantly smaller than the other. MAGMA is a great example here. For instance, in the right-looking LU factorization, the GEMM is called with the dimension $K = 64$, which is much smaller than $M$ and $N$. This causes the default GEMM kernel to only achieve 253 Gflop/s instead of the asymptotic 300 Gflop/s. When tuned for this shape, a kernel was found by the autotuner that delivers 268 Gflop/s. If $K$ is further reduced to 32, the default kernel’s performance drops to 204 Gflop/s, while the autotuner is capable of finding a kernel that delivers 242 Gflop/s for that case.

Acknowledgments

This work was supported by DOE grant #DE-SC0003852, “Architecture-aware Algorithms for Scalable Perfor-
mance and Resilience on Heterogeneous Architectures”, DOE grant #DE-SC0004983, “Matrix Algebra for GPU and Multicore Architectures (MAGMA) for Large Peta-scale Systems”, Georgia Institute of Technology subcontract RA241-G1 funded by NSF grant OCI-0910735, “Keeneland: National Institute for Experimental Computing”.

The authors would like to thank David Luebke, Steven Parker and Massimiliano Fatica for their insightful comments about the Fermi architecture.

SOFTWARE

Ultimately the software will be distributed as part of the MAGMA project (http://icl.cs.utk.edu/magma/). Initial prototype snapshots will be posted on the authors’ websites (http://icl.cs.utk.edu/people/). All code will be released under the modified BSD license.

REFERENCES


Jakub Kurzak received the MSc degree in Electrical and Computer Engineering from Wrocław University, Poland, and the PhD degree in Computer Science from the University of Houston. He is a Research Director in the Innovative Computing Laboratory in the Department of Electrical Engineering and Computer Science at the University of Tennessee, Knoxville. His research interests include parallel algorithms, specifically in the area of numerical linear algebra, and also parallel programming models and performance optimization for parallel architectures multicore processors and GPU accelerators.

Stanimire Tomov received a M.S. degree in Computer Science from Sofia University, Bulgaria, and Ph.D. in Mathematics from Texas A&M University. He is a Research Director in ICL and Adjunct Assistant Professor in the EECG at UTK. Tomov’s research interests are in parallel algorithms, numerical analysis, and high-performance scientific computing (HPC). Currently, his work is concentrated on the development of numerical linear algebra software for emerging architectures for HPC.

Jack Dongarra holds an appointment at the University of Tennessee, Oak Ridge National Laboratory, and the University of Manchester. He specializes in numerical algorithms in linear algebra, parallel computing, use of advanced-computer architectures, programming methodology, and tools for parallel computers. He was awarded the IEEE Sid Fernbach Award in 2004 for his contributions in the application of high performance computers using innovative approaches; in 2008 he was the recipient of the first IEEE Medal of Excellence in Scalable Computing; in 2010 he was the first recipient of the SIAM Special Interest Group on Supercomputing’s award for Career Achievement; and in 2011 he was the recipient of the IEEE IPDPS 2011 Charles Babbage Award. He is a Fellow of the AAAS, ACM, IEEE, and SIAM and a member of the National Academy of Engineering.