# Kokkos笔记（一）

最近因工作需要学习了一下 Kokkos 这个 C++ 并行计算框架。以下是我看 Kokkos Tutorials 时记的一些笔记。

笔记中包含部分从课件中直接抄来（未经测试）的代码片段，并且省略了很多我已经熟悉的相关知识，比如 OpenMP, CUDA, 张量存储顺序等。如有疑问，请以官方教程为准。

## 1. Concepts for Data Parallelism

1 2 3 | int res = 0; for (int i = 0; i < n; i++) res += a[i] * b[i]; |

**Pattern**: structure of the computation, here`for`

is the pattern. Commonly used patterns: for, reduction, scan, task-graph**Execution Policy**: how computations are executed (range, load-balancing), here`i = 0; i < n; i++`

is the execution policy**Computational Body**: code which performs each**unit of work**, here`res += a[i] * b[i]`

is the computational body

Kokkos maps **work** to execution resources:

- An
**iteration range**identifies a total amount of work - An
**iteration index**identifies a particular unit of work - Each iteration of a computational body is a
**unit of work**

Computational bodies are given to Kokkos as **functors** or **lambdas** (compiler generated functors). Functor example:

1 2 3 4 5 6 7 8 9 10 11 12 13 | struct my_functor_name { // Data members that can be seen by this function // Functor constructor my_functor_name(<params>) {} KOKKOS_INLINE_FUNCTION void operator()(<params>) const { // Computations to be performed } } |

A **lambda** can see all the variables in the current scope. It’s the same as C++11 lambda. Here are two examples (can only run on CPU):

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 | int n = ...; double *a = (double*) malloc(sizeof(double) * n); double *b = (double*) malloc(sizeof(double) * n); // AXPY vector scaling and adding double alpha = 8.9; Kokkos::parallel_for( // Pattern: for loop "axpy", // Label for profiling n, // Iteration range: [0, n-1] // Computational body, a lambda function // i is the iteration index KOKKOS_LAMBDA(const int &i) { b[i] += alpha * a[i]; } ); // Dot product double dot_res = 0.0; Kokkos::parallel_reduce( // Pattern: reduction "dot_prod", // Label for profiling n, // Iteration range: [0, n-1] // Computational body, a lambda function // i is the iteration index, lsum is the reference to the output // KOKKOS_LAMBDA captures values instead of reference KOKKOS_LAMBDA(const int &i, double& lsum) { lsum += a[i] * b[i]; }, dot_res // Returning value of the computational body ); |

`KOKKOS_LAMBDA`

will be defined to `[=] __device__`

or `[=] __host__ __device__`

, depending on your CUDA version. Without CUDA it is simply `[=]`

.

## 2. Views

**View** is a lightweight C++ class with a pointer to array and some metadata specifying *where* and *how* a **multidimensional array** is stored.

Views are like pointers, copy them in the functor. Copy construction and assignment are *shallow*. Reference counting is used for automatic deallocation.

Number of dimensions (rank) is fixed at compile time. Sizes of dimensions can be set at compile-time or runtime, runtime-sized dimensions must come first. Example:

1 2 3 4 5 6 7 8 9 10 | int N0 = ...; int N1 = ...; Kokkos::View<double **> mat_a("mat_a_label", N0, N1); Kokkos::View<double *[N1]> mat_b("mat_b_label", N0); // Get the sizes of dimensions assert(mat_a.extent(0) == N0); // or assert(mat_a.exten_0() == N0); assert(mat_b.extent(1) == N1); // or assert(mat_b.exten_1() == N1); // Get the raw data pointer and label assert(a.data() != NULL); assert(b.label() == "A"); |

Resizing:

1 2 3 4 5 6 7 8 | // Allocate a view with 100x50x4 elements Kokkos::View<int**[4]> a("a", 100, 50); // Resize a to 200x50x4 elements; the original allocation is freed Kokkos::resize(a, 200, 50); // Create a second view b viewing the same data as a Kokkos::View<int**[4]> b = a; // Resize a again to 300x60x4 elements; b is still 200x50x4 Kokkos::resize(a, 300, 60); |

Access elements via “(idx1, idx2, …)” operator. For example: `mat_a(6, 4)`

.

Data layout:

`LayoutLeft`

: left indices have smaller strides, “column-major”, default on GPU`LayoutRight`

: right indices have smaller strides, “row-major”, default on CPU- Other data layouts

The stride on each dimension indicates how far apart in memory (number of current data type elements) two array entries are whose indices only differ by 1 on this dimension. The stride on each dimension is not smaller than the size of each dimension.

1 2 | size_t a_strides[3]; a.strides(a_strides); |

## 3. Spaces

**Spaces** control where parallel bodies are executed (execution space) and where view data resides (memory space). Examples:

1 2 3 4 5 6 7 8 9 10 11 | // Allocate 2 views in CUDA device memory int n = ...; Kokkos::View<float*, Kokkos::CudaSpace> vec_a("vec_a", n); Kokkos::View<float*, Kokkos::CudaSpace> vec_b("vec_b", n); float alpha = 5.4; // Run on CUDA: vec_b(i) += alpha * vec_a(i) parallel_for( "saxpy", RangePolicy<Kokkos::CudaSpace>(0, n), // Execute on Cuda, range [0, n-1] KOKKOS_LAMBDA (const int &i) { vec_b(i) += alpha * vec_a(i); } ); |

Available spaces: `HostSpace`

, `CudaSpace`

, `CudaUVMSpace`

, `HBWSpace`

, `ROCmSpace`

, and other

Deep copy: copy the data from one view to another view, two views must have the same memory layout and strides. You can use a `HostMirror`

to copy between host view and device view. Example:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 | // This will give a compiler error since memory layout is different Kokkos::View<int **, Kokkos::CudaSpace> a("a", 10, 20); Kokkos::View<int **, Kokkos::HostSpace> b("a", 10, 20); Kokkos::deep_copy(a, b); // A is destination, B is source // This might give a runtime error since strides might be different Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::CudaSpace> c("c", 10, 20); Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::HostSpace> d("d", 10, 20); Kokkos::deep_copy(c, d); // This is the safe way Kokkos::View<int **, Kokkos::CudaSpace> e("e", 10, 20); Kokkos::View<int **, Kokkos::CudaSpace>::HostMirror eh = create_mirror(e); // eh view is always on host // Initialize eh matrix on host Kokkos::deep_copy(e, eh); // Copy from eh to e // Calculation using matrix e on GPU Kokkos::deep_copy(eh, e); // Copy from e to eh |

## 4. Reduction

Many reducers are available: `Sum`

, `Prod`

, `Min`

, `Max`

, and other. Can use multiple reducers for multiple data types simultaneously (after version 3.2), example:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 | int neg_cnt = 0; float sum = 0.0, max = a(0); Kokkos::parallel_reduce( "sum_and_neg_count", n, KOKKOS_LAMBDA(const int &i, float &sum_, float &max_, int &neg_cnt_) { float a_i = a(i); sum_ += a_i; if (a_i > max_) max_ = a_i; if (a_i < 0) neg_cnt_++; }, Kokkos::Sum<float>(sum), Kokkos::Max<float>(max), Kokkos::Sum<int>(neg_cnt) ); |

Reductions with an array of results:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 | struct col_sum { // In this case, the reduction result is an array of float // Note the C++ notation for an array typedef (???) typedef float value_type[]; // Is it the same as "typedef float* value_type;" ? typedef Kokkos::View<float**>::size_type size_type; size_type n_col; Kokkos::View<float**> mat; col_sum(const Kokkos::View<float**> &mat_) n_col(mat_.extent_1()), mat(mat_) {} // value_type here is already a "reference" type, // so we don't pass it in by reference KOKKOS_INLINE_FUNCTION void operator()(const size_type i, value_type sum) const { for (size_type j = 0; j < n_col; j++) sum[j] += x(i, j); } // "Join" intermediate results from different threads KOKKOS_INLINE_FUNCTION void join(volatile value_type dst, const volatile value_type src) const { for (size_type j = 0; j < n_col; j++) dst[j] += src[j]; } // Tell each thread how to initialize its reduction result KOKKOS_INLINE_FUNCTION void init(value_type sum) const { for (size_type j = 0; j < n_col; j++) sum[j] = 0.0; } }; const int nrow = 1000, ncol = 100; Kokkos::View<float**> mat("mat", nrow, ncol); // Fill mat col_sum cs(mat); float *col_sum_res = (float*) malloc(sizeof(float) * ncol); Kokkos::parallel_reduce(mat.extent_0(), cs, col_sum_res); |