# Kokkos笔记（一）

## 1. Concepts for Data Parallelism

• Pattern: structure of the computation, here for is the pattern. Commonly used patterns: for, reduction, scan, task-graph
• Execution Policy: how computations are executed (range, load-balancing), here i = 0; i < n; i++ is the execution policy
• Computational Body: code which performs each unit of work, here res += a[i] * b[i] is the computational body

Kokkos maps work to execution resources:

• An iteration range identifies a total amount of work
• An iteration index identifies a particular unit of work
• Each iteration of a computational body is a unit of work

Computational bodies are given to Kokkos as functors or lambdas (compiler generated functors). Functor example:

A lambda can see all the variables in the current scope. It’s the same as C++11 lambda. Here are two examples (can only run on CPU):

KOKKOS_LAMBDA will be defined to [=] __device__ or [=] __host__ __device__, depending on your CUDA version. Without CUDA it is simply [=].

## 2. Views

View is a lightweight C++ class with a pointer to array and some metadata specifying where and how a multidimensional array is stored.

Views are like pointers, copy them in the functor. Copy construction and assignment are shallow. Reference counting is used for automatic deallocation.

Number of dimensions (rank) is fixed at compile time. Sizes of dimensions can be set at compile-time or runtime, runtime-sized dimensions must come first. Example:

Resizing:

Access elements via “(idx1, idx2, …)” operator. For example: mat_a(6, 4).
Data layout:

• LayoutLeft: left indices have smaller strides, “column-major”, default on GPU
• LayoutRight: right indices have smaller strides, “row-major”, default on CPU
• Other data layouts

The stride on each dimension indicates how far apart in memory (number of current data type elements) two array entries are whose indices only differ by 1 on this dimension. The stride on each dimension is not smaller than the size of each dimension.

## 3. Spaces

Spaces control where parallel bodies are executed (execution space) and where view data resides (memory space). Examples:

Available spaces: HostSpace, CudaSpace, CudaUVMSpace, HBWSpace, ROCmSpace, and other

Deep copy: copy the data from one view to another view, two views must have the same memory layout and strides. You can use a HostMirror to copy between host view and device view. Example:

## 4. Reduction

Many reducers are available: Sum, Prod, Min, Max, and other. Can use multiple reducers for multiple data types simultaneously (after version 3.2), example:

Reductions with an array of results: