Pattern: structure of the computation, here for is the pattern. Commonly used patterns: for, reduction, scan, task-graph
Execution Policy: how computations are executed (range, load-balancing), here i = 0; i < n; i++ is the execution policy
Computational Body: code which performs each unit of work, here res += a[i] * b[i] is the computational body
Kokkos maps work to execution resources:
An iteration range identifies a total amount of work
An iteration index identifies a particular unit of work
Each iteration of a computational body is a unit of work
Computational bodies are given to Kokkos as functors or lambdas (compiler generated functors). Functor example:
1
2
3
4
5
6
7
8
9
10
11
12
13
structmy_functor_name
{
// Data members that can be seen by this function
// Functor constructor
my_functor_name(<params>) {}
KOKKOS_INLINE_FUNCTION
voidoperator()(<params>)const
{
// Computations to be performed
}
}
A lambda can see all the variables in the current scope. It’s the same as C++11 lambda. Here are two examples (can only run on CPU):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
int n = ...;
double *a = (double*) malloc(sizeof(double) * n);
double *b = (double*) malloc(sizeof(double) * n);
// AXPY vector scaling and adding
double alpha = 8.9;
Kokkos::parallel_for( // Pattern: for loop
"axpy", // Label for profiling
n, // Iteration range: [0, n-1]
// Computational body, a lambda function
// i is the iteration index
KOKKOS_LAMBDA(constint &i)
{
b[i] += alpha * a[i];
}
);
// Dot product
double dot_res = 0.0;
Kokkos::parallel_reduce( // Pattern: reduction
"dot_prod", // Label for profiling
n, // Iteration range: [0, n-1]
// Computational body, a lambda function
// i is the iteration index, lsum is the reference to the output
// KOKKOS_LAMBDA captures values instead of reference
KOKKOS_LAMBDA(constint &i, double& lsum)
{
lsum += a[i] * b[i];
},
dot_res // Returning value of the computational body
);
KOKKOS_LAMBDA will be defined to [=] __device__ or [=] __host__ __device__, depending on your CUDA version. Without CUDA it is simply [=].
2. Views
View is a lightweight C++ class with a pointer to array and some metadata specifying where and how a multidimensional array is stored.
Views are like pointers, copy them in the functor. Copy construction and assignment are shallow. Reference counting is used for automatic deallocation.
Number of dimensions (rank) is fixed at compile time. Sizes of dimensions can be set at compile-time or runtime, runtime-sized dimensions must come first. Example:
assert(mat_a.extent(0) == N0); // or assert(mat_a.exten_0() == N0);
assert(mat_b.extent(1) == N1); // or assert(mat_b.exten_1() == N1);
// Get the raw data pointer and label
assert(a.data() != NULL);
assert(b.label() == "A");
Resizing:
1
2
3
4
5
6
7
8
// Allocate a view with 100x50x4 elements
Kokkos::View<int**[4]> a("a", 100, 50);
// Resize a to 200x50x4 elements; the original allocation is freed
Kokkos::resize(a, 200, 50);
// Create a second view b viewing the same data as a
Kokkos::View<int**[4]> b = a;
// Resize a again to 300x60x4 elements; b is still 200x50x4
Kokkos::resize(a, 300, 60);
Access elements via “(idx1, idx2, …)” operator. For example: mat_a(6, 4). Data layout:
LayoutLeft: left indices have smaller strides, “column-major”, default on GPU
LayoutRight: right indices have smaller strides, “row-major”, default on CPU
Other data layouts
The stride on each dimension indicates how far apart in memory (number of current data type elements) two array entries are whose indices only differ by 1 on this dimension. The stride on each dimension is not smaller than the size of each dimension.
1
2
size_t a_strides[3];
a.strides(a_strides);
3. Spaces
Spaces control where parallel bodies are executed (execution space) and where view data resides (memory space). Examples:
Available spaces: HostSpace, CudaSpace, CudaUVMSpace, HBWSpace, ROCmSpace, and other
Deep copy: copy the data from one view to another view, two views must have the same memory layout and strides. You can use a HostMirror to copy between host view and device view. Example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// This will give a compiler error since memory layout is different
Kokkos::View<int **, Kokkos::CudaSpace>::HostMirror eh = create_mirror(e); // eh view is always on host
// Initialize eh matrix on host
Kokkos::deep_copy(e, eh); // Copy from eh to e
// Calculation using matrix e on GPU
Kokkos::deep_copy(eh, e); // Copy from e to eh
4. Reduction
Many reducers are available: Sum, Prod, Min, Max, and other. Can use multiple reducers for multiple data types simultaneously (after version 3.2), example:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
int neg_cnt = 0;
float sum = 0.0, max = a(0);
Kokkos::parallel_reduce(
"sum_and_neg_count",
n,
KOKKOS_LAMBDA(constint &i, float &sum_, float &max_, int &neg_cnt_)
{
float a_i = a(i);
sum_ += a_i;
if (a_i > max_) max_ = a_i;
if (a_i < 0) neg_cnt_++;
},
Kokkos::Sum<float>(sum),
Kokkos::Max<float>(max),
Kokkos::Sum<int>(neg_cnt)
);
Reductions with an array of results:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
structcol_sum
{
// In this case, the reduction result is an array of float
// Note the C++ notation for an array typedef (???)
typedeffloat value_type[];
// Is it the same as "typedef float* value_type;" ?