Kokkos笔记(一)

最近因工作需要学习了一下 Kokkos 这个 C++ 并行计算框架。以下是我看 Kokkos Tutorials 时记的一些笔记。

笔记中包含部分从课件中直接抄来(未经测试)的代码片段,并且省略了很多我已经熟悉的相关知识,比如 OpenMP, CUDA, 张量存储顺序等。如有疑问,请以官方教程为准。

1. Concepts for Data Parallelism

1
2
3
int res = 0;
for (int i = 0; i < n; i++)
    res += a[i] * b[i];
  • Pattern: structure of the computation, here for is the pattern. Commonly used patterns: for, reduction, scan, task-graph
  • Execution Policy: how computations are executed (range, load-balancing), here i = 0; i < n; i++ is the execution policy
  • Computational Body: code which performs each unit of work, here res += a[i] * b[i] is the computational body

Kokkos maps work to execution resources:

  • An iteration range identifies a total amount of work
  • An iteration index identifies a particular unit of work
  • Each iteration of a computational body is a unit of work

Computational bodies are given to Kokkos as functors or lambdas (compiler generated functors). Functor example:

1
2
3
4
5
6
7
8
9
10
11
12
13
struct my_functor_name
{
    // Data members that can be seen by this function
    
    // Functor constructor
    my_functor_name(<params>) {}
    
    KOKKOS_INLINE_FUNCTION
    void operator()(<params>) const 
    {
        // Computations to be performed
    }
}

A lambda can see all the variables in the current scope. It’s the same as C++11 lambda. Here are two examples (can only run on CPU):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
int n = ...;
double *a = (double*) malloc(sizeof(double) * n);
double *b = (double*) malloc(sizeof(double) * n);

// AXPY vector scaling and adding
double alpha = 8.9;
Kokkos::parallel_for(   // Pattern: for loop
    "axpy",             // Label for profiling
    n,                  // Iteration range: [0, n-1]
    // Computational body, a lambda function
    // i is the iteration index
    KOKKOS_LAMBDA(const int &i) 
    {
        b[i] += alpha * a[i];
    }
);

// Dot product
double dot_res = 0.0;
Kokkos::parallel_reduce(  // Pattern: reduction
    "dot_prod",           // Label for profiling
    n,                    // Iteration range: [0, n-1]
    // Computational body, a lambda function
    // i is the iteration index, lsum is the reference to the output
    // KOKKOS_LAMBDA captures values instead of reference
    KOKKOS_LAMBDA(const int &i, double& lsum)  
    {
        lsum += a[i] * b[i];
    }, 
    dot_res    // Returning value of the computational body
);

KOKKOS_LAMBDA will be defined to [=] __device__ or [=] __host__ __device__, depending on your CUDA version. Without CUDA it is simply [=].

2. Views

View is a lightweight C++ class with a pointer to array and some metadata specifying where and how a multidimensional array is stored.

Views are like pointers, copy them in the functor. Copy construction and assignment are shallow. Reference counting is used for automatic deallocation.

Number of dimensions (rank) is fixed at compile time. Sizes of dimensions can be set at compile-time or runtime, runtime-sized dimensions must come first. Example:

1
2
3
4
5
6
7
8
9
10
int N0 = ...;
int N1 = ...;
Kokkos::View<double **> mat_a("mat_a_label", N0, N1);
Kokkos::View<double *[N1]> mat_b("mat_b_label", N0);
// Get the sizes of dimensions
assert(mat_a.extent(0) == N0);   // or assert(mat_a.exten_0() == N0);
assert(mat_b.extent(1) == N1);   // or assert(mat_b.exten_1() == N1);
// Get the raw data pointer and label
assert(a.data() != NULL);
assert(b.label() == "A");

Resizing:

1
2
3
4
5
6
7
8
// Allocate a view with 100x50x4 elements
Kokkos::View<int**[4]> a("a", 100, 50);
// Resize a to 200x50x4 elements; the original allocation is freed
Kokkos::resize(a, 200, 50);
// Create a second view b viewing the same data as a
Kokkos::View<int**[4]> b = a;
// Resize a again to 300x60x4 elements; b is still 200x50x4
Kokkos::resize(a, 300, 60);

Access elements via “(idx1, idx2, …)” operator. For example: mat_a(6, 4).
Data layout:

  • LayoutLeft: left indices have smaller strides, “column-major”, default on GPU
  • LayoutRight: right indices have smaller strides, “row-major”, default on CPU
  • Other data layouts

The stride on each dimension indicates how far apart in memory (number of current data type elements) two array entries are whose indices only differ by 1 on this dimension. The stride on each dimension is not smaller than the size of each dimension.

1
2
size_t a_strides[3];
a.strides(a_strides);

3. Spaces

Spaces control where parallel bodies are executed (execution space) and where view data resides (memory space). Examples:

1
2
3
4
5
6
7
8
9
10
11
// Allocate 2 views in CUDA device memory
int n = ...;
Kokkos::View<float*, Kokkos::CudaSpace> vec_a("vec_a", n);
Kokkos::View<float*, Kokkos::CudaSpace> vec_b("vec_b", n);
float alpha = 5.4;
// Run on CUDA: vec_b(i) += alpha * vec_a(i)
parallel_for(
    "saxpy", 
    RangePolicy<Kokkos::CudaSpace>(0, n),  // Execute on Cuda, range [0, n-1]
    KOKKOS_LAMBDA (const int &i) { vec_b(i) += alpha * vec_a(i); }
);

Available spaces: HostSpace, CudaSpace, CudaUVMSpace, HBWSpace, ROCmSpace, and other

Deep copy: copy the data from one view to another view, two views must have the same memory layout and strides. You can use a HostMirror to copy between host view and device view. Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
// This will give a compiler error since memory layout is different
Kokkos::View<int **, Kokkos::CudaSpace> a("a", 10, 20);
Kokkos::View<int **, Kokkos::HostSpace> b("a", 10, 20);
Kokkos::deep_copy(a, b);  // A is destination, B is source
// This might give a runtime error since strides might be different
Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::CudaSpace> c("c", 10, 20);
Kokkos::View<int **, Kokkos::LayoutLeft, Kokkos::HostSpace> d("d", 10, 20);
Kokkos::deep_copy(c, d);
// This is the safe way
Kokkos::View<int **, Kokkos::CudaSpace> e("e", 10, 20);
Kokkos::View<int **, Kokkos::CudaSpace>::HostMirror eh = create_mirror(e);  // eh view is always on host
// Initialize eh matrix on host 
Kokkos::deep_copy(e, eh);  // Copy from eh to e
// Calculation using matrix e on GPU
Kokkos::deep_copy(eh, e);  // Copy from e to eh

4. Reduction

Many reducers are available: Sum, Prod, Min, Max, and other. Can use multiple reducers for multiple data types simultaneously (after version 3.2), example:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
int neg_cnt = 0;
float sum = 0.0, max = a(0);
Kokkos::parallel_reduce(
    "sum_and_neg_count", 
    n,
    KOKKOS_LAMBDA(const int &i, float &sum_, float &max_, int &neg_cnt_)
    {
        float a_i = a(i);
        sum_ += a_i;
        if (a_i > max_) max_ = a_i;
        if (a_i < 0) neg_cnt_++;
    },
    Kokkos::Sum<float>(sum),
    Kokkos::Max<float>(max),
    Kokkos::Sum<int>(neg_cnt)
);

Reductions with an array of results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
struct col_sum
{
    // In this case, the reduction result is an array of float
    // Note the C++ notation for an array typedef (???)
    typedef float value_type[];
    // Is it the same as "typedef float* value_type;" ?

    typedef Kokkos::View<float**>::size_type size_type;

    size_type n_col;
    Kokkos::View<float**> mat;

    col_sum(const Kokkos::View<float**> &mat_) n_col(mat_.extent_1()), mat(mat_) {}

    // value_type here is already a "reference" type, 
    // so we don't pass it in by reference
    KOKKOS_INLINE_FUNCTION
    void operator()(const size_type i, value_type sum) const
    {
        for (size_type j = 0; j < n_col; j++) sum[j] += x(i, j);
    }

    // "Join" intermediate results from different threads
    KOKKOS_INLINE_FUNCTION 
    void join(volatile value_type dst, const volatile value_type src) const
    {
        for (size_type j = 0; j < n_col; j++) dst[j] += src[j];
    }

    // Tell each thread how to initialize its reduction result
    KOKKOS_INLINE_FUNCTION 
    void init(value_type sum) const
    {
        for (size_type j = 0; j < n_col; j++) sum[j] = 0.0;
    }
};

const int nrow = 1000, ncol = 100;
Kokkos::View<float**> mat("mat", nrow, ncol);
// Fill mat
col_sum cs(mat);
float *col_sum_res = (float*) malloc(sizeof(float) * ncol);
Kokkos::parallel_reduce(mat.extent_0(), cs, col_sum_res);