Kokkos笔记（二）

Kokkos Tutorials 笔记第二部分。

5. MDRangePolicy

To parallelize tightly nested loops of 1 to 6 dimensions. Example:

Tiling: easy, just add the third set of parameters: MDRangePolicy<Rank<3>>({nx_s, ny_s, nz_s}, {nx_e, ny_e, nz_e}, {ts_x, ts_y, ts_z}), the tile sizes on x/y/z loops are ts_{x/y/z}. For GPUs a tile is handled by a single thread block.

Default iteration patterns match the default memory layouts, can change the iteration patterns between tiles (IterateOuter) and within tiles (IterateInner): Kokkos:Rank<ndim, IterateOuter, IterateInner>.

WorkTag: enables multiple operators in one functor:

6. Subview

A subview is a slice of each dimension of a view and points to the same data, can be constructed on host or with in a kernel. Similar to the “colon” notation provided by MATLAB, Fortran, Python.

Subview can take three types of slice arguments:

• Index: a scalar, only the given index in that dimension will remain
• Kokkos::pair: a half-open range of indices
• Kokkos::ALL: the entire range

For example, the following code is equivalent to MATLAB code norm(tensor(3, 5:10, :), 'fro'):

7. Thread Safety and Atomic Operations

Example: histogram

View can have memory traits including Atomic, RandomAccess, Restrict, Unmanaged, and other. Two examples:

ScatterView: transparently switch between atomic and data replication (every thread owns a copy).

8. Hierarchical Parallelism

Thread team: a collection of threads which are guaranteed to be executing cincurrently and can synchronize (similar to a thread block in CUDA).

Here are some important properties:

Hierarchical parallelism using TeamPolicy: total work = number of teams * size of teams

Nested parallel pattern: use parallel executions in parallel execution. Here is an example of doing a matrix-vector multiplication:

Third level parallelism: ThreadVectorRange, but the tutorial does not provide a detailed example.