# Kokkos笔记（二）

Kokkos Tutorials 笔记第二部分。

## 5. MDRangePolicy

To parallelize tightly nested loops of 1 to 6 dimensions. Example:

1 2 3 4 5 6 7 8 9 10 11 12 13 | float sum = 0.0; for (int x = nx_s; x < nx_e; x++) for (int y = ny_s; y < ny_e; y++) for (int z = nz_s; z < nz_e; z++) sum += f[x][y][z]; // Note: when applying MDRangPolicy, cannot use multiple reducers Kokkos::parallel_reduce( "Sum_over_cuboid", MDRangePolicy<Rank<3>>({nx_s, ny_s, nz_s}, {nx_e, ny_e, nz_e}), KOKKOS_LAMBDA(int i, int j, int k, float &lsum) { lsum += f[x][y][z]; }, Kokkos::Sum<float>(sum) ); |

Tiling: easy, just add the third set of parameters: `MDRangePolicy<Rank<3>>({nx_s, ny_s, nz_s}, {nx_e, ny_e, nz_e}, {ts_x, ts_y, ts_z})`

, the tile sizes on x/y/z loops are ts_{x/y/z}. For GPUs a tile is handled by a single thread block.

Default iteration patterns match the default memory layouts, can change the iteration patterns between tiles (IterateOuter) and within tiles (IterateInner): `Kokkos:Rank<ndim, IterateOuter, IterateInner>`

.

**WorkTag**: enables multiple operators in one functor:

1 2 3 4 5 6 7 8 9 10 11 | struct foo { struct Tag1{}; struct Tag2{}; KOKKOS_FUNCTION void operator(Tag1, int i) const {...} KOKKOS_FUNCTION void operator(Tag2, int i) const {...} void run_both(int n) { parallel_for(RangePolicy<Tag1>(0, N), *this); parallel_for(RangePolicy<Tag2>(0, N), *this); } } |

## 6. Subview

A subview is a slice of each dimension of a view and points to the same data, can be constructed on host or with in a kernel. Similar to the “colon” notation provided by MATLAB, Fortran, Python.

Subview can take three types of slice arguments:

- Index: a scalar, only the given index in that dimension will remain
`Kokkos::pair`

: a half-open range of indices`Kokkos::ALL`

: the entire range

For example, the following code is equivalent to MATLAB code `norm(tensor(3, 5:10, :), 'fro')`

:

1 2 3 4 5 6 7 8 9 10 11 12 13 | Kokkos::View<double***> tensor("tensor", N0, N1, N2); auto sv = Kokkos::subview(tensor, 2, Kokkos::make_pair(4, 10), Kokkos::ALL); double fnorm = 0.0; Kokkos::parallel_reduce( "fro_norm", MDRangePolicy<Rank<2>>({0, 0}, {6, N2}), KOKKOS_LAMBDA(int j, int k, double &lsum) { double sv_jk = sv(j, k); lsum += sv_jk * sv_jk; }, fnorm ); fnorm = sqrt(fnorm); |

## 7. Thread Safety and Atomic Operations

Example: histogram

1 2 3 4 5 6 7 8 9 10 11 12 | Kokkos::View<int*> histogram("histogram", num_bucket); Kokkos::parallel_for(num_bucket, KOKKOS_LAMBDA(const int &i) {histogram(i) = 0;}); Kokkos::parallel_for( "histogram", n, KOKKOS_LAMBDA(const int &i) { const dtype value = ...; const int bucket_index = calc_bucket_index(value); Kokkos::atomic_add(&histogram(bucket_index), 1); } ); |

**View** can have memory traits including `Atomic`

, `RandomAccess`

, `Restrict`

, `Unmanaged`

, and other. Two examples:

1 2 3 4 5 6 7 8 | Kokkos::View<int*> a("a" , 100); Kokkos::View<int*, Kokkos::MemoryTraits<Atomic> > a_atomic = a; a_atomic(1) += 1; // This access will do an atomic addition const size_t N0 = ...; Kokkos::View<int*> a_nonconst ("a", N0); // Allocate nonconst View // Assign to const, RandomAccess View (useful on GPU) Kokkos::View<const int*, Kokkos::MemoryTraits<Kokkos::RandomAccess>> a_ra = a_nonconst; |

**ScatterView**: transparently switch between atomic and data replication (every thread owns a copy).

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 | void compute_forces( View<real3*> x, View<real3*> f, ScatterView<real3*>scatter_f, View<int**> neighs, Interaction force ) { scatter_f.reset(); int N = x.extent(0); int num_neighs = neighs.extent(1); Kokkos::parallel_for( "ForceCompute", N, KOKKOS_LAMBDA(const int i) { auto f_a = scatter_f.access(); for (int j = 0; j < num_neighs; j++) { real3 df = force.compute(x(i), x(neighs(i, j))); f_a(i) += df; // Only += and -= operators are available f_a(j) -= df; } } ); Kokkos::Experimental::contribute(f, scatter_f); } |

## 8. Hierarchical Parallelism

**Thread team**: a collection of threads which are guaranteed to be **executing cincurrently** and **can synchronize** (similar to a thread block in CUDA).

Here are some important properties:

1 2 3 4 5 6 7 8 9 10 | using member_type = Kokkos::TeamPolicy<>::member_type; void operator(const member_type &team_member) { // How many teams are there and which team am I on const unsigned int num_team = team_member.league_size(); const unsigned int team_idx = team_member.league_rank(); // How many threads are in the team and which thread am I on const unsigned int num_thread = team_member.team_size(); const unsigned int thread_idx = team_member.team_rank(); } |

Hierarchical parallelism using TeamPolicy: total work = number of teams * size of teams

1 2 3 4 5 6 | parallel_operation( "label", TeamPolicy<execution_space>(number_of_teams, team_size), functor, // Something else ) |

Nested parallel pattern: use parallel executions in parallel execution. Here is an example of doing a matrix-vector multiplication:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | const int m = ...; const int n = ...; Kokkos::View<float**> A("A", m, n); Kokkos::View<float*> x("x", n); Kokkos::View<float*> b("x", m); // Fill A, x and set b to all 0 Kokkos::parallel_for( "GEMV", TeamPolicy<exec_space>(m, Kokkos::AUTO), // Kokkos::AUTO can be other team_size KOKKOS_LAMBDA(const member_type &team_member) { int row = team_member.league_rank(); double row_dot_sum = 0.0; Kokkos::parallel_reduce( // Inner lambda: // (1) policy is always a TeamThreadRange // (2) may capture by reference, but capture by value is recommended // (3) cannot use KOKKOS_LAMBDA Kokkos::TeamThreadRange(team_member, n), [=] (const int col, float &lsum) { lsum += A(row, col) * x(col); }, row_dot_sum ); if (team_member.team_rank() == 0) b(row) = row_dot_sum; } ); |

Third level parallelism: `ThreadVectorRange`

, but the tutorial does not provide a detailed example.