Kokkos笔记(二)

Kokkos Tutorials 笔记第二部分。

5. MDRangePolicy

To parallelize tightly nested loops of 1 to 6 dimensions. Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
#pragma omp parallel for collapse(3) reduction(+:sum)
float sum = 0.0;
for (int x = nx_s; x < nx_e; x++)
for (int y = ny_s; y < ny_e; y++)
for (int z = nz_s; z < nz_e; z++)
sum += f[x][y][z];
// Note: when applying MDRangPolicy, cannot use multiple reducers
Kokkos::parallel_reduce(
"Sum_over_cuboid",
MDRangePolicy<Rank<3>>({nx_s, ny_s, nz_s}, {nx_e, ny_e, nz_e}),
KOKKOS_LAMBDA(int i, int j, int k, float &lsum) { lsum += f[x][y][z]; },
Kokkos::Sum<float>(sum)
);

Tiling: easy, just add the third set of parameters: MDRangePolicy<Rank<3>>({nx_s, ny_s, nz_s}, {nx_e, ny_e, nz_e}, {ts_x, ts_y, ts_z}), the tile sizes on x/y/z loops are ts_{x/y/z}. For GPUs a tile is handled by a single thread block.

Default iteration patterns match the default memory layouts, can change the iteration patterns between tiles (IterateOuter) and within tiles (IterateInner): Kokkos:Rank<ndim, IterateOuter, IterateInner>.

WorkTag: enables multiple operators in one functor:

1
2
3
4
5
6
7
8
9
10
11
struct foo
{
struct Tag1{}; struct Tag2{};
KOKKOS_FUNCTION void operator(Tag1, int i) const {...}
KOKKOS_FUNCTION void operator(Tag2, int i) const {...}
void run_both(int n)
{
parallel_for(RangePolicy<Tag1>(0, N), *this);
parallel_for(RangePolicy<Tag2>(0, N), *this);
}
}

6. Subview

A subview is a slice of each dimension of a view and points to the same data, can be constructed on host or with in a kernel. Similar to the “colon” notation provided by MATLAB, Fortran, Python.

Subview can take three types of slice arguments:

  • Index: a scalar, only the given index in that dimension will remain
  • Kokkos::pair: a half-open range of indices
  • Kokkos::ALL: the entire range

For example, the following code is equivalent to MATLAB code norm(tensor(3, 5:10, :), 'fro'):

1
2
3
4
5
6
7
8
9
10
11
12
13
Kokkos::View<double***> tensor("tensor", N0, N1, N2);
auto sv = Kokkos::subview(tensor, 2, Kokkos::make_pair(4, 10), Kokkos::ALL);
double fnorm = 0.0;
Kokkos::parallel_reduce(
"fro_norm",
MDRangePolicy<Rank<2>>({0, 0}, {6, N2}),
KOKKOS_LAMBDA(int j, int k, double &lsum) {
double sv_jk = sv(j, k);
lsum += sv_jk * sv_jk;
},
fnorm
);
fnorm = sqrt(fnorm);

7. Thread Safety and Atomic Operations

Example: histogram

1
2
3
4
5
6
7
8
9
10
11
12
Kokkos::View<int*> histogram("histogram", num_bucket);
Kokkos::parallel_for(num_bucket, KOKKOS_LAMBDA(const int &i) {histogram(i) = 0;});
Kokkos::parallel_for(
"histogram",
n,
KOKKOS_LAMBDA(const int &i)
{
const dtype value = ...;
const int bucket_index = calc_bucket_index(value);
Kokkos::atomic_add(&histogram(bucket_index), 1);
}
);

View can have memory traits including Atomic, RandomAccess, Restrict, Unmanaged, and other. Two examples:

1
2
3
4
5
6
7
8
Kokkos::View<int*> a("a" , 100);
Kokkos::View<int*, Kokkos::MemoryTraits<Atomic> > a_atomic = a;
a_atomic(1) += 1; // This access will do an atomic addition
const size_t N0 = ...;
Kokkos::View<int*> a_nonconst ("a", N0); // Allocate nonconst View
// Assign to const, RandomAccess View (useful on GPU)
Kokkos::View<const int*, Kokkos::MemoryTraits<Kokkos::RandomAccess>> a_ra = a_nonconst;

ScatterView: transparently switch between atomic and data replication (every thread owns a copy).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
void compute_forces(
View<real3*> x, View<real3*> f, ScatterView<real3*>scatter_f,
View<int**> neighs, Interaction force
)
{
scatter_f.reset();
int N = x.extent(0);
int num_neighs = neighs.extent(1);
Kokkos::parallel_for(
"ForceCompute", N,
KOKKOS_LAMBDA(const int i)
{
auto f_a = scatter_f.access();
for (int j = 0; j < num_neighs; j++)
{
real3 df = force.compute(x(i), x(neighs(i, j)));
f_a(i) += df; // Only += and -= operators are available
f_a(j) -= df;
}
}
);
Kokkos::Experimental::contribute(f, scatter_f);
}

8. Hierarchical Parallelism

Thread team: a collection of threads which are guaranteed to be executing cincurrently and can synchronize (similar to a thread block in CUDA).

Here are some important properties:

1
2
3
4
5
6
7
8
9
10
using member_type = Kokkos::TeamPolicy<>::member_type;
void operator(const member_type &team_member)
{
// How many teams are there and which team am I on
const unsigned int num_team = team_member.league_size();
const unsigned int team_idx = team_member.league_rank();
// How many threads are in the team and which thread am I on
const unsigned int num_thread = team_member.team_size();
const unsigned int thread_idx = team_member.team_rank();
}

Hierarchical parallelism using TeamPolicy: total work = number of teams * size of teams

1
2
3
4
5
6
parallel_operation(
"label",
TeamPolicy<execution_space>(number_of_teams, team_size),
functor,
// Something else
)

Nested parallel pattern: use parallel executions in parallel execution. Here is an example of doing a matrix-vector multiplication:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
const int m = ...;
const int n = ...;
Kokkos::View<float**> A("A", m, n);
Kokkos::View<float*> x("x", n);
Kokkos::View<float*> b("x", m);
// Fill A, x and set b to all 0
Kokkos::parallel_for(
"GEMV",
TeamPolicy<exec_space>(m, Kokkos::AUTO), // Kokkos::AUTO can be other team_size
KOKKOS_LAMBDA(const member_type &team_member)
{
int row = team_member.league_rank();
double row_dot_sum = 0.0;
Kokkos::parallel_reduce(
// Inner lambda:
// (1) policy is always a TeamThreadRange
// (2) may capture by reference, but capture by value is recommended
// (3) cannot use KOKKOS_LAMBDA
Kokkos::TeamThreadRange(team_member, n),
[=] (const int col, float &lsum)
{
lsum += A(row, col) * x(col);
},
row_dot_sum
);
if (team_member.team_rank() == 0) b(row) = row_dot_sum;
}
);

Third level parallelism: ThreadVectorRange, but the tutorial does not provide a detailed example.