Kokkos笔记(二)

Kokkos Tutorials 笔记第二部分。

5. MDRangePolicy

To parallelize tightly nested loops of 1 to 6 dimensions. Example:

1
2
3
4
5
6
7
8
9
10
11
12
13
#pragma omp parallel for collapse(3) reduction(+:sum)
float sum = 0.0;
for (int x = nx_s; x < nx_e; x++)
    for (int y = ny_s; y < ny_e; y++)
        for (int z = nz_s; z < nz_e; z++)
            sum += f[x][y][z];
// Note: when applying MDRangPolicy, cannot use multiple reducers
Kokkos::parallel_reduce(
    "Sum_over_cuboid",
    MDRangePolicy<Rank<3>>({nx_s, ny_s, nz_s}, {nx_e, ny_e, nz_e}),
    KOKKOS_LAMBDA(int i, int j, int k, float &lsum) { lsum += f[x][y][z]; },
    Kokkos::Sum<float>(sum)
);

Tiling: easy, just add the third set of parameters: MDRangePolicy<Rank<3>>({nx_s, ny_s, nz_s}, {nx_e, ny_e, nz_e}, {ts_x, ts_y, ts_z}), the tile sizes on x/y/z loops are ts_{x/y/z}. For GPUs a tile is handled by a single thread block.

Default iteration patterns match the default memory layouts, can change the iteration patterns between tiles (IterateOuter) and within tiles (IterateInner): Kokkos:Rank<ndim, IterateOuter, IterateInner>.

WorkTag: enables multiple operators in one functor:

1
2
3
4
5
6
7
8
9
10
11
struct foo
{
    struct Tag1{};  struct Tag2{};
    KOKKOS_FUNCTION void operator(Tag1, int i) const {...}
    KOKKOS_FUNCTION void operator(Tag2, int i) const {...}
    void run_both(int n) 
    {
        parallel_for(RangePolicy<Tag1>(0, N), *this);
        parallel_for(RangePolicy<Tag2>(0, N), *this);
    }
}

6. Subview

A subview is a slice of each dimension of a view and points to the same data, can be constructed on host or with in a kernel. Similar to the “colon” notation provided by MATLAB, Fortran, Python.

Subview can take three types of slice arguments:

  • Index: a scalar, only the given index in that dimension will remain
  • Kokkos::pair: a half-open range of indices
  • Kokkos::ALL: the entire range

For example, the following code is equivalent to MATLAB code norm(tensor(3, 5:10, :), 'fro'):

1
2
3
4
5
6
7
8
9
10
11
12
13
Kokkos::View<double***> tensor("tensor", N0, N1, N2);
auto sv = Kokkos::subview(tensor, 2, Kokkos::make_pair(4, 10), Kokkos::ALL);
double fnorm = 0.0;
Kokkos::parallel_reduce(
    "fro_norm",
    MDRangePolicy<Rank<2>>({0, 0}, {6, N2}),
    KOKKOS_LAMBDA(int j, int k, double &lsum) { 
        double sv_jk = sv(j, k);
        lsum += sv_jk * sv_jk; 
    },
    fnorm
);
fnorm = sqrt(fnorm);

7. Thread Safety and Atomic Operations

Example: histogram

1
2
3
4
5
6
7
8
9
10
11
12
Kokkos::View<int*> histogram("histogram", num_bucket);
Kokkos::parallel_for(num_bucket, KOKKOS_LAMBDA(const int &i) {histogram(i) = 0;});
Kokkos::parallel_for(
    "histogram",
    n, 
    KOKKOS_LAMBDA(const int &i)
    {
        const dtype value = ...;
        const int bucket_index = calc_bucket_index(value);
        Kokkos::atomic_add(&histogram(bucket_index), 1);
    }
);

View can have memory traits including Atomic, RandomAccess, Restrict, Unmanaged, and other. Two examples:

1
2
3
4
5
6
7
8
Kokkos::View<int*> a("a" , 100);
Kokkos::View<int*, Kokkos::MemoryTraits<Atomic> > a_atomic = a;
a_atomic(1) += 1; // This access will do an atomic addition

const size_t N0 = ...;
Kokkos::View<int*> a_nonconst ("a", N0); // Allocate nonconst View
// Assign to const, RandomAccess View (useful on GPU)
Kokkos::View<const int*, Kokkos::MemoryTraits<Kokkos::RandomAccess>> a_ra = a_nonconst;

ScatterView: transparently switch between atomic and data replication (every thread owns a copy).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
void compute_forces(
    View<real3*> x, View<real3*> f, ScatterView<real3*>scatter_f,
    View<int**> neighs, Interaction force
)
{
    scatter_f.reset();
    int N = x.extent(0);
    int num_neighs = neighs.extent(1);
    Kokkos::parallel_for(
        "ForceCompute", N, 
        KOKKOS_LAMBDA(const int i)
        {
            auto f_a = scatter_f.access();
            for (int j = 0; j < num_neighs; j++) 
            {
                real3 df = force.compute(x(i), x(neighs(i, j)));
                f_a(i) += df;   // Only += and -= operators are available
                f_a(j) -= df;
            }
        }
    );
    Kokkos::Experimental::contribute(f, scatter_f);
}

8. Hierarchical Parallelism

Thread team: a collection of threads which are guaranteed to be executing cincurrently and can synchronize (similar to a thread block in CUDA).

Here are some important properties:

1
2
3
4
5
6
7
8
9
10
using member_type = Kokkos::TeamPolicy<>::member_type;
void operator(const member_type &team_member)
{
    // How many teams are there and which team am I on
    const unsigned int num_team   = team_member.league_size();
    const unsigned int team_idx   = team_member.league_rank();
    // How many threads are in the team and which thread am I on
    const unsigned int num_thread = team_member.team_size();
    const unsigned int thread_idx = team_member.team_rank();
}

Hierarchical parallelism using TeamPolicy: total work = number of teams * size of teams

1
2
3
4
5
6
parallel_operation(
    "label", 
    TeamPolicy<execution_space>(number_of_teams, team_size),
    functor,
    // Something else
)

Nested parallel pattern: use parallel executions in parallel execution. Here is an example of doing a matrix-vector multiplication:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
const int m = ...;
const int n = ...;
Kokkos::View<float**> A("A", m, n);
Kokkos::View<float*>  x("x", n);
Kokkos::View<float*>  b("x", m);
// Fill A, x and set b to all 0
Kokkos::parallel_for(
    "GEMV", 
    TeamPolicy<exec_space>(m, Kokkos::AUTO),  // Kokkos::AUTO can be other team_size
    KOKKOS_LAMBDA(const member_type &team_member)
    {
        int row = team_member.league_rank();
        double row_dot_sum = 0.0;
        Kokkos::parallel_reduce(
            // Inner lambda: 
            // (1) policy is always a TeamThreadRange
            // (2) may capture by reference, but capture by value is recommended
            // (3) cannot use KOKKOS_LAMBDA
            Kokkos::TeamThreadRange(team_member, n),
            [=] (const int col, float &lsum)  
            { 
                lsum += A(row, col) * x(col); 
            },
            row_dot_sum
        );
        if (team_member.team_rank() == 0) b(row) = row_dot_sum;
    }
);

Third level parallelism: ThreadVectorRange, but the tutorial does not provide a detailed example.