Kokkos笔记(三)

Kokkos Tutorials 笔记第三部分。

9. SIMD

Kokkos 的 SIMD 看起来不太好理解,远不如 #pragma omp simd, GPU thread block, 以及 intrinsic functions 来得好理解。我不在乎这一部分,按下不提。

10. Scratch Memory

Manually managed cache. Some of the use scenarios:

  • Algorithm requires temporary workspace of size W
  • Threads in a group need to access the same data

How to use:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
TeamPolicy<exec_space> policy(num_team, team_size);
// Define a scratch memory view type
using scratch_pad_view = View<double*, ExecutionSpace::scratch_memory_space>;
// Compute how much scratch memory (in bytes) is needed for each team
size_t team_spm_msize = scratch_pad_view::shmem_size(team_vec_size);
// Tell the policy how much scratch memory is needed
int level = 0; // 0 is faster but smaller, 1 is larger but slower
Kokkos::parallel_for(
policy.set_scratch_size(level, PerTeam(team_spm_msize)), // Can also use PerThread
KOKKOS_LAMBDA(cinst member_type &team_member) const
{
// Create a view from the pre-existing scratch memory
scratch_pad_view spm(team_member.team_scratch(level), team_vec_size);
// Populate the scratch memory using TeamVectorRange
Kokkos::parallel_for(
TeamVectorRange(team_member, team_vec_size),
KOKKOS_LAMBDA(const int i) { spm(i) = src_view(i, ...); }
);
// Make sure all threads have populated the scratch memory
team_member.team_barrier();
// Then we can start calculations
}
);

11. Unique Token

Used to identify a (group of) computing resource, similar to the thread ID in OpenMP. Example usage:

1
2
3
4
5
6
7
8
9
10
11
UniqueToken<exec_space> token; // Default size: exec_space().concurrency()
int num_uniq_id = token.size();
Kokkos::parallel_for(
"lable", n,
KOKKOS_LAMBDA(const int i) const
{
int id = token.acquire();
// Using id value
token.release(id);
}
);

Can also use a unique token for a team:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
int team_size = ...;
// Actual number of teams in-flight
int num_active_team = exec_space().concurrency() / team_size;
UniqueToken<exec_space> token(num_active_team * 1.2);
Kokkos::parallel_for(
"label", TeamPolicy<exec_space>(n, team_size),
KOKKOS_LAMBDA(const team_t &team) // what is team_t?
{
int id;
// Acquire an id and broadcast it with a single thread
Kokkos::single(
PerTeam(team),
[&](int &lid) { lid = token.acquire(); },
id
);
// Other operations
team.team_barrier();
// Release the id
Kokkos::single(PerTeam(team), [&]() {token.release(id)});
}
)

12. Asynchronicity and Streams

  • Most operations in Kokkos are non-blocking, the caller returns before the operation is finished
  • Each unique instance of an execution space has its own FIFO ordered execution queue
  • Each execution space type has a default instance, not all spaces support having multiple instances
  • Execution policies & deep_copy can take an instance as the first argument
  • Use Kokkos::fense() to wait for all completion, use instance specific fence to wait on specified instance, e.g.
1
2
3
using device = Kokkos::DefaultExecutionSpace;
device dev1(...), dev2(...); // Execution space instances
dev1.fence(); // Wait on dev1 instance only
  • Most host backends are blocking dispatches (except HPX) but do not rely on blocking behavior
  • Reductions to scalars are blocking, reductions to views are non-blocking
  • 2-argument deep copy is fully blocking, deep_copy with space argument is non-blocking
  • Deallocation of views implies fence, but do not rely on deallocation fence

13. Using Kokkos with MPI

基本上和 MPI + CUDA 类似,需要先把数据倒腾回来 host 上进行打包然后再收发。按下不提。

14. Kokkos Remote Space

Kokkos has its Partitioned Global Address Space (PGAS) framework. Usage:

1
2
3
4
5
6
7
8
9
10
11
12
13
using RemoteSpace_t = Kokkos::Experimental::SHMEMSpace;
// Allocate a remote view: num_proc processes, each process has
// local_size elements of data type T
Kokkos::View<T**, RemoteSpace_t> glb_a("glb_a", num_proc, local_size);
// Access global memory, only support put and get
T val0 = 19.24, val1;
glb_a(0, 3) = val0; // Write val0 to view glb_a on proc 0 at offset 3
val1 = glb_a(2, 0); // Read val1 from view glb_a on proc 2 at offset 0
// Fence
RemoteSpace_t().fence();
// Copy data to other memory space
Kokkos::View<T**, Kokkos::HostSpace_t> host_a("host_a", 1, local_size);
Kokkos::Experimental::deep_copy(host_a, glb_a);