OpenMP fork-join cost 测试

前两天和基友讨论问题的时候,想起来此之前先后在几个场合遇到过什么场合开 OpenMP 多线程的加速能抵过开线程的 overhead 这个问题。

开 OpenMP 多线程的 overhead 主要在于每次进入并行区域时对线程的 fork 和 join,包括为各线程分配任务以及等待各线程完成计算。有些编译器实现 OpenMP 的时候会使用线程池(thread pool),第一次进入并行区域的时候创建线程池,直到程序结束再销毁线程池。Visual C++ 的 官方文档 说线程池启动需要 16ms:

Assuming an x64, single core, dual processor the threadpool takes about 16ms to startup. After that though there is very little cost for the threadpool.

Intel C 编译器似乎也是有线程池的,但是我只找到 2011 年的文档,还不是在 Intel 网站上的:

Intel OpenMP implementation uses thread pools. A pool of worker threads is created at the first parallel region. These threads exist for the duration of program execution. More threads may be added automatically if requested by the program. The threads are not destroyed until the last parallel region is executed.

至于 GCC,也有人说 它有类似线程池的机制:

The GCC OpenMP run-time libgomp implements thread teams on POSIX systems by something akin to a thread pool)

看起来似乎都很好,似乎我们并不用担心每次进入 OpenMP 并行区域时都有比较大的 overhead。但是实践中我们的确遇到了这样的问题。那我就动手试试,测一下就知道了。

Test code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
#include <stdio.h>
#include <string.h>
#include <stdlib.h>
#include <x86intrin.h>
#include <omp.h>
// sizeof(int) * 16 should >= 64 bytes so each thread uses
// a complete cache line (64 bytes) to avoid false sharing
#define LEN 1024
#define NTEST 1000
int main(int argc, char **argv)
{
int nthreads;
if (argc >= 2) nthreads = atoi(argv[1]);
if (nthreads < 1 || nthreads > omp_get_max_threads()) nthreads = omp_get_max_threads();
printf("Number of OpenMP threads : %d\n", nthreads);
int *a = (int*) _mm_malloc(sizeof(int) * LEN * nthreads, 64);
int *b = (int*) _mm_malloc(sizeof(int) * LEN * nthreads, 64);
// Use first-touch policy to initialize these arrays
// Maybe we don't need this, since the segment used by each thread will be kept in L1 cache
#pragma omp parallel for schedule(static) num_threads(nthreads)
for (int i = 0; i < LEN * nthreads; i++)
{
a[i] = i + 114;
b[i] = i + 514;
}
double st, et, ut0, ut1;
// Single thread test
ut0 = 0.0;
for (int itest = 0; itest < NTEST; itest++)
{
st = omp_get_wtime();
#pragma vector
for (int i = 0; i < LEN; i++) a[i] += b[i];
et = omp_get_wtime();
ut0 += et - st;
b[itest % LEN] -= 1;
}
printf("%d single thread jobs, used time = %.4lf (ms)\n", NTEST, ut0 * 1000.0);
ut0 /= (double) NTEST;
// Multithreading test
ut1 = 0.0;
for (int itest = 0; itest < NTEST; itest++)
{
st = omp_get_wtime();
#pragma omp parallel for schedule(static) num_threads(nthreads)
#pragma vector
for (int i = 0; i < LEN * nthreads; i++) a[i] += b[i];
et = omp_get_wtime();
ut1 += et - st;
b[itest % LEN] -= 1;
}
printf("%d multi-thread jobs, used time = %.4lf (ms)\n", NTEST, ut1 * 1000.0);
ut1 /= (double) NTEST;
printf("OpenMP thread fork average time = %.4lf (ms)\n", 1000.0 * (ut1 - ut0));
_mm_free(a);
_mm_free(b);
return 0;
}

Test script:

1
2
3
4
5
6
7
8
9
10
11
12
13
icc -v
gcc -v
icc -qopenmp -O3 -xHost -std=gnu99 omp_fork_test.c -o omp_fork_test.icc
gcc -fopenmp -O3 -march=native -std=gnu99 omp_fork_test.c -o omp_fork_test.gcc
export OMP_PLACES=threads
export OMP_PROC_BIND=true
for nt in 2, 4, 8, 16, 32, 64, 128, 256; do
./omp_fork_test.icc $nt
./omp_fork_test.gcc $nt
done

测试平台与结果

测试结果的单位是毫秒(ms)。

Sandy Bridge

  • Intel Xeon E5 2670: 8 cores * 2 hyper-threads @ 2.6 GHz

  • Ubuntu 16.04.3 LTS x64, GCC 5.4.0, ICC 17.0.4

Threads 2 4 8 16
ICC 0.0016 0.0017 0.0014 0.0021
GCC 0.0009 0.0015 0.0018 0.0034

Ivy Bridge

  • Intel Xeon E5 1620: 4 cores * 2 hyper-threads @ 2.6 GHz

  • Ubuntu 18.04.1 LTS x64, GCC 7.3.0, ICC 17.0.4

Threads 2 4 8
ICC 0.0009 0.0011 0.0031
GCC 0.0008 0.0017 0.0056

Haswell

  • Intel Xeon E5-2698v3: 2 sockets 16 cores 2 hyper-threads @ 2.3 GHz

  • SUSE Linux Enterprise Server 12 SP3 x64, GCC 7.3.0, ICC 17.0.3

Threads 2 4 8 16 32 64
ICC 0.0015 0.0017 0.0019 0.0025 0.0028 0.0033
GCC 0.0012 0.0017 0.0023 0.0035 0.0062 0.0145

Knights Landing

  • Intel Xeon Phi 7210 (Knights Landing): 64 cores * 4 hyper-threads @ 1.3 GHz

  • CentOS 7.1 x64, GCC 4.8.5, ICC 17.0.4

Threads 2 4 8 16 32 64 128 256
ICC 0.0033 0.0032 0.0039 0.0048 0.0066 0.0072 0.0097 0.0149
GCC 0.0021 0.0029 0.0041 0.0063 0.0116 0.0198 0.0429 0.1387

Skylake

  • Intel Xeon Platinum 8160: 2 sockets 24 cores 2 hyper-threads @ 2.1 GHz

  • CentOS 7.3 x64, GCC 5.4.0, ICC 17.0.4

Threads 2 4 8 16 32 48 96
ICC 0.0014 0.0016 0.0020 0.0025 0.0029 0.0033 0.0040
GCC 0.0026 0.0030 0.0037 0.0068 0.0103 0.0145 0.0286

结论

事实证明,Intel 编译器和 GCC 使用 OpenMP 时,每次进入 OpenMP 区域仍然有 overhead:只需要把代码里的 NTEST 值改大,就可以看到使用 OpenMP 开多线程时总的执行时间变长了,但是单线程执行同样计算量的耗时变化不大。网上也有人 提到 可以用 OMP_WAIT_POLICY 以及 KMP_BLOCKTIME 来设置 OpenMP 线程在执行完毕以后等待下一个任务多久才进入休眠状态。测试发现,默认使用的是 OMP_WAIT_POLICY=passive, 使用 OMP_WAIT_POLICY=active 会使测试程序更慢。这样看来,这些编译器应该都实现了线程池或类似的机制,但是每次进入并行区域时唤醒睡眠的线程仍然有一定的 overhead。

从测试数据上来看,ICC 和 GCC 平均每次进入并行区域的时间都随着线程数的增加而增加,但是 ICC 增加得没有那么快。这个 overhead 似乎受 CPU 架构的影响不大,但受 CPU 主频的影响更多。总体而言,这个 overhead 的量级并不是很大,没有到 16ms 那么大。