CC BY 4.0 (除特别声明或转载文章外)
如果这篇博客帮助到你,可以请我喝一杯咖啡~
自己负责的 ASC20 的赛题的 GPU 部分,不过最后比赛表现不是很理想。随便写点东西吧。
实现工作简述
- 量子电路优化器
- 将 targetQubit 相同或小于 12 的连续电路合并
- 合并概率计算
- 使用两个通用门代替其他门
- 基于 CUDA-aware MPI 的多卡实现
- 使用 15 个 stream 和 16 个显存 Buffer 实现计算和通信重叠
- 手写的显存分配器
- 预先申请 92.5%的显存栈
- 运行时直接移动栈顶指针分配所需空间
- 运行 34 量子比特时相较于原版 QuEST 多进程实现节省 200+GB 空间
- 优化 GPU 利用效率
- 使用 shared memory 存 targetQubit < 12 的电路
- 使用 constant memory 存合并的电路参数(可以保存约 600 个)
- 使用 cuBLAS 代替原来的向量规约和点积操作
决赛结果
GPU(s) | 1 | 2 | 4 | 8 |
---|---|---|---|---|
main_HamExp | 15.130841s | 13.011804s | 18.062945s | 15.508143s |
main_InvQFT | x | x | x | 33.724309s |
random | 8.241649s | 10.510134s | 13.1650710s | 9.797779s |
GHZ | 1.148370s | 1.057206s | 1.641490s | 1.313547s |
GHZ_QFT_N(N=29) | 0.560380s | 0.545665s | 0.840117s | 0.653872s |
被大清和北航按着锤了,他们 main_InvQFT
算例在四机八卡上分别 9s 和 22s…他们的显存分配比我写的更好,我这边只跑出其中两个算例,他们几乎全部跑完了。和北航的同学交流了一下,显存不够的时候怎么办?对方回答:拉回内存上呀…也去问了一下大清的队长,他们是直接重构了 QuEST 中状态向量的分布。华科的同学说大清把通信改成多对多了,我想了想很有道理,因为我这边确实很多进程之间是没有通信的,把通信流量均分成 AlltoAll 可能确实更高效。
感觉自己优化的时候老是局限于从某些细节扣出零点几秒,还非常缺乏直接从算法层面重构整个软件的能力。
辛 亥 革 命 失 败
运行环境
我们的决赛平台「星河一号」包含四台浪潮 5280M6 服务器,每个节点上有两张 NVIDIA A100 PCIe。我们发现同一个节点间显卡 P2P 通信速率远低于跨节点的两张 GPU 通信速率,为此使用 NVIDIA 提供的 CUDA P2P Sample 测试的结果支持这一观点。慢的比例远大于 PCIe4.0 和 NVLink 之间的差距,猜测这里显卡的 P2P 走了内存。为此我们调整了机器走线,使得两张显卡均挂在一个 CPU 上,使结果快了一倍,但是仍远低于跨节点速率。
因此,我们运行时按这样分配进程到节点的映射:12213443,使得节点内的两个进程在算法上尽量不要有通信。然而,我们的代码在决赛机器上的八卡性能仍然略低于单卡性能。
$ nvidia-smi
Tue May 11 12:08:26 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 455.32.00 Driver Version: 455.32.00 CUDA Version: 11.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 A100-PCIE-40GB On | 00000000:CA:00.0 Off | 0 |
| N/A 30C P0 36W / 250W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
| 1 A100-PCIE-40GB On | 00000000:E3:00.0 Off | 0 |
| N/A 31C P0 36W / 250W | 0MiB / 40536MiB | 0% Default |
| | | Disabled |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ nvidia-smi topo -m
GPU0 GPU1 mlx5_0 CPU Affinity NUMA Affinity
GPU0 X NODE NODE 32-63 1
GPU1 NODE X NODE 32-63 1
mlx5_0 NODE NODE X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
初赛结果
GPU(s) | 1 | 2 | 4 |
---|---|---|---|
random | x | 9.62s | 7.38s |
GHZ | x | 0.98s | 0.79s |
GHZ_QFT_N(N=29) | 0.75s | 0.47s | 0.38s |
实际上我们在提交初赛成绩的时候有所保留(以防被赛会过度宣传),只交了一份去年的代码,只写了多机多卡,没有做其他优化。random 算例和 GHZ 算例分别 8.53s、7.45s。很显然这么做的并不止我们,而且别人保留的更多…
运行环境
在超算中心 TH-2K 的 V100 节点上运行的,节点上有四张 V100 16GB,用 NVLink 以一种很坑的方式连起来了,不同显卡间 NVLink 的数量并不一样多。
$ nvidia-smi
Wed May 5 18:58:36 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67 Driver Version: 418.67 CUDA Version: 10.1 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla V100-SXM2... Off | 00000000:8A:00.0 Off | 0 |
| N/A 39C P0 56W / 300W | 10MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 1 Tesla V100-SXM2... Off | 00000000:8B:00.0 Off | 0 |
| N/A 36C P0 61W / 300W | 10MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 2 Tesla V100-SXM2... Off | 00000000:B3:00.0 Off | 0 |
| N/A 36C P0 62W / 300W | 10MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
| 3 Tesla V100-SXM2... Off | 00000000:B4:00.0 Off | 0 |
| N/A 39C P0 60W / 300W | 10MiB / 16130MiB | 0% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
$ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 mlx5_0 mlx5_1 mlx5_2 mlx5_3 CPU Affinity
GPU0 X NV2 NV1 NODE PIX PIX NODE NODE 14-27
GPU1 NV2 X NODE NV2 PIX PIX NODE NODE 14-27
GPU2 NV1 NODE X NV2 NODE NODE PIX PIX 14-27
GPU3 NODE NV2 NV2 X NODE NODE PIX PIX 14-27
mlx5_0 PIX PIX NODE NODE X PIX NODE NODE
mlx5_1 PIX PIX NODE NODE PIX X NODE NODE
mlx5_2 NODE NODE PIX PIX NODE NODE X PIX
mlx5_3 NODE NODE PIX PIX NODE NODE PIX X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe switches (without traversing the PCIe Host Bridge)
PIX = Connection traversing a single PCIe switch
NV# = Connection traversing a bonded set of # NVLinks
彩蛋
只改动statevec_pauliYKernel
中的两行代码后(见statevec_pauliYKernel_WuK
),random 算例在单张 v100 32G 上就快了 0.8s。
不过很可惜,再后来重构的代码中我把所有的量子门使用两个通用门表示了,这个小彩蛋并没有用上。
// nvcc pauliYKernel.cu -ptx -arch=compute_70 -code=sm_70
typedef double qreal;
typedef struct
{
char *buffer; // generated QASM string
int bufferSize; // maximum number of chars before overflow
int bufferFill; // number of chars currently in buffer
int isLogging; // whether gates are being added to buffer
} QASMLogger;
/** Represents an array of complex numbers grouped into an array of
* real components and an array of coressponding complex components.
*
* @ingroup type
* @author Ania Brown
*/
typedef struct ComplexArray
{
qreal *real;
qreal *imag;
} ComplexArray;
/// \endcond
/** Represents a system of qubits.
* Qubits are zero-based
*
* @ingroup type
* @author Ania Brown
* @author Tyson Jones (density matrix)
*/
typedef struct Qureg
{
//! Whether this instance is a density-state representation
int isDensityMatrix;
//! The number of qubits represented in either the state-vector or density matrix
int numQubitsRepresented;
//! Number of qubits in the state-vector - this is double the number represented for mixed states
int numQubitsInStateVec;
//! Number of probability amplitudes held in stateVec by this process
//! In the non-MPI version, this is the total number of amplitudes
long long int numAmpsPerChunk;
//! Total number of amplitudes, which are possibly distributed among machines
long long int numAmpsTotal;
//! The position of the chunk of the state vector held by this process in the full state vector
int chunkId;
//! Number of chunks the state vector is broken up into -- the number of MPI processes used
int numChunks;
//! Computational state amplitudes - a subset thereof in the MPI version
ComplexArray stateVec;
//! Temporary storage for a chunk of the state vector received from another process in the MPI version
ComplexArray pairStateVec;
ComplexArray pairDeviceStateVec;
//! Storage for wavefunction amplitudes in the GPU version
ComplexArray deviceStateVec;
//! Storage for reduction of probabilities on GPU
qreal *firstLevelReduction, *secondLevelReduction;
int *lastPairRank;
void *deviceStateVecRealHandle, *deviceStateVecImagHandle, *pairDeviceStateVecRealHandle, *pairDeviceStateVecImagHandle;
//! Storage for generated QASM output
QASMLogger *qasmLog;
} Qureg;
__global__ void statevec_pauliYKernel(Qureg qureg, const int targetQubit, const int conjFac)
{
long long int sizeHalfBlock = 1LL << targetQubit;
long long int sizeBlock = 2LL * sizeHalfBlock;
long long int numTasks = qureg.numAmpsPerChunk >> 1;
long long int thisTask = blockIdx.x * blockDim.x + threadIdx.x;
if (thisTask >= numTasks)
return;
long long int thisBlock = thisTask / sizeHalfBlock;
long long int indexUp = thisBlock * sizeBlock + thisTask % sizeHalfBlock;
long long int indexLo = indexUp + sizeHalfBlock;
qreal stateRealUp, stateImagUp;
qreal *stateVecReal = qureg.deviceStateVec.real;
qreal *stateVecImag = qureg.deviceStateVec.imag;
stateRealUp = stateVecReal[indexUp];
stateImagUp = stateVecImag[indexUp];
// update under +-{{0, -i}, {i, 0}}
stateVecReal[indexUp] = conjFac * stateVecImag[indexLo];
stateVecImag[indexUp] = conjFac * -stateVecReal[indexLo];
stateVecReal[indexLo] = conjFac * -stateImagUp;
stateVecImag[indexLo] = conjFac * stateRealUp;
}
__global__ void statevec_pauliYKernel_WuK(Qureg qureg, const int targetQubit, const int conjFac)
{
long long int sizeHalfBlock = 1LL << targetQubit;
long long int sizeBlock = 2LL * sizeHalfBlock;
long long int numTasks = qureg.numAmpsPerChunk >> 1;
long long int thisTask = blockIdx.x * blockDim.x + threadIdx.x;
if (thisTask >= numTasks)
return;
long long int thisBlock = thisTask / sizeHalfBlock;
long long int indexUp = thisBlock * sizeBlock + thisTask % sizeHalfBlock;
long long int indexLo = indexUp + sizeHalfBlock;
qreal stateRealUp, stateImagUp;
qreal *stateVecReal = qureg.deviceStateVec.real;
qreal *stateVecImag = qureg.deviceStateVec.imag;
stateRealUp = stateVecReal[indexUp];
stateImagUp = stateVecImag[indexUp];
const qreal
stateRealLo = stateVecReal[indexLo],
stateImagLo = stateVecImag[indexLo];
// update under +-{{0, -i}, {i, 0}}
stateVecReal[indexUp] = conjFac * stateImagLo;
stateVecImag[indexUp] = conjFac * -stateRealLo;
stateVecReal[indexLo] = conjFac * -stateImagUp;
stateVecImag[indexLo] = conjFac * stateRealUp;
}
终端运行
nvcc pauliYKernel.cu -ptx -arch=compute_70 -code=sm_70
得到对应的 ptx 文件:
//
// Generated by NVIDIA NVVM Compiler
//
// Compiler Build ID: CL-26907403
// Cuda compilation tools, release 10.1, V10.1.243
// Based on LLVM 3.4svn
//
.version 6.4
.target sm_70
.address_size 64
// .globl _Z21statevec_pauliYKernel5Quregii
.visible .entry _Z21statevec_pauliYKernel5Quregii(
.param .align 8 .b8 _Z21statevec_pauliYKernel5Quregii_param_0[168],
.param .u32 _Z21statevec_pauliYKernel5Quregii_param_1,
.param .u32 _Z21statevec_pauliYKernel5Quregii_param_2
)
{
.reg .pred %p<3>;
.reg .b32 %r<11>;
.reg .f64 %fd<12>;
.reg .b64 %rd<29>;
mov.b64 %rd8, _Z21statevec_pauliYKernel5Quregii_param_0;
ld.param.u32 %r2, [_Z21statevec_pauliYKernel5Quregii_param_1];
ld.param.u32 %r1, [_Z21statevec_pauliYKernel5Quregii_param_2];
mov.u64 %rd3, %rd8;
cvt.u64.u32 %rd1, %r2;
mov.u64 %rd9, 1;
shl.b64 %rd2, %rd9, %r2;
ld.param.u64 %rd10, [_Z21statevec_pauliYKernel5Quregii_param_0+16];
shr.s64 %rd11, %rd10, 1;
mov.u32 %r3, %ntid.x;
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, %tid.x;
mad.lo.s32 %r6, %r3, %r4, %r5;
cvt.u64.u32 %rd4, %r6;
setp.ge.s64 %p1, %rd4, %rd11;
@%p1 bra BB0_5;
and.b64 %rd12, %rd2, -4294967296;
setp.eq.s64 %p2, %rd12, 0;
@%p2 bra BB0_3;
rem.s64 %rd28, %rd4, %rd2;
bra.uni BB0_4;
BB0_3:
cvt.u32.u64 %r7, %rd2;
cvt.u32.u64 %r8, %rd4;
rem.u32 %r9, %r8, %r7;
cvt.u64.u32 %rd28, %r9;
BB0_4:
cvt.u32.u64 %r10, %rd1;
shr.u64 %rd13, %rd4, %r10;
mul.lo.s64 %rd14, %rd2, %rd13;
shl.b64 %rd15, %rd14, 1;
add.s64 %rd16, %rd28, %rd15;
add.s64 %rd17, %rd16, %rd2;
ld.param.u64 %rd18, [%rd3+88];
cvta.to.global.u64 %rd19, %rd18;
ld.param.u64 %rd20, [%rd3+96];
cvta.to.global.u64 %rd21, %rd20;
shl.b64 %rd22, %rd16, 3;
add.s64 %rd23, %rd19, %rd22;
ld.global.f64 %fd1, [%rd23];
add.s64 %rd24, %rd21, %rd22;
ld.global.f64 %fd2, [%rd24];
shl.b64 %rd25, %rd17, 3;
add.s64 %rd26, %rd21, %rd25;
ld.global.f64 %fd3, [%rd26];
cvt.rn.f64.s32 %fd4, %r1;
mul.f64 %fd5, %fd4, %fd3;
st.global.f64 [%rd23], %fd5;
add.s64 %rd27, %rd19, %rd25;
ld.global.f64 %fd6, [%rd27];
mul.f64 %fd7, %fd4, %fd6;
neg.f64 %fd8, %fd7;
st.global.f64 [%rd24], %fd8;
mul.f64 %fd9, %fd4, %fd2;
neg.f64 %fd10, %fd9;
st.global.f64 [%rd27], %fd10;
mul.f64 %fd11, %fd4, %fd1;
st.global.f64 [%rd26], %fd11;
BB0_5:
ret;
}
// .globl _Z25statevec_pauliYKernel_WuK5Quregii
.visible .entry _Z25statevec_pauliYKernel_WuK5Quregii(
.param .align 8 .b8 _Z25statevec_pauliYKernel_WuK5Quregii_param_0[168],
.param .u32 _Z25statevec_pauliYKernel_WuK5Quregii_param_1,
.param .u32 _Z25statevec_pauliYKernel_WuK5Quregii_param_2
)
{
.reg .pred %p<3>;
.reg .b32 %r<11>;
.reg .f64 %fd<12>;
.reg .b64 %rd<29>;
mov.b64 %rd8, _Z25statevec_pauliYKernel_WuK5Quregii_param_0;
ld.param.u32 %r2, [_Z25statevec_pauliYKernel_WuK5Quregii_param_1];
ld.param.u32 %r1, [_Z25statevec_pauliYKernel_WuK5Quregii_param_2];
mov.u64 %rd3, %rd8;
cvt.u64.u32 %rd1, %r2;
mov.u64 %rd9, 1;
shl.b64 %rd2, %rd9, %r2;
ld.param.u64 %rd10, [_Z25statevec_pauliYKernel_WuK5Quregii_param_0+16];
shr.s64 %rd11, %rd10, 1;
mov.u32 %r3, %ntid.x;
mov.u32 %r4, %ctaid.x;
mov.u32 %r5, %tid.x;
mad.lo.s32 %r6, %r3, %r4, %r5;
cvt.u64.u32 %rd4, %r6;
setp.ge.s64 %p1, %rd4, %rd11;
@%p1 bra BB1_5;
and.b64 %rd12, %rd2, -4294967296;
setp.eq.s64 %p2, %rd12, 0;
@%p2 bra BB1_3;
rem.s64 %rd28, %rd4, %rd2;
bra.uni BB1_4;
BB1_3:
cvt.u32.u64 %r7, %rd2;
cvt.u32.u64 %r8, %rd4;
rem.u32 %r9, %r8, %r7;
cvt.u64.u32 %rd28, %r9;
BB1_4:
cvt.u32.u64 %r10, %rd1;
shr.u64 %rd13, %rd4, %r10;
mul.lo.s64 %rd14, %rd2, %rd13;
shl.b64 %rd15, %rd14, 1;
add.s64 %rd16, %rd28, %rd15;
add.s64 %rd17, %rd16, %rd2;
ld.param.u64 %rd18, [%rd3+88];
cvta.to.global.u64 %rd19, %rd18;
ld.param.u64 %rd20, [%rd3+96];
cvta.to.global.u64 %rd21, %rd20;
shl.b64 %rd22, %rd16, 3;
add.s64 %rd23, %rd19, %rd22;
ld.global.f64 %fd1, [%rd23];
add.s64 %rd24, %rd21, %rd22;
ld.global.f64 %fd2, [%rd24];
shl.b64 %rd25, %rd17, 3;
add.s64 %rd26, %rd19, %rd25;
ld.global.f64 %fd3, [%rd26];
add.s64 %rd27, %rd21, %rd25;
cvt.rn.f64.s32 %fd4, %r1;
ld.global.f64 %fd5, [%rd27];
mul.f64 %fd6, %fd4, %fd5;
st.global.f64 [%rd23], %fd6;
mul.f64 %fd7, %fd4, %fd3;
neg.f64 %fd8, %fd7;
st.global.f64 [%rd24], %fd8;
mul.f64 %fd9, %fd4, %fd2;
neg.f64 %fd10, %fd9;
st.global.f64 [%rd26], %fd10;
mul.f64 %fd11, %fd4, %fd1;
st.global.f64 [%rd27], %fd11;
BB1_5:
ret;
}
可以发现,这样进行一下 naive 的修改之后,部分访存的代码由
add.s64 %rd26, %rd21, %rd25;
ld.global.f64 %fd3, [%rd26];
cvt.rn.f64.s32 %fd4, %r1;
mul.f64 %fd5, %fd4, %fd3;
st.global.f64 [%rd23], %fd5;
add.s64 %rd27, %rd19, %rd25;
ld.global.f64 %fd6, [%rd27];
mul.f64 %fd7, %fd4, %fd6;
neg.f64 %fd8, %fd7;
st.global.f64 [%rd24], %fd8;
mul.f64 %fd9, %fd4, %fd2;
neg.f64 %fd10, %fd9;
st.global.f64 [%rd27], %fd10;
mul.f64 %fd11, %fd4, %fd1;
st.global.f64 [%rd26], %fd11;
变成了
add.s64 %rd26, %rd19, %rd25;
ld.global.f64 %fd3, [%rd26];
add.s64 %rd27, %rd21, %rd25;
cvt.rn.f64.s32 %fd4, %r1;
ld.global.f64 %fd5, [%rd27];
mul.f64 %fd6, %fd4, %fd5;
st.global.f64 [%rd23], %fd6;
mul.f64 %fd7, %fd4, %fd3;
neg.f64 %fd8, %fd7;
st.global.f64 [%rd24], %fd8;
mul.f64 %fd9, %fd4, %fd2;
neg.f64 %fd10, %fd9;
st.global.f64 [%rd26], %fd10;
mul.f64 %fd11, %fd4, %fd1;
st.global.f64 [%rd27], %fd11;
可以看到,核函数代码对 global 内存的读写顺序改变了,原先 ld.global 与 st.global 指令间存在一定的交错执行关系,经过调整之后统一先读后写,这使得其更容易在运行的时候合并访存。可以使用 nvprof 这个工具验证优化的有效性。statevec_pauliYKernel 在算例一中有 43 个 Calls,其运行时间的 Avg/Min/Max 分别为 63.251ms/57.993ms/84.206ms ,总运行时间占算例一总时间的 14.91%(2.71978s/18.688107s)。经过修改之后,其运行时间的 Avg/Min/Max 分别为 44.101ms/42.064ms/53.005ms ,总运行时间占算例一总时间的 10.87%(1.89634s/17.922671s)。从结果上来看,只改动了原 CUDA 代码的两行就给 GPU 单卡的代码带来了近 1s 的提速,还是非常有效的。
发现这个问题的过程也很有参考意义:在 nvprof 得到的结果中,这个核函数运行时间的 Avg(63.251ms)远大于相同计算量的 statevec_pauliXKernel(43.516ms),且这个核函数的 Max 远大于 Avg,说明运行时间的波动较大,存在优化空间。在逐行比较 statevec_pauliYKernel 和 statevec_pauliXKernel 后,唯一有区别的几乎就是,原作者为了节省两个变量,直接将访存代码放进了计算的表达式中;调整之后果然取得了优化。最后通过类似的手段又在别的地方挤出 0.4s。这也提示我们一个比较优秀的写 CUDA 代码的习惯:尽量不要在同一个表达式中同时出现读写 global memory 的操作;同时对 global memory 的读写应该尽量分开,各自聚集在一起。