计算精度问题¶

如果计算精度误差比较大，那么可能为安培架构的GPU引入的TF32数值类型以及Torch等框架会自动启用TF32计算造成的。TF32可简单理解为FP16的精度，FP32的表示范围，带来了更强的性能但是可能更差的精度。

该问题可参考Torch官方文档：文档

一般来说TF32够用，但是如果权重值有比较大的异常数值（一般没有）时会出现较大误差。下面给一个简单结果对比：

code:

import torch

A = torch.tensor([[113.2017, 7.4502, 39.3118],
                  [-99.4285, 13.2169, 85.9321],
                  [194.0693, -4282.2979, 58.0138]]).float().cuda()
B = torch.tensor([[0.8673, -0.4966, 0.0337],
                  [0.0377, -0.0019, -0.9993],
                  [0.4963, 0.8680, 0.0171]]).float().cuda()

gpu = A @ B
cpu = A.cpu() @ B.cpu()
print('gpu:\n', gpu)
print('cpu:\n', cpu)
print('gpu-cpu:\n', gpu.cpu() - cpu)
print("-" * 10)

A = torch.rand(3, 3).float().cuda()
B = torch.rand(3, 3).float().cuda()
print("A:\n", A)
print("B:\n", B)

gpu = A @ B
cpu = A.cpu() @ B.cpu()
print('gpu:\n', gpu)
print('cpu:\n', cpu)
print('gpu-cpu:\n', gpu.cpu() - cpu)

output:

gpu:
 tensor([[ 1.1795e+02, -2.2091e+01, -2.9597e+00],
        [-4.3079e+01,  1.2396e+02, -1.5093e+01],
        [ 3.5670e+01, -3.7907e+01,  4.2894e+03]], device='cuda:0')
cpu:
 tensor([[ 1.1797e+02, -2.2107e+01, -2.9579e+00],
        [-4.3088e+01,  1.2394e+02, -1.5089e+01],
        [ 3.5666e+01, -3.7882e+01,  4.2868e+03]])
gpu-cpu:
 tensor([[-2.3331e-02,  1.6153e-02, -1.8351e-03],
        [ 9.2430e-03,  2.1469e-02, -3.5658e-03],
        [ 3.8834e-03, -2.4605e-02,  2.6079e+00]])
----------
A:
 tensor([[0.2938, 0.5557, 0.5823],
        [0.7572, 0.8567, 0.8239],
        [0.1630, 0.3278, 0.0526]], device='cuda:0')
B:
 tensor([[0.6398, 0.1599, 0.5362],
        [0.6011, 0.3908, 0.5424],
        [0.5615, 0.7290, 0.6213]], device='cuda:0')
gpu:
 tensor([[0.8490, 0.6888, 0.8207],
        [1.4620, 1.0566, 1.3825],
        [0.3308, 0.1926, 0.2979]], device='cuda:0')
cpu:
 tensor([[0.8489, 0.6886, 0.8207],
        [1.4620, 1.0564, 1.3825],
        [0.3308, 0.1925, 0.2979]])
gpu-cpu:
 tensor([[ 4.5955e-05,  2.0498e-04, -6.3181e-06],
        [ 8.4162e-05,  1.1504e-04, -3.2902e-05],
        [ 1.8775e-06,  6.3717e-05,  2.8968e-05]])

从上面的结果可以发现第一组A和B计算出的结果，GPU对比CPU的误差比较大，主要原因在于这个A矩阵中有比较大的数字(绝对值)，而第二组随机初始化的A和B，GPU对比CPU的误差小很多。

如何避免上述误差？可以禁止使用TF32计算。方法为：

torch.backends.cuda.matmul.allow_tf32 = False  # 禁止矩阵乘法使用tf32
torch.backends.cudnn.allow_tf32 = False        # 禁止卷积使用tf32

code:

import torch
torch.backends.cuda.matmul.allow_tf32 = False
torch.backends.cudnn.allow_tf32 = False

A = torch.tensor([[113.2017, 7.4502, 39.3118],
                  [-99.4285, 13.2169, 85.9321],
                  [194.0693, -4282.2979, 58.0138]]).float().cuda()
B = torch.tensor([[0.8673, -0.4966, 0.0337],
                  [0.0377, -0.0019, -0.9993],
                  [0.4963, 0.8680, 0.0171]]).float().cuda()

gpu = A @ B
cpu = A.cpu() @ B.cpu()
print('gpu:\n', gpu)
print('cpu:\n', cpu)
print('gpu-cpu:\n', gpu.cpu() - cpu)
print("-" * 10)

A = torch.rand(3, 3).float().cuda()
B = torch.rand(3, 3).float().cuda()
print("A:\n", A)
print("B:\n", B)

gpu = A @ B
cpu = A.cpu() @ B.cpu()
print('gpu:\n', gpu)
print('cpu:\n', cpu)
print('gpu-cpu:\n', gpu.cpu() - cpu)

output:

gpu:
 tensor([[ 1.1797e+02, -2.2107e+01, -2.9579e+00],
        [-4.3088e+01,  1.2394e+02, -1.5089e+01],
        [ 3.5666e+01, -3.7882e+01,  4.2868e+03]], device='cuda:0')
cpu:
 tensor([[ 1.1797e+02, -2.2107e+01, -2.9579e+00],
        [-4.3088e+01,  1.2394e+02, -1.5089e+01],
        [ 3.5666e+01, -3.7882e+01,  4.2868e+03]])
gpu-cpu:
 tensor([[0.0000e+00, 0.0000e+00, 2.3842e-07],
        [0.0000e+00, 0.0000e+00, 0.0000e+00],
        [7.6294e-06, 0.0000e+00, 0.0000e+00]])
----------
A:
 tensor([[0.3775, 0.7031, 0.2857],
        [0.7453, 0.2000, 0.9838],
        [0.3098, 0.7035, 0.4328]], device='cuda:0')
B:
 tensor([[0.6860, 0.6289, 0.9266],
        [0.6632, 0.1984, 0.4418],
        [0.4027, 0.1074, 0.3741]], device='cuda:0')
gpu:
 tensor([[0.8404, 0.4076, 0.7673],
        [1.0401, 0.6141, 1.1470],
        [0.8534, 0.3809, 0.7598]], device='cuda:0')
cpu:
 tensor([[0.8404, 0.4076, 0.7673],
        [1.0401, 0.6141, 1.1470],
        [0.8534, 0.3809, 0.7598]])
gpu-cpu:
 tensor([[ 0.0000e+00,  0.0000e+00,  0.0000e+00],
        [-1.1921e-07,  0.0000e+00,  0.0000e+00],
        [ 5.9605e-08, -2.9802e-08,  0.0000e+00]])