2. 量化感知训练 (QAT)¶

2.1. 基础概念¶

1. 什么是 plugin？

答：Horizon Plugin Pytorch (后文简称 Plugin ) 是地平线参考 PyTorch 官方量化接口和思路设计的 QAT 量化训练工具，相关文档请参考：工具链用户手册第 7 章节。

2.2. 训练环境¶

1. Docker 容器无法使用 Nvidia 资源

答：可能为 Docker 启动时未指定 nvidia 参数导致，建议使用以下形式启动：

nvidia-docker run -it --shm-size="15g" \
    -v `pwd`: openexplorer/ai_toolchain_centos_7_j5:v$version$ /bin/bash

2. Docker 容器中无法 import hat

答：若您在 GPU Docker 中运行示例提示 GPU RuntimeError，或者加载 hat 出现 ModuleNotFoundError: No module named 'hat'、Segmentation fault (core dumped) 等类似报错，那么可以按照如下两个方向进行排查：

使用 OE 包中提供的 run_docker.sh 脚本启动 GPU Docker 容器;
在容器中执行如下脚本来检测 torch 和 cuda 是否正常。若 CUDA 无法调用，可以使用 nvidia-smi 检查下驱动版本是否符合要求。因为 OE 包自 v1.1.60 版本开始，升级了 torch 环境为 1.1.30+cu116，对应的驱动要求可见：用户手册 4.1.2.1 章节。

import torch
import time
print(torch.__version__)
print(torch.cuda.is_available())
a = torch.randn(10000, 1000)
b = torch.randn(1000, 2000)
t0 = time.time()
c = torch.matmul(a, b)
t1 = time.time()
print(a.device, t1 - t0, c.norm(2))
device = torch.device('cuda')
a = a.to(device)
b = b.to(device)
t0 = time.time()
c = torch.matmul(a, b)
t2 = time.time()
print(a.device, t2 - t0, c.norm(2))
t0 = time.time()
c = torch.matmul(a, b)
t2 = time.time()
print(a.device, t2 - t0, c.norm(2))

2.3. QAT量化训练¶

1. 为什么开启多机训练后精度表现变差？

答：开启多机训练后 batchsize 成倍增大，此时需要同步调整 LR 等超参来进行平衡。

2. Qconfig 是否需要用户干预？

答：地平线提供的 Qconfig 定义了 activation 和 weight 如何进行量化，目前支持 FakeQuantize、LSQ、PACT 等量化算法。若都不能达到很好的效果也支持用户自定义，但这可能会触发 QAT 精度、工具链不支持后续转换等问题，因此建议改动前先与地平线沟通。

3. 如何导出各阶段的 ONNX 模型？

答：请参考如下代码实现：

from horizon_plugin_pytorch.utils import onnx_helper as horizon_onnx_helper

# [Optional] Export float 、qat net to ONNX
# --------------------------------------------------------------------
logging.info("Export qat model to ONNX...")
data = torch.rand((1, 3, 228, 228), device=device)
horizon_onnx_helper.export_to_onnx(qat_net, data, "resnet_qat.onnx")

# [Optional] Export quantized_model to ONNX
# --------------------------------------------------------------------
horizon_onnx_helper.export_quantized_onnx(
    quantized_model, data, "resnet_quantized.onnx"
)

4. 为什么 QAT 训练时存在 nan？

答：该问题的影响因素较多，建议从以下几个方面检查：

检查输入数据是否含 nan；
检查浮点模型是否收敛。未收敛的浮点模型针对某些微量化误差可能会导致很大波动；
检查是否开启 calib。建议开启，可以给模型更好的初始系数；
检查训练策略是否适合。不适合的训练策略也会导致出现 NAN 值，例如学习率 lr 过大（可通过调低学习率或使用梯度截断方式）等。在训练策略上，默认 QAT 与浮点保持一致，如果浮点训练采用的是 OneCycle 等会影响 LR 设置的优化器，建议使用 SGD 替换。

5. 配置 int16 节点或高精度输出节点无效

答：该现象可能是因为错误配置 module_name 导致，module_name 字段只支持 string，不支持按 index 索引进行配置。

6. 如何查看某一层是否开启了高精度输出？

答：可以打印 qat_model 的所在层，查看该层是否有 (activation_post_process): FakeQuantize，若没有，则说明其为高精度输出。例如 int32 高精度 conv 打印如下：

(1): ConvModule2d(
  (0): Conv2d(
    64, 3, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
    (weight_fake_quant): FakeQuantize(
      fake_quant_enabled=tensor([1], dtype=torch.uint8), observer_enabled=tensor([1], dtype=torch.uint8),
      quant_min=-128, quant_max=127, dtype=qint8, qscheme=torch.per_channel_symmetric, ch_axis=0,
      scale=tensor([1., 1., 1.]), zero_point=tensor([0, 0, 0])
      (activation_post_process): MovingAveragePerChannelMinMaxObserver(min_val=tensor([]), max_val=tensor([]))
    )
  )
)

int8 低精度 conv 打印如下：

(0): ConvModule2d(
  (0): ConvReLU2d(
    64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1)
    (weight_fake_quant): FakeQuantize(
      fake_quant_enabled=tensor([1], dtype=torch.uint8), observer_enabled=tensor([1], dtype=torch.uint8),
      quant_min=-128, quant_max=127, dtype=qint8, qscheme=torch.per_channel_symmetric, ch_axis=0,
      scale=tensor([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
                    1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
                    1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
                    1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]),
      zero_point=tensor([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
                         0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0])
      (activation_post_process): MovingAveragePerChannelMinMaxObserver(min_val=tensor([]), max_val=tensor([]))
    )
    (activation_post_process): FakeQuantize(
      fake_quant_enabled=tensor([1], dtype=torch.uint8), observer_enabled=tensor([1], dtype=torch.uint8),
      quant_min=-128, quant_max=127, dtype=qint8, qscheme=torch.per_tensor_symmetric, ch_axis=-1,
      scale=tensor([1.]), zero_point=tensor([0])
      (activation_post_process): MovingAverageMinMaxObserver(min_val=tensor([]), max_val=tensor([]))
    )
  )
）

7. 辅助分支能否插入伪量化节点？

答：建议仅对部署上板的模型部分插入伪量化节点。由于 QAT 训练为全局训练，辅助分支的存在会导致训练难度增加，若辅助分支处的数据分布与其他分支差异较大还会加大精度风险，建议去除。

8. 如何将地平线 gridsample 算子改写为 torch 公版实现？

答：horizon_plugin_pytorch 中的 gridsample 算子非公版实现，其 grid 输入（输入2）是 int16 类型的绝对坐标，而torch 公版是 float32 类型的归一化坐标，范围是 [-1, 1]。

因此，从 torch.nn.functional.grid_sample 路径导入 grid_sample 算子后，可以通过如下方式归一化 grid：

def norm_grid(x, grid):
  n = grid.size(0)
  h = grid.size(1)
  w = grid.size(2)

  base_coord_y = (
      torch.arange(h, dtype=grid.dtype, device=grid.device)
      .unsqueeze(-1)
      .unsqueeze(0)
      .expand(n, h, w)
  )

  base_coord_x = (
      torch.arange(w, dtype=grid.dtype, device=grid.device)
      .unsqueeze(0)
      .unsqueeze(0)
      .expand(n, h, w)
  )

  absolute_grid_x = grid[:, :, :, 0] + base_coord_x
  absolute_grid_y = grid[:, :, :, 1] + base_coord_y
  norm_grid_x = absolute_grid_x * 2 / (x.size(3) - 1) - 1
  norm_grid_y = absolute_grid_y * 2 / (x.size(2) - 1) - 1
  norm_grid = torch.stack((norm_grid_x, norm_grid_y), dim=-1)

  return norm_grid

9. load calibration 后的 QAT 训练为什么权重参数不更新？

答：可以依次检查如下：

prepare_qat 是否在 optimizer 定义之前。因为 prepare_qat 会进行算子融合，导致模型结构发生变化；
fake_quant_enabled 和 observe_enabled 是否为 1；
module 中的 training 变量是否为 True。

10. 如何处理 prepare_fx 不支持的 float 类型常量 +-*/ 操作？

答：由于模型在转换为静态图时，float 操作无法被 fx 记录，所以会触发 add 不支持，float 未转换为 qtensor 等报错。此时需要将常量修改为 tensor，并将输入的 tensor 做量化（插入 quantstub），即可将符号运算转换为 FloatFunction。以下提供一个参考示例。

修改前：

#init
def __init__(self, epsilon = 1e-4):
self.p5_w1 = nn.Parameter(torch.ones(2, dtype=torch.float32), requires_grad=True)
self.act = nn.ReLU()

#forward
def forward(self, x):
    x = self.quant_stub(x)
    p5_w1=self.quant_stub(self.p5_w1)
    p5_w1 = self.act(p5_w1)
    p1=torch.sum(p5_w1.unsqueeze(1), dim=0, keepdim=True)
    cc_p5 = p1+epsilon

修改后：

#init
def __init__(self, ):
self.p5_w1 = nn.Parameter(torch.ones(2, dtype=torch.float32), requires_grad=True)
self.act = nn.ReLU()
self.epsilon = torch.Tensor([1e-4])
self.mul_fu = torch.nn.quantized.FloatFunctional()

#forward
def forward(self, x):
    x = self.quant_stub(x)
    p5_w1=self.quant_stub(self.p5_w1)
    epsilon=self.quant_stub(self.epsilon)
    p5_w1 = self.act(p5_w1)
    p1=torch.sum(p5_w1.unsqueeze(1), dim=0, keepdim=True)
    cc_p5 = self.mul_fu.add(p1, epsilon)

2.4. 常见故障¶

1. Cannot find the extension library(_C.so)

答：主要发生在 horizon_plugin_pytorch 安装成功但 import 失败，解决方案如下：

确定 plugin 版本和 cuda 版本是对应的；
在 python3 中，找到 horizon-plugin-pytorch 的执行路径，检测该目录下是否有 .so 文件。可能同时存在多个horizon-plugin-pytorch 的版本，需要卸载只保留一个需要的版本。

2. RuntimeError: Cannot load custom ops. Please rebuild the horizon_plugin_pytorch.

答：请确认本地 CUDA 环境是否正常，如路径、版本是否符合预期。

3. RuntimeError: Only Tensors created explicitly by the user (graph leaves) support the deepcopy protocol at the moment

答：主要发生在无法正常 prepare_calibration/qat 阶段，一般是由于模型中包含 non-leaf tensor 导致的，请将prepare_calibration/qat 的 inplace 配置为 True。

4. TypeError: when calling function <built-in method conv2d of type object at >

答：主要发生在 prepare_qat 后无法正常进行 qat 训练，可能原因是使用继承的方式自定义了 module，导致没有被成功转成 qat module，例如以下 Conv1d 示例的 L7 行：

class Conv1d(nn.Conv2d):
    def __init__(self, in_channels, ...):
        super().__init__(in_channels, ...)
    def forward(self, x):
        x = torch.unsqueeze(x, -2)
        x = super().forward(x)
        x = torch.squeeze(x, -2)    # 此处引入报错
        return x

建议使用 submodule 的方式调用 conv2d，例如：

class Conv1d(nn.Module):
    def __init__(self, in_channels, ...):
        super().__init__()
        self.conv = nn.Conv2d(in_channels, ...)
    def forward(self, x):
        x = torch.unsqueeze(x, -2)
        x = self().conv(x)
        x = torch.squeeze(x, -2)
        return x

5. torch.multiprocessing.spawn.ProcessExitedException: process 0 terminated with signal SIGKILL

答：可能是多线程，python 程序没有完全杀干净导致的。

6. HBDK model check fail

答：编译报错，可能是硬件不支持或超出限制导致，请配置环境变量 export HNN_MODEL_EXPORT=1 后重新编译模型，并将生成的 hbir 模型文件提供给地平线技术团队做进一步分析。

7. hbdk: no output tensor found. empty model!

答：编译报错，trace 无输出，请检查是否正确 return 结果；或在逻辑中是否包含判断分支导致无输出。

8. hbdk: torch.jit.trace fail. Please make sure the model is traceable.

答：编译报错，导致原因较多，请分别检查以下两点：

打印模型来确认模型中是否包含后处理（loss，target…），若包含则将后处理部分移除；
是否存在某层未配置 qconfig，若未配置则在实现的 class 中添加 set_qconfig，structure 中的 set_qconfig 确认是否会调用。

9. AttributeError: ‘NoneType’ object has no attribute ‘numel’

答：该报错主要发生在插入伪量化节点阶段，是算子的输入 scale 为 None 导致。造成原因可能是输出层 conv 插入 Dequant 后又接了某个 op，存在类似于 conv+dequant+conv 的结构；或者是配置了高精度输出的 conv 后又接了其他算子导致。此时请检查 dequant 算子或高精度输出配置是否使用正确。

10. symbolically traced variables cannot be used as inputs to control flow

答：该报错是在 fx 模式下使用了 if、循环等动态控制流导致。目前 fx 模式仅支持静态控制流，因此需要避免在 forward 中使用 if、for、assert 等动态语句。

11. ‘unsupported node’, %data5.1 : Tensor = aten::select(%1253, %1753, %1753)

答：该类报错主要是因为不支持索引操作，您可以使用切片的方式进行替换，例如使用 input_points[:1] 来替代 input_points[0]。

12. NotImplementedError: function <method ‘neg’ of ‘torch._C._TensorBase’ objects> is not implemented for QTensor.

答：该报错可能发生在 fx 模式下的 Calibration 阶段，是因为 fx 模式不支持 (-x) 形式的计算导致，请将 (-x) 修改为 (-1)*(x)。

13. NotimplementedError: function <function Tensor.__rsub__ at 0x7f5a7cdiee50> is not implemented for QTensor.

答：该报错可能发生在 fx 模式下的 Calibration 阶段，是因为 fx 模式下算子替换的逻辑是如果减法中的被减数是常量，就不自动进行算子替换,所以需要将减法修改为加法，例如将 (1-x) 修改为 (x+(-1))*(-1)。