博客專欄

EEPW首頁 > 博客 > YOLOv5 的量化流程及部署方法

YOLOv5 的量化流程及部署方法

發布人：地平線開發者時間：2024-12-25 來源：工程師

加入技術交流群
- 掃碼加入
  和技術大咖面對面交流
  海量資料庫查詢

發布文章

01 技術背景

YOLOv5 是一種高效的目標檢測算法，尤其在實時目標檢測任務中表現突出。YOLOv5 通過三種不同尺度的檢測頭分別處理大、中、小物體；檢測頭共包括三個關鍵任務：邊界框回歸、類別預測、置信度預測；每個檢測頭都會逐像素地使用三個 Anchor，以幫助算法更準確地預測物體邊界。

YOLOv5 具有多種不同大小的模型（YOLOv5n、YOLOv5s、YOLOv5m、YOLOv5l、YOLOv5x）以適配不同的任務類型和硬件平臺。本文以基于色選機數據集訓練出的 YOLOv5n 模型為例，介紹如何使用 PTQ 進行量化編譯并使用 C++進行全流程的板端部署。

02 模型輸入輸出說明

本示例使用的 Yolov5n 模型，相較于公版在輸入和輸出上存在以下 2 點變動：

1、輸入分辨率設定為 384x2048，從而輸出分辨率也調整為了 48x256，24x128，12x64

2、類別數量設定為 17，因此輸出 tensor 的通道數變為了（17+4+1）x3=66

從 pytorch 導出的 onnx 模型，具體的輸入輸出信息如下圖所示：

同時，為了優化整體耗時，模型尾部的 sigmoid 計算被放在了后處理。

03 工具鏈環境

horizon-nn 1.1.0 
horizon_tc_ui 1.24.3 
hbdk 3.49.15

04 PTQ量化編譯流程

4.1 準備校準數據

先準備 100 張如上圖所示的色選機數據集圖片存放在 seed100 文件夾，之后可借助 horizon_model_convert_sample 的 02_preprocess.sh 腳本幫助我們生成校準數據。

02_preprocess.sh

python3 ../../../data_preprocess.py \
  --src_dir ./seed100 \
  --dst_dir ./calibration_data_rgb_f32 \
  --pic_ext .rgb \
  --read_mode opencv \
  --saved_data_type float32

preprocess.py

def calibration_transformers():
    transformers = [
        PadResizeTransformer(target_size=(384, 2048)),
        HWC2CHWTransformer(),
        BGR2RGBTransformer(data_format="CHW"),
    ]
    return transformers

校準數據僅需 resize 成符合模型輸入的尺寸，并處理成 chw 和 rgb 即可。也就是說，除了歸一化，其他操作都要對齊浮點模型訓練時的數據預處理，而歸一化可以放到模型的預處理節點中實現加速計算。

4.2 配置 yaml 文件

model_parameters:
  onnx_model: 'yolov5n.onnx'
  march: 'bayes-e'
  working_dir: 'model_output'
  output_model_file_prefix: 'yolov5n'
input_parameters:
  input_type_rt: 'nv12'
  input_type_train: 'rgb'
  input_layout_train: 'NCHW'
  norm_type: 'data_scale'
  scale_value: 0.003921568627451
calibration_parameters:
  cal_data_dir: './calibration_data_rgb_f32'
  cal_data_type: 'float32'
  calibration_type: 'default'
compiler_parameters:
  optimize_level: 'O3'

input_type_rt 指模型在部署時輸入的數據類型，考慮到視頻通路傳來的通常都是 nv12，因此我們將該項置為 nv12。

input_type_train 指浮點模型訓練時使用的數據類型，這里使用 rgb。

input_layout_train 指浮點模型訓練時使用的數據排布，這里使用 NCHW。

norm_type 和 scale_value 根據浮點模型訓練時使用的歸一化參數設置，這里配置 scale 為 1/255。

這樣配置后，上板模型會自帶一個預處理節點，用來將 nv12 數據轉換為 rgb 并做歸一化，這個預處理節點可以被等效轉換為卷積，從而支持 BPU 加速計算，進而顯著減少預處理耗時。

我們強烈建議您在編譯處理圖像任務的模型時，使用這種配置方法。上板模型的數據輸入類型可直接使用 nv12，同時我們也提供了板端讀取 bgr 圖片并轉換為 nv12 格式的 C++代碼供您參考。

4.3 編譯上板模型

hb_mapper makertbin --config ./yolov5n_config.yaml --model-type onnx

執行以上命令后，即可編譯出用于板端部署的 bin 模型。

=============================================================================
Output      Cosine Similarity  L1 Distance  L2 Distance  Chebyshev Distance  
-----------------------------------------------------------------------------
output      0.996914           0.234755     0.000420     5.957216            
613         0.997750           0.232995     0.000744     8.833645            
615         0.995946           0.281512     0.001877     4.717240

根據編譯日志可看出，yolov5n 模型的三個輸出頭，量化前后的余弦相似度均>0.99，符合精度要求。

4.4 onnx 和 bin 的一致性驗證（可選流程）

PTQ 量化流程會生成 yolov5n_quantized_model.onnx 和 yolov5n.bin，前者是量化后的 onnx 模型，后者是上板模型。通常來說，這兩個模型具有完全相同的精度，可以使用這種方法進行驗證。

yolov5n_quantized_model.onnx

import cv2 
import numpy as np 
from PIL import Image 
from horizon_tc_ui import HB_ONNXRuntime                               

def bgr2nv12(image): 
    image = image.astype(np.uint8) 
    height, width = image.shape[0], image.shape[1] 
    yuv420p = cv2.cvtColor(image, cv2.COLOR_BGR2YUV_I420).reshape((height * width * 3 // 2, )) 
    y = yuv420p[:height * width] 
    uv_planar = yuv420p[height * width:].reshape((2, height * width // 4)) 
    uv_packed = uv_planar.transpose((1, 0)).reshape((height * width // 2, )) 
    nv12 = np.zeros_like(yuv420p) 
    nv12[:height * width] = y 
    nv12[height * width:] = uv_packed 
    return nv12 
 
def nv12Toyuv444(nv12, target_size): 
    height = target_size[0] 
    width = target_size[1] 
    nv12_data = nv12.flatten() 
    yuv444 = np.empty([height, width, 3], dtype=np.uint8) 
    yuv444[:, :, 0] = nv12_data[:width * height].reshape(height, width) 
    u = nv12_data[width * height::2].reshape(height // 2, width // 2) 
    yuv444[:, :, 1] = Image.fromarray(u).resize((width, height),resample=0) 
    v = nv12_data[width * height + 1::2].reshape(height // 2, width // 2) 
    yuv444[:, :, 2] = Image.fromarray(v).resize((width, height),resample=0) 
    return yuv444 

def preprocess(input_name):
    bgr_input = cv2.imread("seed.jpg")
    nv12_input = bgr2nv12(bgr_input)
    nv12_input.tofile("seed_nv12.bin") 
    yuv444 = nv12Toyuv444(nv12_input, (384,2048))
    yuv444 = yuv444[np.newaxis,:,:,:]
    yuv444_128 = (yuv444-128).astype(np.int8)
    return yuv444_128

def main(): 
    sess = HB_ONNXRuntime(model_file="./yolov5n_quantized_model.onnx")
    input_names = [input.name for input in sess.get_inputs()]
    output_names = [output.name for output in sess.get_outputs()]
    feed_dict = dict()
    for input_name in input_names:
        feed_dict[input_name] = preprocess(input_name)
    output = sess.run(output_names, feed_dict)     
    print(output[0][0][0][0])
        
if __name__ == '__main__':
    main()

在讀取原始圖像后，將其轉換為 nv12 格式并保存，之后處理成 yuv444_128 格式并送給模型推理。

由 print（output00）打印出的信息如下：

[  0.18080421   0.4917729    0.34173843   0.26877916 -10.983349
  -3.8538744   -1.8031031   -2.2803051   -1.5579813   -1.8910917
  -3.7208636   -2.4970834   -2.8638227   -3.5894732   -3.338331
......

yolov5n.bin

hrt_model_exec infer --model-file yolov5n.bin --input-file seed_nv12.bin --enable_dump true --dump_format txt

這里我們將上一步保存的 nv12 數據作為 bin 模型的輸入，并保存輸出數據，其中第一個輸出分支的數據如下：

0.180804208 
0.491772890 
0.341738433 
0.268779159 
-10.983348846 
-3.853874445 
-1.803103089 
-2.280305147 
-1.557981253 
-1.891091704 
-3.720863581 
-2.497083426 
-2.863822699 
-3.589473248 
-3.338330984 
......

可以看到，yolov5n_quantized_model.onnx 和 yolov5n.bin 具有相同的輸出。

05 Runtime 部署流程

在算法工具鏈的交付包中，ai benchmark 示例包含了讀圖、前處理、推理、后處理等完整流程的 C++源碼，但考慮到 ai benchmark 代碼耦合度較高，有不低的學習成本，不方便用戶嵌入到自己的工程應用中，因此我們提供了基于 horizon_runtime_sample 示例修改的簡易版本 C++代碼，只包含 1 個頭文件和 1 個 C++源碼，用戶僅需替換原有的 00_quick_start 示例即可編譯運行。

該 C++ demo 包含對單幀數據的讀圖（bgr->nv12），模型推理（包含預處理），后處理，打印輸出結果等步驟。

5.1 頭文件

該頭文件內容主要來自于 ai benchmark 的 code/include/base/perception_common.h 頭文件，包含了對 argmax 和計時功能的定義，以及目標檢測任務相關結構體的定義。

#include 

typedef std::chrono::steady_clock::time_point Time;
typedef std::chrono::duration Micro;

template 
inline size_t argmax(ForwardIterator first, ForwardIterator last) {
  return std::distance(first, std::max_element(first, last));
}

typedef struct Bbox {
  float xmin{0.0};
  float ymin{0.0};
  float xmax{0.0};
  float ymax{0.0};
  Bbox() {}
  Bbox(float xmin, float ymin, float xmax, float ymax)
      : xmin(xmin), ymin(ymin), xmax(xmax), ymax(ymax) {}
  friend std::ostream &operator<<(std::ostream &os, const Bbox &bbox) {
    const auto precision = os.precision();
    const auto flags = os.flags();
    os << "[" << std::fixed << std::setprecision(6) << bbox.xmin << ","
       << bbox.ymin << "," << bbox.xmax << "," << bbox.ymax << "]";
    os.flags(flags);
    os.precision(precision);
    return os;
  }
  ~Bbox() {}
} Bbox;

typedef struct Detection {
  int id{0};
  float score{0.0};
  Bbox bbox;
  const char *class_name{nullptr};
  Detection() {}
  Detection(int id, float score, Bbox bbox)
      : id(id), score(score), bbox(bbox) {}
  Detection(int id, float score, Bbox bbox, const char *class_name)
      : id(id), score(score), bbox(bbox), class_name(class_name) {}
  friend bool operator>(const Detection &lhs, const Detection &rhs) {
    return (lhs.score > rhs.score);
  }
  friend std::ostream &operator<<(std::ostream &os, const Detection &det) {
    const auto precision = os.precision();
    const auto flags = os.flags();
    os << "{"
       << R"("bbox")"
       << ":" << det.bbox << ","
       << R"("prob")"
       << ":" << std::fixed << std::setprecision(6) << det.score << ","
       << R"("label")"
       << ":" << det.id << ","
       << R"("class_name")"
       << ":\"" << det.class_name << "\"}";
    os.flags(flags);
    os.precision(precision);
    return os;
  }
  ~Detection() {}
} Detection;

struct Perception {
  std::vector det;
  enum {
    DET = (1 << 0),
  } type;
  friend std::ostream &operator<<(std::ostream &os, Perception &perception) {
    os << "[";
    if (perception.type == Perception::DET) {
      auto &detection = perception.det;
      for (int i = 0; i < detection.size(); i++) {
        if (i != 0) {
          os << ",";
        }
        os << detection[i];
      }
    } 
    os << "]";
    return os;
  }
};

5.2 源碼

為方便用戶閱讀，該源碼使用全局變量定義了若干參數，請用戶在實際的應用工程中，避免使用過多全局變量。代碼中已在合適的位置添加中文注釋。

#include 
#include 
#include 
#include 
#include 
#include 
#include 
#include "dnn/hb_dnn.h"
#include "opencv2/core/mat.hpp"
#include "opencv2/imgcodecs.hpp"
#include "opencv2/imgproc.hpp"
#include "head.h"

// 上板模型的路徑
auto modelFileName = "yolov5n.bin";
// 單張測試圖片的路徑
std::string imagePath = "seed.jpg";
// 測試圖片的寬度
int image_width = 2048;
// 測試圖片的高度
int image_height = 384;
// 置信度閾值
float score_threshold = 0.2;
// 分類目標數
int num_classes = 17;
// 模型輸出的通道數
int num_pred = num_classes + 4 + 1;
// nms的topk
int nms_top_k = 5000;
// nms的iou閾值
float nms_iou_threshold = 0.5;

// 為模型推理準備輸入輸出內存空間
void prepare_tensor(int input_count, 
                   int output_count, 
                   hbDNNTensor *input_tensor,
                   hbDNNTensor *output_tensor,
                   hbDNNHandle_t dnn_handle) {
  hbDNNTensor *input = input_tensor;
  for (int i = 0; i < input_count; i++) {
    hbDNNGetInputTensorProperties(&input[i].properties, dnn_handle, i);
    int input_memSize = input[i].properties.alignedByteSize;
    hbSysAllocCachedMem(&input[i].sysMem[0], input_memSize);
    input[i].properties.alignedShape = input[i].properties.validShape;
  }
  hbDNNTensor *output = output_tensor;
  for (int i = 0; i < output_count; i++) {
    hbDNNGetOutputTensorProperties(&output[i].properties, dnn_handle, i);
    int output_memSize = output[i].properties.alignedByteSize;
    hbSysAllocCachedMem(&output[i].sysMem[0], output_memSize);
  }
}

// 讀取bgr圖片并轉換為nv12格式再存儲進輸入內存
void read_image_2_tensor_as_nv12(std::string imagePath, 
                                 hbDNNTensor *input_tensor) {
  hbDNNTensor *input = input_tensor;
  hbDNNTensorProperties Properties = input->properties;
  int input_h = Properties.validShape.dimensionSize[2];
  int input_w = Properties.validShape.dimensionSize[3];
  cv::Mat bgr_mat = cv::imread(imagePath, cv::IMREAD_COLOR);
  cv::Mat yuv_mat;
  cv::cvtColor(bgr_mat, yuv_mat, cv::COLOR_BGR2YUV_I420);
  uint8_t *nv12_data = yuv_mat.ptr();
  auto input_data = input->sysMem[0].virAddr;
  int32_t y_size = input_h * input_w;
  memcpy(reinterpret_cast(input_data), nv12_data, y_size);
  int32_t uv_height = input_h / 2;
  int32_t uv_width = input_w / 2;
  uint8_t *nv12 = reinterpret_cast(input_data) + y_size;
  uint8_t *u_data = nv12_data + y_size;
  uint8_t *v_data = u_data + uv_height * uv_width;
  for (int32_t i = 0; i < uv_width * uv_height; i++) {
    if (u_data && v_data) {
      *nv12++ = *u_data++;
      *nv12++ = *v_data++;
    }
  }
}

// 后處理的核心代碼（不包括nms），初步篩選檢測框
void process_tensor_core(hbDNNTensor *tensor,
                       int layer,
                       std::vector &dets){
  hbSysFlushMem(&(tensor->sysMem[0]), HB_SYS_MEM_CACHE_INVALIDATE);
  int height, width, stride;
  std::vector> anchors;
  if(layer == 0){
    height = 48; width = 256; stride = 8; anchors = {{10, 13}, {16, 30}, {33, 23}};
  } else if (layer == 1){
    height = 24; width = 128; stride = 16; anchors = {{30, 61}, {62, 45}, {59, 119}};
  } else if (layer == 2){
    height = 12; width = 64; stride = 32; anchors = {{116, 90}, {156, 198}, {373, 326}};
  }
  int anchor_num = anchors.size();
  auto *data = reinterpret_cast(tensor->sysMem[0].virAddr);
  for (uint32_t h = 0; h < height; h++) {
    for (uint32_t w = 0; w < width; w++) {
      for (int k = 0; k < anchor_num; k++) {        
        double anchor_x = anchors[k].first;
        double anchor_y = anchors[k].second;
        float *cur_data = data + k * num_pred;
        float objness = cur_data[4];
        if (objness < score_threshold)
            continue;
        int id = argmax(cur_data + 5, cur_data + 5 + num_classes);
        // 模型檢測頭不包含sigmoid算子，而將sigmoid計算安排在后處理進行                  
        double x1 = 1 / (1 + std::exp(-objness)) * 1; 
        double x2 = 1 / (1 + std::exp(-cur_data[id + 5]));
        double confidence = x1 * x2;
        if (confidence < score_threshold)
          continue;
        float center_x = cur_data[0];
        float center_y = cur_data[1];
        float scale_x = cur_data[2];
        float scale_y = cur_data[3];
        double box_center_x =
            ((1.0 / (1.0 + std::exp(-center_x))) * 2 - 0.5 + w) * stride;
        double box_center_y =
            ((1.0 / (1.0 + std::exp(-center_y))) * 2 - 0.5 + h) * stride;             
        double box_scale_x =
            std::pow((1.0 / (1.0 + std::exp(-scale_x))) * 2, 2) * anchor_x;
        double box_scale_y =
            std::pow((1.0 / (1.0 + std::exp(-scale_y))) * 2, 2) * anchor_y;
        double xmin = (box_center_x - box_scale_x / 2.0);
        double ymin = (box_center_y - box_scale_y / 2.0);
        double xmax = (box_center_x + box_scale_x / 2.0);
        double ymax = (box_center_y + box_scale_y / 2.0);          
        double xmin_org = xmin; 
        double xmax_org = xmax; 
        double ymin_org = ymin;
        double ymax_org = ymax;
        if (xmax_org <= 0 || ymax_org <= 0)
          continue;
        if (xmin_org > xmax_org || ymin_org > ymax_org)
          continue;
        xmin_org = std::max(xmin_org, 0.0);
        xmax_org = std::min(xmax_org, image_width - 1.0);
        ymin_org = std::max(ymin_org, 0.0);
        ymax_org = std::min(ymax_org, image_height - 1.0);
        Bbox bbox(xmin_org, ymin_org, xmax_org, ymax_org);
        dets.emplace_back((int)id, confidence, bbox);
      }
      data = data + num_pred * anchors.size();
    }
  }
}

// nms處理，精挑細選出合適的檢測框
void yolo5_nms(std::vector &input,
               std::vector &result,
               bool suppress) {
  std::stable_sort(input.begin(), input.end(), std::greater());
  std::vector skip(input.size(), false);
  std::vector areas;
  areas.reserve(input.size());
  for (size_t i = 0; i < input.size(); i++) {
    float width = input[i].bbox.xmax - input[i].bbox.xmin;
    float height = input[i].bbox.ymax - input[i].bbox.ymin;
    areas.push_back(width * height);
  }
  int count = 0;
  for (size_t i = 0; count < nms_top_k && i < skip.size(); i++) {
    if (skip[i]) {
      continue;
    }
    skip[i] = true;
    ++count;
    for (size_t j = i + 1; j < skip.size(); ++j) {
      if (skip[j]) {
        continue;
      }
      if (suppress == false) {
        if (input[i].id != input[j].id) {
          continue;
        }
      }
      float xx1 = std::max(input[i].bbox.xmin, input[j].bbox.xmin);
      float yy1 = std::max(input[i].bbox.ymin, input[j].bbox.ymin);
      float xx2 = std::min(input[i].bbox.xmax, input[j].bbox.xmax);
      float yy2 = std::min(input[i].bbox.ymax, input[j].bbox.ymax);
      if (xx2 > xx1 && yy2 > yy1) {
        float area_intersection = (xx2 - xx1) * (yy2 - yy1);
        float iou_ratio =
            area_intersection / (areas[j] + areas[i] - area_intersection);
        if (iou_ratio > nms_iou_threshold) {
          skip[j] = true;
        }
      }
    }
    result.push_back(input[i]); 
    // 打印最終篩選出的檢測框的置信度和位置信息
    std::cout << "score " << input[i].score;
    std::cout << " xmin " << input[i].bbox.xmin;
    std::cout << " ymin " << input[i].bbox.ymin;
    std::cout << " xmax " << input[i].bbox.xmax;
    std::cout << " ymax " << input[i].bbox.ymax << std::endl; 
  }
}

// 多線程加速后處理計算
std::mutex dets_mutex;
void process_tensor_thread(hbDNNTensor *tensor, int layer, std::vector &dets){
  std::vector local_dets;
  process_tensor_core(tensor, layer, local_dets);
  std::lock_guard lock(dets_mutex);
  dets.insert(dets.end(), local_dets.begin(), local_dets.end());
}

void post_process(std::vector &tensors, 
                  Perception *perception){
  perception->type = Perception::DET;
  std::vector dets;
  std::vector threads;
  for (int i = 0; i < tensors.size(); ++i) {
    threads.emplace_back([&tensors, i, &dets](){
      process_tensor_thread(&tensors[i], i, dets);
    });
  }
  for (auto &thread : threads) 
    thread.join();
  yolo5_nms(dets, perception->det, false);
}


int main(int argc, char **argv) {
  //初始化模型
  hbPackedDNNHandle_t packed_dnn_handle;
  hbDNNHandle_t dnn_handle;
  const char **model_name_list;
  int model_count = 0;
  hbDNNInitializeFromFiles(&packed_dnn_handle, &modelFileName, 1);
  hbDNNGetModelNameList(&model_name_list, &model_count, packed_dnn_handle);
  hbDNNGetModelHandle(&dnn_handle, packed_dnn_handle, model_name_list[0]);
  std::cout<< "yolov5 demo begin!" << std::endl;
  std::cout<< "load model success" <

5.3 運行說明

用戶可將頭文件和源碼放入 horizon_runtime_sample/code/00_quick_start/src 路徑，并執行 build_x5.sh 編譯工程，再將 horizon_runtime_sample/x5 文件夾復制到開發板的 /userdata 目錄，并在 /userdata/x5/script/00_quick_start/ 路徑下存放上板模型、測試圖片等文件，并編寫板端運行腳本：

bin=../aarch64/bin/run_mobileNetV1_224x224
lib=../aarch64/lib

export LD_LIBRARY_PATH=${lib}:${LD_LIBRARY_PATH}
export BMEM_CACHEABLE=true

${bin}

運行結果如下：

yolov5 demo begin!
load model success
prepare intput and output tensor success
read image to tensor as nv12 success
model infer time: 7.763 ms
model infer success
score 0.365574 xmin 1448.69 ymin 148.4 xmax 1518.55 ymax 278.487
postprocess time: 1.376 ms
postprocess success
release resources success
yolov5 demo end!

對于這次推理，我們的輸入圖像為下圖：

可以看到，推理程序成功識別到了 1 枚瓜子，并且給出了正確的坐標信息。

5.4 模型推理耗時說明

需要強調的是，應用程序在推理第一幀的時候，會產生加載推理框架導致的額外耗時，因此運行該程序測出的模型推理耗時是偏高的。

準確的模型的推理時間應當以 hrt_model_exec 工具實測結果為準，參考命令：

hrt_model_exec perf --model-file ./yolov5n.bin --thread-num 1（測試單線程單幀延時，關注latency）
hrt_model_exec perf --model-file ./yolov5n.bin --thread-num 8（測試多線程極限吞吐量，關注FPS）

*博客內容為網友個人發布，僅代表博主個人觀點，如有侵權請聯系工作人員刪除。

<a target='_blank'><img src='https://ad.eepw.com.cn/www/delivery/avw.php?zoneid=114&cb=INSERT_RANDOM_NUMBER_HERE&n=a7a83b30' border='0' alt='' /></a>
<a target='_blank'><img src='https://ad.eepw.com.cn/www/delivery/avw.php?zoneid=115&cb=INSERT_RANDOM_NUMBER_HERE&n=a3d98779' border='0' alt='' /></a>
<a target='_blank'><img src='https://ad.eepw.com.cn/www/delivery/avw.php?zoneid=116&cb=INSERT_RANDOM_NUMBER_HERE&n=abca108c' border='0' alt='' /></a>
<a target='_blank'><img src='https://ad.eepw.com.cn/www/delivery/avw.php?zoneid=117&cb=INSERT_RANDOM_NUMBER_HERE&n=a1775170' border='0' alt='' /></a>
<a target='_blank'><img src='https://ad.eepw.com.cn/www/delivery/avw.php?zoneid=118&cb=INSERT_RANDOM_NUMBER_HERE&n=a449048b' border='0' alt='' /></a>

關鍵詞：算法 自動駕駛 算法工具鏈 地平線 征程5

焦點

更多>>

技術專區

關閉

博客專欄

YOLOv5 的量化流程及部署方法

相關推薦

技術專區