0%

1.Label tool

  1. Yolo_Label
  • 使用windows環境可以work!(其他環境還沒測試)

  • 選擇目標物件的class(ex: cat or dog),並框出範圍

2.Perf

Perf是什麼?

使用方式

perf record a.out && perf annotate

  • perf record :後面加上目標執行檔做做程式紀錄後儲存為.perf格式
  • perf annotate :查看結果

呈現結果

  • 左邊為組合語言值行時間佔全部時間的百分比,右邊為對應的組合語言內容
  • 可以觀察到cf: addl $0x1,-0x20020(%rbp) 執行時間佔全部27.70%,為最耗時的指令,要做程式優化可以從這裡開始著手。

(以上來源為基本介紹中的程式範例,詳細探討的議題內容可以點進去連結)

Perf安裝

  • perf script

  • 使用此shell script install perf和相關套件,打pert list檢查安裝成功

  • 遇到問題:安裝完後有把環境設定搞的怪怪的,如下面三點

    1. 以前下trtexec指令即可build tensorRT engine做加速,現在需要改成/usr/src/tensorrt/bin/trtexec才能抓到執行檔做加速
    2. 使用nvprof的錯誤指令不一樣,原本像是抓不到.trt檔(後來檢查才發現是我路徑寫錯),使用完上面的script後,錯誤訊息改成下面nvprof抓不到cupti library。
    3. 預設的jetson_release指令可以查看系統軟體版本資訊,現在無法抓到內容

3.nvprof

  • 遇到cupti library 抓不到的問題

跑論壇上面的nvprof demo是可以執行的,推測是原本指令下錯抓不到執行檔,nvprof本身沒有問題

  • nvprof demo on jetson nano
    1
    2
    3
    4
    5
    6
    ======== Warning: The path to cupti library might not be
    set in LD_LIBRARY_PATH. By default, it is installed in
    /usr/local/<cuda-toolkit>/extras/CUPTI/lib64 or
    /usr/local/<cuda-toolkit>/targets/<arch>/lib.========
    Warning: No CUDA application was profiled, exiting
    ======== Error: Application received signal 132
    此連結說sudo指令有時會影響環境變量,以下是刪除sudo指令後的錯誤訊息
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    ==9520== NVPROF is profiling process 9520, command:
    python3 /home/f64081169/Desktop/demo/pytorch-
    YOLOv4/demo_trt.py /home/f64081169/Desktop/demo/pytorch-YOLOv4/data/dog.jpg 416 416
    ==9520== Warning: ERR_NVGPUCTRPERM - The user does not
    have permission to profile on the target device. See the
    following link for instructions to enable permissions
    and get more information: https://developer.nvidia.com/ERR_NVGPUCTRPERM
    ==9520== Profiling application: python3 /home/f64081169/Desktop/demo/pytorch-YOLOv4/demo_trt.py /home/f64081169/Desktop/demo/pytorch-YOLOv4/data/dog.jpg 416 416
    ==9520== Profiling result:
    No kernels were profiled.

跑nvprof

執行結果如下

  • 指令
    1
    2
    3
    ! cd NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul ;ls && make
    ! sudo /usr/local/cuda-10.2/bin/nvprof /home/f64081169/Desktop/demo/NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul/matrixMul
    ! cd ~/Desktop/demo;
  • 執行結果
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    Output exceeds the size limit. Open the full output data in a text editor
    Makefile matrixMul matrixMul.cu matrixMul.o NsightEclipse.xml readme.txt
    make: Nothing to be done for 'all'.
    [Matrix Multiply Using CUDA] - Starting...
    ==16367== NVPROF is profiling process 16367, command: /home/f64081169/Desktop/demo/NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul/matrixMul
    ==16367== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
    GPU Device 0: "Maxwell" with compute capability 5.3

    MatrixA(320,320), MatrixB(640,320)
    Computing result using CUDA Kernel...
    done
    Performance= 24.44 GFlop/s, Time= 5.363 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
    Checking computed result for correctness: Result = PASS

    NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
    ==16367== Profiling application: /home/f64081169/Desktop/demo/NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul/matrixMul
    ==16367== Profiling result:
    Type Time(%) Time Calls Avg Min Max Name
    GPU activities: 99.98% 1.52216s 301 5.0570ms 4.4092ms 15.839ms void MatrixMulCUDA<int=32>(float*, float*, float*, int, int)
    0.02% 233.19us 2 116.59us 79.743us 153.44us [CUDA memcpy HtoD]
    0.01% 86.462us 1 86.462us 86.462us 86.462us [CUDA memcpy DtoH]
    API calls: 73.21% 1.59586s 1 1.59586s 1.59586s 1.59586s cudaEventSynchronize
    25.02% 545.39ms 3 181.80ms 742.86us 543.81ms cudaMalloc
    0.97% 21.223ms 2 10.612ms 21.197us 21.202ms cudaStreamSynchronize
    0.64% 13.917ms 301 46.235us 34.114us 439.42us cudaLaunchKernel
    0.11% 2.4702ms 3 823.41us 282.50us 1.7598ms cudaMemcpyAsync
    ...
    0.00% 4.1140us 2 2.0570us 1.3540us 2.7600us cuDeviceGet
    0.00% 2.7090us 1 2.7090us 2.7090us 2.7090us cuDeviceGetName
    0.00% 1.5100us 1 1.5100us 1.5100us 1.5100us cudaGetDeviceCount
    0.00% 938ns 1 938ns 938ns 938ns cuDeviceGetUuid

閱讀相關文章(2022/04/23)

Paper

  1. Increasing FPS for single board computers and embedded computers in 2021
    (Jetson nano and YOVOv4-tiny).
  2. Object Detection in Thermal Spectrum for
    Advanced Driver-Assistance Systems (ADAS)
  3. TSM: Temporal Shift Module for Efficient and Scalable Video Understanding on Edge Devices
  4. Experimental implementation of a neural network
    optical channel equalizer in restricted hardware using pruning and quantization
  5. Dynamic Transformer for Efficient Machine Translation on Embedded Devices
  6. Incremental Training and Group Convolution
    Pruning for Runtime DNN Performance Scaling on Heterogeneous Embedded Platforms

CSDN

  1. searching result : jetson nano performance
  2. YOLO Nano: a Highly Compact You Only Look Once Convolutional Neural Network for Object Detection翻譯
  3. YOLO Nano:一种高度紧凑的只看一次的卷积神经网络用于目标检测
  4. 论文翻译:YOLO Nano
  5. NVIDIA Deep Learning学习笔记
  6. tensorRT使用技巧
  7. TSM: Temporal Shift Module for Efficient Video Understanding
  8. 拨开算力的迷雾:聊聊不同 GPU 计算能力的上限

    NVDIA developer blog

  9. Boosting Application Performance with GPU Memory Prefetching
  10. TensorFlow Performance Logging Plugin nvtx-plugins-tf Goes Public
  11. Nsight Systems Exposes New GPU Optimization Opportunities
  12. CUDA 8 Features Revealed
  13. Transitioning to Nsight Systems from NVIDIA Visual Profiler / nvprof

    Label tool

  14. Yolo_Label

tensorRT + yolov4 (2022/03/07)

  • 目標:
  • 了解如何使用gpu加速(tensorRT)
  • 學如何訓練模型

repo
video tutorial

中文教學

實作成果

  1. 需購買web cam
  2. sd card 容量太小
  3. 走過官方給的demo流程,需要自己再熟悉一下

開始訓練自己的模型

Cave education 對於jetson nano教學的系列文

開會筆記

  1. trace code
  2. 買兩台web cam
  3. 做hackMD下次咪挺報告

trace code(2022/03/21)

  • 目標:
  • 買web cam *2
  • trace code

用pre-trained model做一遍由.weights to .trt的流程

1.訓練

YOLOv4實作教學(DarkNet)

了解如何用DarkNet做訓練
但這次時做是使用pre-trained model

2.weight -> ONNX -> tensorRT

i. 權重轉ONNX

使用Google colab實作流程

卡在onnx轉成trt:
tensorRT在google collab安裝時遇到問題,但依據上禮拜我在jetson nano跑demo後,tensorRT是原本就被裝好的,在jetson nano不會遇到這個問題。

方法一: weight file轉onnx file
  • 路線: .weights 轉成.pth再用pytorch的函式轉成onnx

1
2
3
4
5
6
7
8
9
10
11
12
13
14

from tool import darknet2pytorch
import torch

# load weights from darknet format
model = darknet2pytorch.Darknet('path/to/cfg/yolov4-416.cfg', inference=True)
model.load_weights('path/to/weights/yolov4-416.weights')

# save weights to pytorch format
torch.save(model.state_dict(), 'path/to/save/yolov4-pytorch.pth')

# reload weights from pytorch format
model_pt = darknet2pytorch.Darknet('path/to/cfg/yolov4-416.cfg', inference=True)
model_pt.load_state_dict(torch.load('path/to/save/yolov4-pytorch.pth'))
方法二: pth file轉onnx file

refpytorch doc

  • 使用Torch.onnx.export函式
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    torch.onnx.export(model, args, f, export_params=True,
    verbose=False, training=False,
    input_names=None, output_names=None,
    aten=False, export_raw_ir=False,
    operator_export_type=None,
    opset_version=None,
    _retain_param_name=True,
    do_constant_folding=False,
    example_outputs=None,
    strip_doc_string=True,
    dynamic_axes=None,
    keep_initializers_as_inputs=None)
  • 官方範例

    使用pre-trained model AlexNet轉換成onnx檔案

1
2
3
4
5
6
7
8
9
10
import torch
import torchvision

dummy_input = torch.randn(1, 3, 224, 224, device='cuda')
model = torchvision.models.alexnet(pretrained=True).cuda()

input_names = [ "actual_input_1" ] + [ "learned_%d" % i for i in range(16) ]
output_names = [ "output1" ]

torch.onnx.export(model, dummy_input, "alexnet.onnx", verbose=True, input_names=input_names, output_names=output_names)

ii. ONNX 轉 trt (還沒在google collab上實作過)

  • builder
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    # Builder:將模型導入TensorRT並且建構TensorRT的引擎。
    def build_engine(onnx_path, shape = [1,224,224,3]):

    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:

    builder.max_workspace_size = (256 << 20)
    # 256MiB model任一層的最大可用空間
    #使用fp16的精度
    builder.fp16_mode = True
    # fp32_mode -> False


    with open(onnx_path, 'rb') as model:
    parser.parse(model.read())

    engine = builder.build_cuda_engine(network)

    return engine

  • main function
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14

    if __name__ == "__main__":

    onnx_path = '/content/yolov4_1_3_608_608_static.onnx'
    trt_path = '/content/yolov4_1_3_608_608_static.trt'
    input_shape = [1, 224, 224, 3]

    build_trt = timer('Parser ONNX & Build TensorRT Engine')
    engine = build_engine(onnx_path, input_shape)
    build_trt.end()

    save_trt = timer('Save TensorRT Engine')
    save_engine(engine, trt_path)
    save_trt.end()

    3.inference and visualize 成果

    還沒看QQ

結論與問題

上週是使用google colab實作這個流程,這周在jetson nano裝好openCV,pytorch等套件後有根據google colab上的程式去跑跑看。
但是在跑DarkNet to ONNX的步驟沒辦法抓到onnx的library(已經有install過)還不確定是甚麼原因。
我有裝好archiconda,之後會用虛擬環境的方式再做一遍,也繼續看tensorrt_demos的程式和環境設定。不過jetson nano 裝那些套件比較麻煩,應該要先找好或自己寫script去跑。

Finish .weights to .trt(2022/03/28)

  • 解決無法import onnx問題 –>輸出.onnx成功
  • 使用trtexec 將.onnx轉成.trt
    轉換完後 執行demo_trt.py無法框出物件位置
    (2022/3/28已解決)

在跑.weights轉.onnx很常遇到的錯誤訊息

會導致程式中斷

1
2
3
>  - Can't parse 'pt2'. Sequence item with index 0 has a wrong type
> - Can't parse 'rec'. Expected sequence length 4, got 2
> - Can't parse 'rec'. Expected sequence length 4, got 2

可能原因
cv2.rectangle(img,(x1,y1),(x2,y2),rgb,3);
其中x1,x2,y1,y2需要為type int

已解決
參考此連結修改類似的地方即可

1
2
3
4
5
6
Try to change following lines in core/utils.py as defined below:

Line 152 -> c1, c2 = (int(coor[1]), int(coor[0])), (int(coor[3]), int(coor[2]))
Line 159 -> cv2.rectangle(image, c1, (int(np.float32(c3[0])), int(np.float32(c3[1]))), bbox_color, -1)
Line 161 -> cv2.putText(image, bbox_mess, (c1[0], int(np.float32(c1[1] - 2))), cv2.FONT_HERSHEY_SIMPLEX,
fontScale, (0, 0, 0), bbox_thick // 2, lineType=cv2.LINE_AA)

.onnx轉.trt方法(可行)

1
trtexec --onnx=resnet50/model.onnx --saveEngine=resnet_engine.trt

下次進度:開始訓練

  • Labeling tool

  • 玩幾次kaggle object detect相關

  • 讀yolo理論

  • 問專題可以做出大概怎樣的東西?

需求

靠北我只有mac可以安裝pygraphviz

  1. 需安裝 pygraphviz,request
  2. 連上server(heroku or ngrok)
  3. pipenv等相關套件安裝

程式畫面

步驟

  1. 建立get_file.py,並打入程式碼
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    import requests

    SERVER_IP = "0.0.0.0" #IP
    API_SERVER = "http://" + SERVER_IP + ":8000" #port
    DOWNLOAD_IMAGE_API = "/show-fsm"

    try:
    downloadImageInfoResponse = requests.get(
    API_SERVER + DOWNLOAD_IMAGE_API)

    if downloadImageInfoResponse.status_code == 200:
    with open('img.jpg', 'wb') as getFile:
    getFile.write(downloadImageInfoResponse.content)
    except Exception as err:
    print('Other error occurred %s' % {err})
  2. 在終端機輸入 pipenv run app.py
  3. 打開另一個終端機輸入pipenv run get_file.py
  4. 即可拿到生成的fsm圖

    reference

    【 Python 】透過 flask 中的 send_file 傳送影像

新增Blog 文章

1
hexo new 'Post Name'

Deploy on gitpages

清除上次產生的靜態網頁渲染

1
hexo cl

產生最新的靜態網頁

1
hexo g

run local server

1
hexo s

deploy on github

1
hexo d

881~~~