寶琳的城市日記

Perf & nvprof

發表於 2022-04-25

1.Label tool

Yolo_Label

使用windows環境可以work!(其他環境還沒測試)

選擇目標物件的class(ex: cat or dog)，並框出範圍

完成後出現.txt檔，裡面內容即是框出物件的class與點座標
格式說明
openCV教學：了解yolo dataset格式並以openCV框出目標物檢查是否正確

2.Perf

Perf是什麼？

使用方式

perf record a.out && perf annotate

perf record ：後面加上目標執行檔做做程式紀錄後儲存為.perf格式
perf annotate ：查看結果

呈現結果

左邊為組合語言值行時間佔全部時間的百分比，右邊為對應的組合語言內容
可以觀察到cf: addl $0x1,-0x20020(%rbp) 執行時間佔全部27.70%，為最耗時的指令，要做程式優化可以從這裡開始著手。

(以上來源為基本介紹中的程式範例，詳細探討的議題內容可以點進去連結)

Perf安裝

perf script
使用此shell script install perf和相關套件，打pert list檢查安裝成功
遇到問題：安裝完後有把環境設定搞的怪怪的，如下面三點
1. 以前下trtexec指令即可build tensorRT engine做加速，現在需要改成/usr/src/tensorrt/bin/trtexec才能抓到執行檔做加速
2. 使用nvprof的錯誤指令不一樣，原本像是抓不到.trt檔(後來檢查才發現是我路徑寫錯)，使用完上面的script後，錯誤訊息改成下面nvprof抓不到cupti library。
3. 預設的jetson_release指令可以查看系統軟體版本資訊，現在無法抓到內容

3.nvprof

nvprof tutorial

遇到cupti library 抓不到的問題

跑論壇上面的nvprof demo是可以執行的，推測是原本指令下錯抓不到執行檔，nvprof本身沒有問題

nvprof demo on jetson nano

======== Warning: The path to cupti library might not be
set in LD_LIBRARY_PATH. By default, it is installed in
/usr/local/<cuda-toolkit>/extras/CUPTI/lib64 or
/usr/local/<cuda-toolkit>/targets/<arch>/lib.========
Warning: No CUDA application was profiled, exiting
======== Error: Application received signal 132

此連結說sudo指令有時會影響環境變量，以下是刪除sudo指令後的錯誤訊息

==9520== NVPROF is profiling process 9520, command:
python3 /home/f64081169/Desktop/demo/pytorch-
YOLOv4/demo_trt.py /home/f64081169/Desktop/demo/pytorch-YOLOv4/data/dog.jpg 416 416
==9520== Warning: ERR_NVGPUCTRPERM - The user does not
have permission to profile on the target device. See the
following link for instructions to enable permissions
and get more information: https://developer.nvidia.com/ERR_NVGPUCTRPERM 
==9520== Profiling application: python3 /home/f64081169/Desktop/demo/pytorch-YOLOv4/demo_trt.py /home/f64081169/Desktop/demo/pytorch-YOLOv4/data/dog.jpg 416 416
==9520== Profiling result:
No kernels were profiled.

跑nvprof

執行結果如下

指令

1
2
3

! cd NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul ;ls && make
! sudo /usr/local/cuda-10.2/bin/nvprof /home/f64081169/Desktop/demo/NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul/matrixMul
! cd ~/Desktop/demo;

執行結果

Output exceeds the size limit. Open the full output data in a text editor
Makefile  matrixMul  matrixMul.cu  matrixMul.o	NsightEclipse.xml  readme.txt
make: Nothing to be done for 'all'.
[Matrix Multiply Using CUDA] - Starting...
==16367== NVPROF is profiling process 16367, command: /home/f64081169/Desktop/demo/NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul/matrixMul
==16367== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
GPU Device 0: "Maxwell" with compute capability 5.3

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 24.44 GFlop/s, Time= 5.363 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
==16367== Profiling application: /home/f64081169/Desktop/demo/NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul/matrixMul
==16367== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.98%  1.52216s       301  5.0570ms  4.4092ms  15.839ms  void MatrixMulCUDA<int=32>(float*, float*, float*, int, int)
                    0.02%  233.19us         2  116.59us  79.743us  153.44us  [CUDA memcpy HtoD]
                    0.01%  86.462us         1  86.462us  86.462us  86.462us  [CUDA memcpy DtoH]
      API calls:   73.21%  1.59586s         1  1.59586s  1.59586s  1.59586s  cudaEventSynchronize
                   25.02%  545.39ms         3  181.80ms  742.86us  543.81ms  cudaMalloc
                    0.97%  21.223ms         2  10.612ms  21.197us  21.202ms  cudaStreamSynchronize
                    0.64%  13.917ms       301  46.235us  34.114us  439.42us  cudaLaunchKernel
                    0.11%  2.4702ms         3  823.41us  282.50us  1.7598ms  cudaMemcpyAsync
...
                    0.00%  4.1140us         2  2.0570us  1.3540us  2.7600us  cuDeviceGet
                    0.00%  2.7090us         1  2.7090us  2.7090us  2.7090us  cuDeviceGetName
                    0.00%  1.5100us         1  1.5100us  1.5100us  1.5100us  cudaGetDeviceCount
                    0.00%     938ns         1     938ns     938ns     938ns  cuDeviceGetUuid

閱讀相關文章(2022/04/23)

Paper

CSDN

tensorRT + yolov4 (2022/03/07)

目標：
了解如何使用gpu加速（tensorRT)
學如何訓練模型

repo
video tutorial

中文教學

實作成果

需購買web cam
sd card 容量太小
走過官方給的demo流程，需要自己再熟悉一下

開始訓練自己的模型

Cave education 對於jetson nano教學的系列文

開會筆記

trace code
買兩台web cam
做hackMD下次咪挺報告

trace code(2022/03/21)

目標：
買web cam *2
trace code

用pre-trained model做一遍由.weights to .trt的流程

1.訓練

YOLOv4實作教學(DarkNet)

了解如何用DarkNet做訓練
但這次時做是使用pre-trained model

2.weight -> ONNX -> tensorRT

i. 權重轉ONNX

使用Google colab實作流程

卡在onnx轉成trt:
tensorRT在google collab安裝時遇到問題，但依據上禮拜我在jetson nano跑demo後，tensorRT是原本就被裝好的，在jetson nano不會遇到這個問題。

方法一: weight file轉onnx file

路線： .weights 轉成.pth再用pytorch的函式轉成onnx


from tool import darknet2pytorch
import torch

# load weights from darknet format
model = darknet2pytorch.Darknet('path/to/cfg/yolov4-416.cfg', inference=True)
model.load_weights('path/to/weights/yolov4-416.weights')

# save weights to pytorch format
torch.save(model.state_dict(), 'path/to/save/yolov4-pytorch.pth')

# reload weights from pytorch format
model_pt = darknet2pytorch.Darknet('path/to/cfg/yolov4-416.cfg', inference=True)
model_pt.load_state_dict(torch.load('path/to/save/yolov4-pytorch.pth'))

方法二: pth file轉onnx file

ref 、pytorch doc

使用Torch.onnx.export函式

torch.onnx.export(model, args, f, export_params=True,
                  verbose=False, training=False, 
                  input_names=None, output_names=None,
                  aten=False, export_raw_ir=False, 
                  operator_export_type=None, 
                  opset_version=None, 
                  _retain_param_name=True, 
                  do_constant_folding=False, 
                  example_outputs=None, 
                  strip_doc_string=True, 
                  dynamic_axes=None, 
                  keep_initializers_as_inputs=None)

官方範例

使用pre-trained model AlexNet轉換成onnx檔案

import torch
import torchvision

dummy_input = torch.randn(1, 3, 224, 224, device='cuda')
model = torchvision.models.alexnet(pretrained=True).cuda()

input_names = [ "actual_input_1" ] + [ "learned_%d" % i for i in range(16) ]
output_names = [ "output1" ]

torch.onnx.export(model, dummy_input, "alexnet.onnx", verbose=True, input_names=input_names, output_names=output_names)

ii. ONNX 轉 trt (還沒在google collab上實作過)

builder

# Builder：將模型導入TensorRT並且建構TensorRT的引擎。
def build_engine(onnx_path, shape = [1,224,224,3]):
    
    with trt.Builder(TRT_LOGGER) as builder, builder.create_network(1) as network, trt.OnnxParser(network, TRT_LOGGER) as parser:
        
        builder.max_workspace_size = (256 << 20)    
# 256MiB model任一層的最大可用空間
        #使用fp16的精度
        builder.fp16_mode = True 
# fp32_mode -> False

        
        with open(onnx_path, 'rb') as model:
            parser.parse(model.read())

        engine = builder.build_cuda_engine(network)

        return engine

main function


if __name__ == "__main__":
    
    onnx_path = '/content/yolov4_1_3_608_608_static.onnx'
    trt_path = '/content/yolov4_1_3_608_608_static.trt'
    input_shape = [1, 224, 224, 3]
    
    build_trt = timer('Parser ONNX & Build TensorRT Engine')
    engine = build_engine(onnx_path, input_shape)
    build_trt.end()
    
    save_trt = timer('Save TensorRT Engine')
    save_engine(engine, trt_path)
    save_trt.end()

3.inference and visualize 成果

還沒看QQ

結論與問題

上週是使用google colab實作這個流程，這周在jetson nano裝好openCV,pytorch等套件後有根據google colab上的程式去跑跑看。
但是在跑DarkNet to ONNX的步驟沒辦法抓到onnx的library(已經有install過)還不確定是甚麼原因。
我有裝好archiconda，之後會用虛擬環境的方式再做一遍，也繼續看tensorrt_demos的程式和環境設定。不過jetson nano 裝那些套件比較麻煩，應該要先找好或自己寫script去跑。

Finish .weights to .trt(2022/03/28)

解決無法import onnx問題 –>輸出.onnx成功
使用trtexec 將.onnx轉成.trt
~~轉換完後執行demo_trt.py無法框出物件位置~~
(2022/3/28已解決)

在跑.weights轉.onnx很常遇到的錯誤訊息

會導致程式中斷

1
2
3

>  - Can't parse 'pt2'. Sequence item with index 0 has a wrong type
>  - Can't parse 'rec'. Expected sequence length 4, got 2
>  - Can't parse 'rec'. Expected sequence length 4, got 2

可能原因
cv2.rectangle(img,(x1,y1),(x2,y2),rgb,3);
其中x1,x2,y1,y2需要為type int

已解決
參考此連結修改類似的地方即可

Try to change following lines in core/utils.py as defined below:

Line 152 -> c1, c2 = (int(coor[1]), int(coor[0])), (int(coor[3]), int(coor[2]))
Line 159 -> cv2.rectangle(image, c1, (int(np.float32(c3[0])), int(np.float32(c3[1]))), bbox_color, -1)
Line 161 -> cv2.putText(image, bbox_mess, (c1[0], int(np.float32(c1[1] - 2))), cv2.FONT_HERSHEY_SIMPLEX,
fontScale, (0, 0, 0), bbox_thick // 2, lineType=cv2.LINE_AA)

.onnx轉.trt方法(可行)

1	trtexec --onnx=resnet50/model.onnx --saveEngine=resnet_engine.trt

下次進度：開始訓練

Labeling tool
玩幾次kaggle object detect相關
讀yolo理論
問專題可以做出大概怎樣的東西？

(TOC)透過 flask 中的 send_file 拿到fsm狀態圖

發表於 2022-03-10

需求

靠北我只有mac可以安裝pygraphviz

需安裝 pygraphviz,request
連上server(heroku or ngrok)
pipenv等相關套件安裝

程式畫面

步驟

建立get_file.py，並打入程式碼

import requests

SERVER_IP = "0.0.0.0" #IP
API_SERVER = "http://" + SERVER_IP + ":8000" #port
DOWNLOAD_IMAGE_API = "/show-fsm"

try:
    downloadImageInfoResponse = requests.get(
        API_SERVER + DOWNLOAD_IMAGE_API)

    if downloadImageInfoResponse.status_code == 200:
        with open('img.jpg', 'wb') as getFile:
            getFile.write(downloadImageInfoResponse.content)
except Exception as err:
    print('Other error occurred %s' % {err})

在終端機輸入 pipenv run app.py
打開另一個終端機輸入pipenv run get_file.py
即可拿到生成的fsm圖
reference
【 Python 】透過 flask 中的 send_file 傳送影像

使用Hexo建立個人部落格

發表於 2022-03-09 更新於 2022-03-10

新增Blog 文章

1	hexo new 'Post Name'

Deploy on gitpages

清除上次產生的靜態網頁渲染

hexo cl

產生最新的靜態網頁

hexo g

run local server

hexo s

deploy on github

hexo d

881~~~