1.Label tool
- 使用windows環境可以work!(其他環境還沒測試)
- 選擇目標物件的class(ex: cat or dog),並框出範圍
完成後出現.txt檔,裡面內容即是框出物件的class與點座標
格式說明
2.Perf
Perf是什麼?
使用方式
perf record a.out && perf annotate
perf record
:後面加上目標執行檔做做程式紀錄後儲存為.perf格式perf annotate
:查看結果
呈現結果
- 左邊為組合語言值行時間佔全部時間的百分比,右邊為對應的組合語言內容
- 可以觀察到
cf: addl $0x1,-0x20020(%rbp)
執行時間佔全部27.70%,為最耗時的指令,要做程式優化可以從這裡開始著手。
(以上來源為基本介紹中的程式範例,詳細探討的議題內容可以點進去連結)
Perf安裝
使用此shell script install perf和相關套件,打
pert list
檢查安裝成功遇到問題:安裝完後有把環境設定搞的怪怪的,如下面三點
- 以前下
trtexec
指令即可build tensorRT engine做加速,現在需要改成/usr/src/tensorrt/bin/trtexec
才能抓到執行檔做加速 - 使用nvprof的錯誤指令不一樣,原本像是抓不到.trt檔(後來檢查才發現是我路徑寫錯),使用完上面的script後,錯誤訊息改成下面nvprof抓不到cupti library。
- 預設的
jetson_release
指令可以查看系統軟體版本資訊,現在無法抓到內容
- 以前下
3.nvprof
- 遇到cupti library 抓不到的問題
跑論壇上面的nvprof demo是可以執行的,推測是原本指令下錯抓不到執行檔,nvprof本身沒有問題
- nvprof demo on jetson nano此連結說sudo指令有時會影響環境變量,以下是刪除sudo指令後的錯誤訊息
1
2
3
4
5
6======== Warning: The path to cupti library might not be
set in LD_LIBRARY_PATH. By default, it is installed in
/usr/local/<cuda-toolkit>/extras/CUPTI/lib64 or
/usr/local/<cuda-toolkit>/targets/<arch>/lib.========
Warning: No CUDA application was profiled, exiting
======== Error: Application received signal 1321
2
3
4
5
6
7
8
9
10==9520== NVPROF is profiling process 9520, command:
python3 /home/f64081169/Desktop/demo/pytorch-
YOLOv4/demo_trt.py /home/f64081169/Desktop/demo/pytorch-YOLOv4/data/dog.jpg 416 416
==9520== Warning: ERR_NVGPUCTRPERM - The user does not
have permission to profile on the target device. See the
following link for instructions to enable permissions
and get more information: https://developer.nvidia.com/ERR_NVGPUCTRPERM
==9520== Profiling application: python3 /home/f64081169/Desktop/demo/pytorch-YOLOv4/demo_trt.py /home/f64081169/Desktop/demo/pytorch-YOLOv4/data/dog.jpg 416 416
==9520== Profiling result:
No kernels were profiled.
跑nvprof
執行結果如下
- 指令
1
2
3! cd NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul ;ls && make
! sudo /usr/local/cuda-10.2/bin/nvprof /home/f64081169/Desktop/demo/NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul/matrixMul
! cd ~/Desktop/demo; - 執行結果
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31Output exceeds the size limit. Open the full output data in a text editor
Makefile matrixMul matrixMul.cu matrixMul.o NsightEclipse.xml readme.txt
make: Nothing to be done for 'all'.
[Matrix Multiply Using CUDA] - Starting...
==16367== NVPROF is profiling process 16367, command: /home/f64081169/Desktop/demo/NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul/matrixMul
==16367== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
GPU Device 0: "Maxwell" with compute capability 5.3
MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 24.44 GFlop/s, Time= 5.363 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS
NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
==16367== Profiling application: /home/f64081169/Desktop/demo/NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul/matrixMul
==16367== Profiling result:
Type Time(%) Time Calls Avg Min Max Name
GPU activities: 99.98% 1.52216s 301 5.0570ms 4.4092ms 15.839ms void MatrixMulCUDA<int=32>(float*, float*, float*, int, int)
0.02% 233.19us 2 116.59us 79.743us 153.44us [CUDA memcpy HtoD]
0.01% 86.462us 1 86.462us 86.462us 86.462us [CUDA memcpy DtoH]
API calls: 73.21% 1.59586s 1 1.59586s 1.59586s 1.59586s cudaEventSynchronize
25.02% 545.39ms 3 181.80ms 742.86us 543.81ms cudaMalloc
0.97% 21.223ms 2 10.612ms 21.197us 21.202ms cudaStreamSynchronize
0.64% 13.917ms 301 46.235us 34.114us 439.42us cudaLaunchKernel
0.11% 2.4702ms 3 823.41us 282.50us 1.7598ms cudaMemcpyAsync
...
0.00% 4.1140us 2 2.0570us 1.3540us 2.7600us cuDeviceGet
0.00% 2.7090us 1 2.7090us 2.7090us 2.7090us cuDeviceGetName
0.00% 1.5100us 1 1.5100us 1.5100us 1.5100us cudaGetDeviceCount
0.00% 938ns 1 938ns 938ns 938ns cuDeviceGetUuid