Perf & nvprof

1.Label tool

Yolo_Label

使用windows環境可以work!(其他環境還沒測試)

選擇目標物件的class(ex: cat or dog)，並框出範圍

完成後出現.txt檔，裡面內容即是框出物件的class與點座標
格式說明
openCV教學：了解yolo dataset格式並以openCV框出目標物檢查是否正確

2.Perf

Perf是什麼？

使用方式

perf record a.out && perf annotate

perf record ：後面加上目標執行檔做做程式紀錄後儲存為.perf格式
perf annotate ：查看結果

呈現結果

左邊為組合語言值行時間佔全部時間的百分比，右邊為對應的組合語言內容
可以觀察到cf: addl $0x1,-0x20020(%rbp) 執行時間佔全部27.70%，為最耗時的指令，要做程式優化可以從這裡開始著手。

(以上來源為基本介紹中的程式範例，詳細探討的議題內容可以點進去連結)

Perf安裝

perf script
使用此shell script install perf和相關套件，打pert list檢查安裝成功
遇到問題：安裝完後有把環境設定搞的怪怪的，如下面三點
1. 以前下trtexec指令即可build tensorRT engine做加速，現在需要改成/usr/src/tensorrt/bin/trtexec才能抓到執行檔做加速
2. 使用nvprof的錯誤指令不一樣，原本像是抓不到.trt檔(後來檢查才發現是我路徑寫錯)，使用完上面的script後，錯誤訊息改成下面nvprof抓不到cupti library。
3. 預設的jetson_release指令可以查看系統軟體版本資訊，現在無法抓到內容

3.nvprof

nvprof tutorial

遇到cupti library 抓不到的問題

跑論壇上面的nvprof demo是可以執行的，推測是原本指令下錯抓不到執行檔，nvprof本身沒有問題

nvprof demo on jetson nano

======== Warning: The path to cupti library might not be
set in LD_LIBRARY_PATH. By default, it is installed in
/usr/local/<cuda-toolkit>/extras/CUPTI/lib64 or
/usr/local/<cuda-toolkit>/targets/<arch>/lib.========
Warning: No CUDA application was profiled, exiting
======== Error: Application received signal 132

此連結說sudo指令有時會影響環境變量，以下是刪除sudo指令後的錯誤訊息

==9520== NVPROF is profiling process 9520, command:
python3 /home/f64081169/Desktop/demo/pytorch-
YOLOv4/demo_trt.py /home/f64081169/Desktop/demo/pytorch-YOLOv4/data/dog.jpg 416 416
==9520== Warning: ERR_NVGPUCTRPERM - The user does not
have permission to profile on the target device. See the
following link for instructions to enable permissions
and get more information: https://developer.nvidia.com/ERR_NVGPUCTRPERM 
==9520== Profiling application: python3 /home/f64081169/Desktop/demo/pytorch-YOLOv4/demo_trt.py /home/f64081169/Desktop/demo/pytorch-YOLOv4/data/dog.jpg 416 416
==9520== Profiling result:
No kernels were profiled.

跑nvprof

執行結果如下

指令

1
2
3

! cd NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul ;ls && make
! sudo /usr/local/cuda-10.2/bin/nvprof /home/f64081169/Desktop/demo/NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul/matrixMul
! cd ~/Desktop/demo;

執行結果

Output exceeds the size limit. Open the full output data in a text editor
Makefile  matrixMul  matrixMul.cu  matrixMul.o	NsightEclipse.xml  readme.txt
make: Nothing to be done for 'all'.
[Matrix Multiply Using CUDA] - Starting...
==16367== NVPROF is profiling process 16367, command: /home/f64081169/Desktop/demo/NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul/matrixMul
==16367== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
GPU Device 0: "Maxwell" with compute capability 5.3

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 24.44 GFlop/s, Time= 5.363 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: Result = PASS

NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
==16367== Profiling application: /home/f64081169/Desktop/demo/NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul/matrixMul
==16367== Profiling result:
            Type  Time(%)      Time     Calls       Avg       Min       Max  Name
 GPU activities:   99.98%  1.52216s       301  5.0570ms  4.4092ms  15.839ms  void MatrixMulCUDA<int=32>(float*, float*, float*, int, int)
                    0.02%  233.19us         2  116.59us  79.743us  153.44us  [CUDA memcpy HtoD]
                    0.01%  86.462us         1  86.462us  86.462us  86.462us  [CUDA memcpy DtoH]
      API calls:   73.21%  1.59586s         1  1.59586s  1.59586s  1.59586s  cudaEventSynchronize
                   25.02%  545.39ms         3  181.80ms  742.86us  543.81ms  cudaMalloc
                    0.97%  21.223ms         2  10.612ms  21.197us  21.202ms  cudaStreamSynchronize
                    0.64%  13.917ms       301  46.235us  34.114us  439.42us  cudaLaunchKernel
                    0.11%  2.4702ms         3  823.41us  282.50us  1.7598ms  cudaMemcpyAsync
...
                    0.00%  4.1140us         2  2.0570us  1.3540us  2.7600us  cuDeviceGet
                    0.00%  2.7090us         1  2.7090us  2.7090us  2.7090us  cuDeviceGetName
                    0.00%  1.5100us         1  1.5100us  1.5100us  1.5100us  cudaGetDeviceCount
                    0.00%     938ns         1     938ns     938ns     938ns  cuDeviceGetUuid