0%

Perf & nvprof

1.Label tool

  1. Yolo_Label
  • 使用windows環境可以work!(其他環境還沒測試)

  • 選擇目標物件的class(ex: cat or dog),並框出範圍

2.Perf

Perf是什麼?

使用方式

perf record a.out && perf annotate

  • perf record :後面加上目標執行檔做做程式紀錄後儲存為.perf格式
  • perf annotate :查看結果

呈現結果

  • 左邊為組合語言值行時間佔全部時間的百分比,右邊為對應的組合語言內容
  • 可以觀察到cf: addl $0x1,-0x20020(%rbp) 執行時間佔全部27.70%,為最耗時的指令,要做程式優化可以從這裡開始著手。

(以上來源為基本介紹中的程式範例,詳細探討的議題內容可以點進去連結)

Perf安裝

  • perf script

  • 使用此shell script install perf和相關套件,打pert list檢查安裝成功

  • 遇到問題:安裝完後有把環境設定搞的怪怪的,如下面三點

    1. 以前下trtexec指令即可build tensorRT engine做加速,現在需要改成/usr/src/tensorrt/bin/trtexec才能抓到執行檔做加速
    2. 使用nvprof的錯誤指令不一樣,原本像是抓不到.trt檔(後來檢查才發現是我路徑寫錯),使用完上面的script後,錯誤訊息改成下面nvprof抓不到cupti library。
    3. 預設的jetson_release指令可以查看系統軟體版本資訊,現在無法抓到內容

3.nvprof

  • 遇到cupti library 抓不到的問題

跑論壇上面的nvprof demo是可以執行的,推測是原本指令下錯抓不到執行檔,nvprof本身沒有問題

  • nvprof demo on jetson nano
    1
    2
    3
    4
    5
    6
    ======== Warning: The path to cupti library might not be
    set in LD_LIBRARY_PATH. By default, it is installed in
    /usr/local/<cuda-toolkit>/extras/CUPTI/lib64 or
    /usr/local/<cuda-toolkit>/targets/<arch>/lib.========
    Warning: No CUDA application was profiled, exiting
    ======== Error: Application received signal 132
    此連結說sudo指令有時會影響環境變量,以下是刪除sudo指令後的錯誤訊息
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    ==9520== NVPROF is profiling process 9520, command:
    python3 /home/f64081169/Desktop/demo/pytorch-
    YOLOv4/demo_trt.py /home/f64081169/Desktop/demo/pytorch-YOLOv4/data/dog.jpg 416 416
    ==9520== Warning: ERR_NVGPUCTRPERM - The user does not
    have permission to profile on the target device. See the
    following link for instructions to enable permissions
    and get more information: https://developer.nvidia.com/ERR_NVGPUCTRPERM
    ==9520== Profiling application: python3 /home/f64081169/Desktop/demo/pytorch-YOLOv4/demo_trt.py /home/f64081169/Desktop/demo/pytorch-YOLOv4/data/dog.jpg 416 416
    ==9520== Profiling result:
    No kernels were profiled.

跑nvprof

執行結果如下

  • 指令
    1
    2
    3
    ! cd NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul ;ls && make
    ! sudo /usr/local/cuda-10.2/bin/nvprof /home/f64081169/Desktop/demo/NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul/matrixMul
    ! cd ~/Desktop/demo;
  • 執行結果
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    Output exceeds the size limit. Open the full output data in a text editor
    Makefile matrixMul matrixMul.cu matrixMul.o NsightEclipse.xml readme.txt
    make: Nothing to be done for 'all'.
    [Matrix Multiply Using CUDA] - Starting...
    ==16367== NVPROF is profiling process 16367, command: /home/f64081169/Desktop/demo/NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul/matrixMul
    ==16367== Warning: Unified Memory Profiling is not supported on the underlying platform. System requirements for unified memory can be found at: http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-requirements
    GPU Device 0: "Maxwell" with compute capability 5.3

    MatrixA(320,320), MatrixB(640,320)
    Computing result using CUDA Kernel...
    done
    Performance= 24.44 GFlop/s, Time= 5.363 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
    Checking computed result for correctness: Result = PASS

    NOTE: The CUDA Samples are not meant for performancemeasurements. Results may vary when GPU Boost is enabled.
    ==16367== Profiling application: /home/f64081169/Desktop/demo/NVIDIA_CUDA-10.2_Samples/0_Simple/matrixMul/matrixMul
    ==16367== Profiling result:
    Type Time(%) Time Calls Avg Min Max Name
    GPU activities: 99.98% 1.52216s 301 5.0570ms 4.4092ms 15.839ms void MatrixMulCUDA<int=32>(float*, float*, float*, int, int)
    0.02% 233.19us 2 116.59us 79.743us 153.44us [CUDA memcpy HtoD]
    0.01% 86.462us 1 86.462us 86.462us 86.462us [CUDA memcpy DtoH]
    API calls: 73.21% 1.59586s 1 1.59586s 1.59586s 1.59586s cudaEventSynchronize
    25.02% 545.39ms 3 181.80ms 742.86us 543.81ms cudaMalloc
    0.97% 21.223ms 2 10.612ms 21.197us 21.202ms cudaStreamSynchronize
    0.64% 13.917ms 301 46.235us 34.114us 439.42us cudaLaunchKernel
    0.11% 2.4702ms 3 823.41us 282.50us 1.7598ms cudaMemcpyAsync
    ...
    0.00% 4.1140us 2 2.0570us 1.3540us 2.7600us cuDeviceGet
    0.00% 2.7090us 1 2.7090us 2.7090us 2.7090us cuDeviceGetName
    0.00% 1.5100us 1 1.5100us 1.5100us 1.5100us cudaGetDeviceCount
    0.00% 938ns 1 938ns 938ns 938ns cuDeviceGetUuid