Simple Model Profiler Helper
The profiler measures your models' performance on speed and latency with given input_shape
using your hardware.
Get started
Import dependencies
from neetbox.torch.arch import cnn
from neetbox.torch.profile import profile
Build a basic network:
model = cnn.ResBlock(
inplanes=64, outplanes=128, kernel_size=3, stride=2, residual=True, dilation=2
).cuda()
model.eval()
output:
ResBlock(
(conv1): Conv2d(64, 128, kernel_size=(3, 3), stride=(2, 2), padding=(2, 2), dilation=(2, 2))
(bn1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(relu_inplace): ReLU(inplace=True)
(conv2): Conv2d(128, 128, kernel_size=(3, 3), stride=(1, 1), padding=(2, 2), dilation=(2, 2))
(bn2): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
(residual): Sequential(
(0): Conv2d(64, 128, kernel_size=(1, 1), stride=(2, 2), dilation=(2, 2))
(1): BatchNorm2d(128, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
)
)
Anything that inherits from torch.nn.Module
should work with neetbox.torch.profile.profile
.
Run profile:
profile(model, input_shape=(1, 64, 1280, 720), speedtest=100)
output:
2023-03-15-16:49:50 > NEETBOX > running speedtest...
2023-03-15-16:49:50 > NEETBOX > start warm up
100%|██████████| 10/10 [00:03<00:00, 2.65it/s]
2023-03-15-16:49:54 > NEETBOX > warm up done
100%|██████████| 102/102 [00:00<00:00, 133.97it/s]
2023-03-15-16:49:55 > NEETBOX > average 0.007300891876220703s per run with out put size torch.Size([1, 128, 640, 360])
2023-03-15-16:49:55 > NEETBOX > min inference time: 0.0s, Max inference time: 0.030241966247558594s
2023-03-15-16:49:55 > NEETBOX > That is 136.96956713700143 frames per second
2023-03-15-16:49:55 > NEETBOX > ====================================================
2023-03-15-16:49:55 > NEETBOX > running CUDA side synchronous speedtest...
100%|██████████| 102/102 [00:02<00:00, 39.46it/s]
2023-03-15-16:49:58 > NEETBOX > average 0.015801547095179558s per run with out put size torch.Size([1, 128, 640, 360])
2023-03-15-16:49:58 > NEETBOX > min inference time: 0.014497632160782814s, max inference time: 0.01753702387213707s
2023-03-15-16:49:58 > NEETBOX > That is 63.284942626953125 frames per second
[WARNING] 2023-03-15-16:49:58 > NEETBOX > Seems your model has an imbalanced performance peek on cuda side test and python side test. Consider raising speedtest loop times (currently 100 +2) to have a stable result.
[DEBUG] 2023-03-15-16:49:58 > NEETBOX > Note that the CUDA side synchronous speedtest is more reliable since you are using a GPU.
2023-03-15-16:49:58 > NEETBOX > ====================================================
2023-03-15-16:49:58 > NEETBOX > model profiling...
[INFO] Register count_convNd() for <class 'torch.nn.modules.conv.Conv2d'>.
[INFO] Register count_normalization() for <class 'torch.nn.modules.batchnorm.BatchNorm2d'>.
[INFO] Register zero_ops() for <class 'torch.nn.modules.activation.ReLU'>.
[INFO] Register zero_ops() for <class 'torch.nn.modules.container.Sequential'>.
2023-03-15-16:49:58 > NEETBOX > Model FLOPs = 53.2021248G, params = 0.230528M
Read the result
The output above shows that:
- Your model runs at about 137 FPS and spends about 0.0073 seconds on average per inference without awaiting the result.
- Your model runs at about 63.28 FPS and spends about 0.0145 seconds on average per inference synchronously.
- Your model has 0.230528M parameters and runs 53.2021248 GFLOPs per inference with the given input / input size.
If the output tells you that there is an imbalanced performance peek between CUDA side synchronous test and none-sync one, you should take the synchronous test result as the final result.
CPU or GPU
The tests run at where your model stays.
Run tests on CPU:
model.cpu()
profile(model, input_shape=(1, 64, 1280, 720), speedtest=100)
Run tests on GPU:
model.cuda()
profile(model, input_shape=(1, 64, 1280, 720), speedtest=100)
Specify the input / input shape
You can either give an input_shape
or specify an specific_input
. The priority of specific_input
is higher. If your model takes an ordinary torch.Tensor
as input, you can both specify the input_shape
or the specific_input
.
For example:
model.cuda()
profile(model, input_shape=(1, 64, 1280, 720), speedtest=100)
profile(model, specific_input=torch.rand(input_shape).cuda(), speedtest=100)
They are the same actually.
However, if your model does not take an ordinary torch.Tensor
as input, you should specify the specific_input
.
The specific_input
should on the same device as model
does.
Choose how profile
actually profile
If your code looks like:
model.cuda()
profile(model, input_shape=(1, 64, 1280, 720), speedtest=100)
Then profile
measures model FLOPs and params with the input of a (1, 64, 1280, 720)
shaped Tensor, also inferences the model 100 times for speed test.
If your code looks like:
model.cuda()
profile(model, input_shape=(1, 64, 1280, 720), speedtest=False)
Then profile
only measures model FLOPs and params with the input of a (1, 64, 1280, 720)
shaped Tensor.
If your code looks like:
model.cuda()
profile(model, input_shape=(1, 64, 1280, 720), speedtest=10, profiling=False)
Then profile
only inferences the model 100 times for speed test with the input of a (1, 64, 1280, 720)
shaped Tensor.