VirtusLab Group Company
This is the second post regarding Triton Inference Server. In the first one, I’ve described features and use cases. If you’ve missed it and you don’t understand some of the concepts from this text I encourage you to read it first. You can find it here.
This blog post describes tips and tricks, presents a cheat sheet of useful commands, and provides code samples for benchmarking different model configurations for computer vision models. This post is more technical and by reading it you will learn specific commands for optimizing and configuring the models served by Triton.
Note: in all commands below I am using containers with versions 22.04 or 22.11 but you can choose whatever version that works for you. Those containers have dependencies on CUDA drivers and versions I am using in commands below might not work for you.
Important note: You have to pass the absolute path of the current directory and mount at the same path in the container filesystem. Otherwise, it won’t be able to find models in the model-repository
Define model response cache
It’s very important to use the same version of tensorrt container as tritonserver due to version validation. In other words, Triton won’t server model exported with a different version of TensorRT.
This optimization gave up to 2x speedup in terms of latency and throughput on MobileNetV3 compared to FP32 ONNX model.
This optimization gave up to 2-3x speedup in terms of latency and throughput on MobileNetV3 compared to FP32 ONNX model.
This optimization gave up to 2x speedup in terms of latency and throughput on MobileNetV3 compared to models in single batch mode.
Around 20% increased throughput and 16% reduced inference time were observed for the model with the explicit batch size compared to the model with dynamic batch size.
You can define input and output names to gain better control on parameter names during inference. Otherwise, names based on the model’s layers names will be used.
You can take a look at the PyTorch example to gain a better understanding of the process.
Define max_batch_size of the model to a reasonable value greater or equal to 1. If the model does not support dynamic batches, for example, a model exported to TensorRT with an explicit batch size equal to 1, You should select this value to 1.
If max_batch_size is larger than 1 then in dims the first dimension by default is always -1. If You define dims: [3, 224, 224] Triton will append -1 at the first position in the list for You.
If You use a docker container to perform model analysis and tune Triton configuration parameters remember to mount a volume with the model-repository inside the container at the same path as it is on the host machine. Otherwise perf_analyzer and model_analyzer will struggle to find models.
You can check the documentation and all parameters of the model_analyzer here.
It’s problematic to analyze models that operate on images if You can’t request random data. In this case, You can prepare a file with example data to be sent in a request. It has a predefined format with data as a flat list of values and the shape of a desired tensor. In the example below I want to send an image of a shape (640, 480, 3) but in the field “content” I have to specify the image as a flat list of values, in this case of shape (921600,).
For example, deployment of a ViT model as a Python model looks like this:
Functions initialize and execute are required by Triton. In initialization model and feature extractor are obtained from huggingface. execute does some preprocessing, calls models, and returns results.
And its config.pbtxt:
And directory looks like this:
Below I present You ViT model deployed as an ensemble model with separate pre- and post-processing.
Directory structure looks like this:
Take a look at a comparison of those two deployment configurations. You can clearly see that optimization of the core model gives speedup of the inference pipeline, up to around 30%.
I’ve selected MobileNetV2 for a benchmark. Below You can find two tables with results for different batch sizes.
As one could expect, TorchScript without any optimizations is the worst in this comparison. From the table above we can conclude that increasing batch_size with dynamic batching translates to increased throughput. Model conversion to FP16 and INT8 gives noticeable speedup but it may cause reduced performance. What is interesting is that TensorRT FP16 has higher throughput than TensorRT FP16 optimized. The first model was exported as a half-precision model, whereas the second one was exported as a full-precision model and configured in Triton to use FP16. From the chart above we can conclude that if you want to use a half-precision model it’s better to export it in this form and don’t rely on Triton’s conversion. As always results will differ on different hardware and you should test all configurations before deployment on Your machine.
This post presents a cheat sheet with useful Triton commands and configurations that you can refer back to when working with this tool. Code and benchmarks for different configurations of the model for computer vision are presented. The reader observed the process of optimizing a complex model by separating the pre and post-processing code from the model inference itself. I hope you learned something new today and I encourage you to try to play around with Triton. I have a feeling you might like each other 🙂