airpack.deploy.trt

This package provides functionality to optimize a neural network using NVIDIA’s TensorRT framework and perform inference using optimized models.

Optimized models are saved in .plan format, which is an internal, platform-specific data format for TensorRT. Since TensorRT optimization functions by running many variations of the network on the target hardware, it must be executed on the platform that will be used for inference, i.e., the AIR-T for final deployment.

The basic workflow is as follows:

  1. Save your trained model to an ONNX file

  2. Optimize the model using TensorRT with the onnx2plan() function.

  3. Create a TrtInferFromPlan object for your optimized model and use it to perform inference.

Module Contents

airpack.deploy.trt.onnx2plan(onnx_file, input_node_name='input', input_port_name='', input_len=4096, fp16_mode=True, max_workspace_size=1073741824, max_batch_size=128, verbose=False)

Optimize the provided ONNX model using TensorRT and save the result.

The optimized model will have a .plan extension and be saved in the same folder as the input ONNX model.

Parameters
  • onnx_file (Union[str, os.PathLike]) – Filename of the ONNX model to optimize

  • input_node_name (str) – Name of the ONNX model’s input layer

  • input_port_name (str) –

  • input_len (int) – Length of the ONNX model’s input layer, determined when the model was created

  • fp16_mode (bool) – Try to use reduced precision (float16) layers if performace would improve

  • max_workspace_size (int) – Maximum scratch memory that the TensorRT optimizer may use, defaults to 1GB. The default value can be used in most situations and may only need to be reduced if using very low-end GPU hardware

  • max_batch_size (int) – The maximum batch size to optimize for. When running inference using the optimized model, the chosen batch size must be less than the maximum specified here

  • verbose (bool) – Print extra information about the optimized model

Return type

pathlib.Path

airpack.deploy.trt.uff2plan(uff_file, input_node_name='input/IteratorGetNext', input_len=4096, fp16_mode=True, max_workspace_size=1073741824, max_batch_size=128, verbose=False)

Optimize the provided UFF (TensorFlow 1.x) model using TensorRT and save the result.

The optimized model will have a .plan extension and be saved in the same folder as the input model.

Parameters
  • uff_file (Union[str, os.PathLike]) – Filename of the UFF model to optimize

  • input_node_name (str) – Name of the UFF model’s input layer

  • input_len (int) – Length of the UFF model’s input layer, determined when the model was created

  • fp16_mode (bool) – Try to use reduced precision (float16) layers if performace would improve

  • max_workspace_size (int) – Maximum scratch memory that the TensorRT optimizer may use, defaults to 1GB. The default value can be used in most situations and may only need to be reduced if using very low-end GPU hardware

  • max_batch_size (int) – The maximum batch size to optimize for. When running inference using the optimized model, the chosen batch size must be less than the maximum specified here

  • verbose (bool) – Print extra information about the optimized model

airpack.deploy.trt.plan_bench(plan_file_name, cplx_samples, batch_size=128, num_inferences=100, input_dtype=np.float32)

Benchmarks a model that has been pre-optimized using the TensorRT framework.

This function uses settings for the CUDA context and memory buffers that are optimized for NVIDIA Jetson modules and may not be optimal for desktops.

Note

When selecting a batch_size to benchmark, the selected size must be less than or equal to the max_batch_size value that was specified when creating the .plan file. Additionally, to maximise performance, power-of-two values for batch_size are recommended.

Note

To accurately benchmark the result of TensorRT optimization, this benchmark should be run on the same computer that generated the .plan file.

Parameters
  • plan_file_name (Union[str, os.PathLike]) – TensorRT optimized model file (.plan format)

  • cplx_samples (int) – Input length of the neural network, in complex samples; this is half of the input_length of the neural network, which operates on real values

  • batch_size (int) – How many sets of cplx_samples inputs are batched together in a single inference call

  • num_inferences (Optional[int]) – Number of iterations to execute inference between measurements of inference throughput (if None, then run forever)

  • input_dtype (numpy.number) – Data type of a single value (a single I or Q value, not a complete complex (I, Q) sample): use one of numpy.int16 or numpy.float32 here

Return type

None

class airpack.deploy.trt.TrtInferFromPlan(plan_file, batch_size, input_buffer, verbose=True)

Wrapper class for TensorRT inference using a pre-optimized .plan file.

Since it is expensive to create these inference objects, they should be created once at the start of your program and then re-used for multiple inference calls.

The buffer containing data for inference is provided when creating this inference object and will be re-used for each inference. It’s designed to be used by repeatedly copying data from the radio into this buffer and then calling the feed_forward() method to run inference.

After calling feed_forward(), the inference results will be available as the MappedBuffer object output_buff.

Note

Only device-mapped memory buffers are supported.

Parameters
  • plan_file (path_like) – TensorRT .plan file containing the optimized model

  • batch_size (int) – Batch size for a single inference execution

  • input_buffer (MappedBuffer) – Buffer containing data for inference of size input_length x batch_size, where input_length is the length of input to the model, determined when the neural network was created

  • verbose (bool) – Print verbose information about the loaded network, defaults to True

Variables
  • input_buff (MappedBuffer) – Input buffer for use in inference

  • output_buff (MappedBuffer) – Output buffer containing inference results

feed_forward(self)

Forward propagates the input buffer through the neural network to run inference.

Call this method each time samples from the radio are read into input_buff. Results will be available afterwards in output_buff.

Return type

None