AI Inference¶
The AIR-T with AirStack Core is primarily leveraged for deployment and performing inference of trained AI models. This may be performed using a multitude of workflows, which involve training a model offline followed by deployment in AirStack Core.
As an example, referring back to the AirStack Core Application Flow Diagram, an application developer may want to "Perform Inference" using a model that was trained offline. Here are the recommended workflows for executing the model inference:
- Inference Using AI Framework Directly (e.g., PyTorch)
- Inference using ONNX Runtime
- Inference using TensorRT
- Inference using NVIDIA Triton
AirStack Core is a flexible tool that supports a wide range of other workflows that are not documented here. If you have a workflow you are using and are comfortable with, it is likely possible on AirStack Core.
The following section cover the four recommended workflows described above.
Inference Using AI Framework Directly¶
This section covers AI Toolboxes that are directly installable within the AirStack Core environment to perform inference directly on the hardware.
AirStack Core supports directly performing inference with most toolboxes and workflows. While each of these approaches may have different implementation specifics, all of them may be integrated into the AirStack Core inference workflow. Here is a list of some of the most common AI Toolboxes.
Toolbox | Usage | Documentation | Tutorial Link |
---|---|---|---|
PyTorch | Dynamic framework for research and production | Coming Soon1 | |
TensorFlow | Scalable AI platform for production | Coming Soon1 | |
ONNX Runtime | Cross-platform inference engine for ONNX models | Coming Soon1 | |
TensorRT | High-performance AI inference optimizer | ||
Triton2 | Multi-framework inference server for AI |
Note that MATLAB does not support ARM architectures, therefore it cannot run directly on AirStack Core. Currently, we provide a tutorial for performing training on a separate computer and deploying the trained model in AirStack Core, found here.
Inference using ONNX Runtime¶
Open Neural Network Exchange (ONNX) is an open-source format designed to enable interoperability between different deep learning frameworks. It allows developers to train models in tools like PyTorch or TensorFlow and export them for inference in platforms such as ONNX Runtime, TensorRT, or other optimized engines. ONNX simplifies model deployment across diverse hardware and software environments, making AI development more flexible and portable.
flowchart LR
Train["**Train Model**<br>(AI Toolboxes)"] ==>
Export["**Export Model<br>(ONNX File)**"] ==>
Deploy["**Perform Inference**<br>(ONNX Runtime)"]
subgraph **AirStack Core**
Deploy
end
Models saved as in the ONNX format may be optimized using TensorRT (next section), however users may want to run the model directly using [ONNX Runtime]. ONNX Runtime is a high-performance inference engine built to run ONNX models efficiently across a wide range of hardware platforms. Developed by Microsoft, it supports CPU, GPU, and specialized accelerators, and is optimized for both cloud and edge deployment. With features like model quantization, graph optimization, and hardware-specific execution providers (including NVIDIA TensorRT, Intel OpenVINO, and others), ONNX Runtime ensures fast, scalable, and production-ready AI inference.
Inference using TensorRT¶
TensorRT is NVIDIAβs high-performance inference engine optimized for deploying AI models at the edge with NVIDIA Jetson platforms. It delivers ultra-low latency and high throughput by optimizing neural networks for Jetsonβs GPU and DLA accelerators. Supporting FP32, FP16, and INT8 precision, TensorRT enables efficient, real-time AI performance in power-constrained environments such as the AIR-T with AirStack Core.
The workflow for creating a deep learning application using TensorRT consists of three phases: training, optimization, and deployment. These steps are illustrated in the figure below and covered in this section.
flowchart LR
Train["**Train Model**<br>(AI Toolboxes)"] ==>
Export["**Export Model**<br>(ONNX File)"] ==>
Optimize["**Optimize Model**<br>(TensorRT)"]
subgraph **AirStack Core**
Optimize ==>
Deploy["**Perform Inference**<br>(TensorRT)"]
end
When training a model for optimization and execution using TensorRT, make sure that the layers being used are supported by your version of TensorRT. To determine what version of TensorRT is installed on your version of AirStack Core, open a terminal and run:
$ dpkg -l | grep TensorRT
The supported layers for your version of TensorRT may be found in the TensorRT SDK Documentation{: target="_blank"} under the TensorRT Support Matrix section.
AirPack is an add-on software package (not included with AirStack Core) that provides source code for the complete training-to-deployment workflow described in this section.
TensorRT Source Code Examples¶
Deepwave provides a source code toolbox to demonstrate the recommended training to inference workflow for deploying a neural network in AirStack Core using TensorRT:
AIR-T Deep Learning Inference Examples.
The toolbox demonstrates how to create a simple neural network for the following AI Toolboxes:
Installation of these packages for training is made easy by the inclusion .yml file to create a conda environment. For the inference execution, all python packages and dependencies are pre-installed on AIR-Ts running AirStack Core 0.3+.
In the above code, the example neural network model inputs an arbitrary length input buffer and has only one output node that calculates the average of the instantaneous power across each batch for the input buffer. While this network is not a typical neural network model, it is an excellent example of how customers may deploy their trained models on the AIR-T for inference.
Inference using NVIDIA Triton¶
NVIDIA Triton Inference Server is an open-source platform designed to simplify the deployment of AI models across cloud, data center, edge, and embedded environments. It supports a wide range of frameworks including TensorFlow, PyTorch, TensorRT, and ONNX, enabling developers to serve models using a unified interface. With advanced features like dynamic batching, model ensembling, and concurrent model execution, Triton ensures high-performance and scalable AI inference.
Triton supports multiple backends, each corresponding to a specific machine learning framework or execution environment. These include TensorFlow, PyTorch, ONNX Runtime, TensorRT, OpenVINO, and even a Python backend for custom logic. Each backend is implemented as a modular plugin, allowing Triton to run models from different frameworks simultaneously. This backend flexibility makes it easy to deploy heterogeneous AI workloads on a single platform.
On AirStack Core, Triton may be used to deploy and execute AI models or real-time signal processing. A complete tutorial for running NVIDIA Triton in AirStack Core may be found here: