AI Inference¶

The AIR-T with AirStack Core is primarily leveraged for deployment and performing inference of trained AI models. This may be performed using a multitude of workflows, which involve training a model offline followed by deployment in AirStack Core.

As an example, referring back to the AirStack Core Application Flow Diagram, an application developer may want to "Perform Inference" using a model that was trained offline. Here are the recommended workflows for executing the model inference:

Inference Using AI Framework Directly (e.g., PyTorch)
Inference using ONNX Runtime
Inference using TensorRT
Inference using NVIDIA Triton

AirStack Core is a flexible tool that supports a wide range of other workflows that are not documented here. If you have a workflow you are using and are comfortable with, it is likely possible on AirStack Core.

The following section cover the four recommended workflows described above.

Inference Using AI Framework Directly¶

This section covers AI Toolboxes that are directly installable within the AirStack Core environment to perform inference directly on the hardware.

AirStack Core supports directly performing inference with most toolboxes and workflows. While each of these approaches may have different implementation specifics, all of them may be integrated into the AirStack Core inference workflow. Here is a list of some of the most common AI Toolboxes.

Toolbox	Usage	Tutorial Link
PyTorch	Dynamic framework for research and production	Coming Soon¹
TensorFlow	Scalable AI platform for production	Coming Soon¹
ONNX Runtime	Cross-platform inference engine for ONNX models	Coming Soon¹
TensorRT	High-performance AI inference optimizer
Triton²	Multi-framework inference server for AI

Note that MATLAB does not support ARM architectures, therefore it cannot run directly on AirStack Core. Currently, we provide a tutorial for performing training on a separate computer and deploying the trained model in AirStack Core, found here.

Inference using ONNX Runtime¶

Open Neural Network Exchange (ONNX) is an open-source format designed to enable interoperability between different deep learning frameworks. It allows developers to train models in tools like PyTorch or TensorFlow and export them for inference in platforms such as ONNX Runtime, TensorRT, or other optimized engines. ONNX simplifies model deployment across diverse hardware and software environments, making AI development more flexible and portable.

Deployment Workflow for ONNX Runtime

flowchart LR
  Train["**Train Model**<br>(AI Toolboxes)"] ==>
  Export["**Export Model<br>(ONNX File)**"] ==>
  Deploy["**Perform Inference**<br>(ONNX Runtime)"]
  subgraph **AirStack Core**
    Deploy   
  end

Models saved as in the ONNX format may be optimized using TensorRT (next section), however users may want to run the model directly using [ONNX Runtime]. ONNX Runtime is a high-performance inference engine built to run ONNX models efficiently across a wide range of hardware platforms. Developed by Microsoft, it supports CPU, GPU, and specialized accelerators, and is optimized for both cloud and edge deployment. With features like model quantization, graph optimization, and hardware-specific execution providers (including NVIDIA TensorRT, Intel OpenVINO, and others), ONNX Runtime ensures fast, scalable, and production-ready AI inference.

Inference using TensorRT¶

TensorRT is NVIDIA’s high-performance inference engine optimized for deploying AI models at the edge with NVIDIA Jetson platforms. It delivers ultra-low latency and high throughput by optimizing neural networks for Jetson’s GPU and DLA accelerators. Supporting FP32, FP16, and INT8 precision, TensorRT enables efficient, real-time AI performance in power-constrained environments such as the AIR-T with AirStack Core.

The workflow for creating a deep learning application using TensorRT consists of three phases: training, optimization, and deployment. These steps are illustrated in the figure below and covered in this section.

Deployment Workflow for TensorRT

flowchart LR
  Train["**Train Model**<br>(AI Toolboxes)"] ==>
  Export["**Export Model**<br>(ONNX File)"] ==>
  Optimize["**Optimize Model**<br>(TensorRT)"]
  subgraph **AirStack Core**
    Optimize ==>
    Deploy["**Perform Inference**<br>(TensorRT)"]
  end

When training a model for optimization and execution using TensorRT, make sure that the layers being used are supported by your version of TensorRT. To determine what version of TensorRT is installed on your version of AirStack Core, open a terminal and run:

$ dpkg -l | grep TensorRT

The supported layers for your version of TensorRT may be found in the TensorRT SDK Documentation{: target="_blank"} under the TensorRT Support Matrix section.

AirPack is an add-on software package (not included with AirStack Core) that provides source code for the complete training-to-deployment workflow described in this section.

TensorRT Source Code Examples¶

Deepwave provides a source code toolbox to demonstrate the recommended training to inference workflow for deploying a neural network in AirStack Core using TensorRT:

AIR-T Deep Learning Inference Examples.

The toolbox demonstrates how to create a simple neural network for the following AI Toolboxes:

Installation of these packages for training is made easy by the inclusion .yml file to create a conda environment. For the inference execution, all python packages and dependencies are pre-installed on AIR-Ts running AirStack Core 0.3+.

In the above code, the example neural network model inputs an arbitrary length input buffer and has only one output node that calculates the average of the instantaneous power across each batch for the input buffer. While this network is not a typical neural network model, it is an excellent example of how customers may deploy their trained models on the AIR-T for inference.

Inference using NVIDIA Triton¶

NVIDIA Triton Inference Server is an open-source platform designed to simplify the deployment of AI models across cloud, data center, edge, and embedded environments. It supports a wide range of frameworks including TensorFlow, PyTorch, TensorRT, and ONNX, enabling developers to serve models using a unified interface. With advanced features like dynamic batching, model ensembling, and concurrent model execution, Triton ensures high-performance and scalable AI inference.

Triton supports multiple backends, each corresponding to a specific machine learning framework or execution environment. These include TensorFlow, PyTorch, ONNX Runtime, TensorRT, OpenVINO, and even a Python backend for custom logic. Each backend is implemented as a modular plugin, allowing Triton to run models from different frameworks simultaneously. This backend flexibility makes it easy to deploy heterogeneous AI workloads on a single platform.

On AirStack Core, Triton may be used to deploy and execute AI models or real-time signal processing. A complete tutorial for running NVIDIA Triton in AirStack Core may be found here:

NVIDIA Triton Tutorial on AirStack Core

Supported for directly performing inference using AirStack Core. Currently, we provide a tutorial for performing training on a separate computer and deploying the trained model in AirStack Core, found here. ↩↩↩
Note that Triton supports running models in the most common frameworks as shown here. ↩