NVIDIA TensorRT RTX Execution Provider

The NVIDIA TensorRT-RTX Execution Provider (EP) is an inference deployment solution designed specifically for NVIDIA RTX GPUs. It is optimized for client-centric use cases..

TensorRT RTX EP provides the following benefits:

Small package footprint: Optimized resource usage on end-user systems at just under 200 MB.
Faster model compile and load times: Leverages just-in-time compilation techniques, to build RTX hardware-optimized engines on end-user devices in seconds.
Portability: Seamlessly use cached models across multiple RTX GPUs.

The TensorRT RTX EP leverages NVIDIA’s new deep learning inference engine, TensorRT for RTX, to accelerate ONNX models on RTX GPUs. Microsoft and NVIDIA collaborated closely to integrate the TensorRT RTX EP with ONNX Runtime.

Currently, TensorRT RTX supports RTX GPUs based on Ampere and later architectures.

For a full compatibility and support matrix, please refer to this page.

Install
Build from source
Usage
- C/C++
- Python
Features
Execution Provider Options
- Profile shape options
Performance test
Plugins Support

Install

Currently, TensorRT RTX EP can be built from the source code. Support for installation from package managers, such as PyPi and NuGet, is coming soon. See the WinML install section for WinML-related installation instructions.

Build from source

Information on how to build from source for TensorRT RTX EP can be found here.

Usage

C/C++

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "SampleApp");
Ort::SessionOptions session_options;
session_options.AppendExecutionProvider(onnxruntime::kNvTensorRTRTXExecutionProvider, {});
Ort::Session session(env, model_path, session_options);

Python

import onnxruntime as ort
session = ort.InferenceSession(model_path, providers=['NvTensorRtRtxExecutionProvider'])

Features

CUDA Graph

CUDA Graph is a representation of a sequence of GPU operations, such as kernel launches and memory copies, captured from a CUDA stream. Instead of the CPU launching each operation individually, the entire sequence is captured at once and then replayed on the GPU. This process significantly reduces CPU overhead and improves GPU utilization. Find out more details about CUDA Graphs from this blog.

Usage

CUDA Graph can be enabled by setting a provider option. By default, ONNX Runtime uses a graph annotation ID of 0 and starts capturing graphs. Users can control the annotation ID at runtime by setting the run option gpu_graph_id. If we have gpu_graph_id as -1, it indicates that the graph will not be captured for that specific run.

Python

trt_rtx_provider_options = {'enable_cuda_graph': True}
providers = [('NvTensorRTRTXExecutionProvider', trt_rtx_provider_options)]
session = ort.InferenceSession("model.onnx", providers=providers)

C/C++

const auto& api = Ort::GetApi();
Ort::SessionOptions session_options;
const char* keys[]   = {onnxruntime::nv::provider_option_names::kCudaGraphEnable};
const char* values[] = {"1"};
OrtStatus* status = api.SessionOptionsAppendExecutionProvider(session_options, onnxruntime::kNvTensorRTRTXExecutionProvider, keys, values, 1);
Ort::Session session(env, model_path, session_options);

ONNXRuntime Perf Test

onnxruntime_perf_test.exe -I -t 5 -e nvtensorrtrtx -i "enable_cuda_graph|1" "model.onnx"

Effectively Using CUDA Graphs

CUDA Graph can be beneficial when execution patterns are static and involve many small GPU kernels. This feature helps reduce CPU overhead and improve GPU utilization, particularly for static execution plans run more than twice.

Avoid enabling CUDA Graph or proceed with caution if:

Input shapes or device bindings frequently change.
The control flow is conditional and data-dependent.

EP context model

EP context nodes are precompiled optimized formats that are execution provider specific. They enable to compile a standard ONNX model once and make any subsequent load of the same unchanged model as fast as possible.

TensorRT RTX handle compilation into two distinct phases:

Ahead-of-Time (AOT): The ONNX model is compiled into an optimized binary blob, and stored as an EP context model.
Just-in-Time (JIT): At inference time, the EP context model is loaded and TensorRT RTX dynamically compiles the binary blob (engine) to optimize it for the exact GPU hardware being used.

Generating EP Context Models

ONNX Runtime 1.22 introduced dedicated Compile APIs to simplify the generation of EP context models:

// AOT phase
Ort::ModelCompilationOptions compile_options(env, session_options);
compile_options.SetInputModelPath(input_model_path);
compile_options.SetOutputModelPath(compile_model_path);

Ort::Status status = Ort::CompileModel(env, compile_options);

After successful generation, the EP context model can be directly loaded for inference:

// JIT phase
Ort::Session session(env, compile_model_path, session_options);

This leads to a considerable reduction in session creation time, improving the overall user experience.

The JIT time can be further improved using runtime cache. A runtime cache directory with a per model cache is created. This cache stores the compiled CUDA kernels and reduces session load time. Learn more about the process here.

For a practical example of usage for EP context, please refer to:

EP context samples
EP context unit tests

There are two other ways to quick generate an EP context model:

ONNXRuntime Perf Test

onnxruntime_perf_test.exe -e nvtensorrtrtx -I -r 1 --compile_ep_context --compile_model_path "/path/to/model_ctx.onnx" "/path/to/model.onnx"

Python Script

python tools/python/compile_ep_context_model.py -i "path/to/model.onnx" -o "/path/to/model_ctx.onnx"

NVIDIA recommended settings

For models > 2GB, set embed_mode = 0 in model compilation options. If binary blob is embedded within the EP context, it fails for > 2GB models due to protobuf limitations
```
Ort::ModelCompilationOptions compile_options(env, session_options);
compile_options.SetEpContextEmbedMode(0);
```

Runtime cache

Runtime caches help reduce JIT compilation time. When a user compiles an EP context and loads the resulting model for the first time, the system generates specialized CUDA kernels for the GPU. By setting the provider option "nv_runtime_cache_path" to a directory, a cache is created for each TensorRT RTX engine in an EP context node. On subsequent loads, this cache allows the system to quickly deserialize precompiled kernels instead of compiling them again. This is especially helpful for large models with many different operators, such as SD 1.5, which includes a mix of Conv and MatMul operations. The cache only contains compiled kernels. No information about the model’s graph structure or weights is stored.

Execution Provider Options

TensorRT RTX EP provides the following user configurable options with the Execution Provider Options

Parameter	Type	Description	Default
device_id	`int`	GPU device identifier	0
user_compute_stream	`str`	Specify compute stream to run GPU workload	””
nv_max_workspace_size	`int`	Maximum TensorRT engine workspace (bytes)	0 (auto)
nv_max_shared_mem_size	`int`	Maximum TensorRT engine workspace (bytes)	0 (auto)
nv_dump_subgraphs	`bool`	Enable subgraph dumping for debugging	false
nv_detailed_build_log	`bool`	Enable detailed build logging	false
enable_cuda_graph	`bool`	Enable CUDA graph to reduce inference overhead. Helpful for smaller models	false
profile_min_shapes	`str`	Comma-separated list of input tensor shapes for the minimum optimization profile. Format: `"input1:dim1xdim2x...,input2:dim1xdim2x..."`	”” (auto)
profile_max_shapes	`str`	Comma-separated list of input tensor shapes for the maximum optimization profile. Format: `"input1:dim1xdim2x...,input2:dim1xdim2x..."`	”” (auto)
profile_opt_shapes	`str`	Comma-separated list of input tensor shapes for the optimal optimization profile. Format: `"input1:dim1xdim2x...,input2:dim1xdim2x..."`	”” (auto)
nv_multi_profile_enable	`bool`	Enable support for multiple optimization profiles in TensorRT engine. Allows dynamic input shapes for different inference requests	false
nv_use_external_data_initializer	`bool`	Use external data initializer for model weights. Useful for EP context large models with external data files	false
nv_runtime_cache_path	`str`	Path to store runtime cache. Setting this enables faster model loading by caching JIT compiled kernels for each TensorRT RTX engine.	”” (disabled)

Click below for Python API example:

import onnxruntime as ort

model_path = '/path/to/model'

# note: for bool type options in python API, set them as False/True
provider_options = {
  'device_id': 0,
  'nv_dump_subgraphs': False,
  'nv_detailed_build_log': True,
  'user_compute_stream': stream_handle
}

sesion_options = ort.SessionOptions()
session = ort.InferenceSession(model_path, sess_options=sesion_options, providers=[('NvTensorRTRTXExecutionProvider', provider_options)])

Click below for C++ API example:

Ort::SessionOptions session_options;

// define a cuda stream
cudaStream_t cuda_stream;
cudaStreamCreate(&cuda_stream);

char stream_handle[32];
sprintf_s(stream_handle, "%lld", (uint64_t)cuda_stream);

std::unordered_map<std::string, std::string> provider_options;
provider_options[onnxruntime::nv::provider_option_names::kDeviceId] = "1";
provider_options[onnxruntime::nv::provider_option_names::kUserComputeStream] = stream_handle;

session_options.AppendExecutionProvider(onnxruntime::kNvTensorRTRTXExecutionProvider, provider_options);

NOTE: For bool type options, assign them with True/False in python, or 1/0 in C++.

Profile shape options

Description: build with explicit dynamic shapes using a profile with the min/max/opt shapes provided.
- By default TensorRT RTX engines support dynamic shapes. For additional performance improvements, you can specify one or multiple explicit ranges of shapes.
- The format of the profile shapes is input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,...
  - These three flags must be provided in order to enable explicit profile shapes.
- Note that multiple TensorRT RTX profiles can be enabled by passing multiple shapes for the same input tensor.
- Check TensorRT for RTX doc optimization profiles for more details.

Performance test

When using onnxruntime_perf_test, use the flag -e nvtensorrttrx

Plugins Support

TensorRT RTX doesn’t support plugins