NVIDIA TensorRT RTX Execution Provider

The NVIDIA TensorRT-RTX Execution Provider (EP) is an inference deployment solution designed specifically for NVIDIA RTX GPUs. It is optimized for client-centric use cases..

TensorRT RTX EP provides the following benefits:

  • Small package footprint: Optimized resource usage on end-user systems at just under 200 MB.
  • Faster model compile and load times: Leverages just-in-time compilation techniques, to build RTX hardware-optimized engines on end-user devices in seconds.
  • Portability: Seamlessly use cached models across multiple RTX GPUs.

The TensorRT RTX EP leverages NVIDIA’s new deep learning inference engine, TensorRT for RTX, to accelerate ONNX models on RTX GPUs. Microsoft and NVIDIA collaborated closely to integrate the TensorRT RTX EP with ONNX Runtime.

Currently, TensorRT RTX supports RTX GPUs based on Ampere and later architectures.

For a full compatibility and support matrix, please refer to this page.

Contents

Install

Currently, TensorRT RTX EP can be built from the source code. Support for installation from package managers, such as PyPi and NuGet, is coming soon. See the WinML install section for WinML-related installation instructions.

Build from source

Information on how to build from source for TensorRT RTX EP can be found here.

Usage

C/C++

Ort::Env env(ORT_LOGGING_LEVEL_WARNING, "SampleApp");
Ort::SessionOptions session_options;
session_options.AppendExecutionProvider(onnxruntime::kNvTensorRTRTXExecutionProvider, {});
Ort::Session session(env, model_path, session_options);

Python

Register the TensorRT RTX EP by specifying it in the providers argument when creating an InferenceSession.

import onnxruntime as ort
session = ort.InferenceSession(model_path, providers=['NvTensorRtRtxExecutionProvider'])

Features

CUDA Graph

CUDA Graph is a representation of a sequence of GPU operations, such as kernel launches and memory copies, captured from a CUDA stream. Instead of the CPU launching each operation individually, the entire sequence is captured at once and then replayed on the GPU. This process significantly reduces CPU overhead and improves GPU utilization. Find out more details about CUDA Graphs from this blog.

Usage

CUDA Graph can be enabled by setting a provider option. By default, ONNX Runtime uses a graph annotation ID of 0 and starts capturing graphs. Users can control the annotation ID at runtime by setting the run option gpu_graph_id. If we have gpu_graph_id as -1, it indicates that the graph will not be captured for that specific run.

Python

trt_rtx_provider_options = {'enable_cuda_graph': True}
providers = [('NvTensorRTRTXExecutionProvider', trt_rtx_provider_options)]
session = ort.InferenceSession("model.onnx", providers=providers)

C/C++

const auto& api = Ort::GetApi();
Ort::SessionOptions session_options;
const char* keys[]   = {onnxruntime::nv::provider_option_names::kCudaGraphEnable};
const char* values[] = {"1"};
OrtStatus* status = api.SessionOptionsAppendExecutionProvider(session_options, onnxruntime::kNvTensorRTRTXExecutionProvider, keys, values, 1);
Ort::Session session(env, model_path, session_options);

ONNXRuntime Perf Test

onnxruntime_perf_test.exe -I -t 5 -e nvtensorrtrtx -i "enable_cuda_graph|1" "model.onnx"

Effectively Using CUDA Graphs

CUDA Graph can be beneficial when execution patterns are static and involve many small GPU kernels. This feature helps reduce CPU overhead and improve GPU utilization, particularly for static execution plans run more than twice.

Avoid enabling CUDA Graph or proceed with caution if:

  • Input shapes or device bindings frequently change.
  • The control flow is conditional and data-dependent.

EP context model

EP context nodes are precompiled optimized formats that are execution provider specific. They enable to compile a standard ONNX model once and make any subsequent load of the same unchanged model as fast as possible.

TensorRT RTX handle compilation into two distinct phases:

  • Ahead-of-Time (AOT): The ONNX model is compiled into an optimized binary blob, and stored as an EP context model.
  • Just-in-Time (JIT): At inference time, the EP context model is loaded and TensorRT RTX dynamically compiles the binary blob (engine) to optimize it for the exact GPU hardware being used.

Generating EP Context Models

ONNX Runtime 1.22 introduced dedicated Compile APIs to simplify the generation of EP context models:

// AOT phase
Ort::ModelCompilationOptions compile_options(env, session_options);
compile_options.SetInputModelPath(input_model_path);
compile_options.SetOutputModelPath(compile_model_path);

Ort::Status status = Ort::CompileModel(env, compile_options);

After successful generation, the EP context model can be directly loaded for inference:

// JIT phase
Ort::Session session(env, compile_model_path, session_options);

This leads to a considerable reduction in session creation time, improving the overall user experience.

The JIT time can be further improved using runtime cache. A runtime cache directory with a per model cache is created. This cache stores the compiled CUDA kernels and reduces session load time. Learn more about the process here.

For a practical example of usage for EP context, please refer to:

There are two other ways to quick generate an EP context model:

ONNXRuntime Perf Test

onnxruntime_perf_test.exe -e nvtensorrtrtx -I -r 1 --compile_ep_context --compile_model_path "/path/to/model_ctx.onnx" "/path/to/model.onnx"

Python Script

python tools/python/compile_ep_context_model.py -i "path/to/model.onnx" -o "/path/to/model_ctx.onnx"

NVIDIA recommended settings

  • For models > 2GB, set embed_mode = 0 in model compilation options. If binary blob is embedded within the EP context, it fails for > 2GB models due to protobuf limitations
    Ort::ModelCompilationOptions compile_options(env, session_options);
    compile_options.SetEpContextEmbedMode(0);
    

Runtime cache

Runtime caches help reduce JIT compilation time. When a user compiles an EP context and loads the resulting model for the first time, the system generates specialized CUDA kernels for the GPU. By setting the provider option "nv_runtime_cache_path" to a directory, a cache is created for each TensorRT RTX engine in an EP context node. On subsequent loads, this cache allows the system to quickly deserialize precompiled kernels instead of compiling them again. This is especially helpful for large models with many different operators, such as SD 1.5, which includes a mix of Conv and MatMul operations. The cache only contains compiled kernels. No information about the model’s graph structure or weights is stored.

Execution Provider Options

TensorRT RTX EP provides the following user configurable options with the Execution Provider Options

Parameter Type Description Default
device_id int GPU device identifier 0
user_compute_stream str Specify compute stream to run GPU workload ””
nv_max_workspace_size int Maximum TensorRT engine workspace (bytes) 0 (auto)
nv_max_shared_mem_size int Maximum TensorRT engine workspace (bytes) 0 (auto)
nv_dump_subgraphs bool Enable subgraph dumping for debugging false
nv_detailed_build_log bool Enable detailed build logging false
enable_cuda_graph bool Enable CUDA graph to reduce inference overhead. Helpful for smaller models false
profile_min_shapes str Comma-separated list of input tensor shapes for the minimum optimization profile. Format: "input1:dim1xdim2x...,input2:dim1xdim2x..." ”” (auto)
profile_max_shapes str Comma-separated list of input tensor shapes for the maximum optimization profile. Format: "input1:dim1xdim2x...,input2:dim1xdim2x..." ”” (auto)
profile_opt_shapes str Comma-separated list of input tensor shapes for the optimal optimization profile. Format: "input1:dim1xdim2x...,input2:dim1xdim2x..." ”” (auto)
nv_multi_profile_enable bool Enable support for multiple optimization profiles in TensorRT engine. Allows dynamic input shapes for different inference requests false
nv_use_external_data_initializer bool Use external data initializer for model weights. Useful for EP context large models with external data files false
nv_runtime_cache_path str Path to store runtime cache. Setting this enables faster model loading by caching JIT compiled kernels for each TensorRT RTX engine. ”” (disabled)

Click below for Python API example:

import onnxruntime as ort

model_path = '/path/to/model'

# note: for bool type options in python API, set them as False/True
provider_options = {
  'device_id': 0,
  'nv_dump_subgraphs': False,
  'nv_detailed_build_log': True,
  'user_compute_stream': stream_handle
}

sesion_options = ort.SessionOptions()
session = ort.InferenceSession(model_path, sess_options=sesion_options, providers=[('NvTensorRTRTXExecutionProvider', provider_options)])

Click below for C++ API example:

Ort::SessionOptions session_options;

// define a cuda stream
cudaStream_t cuda_stream;
cudaStreamCreate(&cuda_stream);

char stream_handle[32];
sprintf_s(stream_handle, "%lld", (uint64_t)cuda_stream);

std::unordered_map<std::string, std::string> provider_options;
provider_options[onnxruntime::nv::provider_option_names::kDeviceId] = "1";
provider_options[onnxruntime::nv::provider_option_names::kUserComputeStream] = stream_handle;

session_options.AppendExecutionProvider(onnxruntime::kNvTensorRTRTXExecutionProvider, provider_options);

NOTE: For bool type options, assign them with True/False in python, or 1/0 in C++.

Profile shape options

  • Description: build with explicit dynamic shapes using a profile with the min/max/opt shapes provided.
    • By default TensorRT RTX engines support dynamic shapes. For additional performance improvements, you can specify one or multiple explicit ranges of shapes.
    • The format of the profile shapes is input_tensor_1:dim_1xdim_2x...,input_tensor_2:dim_3xdim_4x...,...
      • These three flags must be provided in order to enable explicit profile shapes.
    • Note that multiple TensorRT RTX profiles can be enabled by passing multiple shapes for the same input tensor.
    • Check TensorRT for RTX doc optimization profiles for more details.

Performance test

When using onnxruntime_perf_test, use the flag -e nvtensorrttrx

Plugins Support

TensorRT RTX doesn’t support plugins