Enhanced AI Efficiency through NVIDIA TensorRT 10.0’s Weight-Stripped Engines

  • NVIDIA introduces TensorRT 10.0, enhancing AI application deployment.
  • Weight-stripped engines focus solely on execution code, reducing shipment sizes by over 95%.
  • These engines minimize weight duplication, achieving over 95% compression for CNNs and LLMs.
  • Building and deploying weight-stripped engines ensure consistent performance and compatibility with next-gen GPUs.
  • Integration with ONNX Runtime simplifies deployment across diverse hardware.

Main AI News:

NVIDIA recently announced the launch of TensorRT 10.0, marking a significant advancement in its inference library. This upgrade introduces weight-stripped engines, a breakthrough innovation aimed at streamlining AI application deployment. As detailed in the NVIDIA Technical Blog, these novel engines trim down engine shipment sizes by an impressive 95%, focusing solely on the execution code.

Delving into Weight-Stripped Engines

The concept of weight-stripped engines, introduced within TensorRT 10.0, is revolutionary. These engines exclusively contain the execution code (CUDA kernels), omitting weights entirely. This strategic omission renders them notably smaller compared to conventional engines. By eliminating weights during the build phase, these engines retain only indispensable weights for optimal performance. They seamlessly support ONNX models and other network definitions, enabling weight modifications without necessitating engine rebuilds. This streamlined process ensures swift deserialization and upholds superior inference performance.

Advantages of Weight Stripping

In the realm of traditional TensorRT engines, the inclusion of all network weights often resulted in redundant weights across diverse hardware-specific engines. This redundancy typically translated into bulky application binaries. However, weight-stripped engines tackle this challenge head-on by minimizing weight duplication, achieving compression rates exceeding 95% for Convolutional Neural Networks (CNN) and Large Language Models (LLM). Consequently, more AI functionality can be seamlessly integrated into applications without inflating their size. Moreover, these engines seamlessly align with TensorRT minor updates and boast a lean runtime footprint of approximately 40 MB.

Development and Deployment of Weight-Stripped Engines

The process of constructing a weight-stripped engine entails leveraging real weights for optimization decisions, thereby ensuring consistent performance upon subsequent refits. TensorRT streamlines computations through static node folding and fusion optimizations. Additionally, the TensorRT Cloud, currently in early access for select partners, simplifies the creation of weight-stripped engines from ONNX models.

Deploying these engines is a straightforward endeavor. Applications can effortlessly refit weight-stripped engines with weights sourced from the ONNX file directly on the end-user device within seconds. Post-serialization, refitted engines retain the swift deserialization efficiency synonymous with TensorRT, sans recurring refit costs. The lean runtime of TensorRT 10.0 (~40 MB) seamlessly supports this process, ensuring compatibility with next-generation GPUs sans necessitating application updates.

A recent case study conducted on an NVIDIA GeForce RTX 4090 GPU showcased an impressive compression rate exceeding 99% with SDXL. The table below provides a comprehensive comparison of compression rates.

The eagerly anticipated support for weight-stripped TensorRT-LLM engines is slated for imminent release, with internal builds already showcasing substantial compression gains across diverse LLMs.

Challenges and Future Prospects

Presently, the weight-stripped functionality in TensorRT 10.0 is confined to refitting with identical build-time weights to uphold maximum performance. Users currently lack the capability to make layer-level decisions regarding which weights to strip—a limitation that may be addressed in forthcoming releases. Additionally, support for weight-stripped engines in TensorRT-LLM is poised to be rolled out shortly.

Seamless Integration with ONNX Runtime

TensorRT 10.0’s weight-stripped functionality has seamlessly integrated into ONNX Runtime (ORT), commencing from ORT 1.18.1. This integration empowers TensorRT to extend the same functionality via ORT APIs, thereby reducing shipment sizes while catering to a diverse array of customer hardware. Leveraging EP context node-based logic, the ORT integration embeds serialized TensorRT engines within an ONNX model, obviating the need for builder resources and significantly curtailing setup time.

Conclusion:

NVIDIA’s TensorRT 10.0’s Weight-Stripped Engines mark a significant advancement in AI deployment, offering streamlined processes and substantial compression benefits. This innovation is poised to revolutionize the market by enabling more efficient and compact AI application deployment, catering to the evolving needs of diverse hardware ecosystems.

Source