Nvidia’s New TensorRT-LLM Software Pushes Limits of AI Chip Performance
Nvidia has released an open-source software suite capable of multiplying the speed and performance of large language models (LLMs) operating on Nvidia graphics processing units (GPUs). The new TensorRT-LLM software upgrades LLM efficiency and inference processing ability, which the tech giant sees as a way of encouraging further enterprise adoption of LLMs.
As LLMs grow more powerful, their massive size makes inference expensive and difficult to deploy. TensorRT-LLM leverages Nvidia’s GPUs and compilers to radically improve LLM speed and usability. The TensorFlow-based platform minimizes coding requirements and offloads performance optimizations to the software. The company’s TensorRT deep learning compiler and other techniques allow the LLMs to run across multiple GPUs without any code changes. Nvidia teamed up with several major LLM developers and for the project, including Meta, Databricks, and Grammarly, integrating multiple model options into the new software library.
“Models are increasing in complexity and as they get smarter, they get bigger, which is natural, but as they expand beyond the scope of a single GPU and have to run across multiple GPUs, that becomes a problem,” said Ian Buck, vice president of hyperscale and high-performance computing at Nvidia in a press briefing. “Compared to the original A-100 performance we were experiencing just last year, the combination of Hopper plus the TensorRT -LLM software has improved LLM inference performance on large language models by eight times.
For text summarization, TensorRT-LLM quadrupled throughput on the GPT-J 6B model on new H100 GPUs. With Meta’s Llama 2 model, it ran 4.6 times faster performance compared to A100 GPUs. Additionally, the software supports “in-flight batching” to dynamically manage variable inference loads. Rather than waiting for full batches, it pipelines new requests as others complete. Nvidia said this can double throughput on real-world workloads.
“In-flight batching allows work to enter the GPU and exit the GPU independent of other tasks,” Buck said. “With TensorRT-LLM and in-flight batching, work can enter and leave the batch independently and asynchronously to keep the GPU 100% occupied.”
Nvidia began showcasing its generative AI cloud services earlier this year. Since then, it has rapidly scaled up its operations, including working with Hugging Face to develop Training Cluster as a Service, a tool for streamlining enterprise LLM creation. Nvidia sees open-sourcing TensorRT-LLM as a way of providing a unified solution for training and deployment. It gives researchers and companies an on-ramp to otherwise complex LLMs. Nvidia wants to lower the technical hurdles to leveraging LLMs. If successful, TensorRT-LLM could help democratize access to technology some view as too costly and exclusive. Early access to TensorRT-LLM is now available on GitHub and Nvidia NGC. General release is expected soon as part of the Nvidia NeMo AI framework.