Nvidia is teaming up with a list of tech partners on a game-changing piece of software set to double the performance of its flagship H100 Tensor Core GPU.
The open source TensorRT-LLM update, which is set for release in the coming weeks, is an updated system that outperforms the A100 by eight times, while the H100s will only outperform the A100 by four times. It was tested on GPT-J 6B, a model used to summarize articles in CNN and the Daily Mail.
When tested on Meta’s Llama2 LLM, the TensorRT-LLM-powered H100s outperformed the A100s by 4.6 times – versus 2.6 times before the update.
Nvidia H100s faster than ever
The versatility and dynamics of large language models (LLM) can make requests difficult to batch and execute in parallel, meaning that some requests complete much faster than others.
To address this, Nvidia and its partners embedded TensorRT-LLM with a more powerful scheduling technique called in-flight batching. This has the advantage that text generation can be divided into several subtasks.
Simply put, instead of waiting for a complete batch of requests to complete before moving on to the next request, the system can continue to process new batches of different requests in parallel.
TensorRT-LLM includes the TensorRT deep learning compiler and includes optimized kernels, pre-processing and post-processing steps, as well as multi-GPU and multi-node communication primitives.
The result? Groundbreaking performance on Nvidia’s GPU paves the way for new large language model experiments, quick customization and superior performance.
This software uses tensor parallelism, in which individual weight matrices are split across devices, in turn, allowing efficient inference at scale; Each model runs in parallel on multiple GPUs and multiple servers.
TensorRT-LLM includes fully optimized and read-only versions of Llama 2, GPT-2 and GPT-3, as well as Falcon, Mosaic MPT, BLOOM and dozens of other popular LLMs. These can be accessed through the Python API.
The update is available in early access, and will soon be integrated into the Nvidia NeMo framework, which is part of Nvidia AI Enterprise. Researchers can access it through the NeMo framework, the NGC portal, or the source repository on GitHub.