Hopper: NVIDIA’s new GPU architecture
GTC, which began Monday and runs through Thursday, features 900+ sessions. More than 200,000 developers, researchers, and data scientists from 50+ countries have registered for the event. At his GTC 2022 keynote, NVIDIA founder and CEO Jensen Huang announced a wealth of news in data center and high-performance computing, AI, design collaboration and digital twins, networking, automotive, robotics, and healthcare. Huang’s framing was that “companies are processing, refining their data, making AI software … becoming intelligence manufacturers.” If the goal is to transform data centers into ‘AI Factories,’ as NVIDIA puts it, then placing Transformers at the heart of this makes sense. The centerfold in the announcements has been the new Hopper GPU Architecture, which NVIDIA dubs “the next generation of accelerated computing.” Named for Grace Hopper, a pioneering U.S. computer scientist, the new architecture succeeds the NVIDIA Ampere architecture, launched two years ago. The company also announced its first Hopper-based GPU, the NVIDIA H100. NVIDIA claims that Hopper brings an order of magnitude performance leap over its predecessor, and this feat is based on six breakthrough innovations. Let’s go through them, keeping quick notes of how they compare to the competition. First, manufacturing. Built with 80 billion transistors using a cutting-edge TSMC 4N process designed for NVIDIA’s accelerated compute needs, H100 features major advances to accelerate AI, HPC, memory bandwidth, interconnect, and communication, including nearly 5 terabytes per second of external connectivity. On the manufacturing level, upstarts such as Cerebras or Graphcore have been also pushing the boundaries of what’s possible. Second, Multi-Instance GPU (MIG). MIG technology allows a single GPU to be partitioned into seven smaller, fully isolated instances to handle different types of jobs. The Hopper architecture extends MIG capabilities by up to 7x over the previous generation by offering secure multitenant configurations in cloud environments across each GPU instance. Run:AI, a partner of NVIDIA, offers something similar as a software layer, going by the name of fractional GPU sharing. Third, confidential computing. NVIDIA claims H100 is the world’s first accelerator with confidential computing capabilities to protect AI models and customer data while they are being processed. Customers can also apply confidential computing to federated learning for privacy-sensitive industries like healthcare and financial services, as well as on shared cloud infrastructures. This is not a feature we have seen elsewhere. Fourth, 4th-Generation NVIDIA NVLink. To accelerate the largest AI models, NVLink combines with a new external NVLink Switch to extend NVLink as a scale-up network beyond the server, connecting up to 256 H100 GPUs at 9x higher bandwidth versus the previous generation using NVIDIA HDR Quantum InfiniBand. Again, this is NVIDIA-specific, although competitors often leverage their own specialized infrastructure to connect their hardware too. Fifth, DPX instructions to accelerate dynamic programming. Dynamic programming is both a mathematical optimization method and a computer programming method, originally developed in the 1950s. In terms of mathematical optimization, dynamic programming usually refers to simplifying a decision by breaking it down into a sequence of decision steps over time. Dynamic programming is mainly an optimization over plain recursion. NVIDIA notes that dynamic programming is used in a broad range of algorithms, including route optimization and genomics, and it can speed up execution by up to 40x compared with CPUs and up to 7x compared with previous-generation GPUs. We are not aware of a direct equivalent in the competition, although many AI chip upstarts also leverage parallelism. The sixth innovation is the one we deem the most important: a new Transformer engine. As NVIDIA notes, transformers are the standard model choice for natural language processing, and one of the most important deep learning models ever invented. The H100 accelerator’s Transformer Engine is built to speed up these networks as much as 6x versus the previous generation without losing accuracy. This deserves further analysis.
The Transformer Engine at the heart of Hopper
Looking at the headline for the new transformer engine at the heart of NVIDIA’s H100, we were reminded of Intel architect Raja M. Koduri’s remarks to ZDNet’s Tiernan Ray. Koduri noted that the acceleration of matrix multiplications is now an essential measure of the performance and efficiency of chips, which means that every chip will be a neural net processor. Koduri was spot on of course. Besides Intel’s own efforts, this is what has been driving a new generation of AI chip designs from an array of upstarts. Seeing NVIDIA refer to a transformer engine made us wonder whether the company made a radical redesign of its GPUs. GPUs were not originally designed for AI workloads after all, they just happened to be good at them, and NVIDIA had the foresight and acumen to build an ecosystem around them. Going deeper into NVIDIA’s own analysis of the Hopper architecture, however, the notion of a radical redesign seems to be dispelled. While Hopper does introduce a new streaming multiprocessor (SM) with many performance and efficiency improvements, that’s as far as it goes. That’s not surprising, given the sheer weight of the ecosystem built around NVIDIA GPUs and the massive updates and potential incompatibilities a radical redesign would entail. Breaking down the improvements in Hopper, memory seems to be a big part of it. As Facebook’s product manager for PyTorch, the popular machine learning training library, told ZDNet, “Models keep getting bigger and bigger, they are really, really big, and really expensive to train.” The biggest models these days often cannot be stored entirely in the memory circuits that accompany a GPU. Hopper comes with memory that’s faster, more, and shared among SMs. Another boost comes from NVIDIA’s new fourth-generation tensor cores, which are up to 6x faster chip-to-chip compared to A100. Tensor cores are precisely what’s used for matrix multiplications. In H100, a new FP8 data type is used, resulting in 4 times faster compute compared to previous generation 16-bit floating-point options. On equivalent data types, there still is a 2x speedup. As for the so-called “new transformer engine,” it turns out this is the term NVIDIA uses to refer to “a combination of software and custom NVIDIA Hopper Tensor Core technology designed specifically to accelerate transformer model training and inference.” NVIDIA notes that the transformer engine intelligently manages and dynamically chooses between FP8 and 16-bit calculations, automatically handling re-casting and scaling between FP8 and 16-bit in each layer to deliver up to 9x faster AI training and up to 30x faster AI inference speedups on large language models compared to the prior generation A100. So while this is not a radical redesign, the combination of performance and efficiency improvements result in a 6x speedup compared to Ampere, as NVIDIA’s technical blog elaborates. NVIDIA’s focus on improving performance for transformer models is not at all misplaced. Transformer models are the backbone of language models used widely today, such as BERT and GPT-3. Initially developed for natural language processing use cases, their versatility is increasingly being applied to computer vision, drug discovery, and more, as we have been documenting in our State of AI coverage. According to a metric shared by NVIDIA, 70% of published research in AI in the last 2 years is based on transformers.
The software side of things: good news for Apache Spark users
But what about the software side of things? In previous GTC announcements, software stack updates were a key part of the news. In this event, while NVIDIA-tuned heuristics that dynamically choose between FP8 and FP16 calculations are a key part of the new transformer engine internally, updates to the external-facing software stack seem less important in comparison. NVIDIA’s Triton Inference Server and NeMo Megatron framework for training large language models are getting updates. So are Riva, Merlin, and Maxin – a speech AI SDK that includes pre-trained models, an end-to-end recommender AI framework, and an audio and video quality enhancement SDK, respectively. As NVIDIA highlighted, these are used by the likes of AT&T, Microsoft, and Snapchat. There are also 60 SDK updates for NVIDIA’s CUDA-X Libraries. NVIDIA chose to highlight emerging areas such as accelerating quantum circuit simulation (cuQuantum general availability) and 6G physical-layer research (Sionna general availability). However, for most users, the good news is probably in the update in the RAPIDS Accelerator for Apache Spark, which speeds processing by over 3x with no code changes. While this was not exactly prominent in NVIDIA’s announcements, we think it should be. An overnight 3x speedup without code changes for Apache Spark users, with 80 percent of the Fortune 500 using Apache Spark in production, is no small news. It’s not the first time NVIDIA shows Apache Spark users some love either. Overall, NVIDIA seems to be maintaining its momentum. While the competition is fierce, with the headstart NVIDIA has managed to create, radical redesigns may not really be called for.