Table of Contents
Introduction
When it comes to AI accelerators, NVIDIA's hardware, most recently the Blackwell GPUs, enjoys a competitive advantage due to hardware innovation and the widespread adoption and optimization of CUDA, a parallel computing platform and programming model. CUDA’s dominance has led to the extensive optimization of popular resources and tools, such as PyTorch, Triton, and Hugging Face, for NVIDIA hardware. This ecosystem of optimized software and hardware makes NVIDIA a favourable choice for developers and researchers alike, creating a self-reinforcing cycle of popularity and performance. The compatibility and performance advantages of NVIDIA's hardware and CUDA have established a formidable moat, making their products a preferred option in the market for AI and GPU-accelerated computing.
That being said, there are certain workloads where using AMD GPUs is cost-effective. Additionally, we would like to note that when it comes to GPU cloud servers, using an AMD MI300x is currently cheaper than using an NVIDIA H100.
The goal of this article is to be a good resource for those just getting into using AMD GPUs for high performance computing AI applications. As a result, we will be discussing the CDNA architecture to better understand the hardware and the ROCm software stack to better understand the programmability of AMD GPUs. While very interesting, we will not be covering the RDNA architecture (used for gaming applications and optimized for frames per second), the XDNA architecture (used for personal computing AI applications), or the upcoming UDNA architecture (which unifies RDNA and CDNA).
CDNA
CDNA is a compute-optimized GPU architecture, optimized for FLOPs per second. There have been several iterations featured in different AMD Instinct™ Series.
| CDNA | CDNA 2 | CDNA 3 | CDNA 4 | |
|---|---|---|---|---|
| Process Technology | 7nm FinFET | 6nm FinFET | 5nm + 6nm FinFET | 3nm + 6nm FinFET |
| Transistors | 25.6 Billion | Up to 58 Billion | Up to 146 Billion | Up to 185 Billion |
| CUs/Matrix Cores | 120/440 | Up to 220/880 | Up to 304 /1216 | 256 / 1024 |
| Memory Type | 32GB HBM2 | Up to 128GB HBM2E | Up to 256GB HBM3 / HBM3E | 288 GB HBM3E |
| Memory Bandwidth (Peak) | 1.2 TB/s | Up to 3.2 TB/s | Up to 6 TB/s | 8 TB/s |
| AMD Infinity Cache™ | N/A | N/A | 256 MB | 256MB |
| GPU Coherency | N/A | Cache | Cache and HBM | Cache and HBM |
| Data Type Support | INT4, INT8, BF16, FP16, FP32, FP64 | INT4, INT8, BF16, FP16, FP32, FP64 | INT8, FP8, BF16, FP16, TF32, FP32, FP64 (Sparsity support) | INT4, FP4, FP6, INT8, FP8, BF16, FP16, TF32*, FP32, FP64 (Sparsity support) |
| Products | AMD Instinct™ MI100 Series | AMD Instinct™ MI200 Series | AMD Instinct™ MI300 Series | AMD Instinct™ MI350 Series |
*TF32 is supported by software emulation.
(Table adopted from Source)
ROCm Software Stack
ROCm is an open-source software stack for programming AMD GPUs. ROCm includes the HIP (Heterogeneous-Compute Interface for Portability) programming model, which allows developers to write code that can run on both AMD and NVIDIA GPUs with minimal changes.
Developers can program AMD GPUs using several approaches: HIP for CUDA-like programming, OpenCL for cross-platform development, or OpenMP for directive-based parallel programming.
Inference with AMD
For inference tasks, AMD has collaborated with top serving frameworks like vLLM and SGLang to develop highly optimized containers. These containers are prepared for large-scale deployment of generative AI for inference, including Day 0 support for the most widely used generative AI models. vLLM is highly recommended as a versatile, general-purpose solution, with AMD providing support through bi-weekly stable releases and weekly development updates. For agentic workloads, Deepseek, and other specific applications, SGLang is the preferred choice, supported by weekly stable releases.
Beyond just the serving frameworks, AMD also optimizes leading models such as the Llama family, Gemma 3, Deepseek, and the Qwen family with Day 0 support. This ensures that the ecosystem can easily integrate the latest models in the rapidly evolving AI landscape.
Conclusion
While NVIDIA's CUDA ecosystem dominates AI, AMD's CDNA architecture and ROCm software are emerging as a strong alternative, especially for cost-effective workloads. AMD is actively collaborating with key inference frameworks like vLLM and SGLang, and optimizing leading generative AI models. This commitment to compute optimization and open-source software makes AMD an increasingly attractive option for high-performance AI, diversifying the landscape beyond NVIDIA.