AMD 101: Accelerating AI Workloads with AMD

Introduction

When it comes to AI accelerators, NVIDIA's hardware, most recently the Blackwell GPUs, enjoys a competitive advantage due to hardware innovation and the widespread adoption and optimization of CUDA, a parallel computing platform and programming model. CUDA’s dominance has led to the extensive optimization of popular resources and tools, such as PyTorch, Triton, and Hugging Face, for NVIDIA hardware. This ecosystem of optimized software and hardware makes NVIDIA a favourable choice for developers and researchers alike, creating a self-reinforcing cycle of popularity and performance. The compatibility and performance advantages of NVIDIA's hardware and CUDA have established a formidable moat, making their products a preferred option in the market for AI and GPU-accelerated computing.

That being said, there are certain workloads where using AMD GPUs is cost-effective. Additionally, we would like to note that when it comes to GPU cloud servers, using an AMD MI300x is currently cheaper than using an NVIDIA H100.

The goal of this article is to be a good resource for those just getting into using AMD GPUs for high performance computing AI applications. As a result, we will be discussing the CDNA architecture to better understand the hardware and the ROCm software stack to better understand the programmability of AMD GPUs. While very interesting, we will not be covering the RDNA architecture (used for gaming applications and optimized for frames per second), the XDNA architecture (used for personal computing AI applications), or the upcoming UDNA architecture (which unifies RDNA and CDNA).

CDNA

CDNA is a compute-optimized GPU architecture, optimized for FLOPs per second. There have been several iterations featured in different AMD Instinct™ Series.

	CDNA	CDNA 2	CDNA 3	CDNA 4
Process Technology	7nm FinFET	6nm FinFET	5nm + 6nm FinFET	3nm + 6nm FinFET
Transistors	25.6 Billion	Up to 58 Billion	Up to 146 Billion	Up to 185 Billion
CUs/Matrix Cores	120/440	Up to 220/880	Up to 304 /1216	256 / 1024
Memory Type	32GB HBM2	Up to 128GB HBM2E	Up to 256GB HBM3 / HBM3E	288 GB HBM3E
Memory Bandwidth (Peak)	1.2 TB/s	Up to 3.2 TB/s	Up to 6 TB/s	8 TB/s
AMD Infinity Cache™	N/A	N/A	256 MB	256MB
GPU Coherency	N/A	Cache	Cache and HBM	Cache and HBM
Data Type Support	INT4, INT8, BF16, FP16, FP32, FP64	INT4, INT8, BF16, FP16, FP32, FP64	INT8, FP8, BF16, FP16, TF32, FP32, FP64 (Sparsity support)	INT4, FP4, FP6, INT8, FP8, BF16, FP16, TF32*, FP32, FP64 (Sparsity support)
Products	AMD Instinct™ MI100 Series	AMD Instinct™ MI200 Series	AMD Instinct™ MI300 Series	AMD Instinct™ MI350 Series

*TF32 is supported by software emulation.

(Table adopted from Source)

ROCm Software Stack

ROCm is an open-source software stack for programming AMD GPUs. ROCm includes the HIP (Heterogeneous-Compute Interface for Portability) programming model, which allows developers to write code that can run on both AMD and NVIDIA GPUs with minimal changes.

Developers can program AMD GPUs using several approaches: HIP for CUDA-like programming, OpenCL for cross-platform development, or OpenMP for directive-based parallel programming.

Inference with AMD

For inference tasks, AMD has collaborated with top serving frameworks like vLLM and SGLang to develop highly optimized containers. These containers are prepared for large-scale deployment of generative AI for inference, including Day 0 support for the most widely used generative AI models. vLLM is highly recommended as a versatile, general-purpose solution, with AMD providing support through bi-weekly stable releases and weekly development updates. For agentic workloads, Deepseek, and other specific applications, SGLang is the preferred choice, supported by weekly stable releases.

Beyond just the serving frameworks, AMD also optimizes leading models such as the Llama family, Gemma 3, Deepseek, and the Qwen family with Day 0 support. This ensures that the ecosystem can easily integrate the latest models in the rapidly evolving AI landscape.

Conclusion

While NVIDIA's CUDA ecosystem dominates AI, AMD's CDNA architecture and ROCm software are emerging as a strong alternative, especially for cost-effective workloads. AMD is actively collaborating with key inference frameworks like vLLM and SGLang, and optimizing leading generative AI models. This commitment to compute optimization and open-source software makes AMD an increasingly attractive option for high-performance AI, diversifying the landscape beyond NVIDIA.

AMD 101: Accelerating AI Workloads with AMD

Table of Contents

Introduction

CDNA

ROCm Software Stack

Inference with AMD

Conclusion

Links

Newsletter

Contact