How to Profile Application Performance with perf on RHEL 9

December 1, 2025
Linux
Comment off

The perf tool is the Linux kernel’s built-in performance analysis framework, capable of sampling CPU events, tracing system calls, counting hardware performance counters, and generating flame graphs — all without any changes to the application under test. On RHEL 9, perf integrates tightly with the kernel’s BPF and tracepoint subsystems, making it one of the most powerful profiling tools available. Understanding where an application spends its CPU time is the first step toward meaningful optimization, and perf provides both high-level summaries and deep call-graph detail. This tutorial covers the full workflow from installation through flame graph generation.

Prerequisites

RHEL 9 system with root or sudo access
An application or process to profile
Debuginfo packages for the application (recommended for symbol resolution)
Perl and the FlameGraph scripts for flame graph generation (optional)

Step 1 — Installing perf on RHEL 9

The perf tool is distributed in the perf package on RHEL 9. Install it along with the kernel debuginfo package so that kernel symbols resolve correctly in stack traces.

sudo dnf install -y perf

# Install kernel debuginfo for full kernel symbol resolution
sudo dnf install -y kernel-debuginfo-$(uname -r)

# Verify installation
perf --version

Allow unprivileged users to use perf (optional, for development systems only):

sudo sysctl -w kernel.perf_event_paranoid=1
sudo sysctl -w kernel.kptr_restrict=0

Step 2 — Recording CPU Events with perf record

The perf record subcommand samples a running process at a configurable frequency and writes a binary perf.data file. The -g flag enables call-graph (stack trace) capture, which is essential for flame graphs and understanding the full call path leading to hot functions.

# Attach to an existing process by PID (replace 12345 with actual PID)
sudo perf record -g -p 12345 -- sleep 30

# Profile a command from start to finish with call graphs
sudo perf record -g -- ./myapp --some-args

# Record at a higher frequency for finer-grained data (default is ~1000 Hz)
sudo perf record -F 4000 -g -p 12345 -- sleep 30

When the recording finishes, perf.data is written to the current directory. Larger sampling frequencies produce more accurate data but larger files.

Step 3 — Analyzing the Report with perf report

The perf report command opens an interactive terminal UI (TUI) against the captured perf.data file. The display shows functions sorted by their percentage of total CPU samples, along with the overhead attributed to each call path.

sudo perf report

# Non-interactive flat output — useful for scripting or CI
sudo perf report --stdio

# Filter to a specific shared library or binary
sudo perf report --stdio --dsos=/usr/lib64/libc.so.6

In the TUI, press Enter on a function to expand its call graph. Press a to annotate with disassembly, and q to quit. The Children column shows the cumulative overhead including all callees; the Self column shows time spent exclusively in that function.

Step 4 — Live CPU Usage with perf top

perf top provides a live, continuously updating view of the hottest functions system-wide, similar to the top command but at the function level. This is useful for quickly identifying which function is consuming CPU on a busy production server without having to record a trace first.

# System-wide live profiling
sudo perf top

# Filter to a specific process
sudo perf top -p 12345

# Show call graphs in the live view
sudo perf top -g

Step 5 — Counting Hardware Events with perf stat

Hardware performance counters expose low-level CPU metrics including instructions per cycle (IPC), cache miss rates, and branch misprediction rates. Poor IPC or high cache miss rates often indicate memory access patterns that benefit from data structure reorganization or NUMA-aware allocation.

# Run an application and print hardware counter summary
sudo perf stat ./myapp

# Specify additional events: LLC cache misses and branch mispredictions
sudo perf stat -e cache-misses,cache-references,branch-misses,branch-instructions ./myapp

# Attach to a running process for 10 seconds
sudo perf stat -p 12345 -- sleep 10

The output reports instructions, cycles, IPC, cache misses, and branch miss rates. An IPC below 1.0 usually indicates the process is memory-bound rather than compute-bound.

Step 6 — Generating Flame Graphs

Flame graphs provide an intuitive visualization of where CPU time is spent across the entire call stack. Brendan Gregg’s FlameGraph scripts convert perf output into an interactive SVG. Install the scripts and generate a flame graph as follows:

# Clone the FlameGraph repository
git clone https://github.com/brendangregg/FlameGraph /opt/FlameGraph

# Record with call graphs
sudo perf record -F 99 -g -p 12345 -- sleep 30

# Export stack traces to folded format
sudo perf script | /opt/FlameGraph/stackcollapse-perf.pl > out.folded

# Generate the SVG flame graph
/opt/FlameGraph/flamegraph.pl out.folded > flamegraph.svg

Open flamegraph.svg in a browser. Each horizontal bar represents a function; its width is proportional to CPU time. Clicking a bar zooms into that stack frame. Tall stacks with wide bars at the bottom identify the hot code paths to optimize.

Conclusion

You have used perf on RHEL 9 to record CPU profiles, analyze call graphs, monitor live function-level CPU usage, inspect hardware performance counters, and generate flame graphs. These techniques form a systematic approach to application performance analysis: start with perf top for quick identification, use perf stat to characterize the bottleneck type (CPU-bound vs. memory-bound), then use perf record and flame graphs to pinpoint the exact call paths responsible.

Next steps: How to Tune Linux Kernel Parameters with sysctl on RHEL 9, How to Trace System Calls with strace and ltrace on RHEL 9, and How to Use eBPF Tools for Production Tracing on RHEL 9.