The perf tool is a powerful Linux profiling framework built directly into the kernel, capable of sampling CPU performance counters, tracing system calls, and generating flame graphs that reveal exactly where an application spends its time. On RHEL 8, perf is packaged in the standard repositories and integrates with the kernel’s built-in performance monitoring units (PMUs). Whether you are diagnosing slow throughput, excessive cache misses, or unexpected system call overhead, perf provides the data needed to pinpoint the root cause without modifying application source code. This tutorial covers installation through flame graph generation.
Prerequisites
- A RHEL 8 system with root or
sudoaccess - A compiled application binary or process to profile (examples use a generic
./myappplaceholder) - Basic familiarity with C or application compilation helps when interpreting symbols
- The FlameGraph scripts from Brendan Gregg (cloned via
git)
Step 1 — Install perf and Required Tools
The perf package version matches the running kernel. Install it along with git for the FlameGraph scripts.
sudo dnf install -y perf git
# Verify installation and kernel match
perf --version
uname -r
# Install FlameGraph scripts
cd /opt
sudo git clone https://github.com/brendangregg/FlameGraph.git
sudo chmod +x /opt/FlameGraph/*.pl
Step 2 — Measure CPU Counters with perf stat
perf stat runs a command and prints a summary of hardware performance counters when it exits. It requires no sampling and produces minimal overhead, making it safe to run against production binaries. Focus on CPU cycles, cache misses, and branch mispredictions as the first indicators of inefficiency.
# Basic counter summary for a command
perf stat ./myapp
# Specify counters explicitly for detail
perf stat -e cycles,instructions,cache-misses,cache-references,branch-misses,branch-instructions ./myapp
# Profile a running process by PID for 30 seconds
perf stat -p $(pgrep myapp) -- sleep 30
In the output, look at instructions per cycle (IPC). Values below 1.0 indicate the CPU is frequently stalling. A high cache-miss rate (cache-misses / cache-references above 5%) suggests poor data locality. A high branch-miss rate points to unpredictable conditional branches that hurt the CPU pipeline.
Step 3 — Record and Analyze Call Graph Samples
perf record collects time-based samples of the call stack, writing them to a perf.data file. The -g flag enables call graph (stack trace) collection, which is required for flame graphs and for understanding which callers are responsible for hot functions.
# Record call graphs at 99 Hz for the duration of myapp
perf record -g -F 99 ./myapp
# Record a running process by PID for 60 seconds
perf record -g -F 99 -p $(pgrep myapp) -- sleep 60
# Open the interactive report
perf report
# Non-interactive text output, sorted by overhead
perf report --stdio | head -60
Inside perf report, use arrow keys to navigate, and press Enter to expand a symbol’s call chain. The percentage on the left represents the fraction of samples where that function appeared on the CPU. Functions consuming more than 5% of total cycles are prime optimization candidates.
Step 4 — Live Profiling with perf top
perf top works like top but shows the hottest kernel and user-space functions in real time across all running processes. It is useful for quickly identifying which process or function is responsible for a CPU spike without preparing a workload in advance.
# System-wide live profile (requires root)
sudo perf top
# Limit to a single process
sudo perf top -p $(pgrep myapp)
# Sort by specific event
sudo perf top -e cache-misses
# Show annotated assembly for a function
# Press 'a' on a function in perf top, or use:
perf annotate --symbol=function_name
Step 5 — Generate Flame Graphs
Flame graphs visualize the full call stack depth and the relative CPU time spent in each code path as a horizontal stacked bar chart. They make it immediately obvious which call chains are consuming the most time and are far easier to scan than raw perf report output.
# Record with frame pointers for accurate stacks
perf record -g --call-graph dwarf -F 99 -p $(pgrep myapp) -- sleep 60
# Convert samples to folded stack format
perf script | /opt/FlameGraph/stackcollapse-perf.pl > /tmp/stacks.folded
# Generate SVG flame graph
/opt/FlameGraph/flamegraph.pl /tmp/stacks.folded > /tmp/flamegraph.svg
# View in a browser
ls -lh /tmp/flamegraph.svg
Open flamegraph.svg in a web browser. The x-axis represents the total sample population (not time), and the y-axis shows the call depth. Wide bars at any level indicate functions that frequently appear in call stacks. Click on any frame to zoom in on that subtree.
Step 6 — Trace System Calls with perf trace
perf trace is similar to strace but uses kernel tracepoints instead of ptrace, resulting in much lower overhead. It is ideal for profiling applications where strace would introduce unacceptable slowdown.
# Trace all syscalls for a process
sudo perf trace -p $(pgrep myapp)
# Summarize syscall counts and latency (like strace -c)
sudo perf trace --summary -p $(pgrep myapp) -- sleep 30
# Trace specific syscalls only
sudo perf trace -e read,write,epoll_wait -p $(pgrep myapp)
Conclusion
You have used perf on RHEL 8 to measure hardware counters with perf stat, collect and inspect call graph samples with perf record and perf report, monitor live CPU usage with perf top, generate actionable flame graphs, and trace system calls with low overhead using perf trace. Start every profiling session with perf stat to identify the class of bottleneck (CPU-bound, memory-bound, or I/O-bound), then drill down with perf record and flame graphs to find the exact code path to optimize.
Next steps: How to Configure Huge Pages for Database Performance on RHEL 8, How to Tune Linux Kernel Parameters with sysctl on RHEL 8, and How to Use eBPF and bpftrace for Kernel Tracing on RHEL 8.