How to Profile Application Performance with perf on RHEL 7
Performance bottlenecks in production applications are notoriously difficult to diagnose using traditional monitoring tools alone. CPU usage metrics tell you that a system is busy but rarely tell you where the work is happening inside your application. The perf subsystem, built directly into the Linux kernel, provides hardware-assisted profiling at the function and instruction level. On RHEL 7, perf leverages CPU performance monitoring units (PMUs) to capture call graphs, count hardware events, trace system calls, and identify hot code paths — all with minimal overhead. This guide covers installation, common profiling workflows, and how to interpret and visualize output including flamegraph generation with Brendan Gregg’s scripts.
Prerequisites
- RHEL 7 system with root or
sudoaccess - The target application to profile (a running process or a command to benchmark)
- Debug symbols installed for your application where possible (greatly improves readability)
- Internet access or a local yum mirror to install packages
- Git installed if you plan to generate flamegraphs:
yum install -y git
Step 1: Installing perf on RHEL 7
The perf tool ships in the perf package on RHEL 7. Additionally, the kernel debuginfo package is needed for resolving kernel symbols in stack traces. Install both:
yum install -y perf
Verify the installation:
perf version
# perf version 3.10.0-xxx.el7.x86_64
For kernel symbol resolution, install the matching debuginfo package. First, enable the debuginfo repository:
vi /etc/yum.repos.d/rhel-debuginfo.repo
[rhel-7-server-debuginfo]
name=Red Hat Enterprise Linux 7 Server (Debug RPMs)
baseurl=https://cdn.redhat.com/content/dist/rhel/server/7/$releasever/$basearch/debug/os/
enabled=1
gpgcheck=1
yum install -y kernel-debuginfo-$(uname -r) kernel-debuginfo-common-$(uname -m)-$(uname -r)
For user-space applications, install their debuginfo packages. For example, for Nginx:
yum install -y nginx-debuginfo
Step 2: Counting Hardware Events with perf stat
perf stat runs a command and reports CPU hardware performance counters when it exits. It is the fastest way to get a high-level view of an application’s CPU efficiency, cache behavior, and branch prediction accuracy.
# Profile a specific command
perf stat ls -lR /usr/share/doc
# Profile with additional events
perf stat -e cycles,instructions,cache-misses,branch-misses,context-switches ls -lR /usr/share/doc
Example output:
Performance counter stats for 'ls -lR /usr/share/doc':
1,234,567 cycles
2,345,678 instructions # 1.90 insns per cycle
45,678 cache-misses
3,456 branch-misses # 0.42% of all branches
45 context-switches
0.123456 seconds time elapsed
Attach to a running process by PID:
perf stat -p 12345 sleep 10
This collects counters for process 12345 for 10 seconds. A low instructions-per-cycle (IPC) ratio (below 1.0) suggests memory-bound behavior; a high branch-miss rate suggests poor branch prediction in tight loops.
Step 3: Recording CPU Profiles with perf record
perf record samples the CPU at regular intervals and records the instruction pointer and call stack to a perf.data file. This is the primary workflow for finding hot code paths.
# Basic sampling — attach to a running PID for 30 seconds
perf record -p 12345 -g sleep 30
# Profile a command from start to finish with call graphs
perf record -g -- /usr/bin/myapp --config /etc/myapp.conf
# Use frequency-based sampling (1000 samples/sec)
perf record -F 1000 -g -p 12345 sleep 30
# Record with DWARF call graphs (most accurate, larger file)
perf record -g --call-graph dwarf -p 12345 sleep 30
The -g flag enables call graph collection. Without it, you only see where the CPU is at the time of sampling, not how execution arrived there. The resulting perf.data file is written to the current directory.
Check the file size and basic stats:
ls -lh perf.data
perf report --stdio --header | head -30
Step 4: Analyzing Profiles with perf report
perf report opens an interactive text-based interface (or produces stdout output) to browse the recorded profile.
# Interactive mode
perf report
# Output to stdout for scripting
perf report --stdio
# Show flat profile sorted by self CPU time
perf report --stdio --sort=symbol
# Limit to a specific binary
perf report --stdio --dsos=/usr/bin/myapp
In interactive mode, use arrow keys to navigate, Enter to expand a symbol’s call chain, and q to quit. The Children column shows cumulative time including callees; Self shows only the time spent in that function itself. Focus on high Self values first — these are the actual bottlenecks.
To annotate a specific function with source-level detail (requires debuginfo):
perf annotate --stdio -s my_hot_function
Step 5: Real-Time Profiling with perf top
perf top is the htop equivalent for profiling — it shows the hottest functions across the entire system in real time without writing a perf.data file.
# System-wide real-time profiling
perf top
# Focus on a specific PID
perf top -p 12345
# Show call graphs in real time
perf top -g -p 12345
# Increase sampling frequency for short spikes
perf top -F 4000
The output updates every second and shows each function’s contribution to CPU time as a percentage. Kernel functions appear with a [kernel] label. This is particularly useful during live load tests to quickly identify whether a bottleneck is in user space, kernel space, or a shared library.
Step 6: Tracing System Calls with perf trace
perf trace provides strace-like system call tracing with lower overhead, making it suitable for profiling production systems where strace would be too intrusive.
# Trace all syscalls for a PID
perf trace -p 12345
# Trace for a fixed duration
perf trace -p 12345 sleep 10
# Show only specific syscalls
perf trace -e read,write,epoll_wait -p 12345
# Count syscalls by type (summary mode)
perf trace --summary -p 12345 sleep 10
Summary output looks like:
Summary of events:
myapp (12345), 1000 events, 100.0%
syscall calls total min avg max
--------------- -------- --------- --------- --------- ---------
epoll_wait 500 4.500 s 0.100ms 9.000ms 50.000ms
read 300 0.150 s 0.001ms 0.500ms 5.000ms
write 200 0.050 s 0.001ms 0.250ms 2.000ms
High time in epoll_wait is normal for event-driven servers; unexpectedly high time in read or write may indicate I/O saturation or inefficient buffer sizes.
Step 7: Generating Flamegraphs with Brendan Gregg’s Scripts
Flamegraphs transform perf call graph data into an interactive SVG where the width of each frame represents its contribution to CPU time. They are the most effective visualization for quickly pinpointing bottlenecks in deep call stacks.
Clone the FlameGraph repository:
cd /opt
git clone https://github.com/brendangregg/FlameGraph.git
Record a profile with folded call graphs:
perf record -F 99 -g -p 12345 sleep 60
perf script | /opt/FlameGraph/stackcollapse-perf.pl > out.folded
/opt/FlameGraph/flamegraph.pl out.folded > flamegraph.svg
Transfer flamegraph.svg to your workstation and open it in a browser. Clicking on any frame zooms into that call path, and hovering shows the exact percentage. Look for wide, flat plateaus — these are where your application spends the most time.
For off-CPU flamegraphs (time blocked waiting on I/O or locks):
perf record -e sched:sched_switch -ag sleep 30
perf script | /opt/FlameGraph/stackcollapse-perf.pl |
/opt/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Flamegraph" > offcpu.svg
Conclusion
The perf toolchain on RHEL 7 provides a complete, kernel-integrated profiling stack that spans hardware counters, CPU sampling, system call tracing, and call graph visualization. Starting with perf stat for a quick efficiency overview, progressing to perf record and perf report for call-graph analysis, and finishing with flamegraphs for visual presentation gives you a structured methodology that works equally well for optimizing a slow database query handler or diagnosing a CPU regression in a kernel upgrade. Because perf operates at the hardware level with minimal overhead, it is safe to run on production systems during controlled observation windows — making it an indispensable tool in any RHEL 7 performance engineering toolkit.