How to Profile Application Performance with perf on RHEL 7

Performance bottlenecks in production applications are notoriously difficult to diagnose using traditional monitoring tools alone. CPU usage metrics tell you that a system is busy but rarely tell you where the work is happening inside your application. The perf subsystem, built directly into the Linux kernel, provides hardware-assisted profiling at the function and instruction level. On RHEL 7, perf leverages CPU performance monitoring units (PMUs) to capture call graphs, count hardware events, trace system calls, and identify hot code paths — all with minimal overhead. This guide covers installation, common profiling workflows, and how to interpret and visualize output including flamegraph generation with Brendan Gregg’s scripts.

Prerequisites

RHEL 7 system with root or sudo access
The target application to profile (a running process or a command to benchmark)
Debug symbols installed for your application where possible (greatly improves readability)
Internet access or a local yum mirror to install packages
Git installed if you plan to generate flamegraphs: yum install -y git

Step 1: Installing perf on RHEL 7

The perf tool ships in the perf package on RHEL 7. Additionally, the kernel debuginfo package is needed for resolving kernel symbols in stack traces. Install both:

yum install -y perf

Verify the installation:

perf version
# perf version 3.10.0-xxx.el7.x86_64

For kernel symbol resolution, install the matching debuginfo package. First, enable the debuginfo repository:

vi /etc/yum.repos.d/rhel-debuginfo.repo

[rhel-7-server-debuginfo]
name=Red Hat Enterprise Linux 7 Server (Debug RPMs)
baseurl=https://cdn.redhat.com/content/dist/rhel/server/7/$releasever/$basearch/debug/os/
enabled=1
gpgcheck=1

yum install -y kernel-debuginfo-$(uname -r) kernel-debuginfo-common-$(uname -m)-$(uname -r)

For user-space applications, install their debuginfo packages. For example, for Nginx:

yum install -y nginx-debuginfo

Step 2: Counting Hardware Events with perf stat

perf stat runs a command and reports CPU hardware performance counters when it exits. It is the fastest way to get a high-level view of an application’s CPU efficiency, cache behavior, and branch prediction accuracy.

# Profile a specific command
perf stat ls -lR /usr/share/doc

# Profile with additional events
perf stat -e cycles,instructions,cache-misses,branch-misses,context-switches ls -lR /usr/share/doc

Example output:

 Performance counter stats for 'ls -lR /usr/share/doc':

      1,234,567      cycles
      2,345,678      instructions              #    1.90  insns per cycle
         45,678      cache-misses
          3,456      branch-misses             #    0.42% of all branches
             45      context-switches

       0.123456 seconds time elapsed

Attach to a running process by PID:

perf stat -p 12345 sleep 10

This collects counters for process 12345 for 10 seconds. A low instructions-per-cycle (IPC) ratio (below 1.0) suggests memory-bound behavior; a high branch-miss rate suggests poor branch prediction in tight loops.

Step 3: Recording CPU Profiles with perf record

perf record samples the CPU at regular intervals and records the instruction pointer and call stack to a perf.data file. This is the primary workflow for finding hot code paths.

# Basic sampling — attach to a running PID for 30 seconds
perf record -p 12345 -g sleep 30

# Profile a command from start to finish with call graphs
perf record -g -- /usr/bin/myapp --config /etc/myapp.conf

# Use frequency-based sampling (1000 samples/sec)
perf record -F 1000 -g -p 12345 sleep 30

# Record with DWARF call graphs (most accurate, larger file)
perf record -g --call-graph dwarf -p 12345 sleep 30

The -g flag enables call graph collection. Without it, you only see where the CPU is at the time of sampling, not how execution arrived there. The resulting perf.data file is written to the current directory.

Check the file size and basic stats:

ls -lh perf.data
perf report --stdio --header | head -30

Step 4: Analyzing Profiles with perf report

perf report opens an interactive text-based interface (or produces stdout output) to browse the recorded profile.

# Interactive mode
perf report

# Output to stdout for scripting
perf report --stdio

# Show flat profile sorted by self CPU time
perf report --stdio --sort=symbol

# Limit to a specific binary
perf report --stdio --dsos=/usr/bin/myapp

In interactive mode, use arrow keys to navigate, Enter to expand a symbol’s call chain, and q to quit. The Children column shows cumulative time including callees; Self shows only the time spent in that function itself. Focus on high Self values first — these are the actual bottlenecks.

To annotate a specific function with source-level detail (requires debuginfo):

perf annotate --stdio -s my_hot_function

Step 5: Real-Time Profiling with perf top

perf top is the htop equivalent for profiling — it shows the hottest functions across the entire system in real time without writing a perf.data file.

# System-wide real-time profiling
perf top

# Focus on a specific PID
perf top -p 12345

# Show call graphs in real time
perf top -g -p 12345

# Increase sampling frequency for short spikes
perf top -F 4000

The output updates every second and shows each function’s contribution to CPU time as a percentage. Kernel functions appear with a [kernel] label. This is particularly useful during live load tests to quickly identify whether a bottleneck is in user space, kernel space, or a shared library.

Step 6: Tracing System Calls with perf trace

perf trace provides strace-like system call tracing with lower overhead, making it suitable for profiling production systems where strace would be too intrusive.

# Trace all syscalls for a PID
perf trace -p 12345

# Trace for a fixed duration
perf trace -p 12345 sleep 10

# Show only specific syscalls
perf trace -e read,write,epoll_wait -p 12345

# Count syscalls by type (summary mode)
perf trace --summary -p 12345 sleep 10

Summary output looks like:

 Summary of events:

 myapp (12345), 1000 events, 100.0%

   syscall            calls    total       min       avg       max
   --------------- -------- --------- --------- --------- ---------
   epoll_wait           500   4.500 s   0.100ms   9.000ms  50.000ms
   read                 300   0.150 s   0.001ms   0.500ms   5.000ms
   write                200   0.050 s   0.001ms   0.250ms   2.000ms

High time in epoll_wait is normal for event-driven servers; unexpectedly high time in read or write may indicate I/O saturation or inefficient buffer sizes.

Step 7: Generating Flamegraphs with Brendan Gregg’s Scripts

Flamegraphs transform perf call graph data into an interactive SVG where the width of each frame represents its contribution to CPU time. They are the most effective visualization for quickly pinpointing bottlenecks in deep call stacks.

Clone the FlameGraph repository:

cd /opt
git clone https://github.com/brendangregg/FlameGraph.git

Record a profile with folded call graphs:

perf record -F 99 -g -p 12345 sleep 60
perf script | /opt/FlameGraph/stackcollapse-perf.pl > out.folded
/opt/FlameGraph/flamegraph.pl out.folded > flamegraph.svg

Transfer flamegraph.svg to your workstation and open it in a browser. Clicking on any frame zooms into that call path, and hovering shows the exact percentage. Look for wide, flat plateaus — these are where your application spends the most time.

For off-CPU flamegraphs (time blocked waiting on I/O or locks):

perf record -e sched:sched_switch -ag sleep 30
perf script | /opt/FlameGraph/stackcollapse-perf.pl | 
  /opt/FlameGraph/flamegraph.pl --color=io --title="Off-CPU Flamegraph" > offcpu.svg

Conclusion

The perf toolchain on RHEL 7 provides a complete, kernel-integrated profiling stack that spans hardware counters, CPU sampling, system call tracing, and call graph visualization. Starting with perf stat for a quick efficiency overview, progressing to perf record and perf report for call-graph analysis, and finishing with flamegraphs for visual presentation gives you a structured methodology that works equally well for optimizing a slow database query handler or diagnosing a CPU regression in a kernel upgrade. Because perf operates at the hardware level with minimal overhead, it is safe to run on production systems during controlled observation windows — making it an indispensable tool in any RHEL 7 performance engineering toolkit.