How to Profile a Go Application: The Complete pprof Guide
How to use Go's pprof toolchain to profile CPU, memory, goroutines, block, and mutex — collect profiles safely in production and act on what you find.
CPU, Memory, Goroutines, Block, and Mutex profiles — what each one tells you, when to use it, and how to act on what you find.
Most Go services run unexamined for months or years. Engineers add infrastructure when things get slow, argue about whether to scale vertically or horizontally, and occasionally rewrite services that were never actually the bottleneck. Knowing how to profile a Go application avoids all of that — and Go ships with one of the best profiling toolchains of any production language.
This guide covers the full picture: all five profile types built into pprof, how to collect them without harming a live service, how to read what you get, and the specific patterns to look for in each one. By the end, you’ll have a repeatable methodology for diagnosing any Go service.
What Is pprof?
pprof is Go’s built-in profiling system. It ships in the standard library — no third-party tools, no agent to install. You import one package, and you get HTTP endpoints that produce profiles on demand from a live, running service.
A profile is a sampled snapshot of what your program is doing over a window of time (CPU) or at a point in time (memory, goroutines). The go tool pprof command reads these profiles and gives you an interactive shell, a call graph, a flame graph, and an annotated source view.
There are five profile types worth knowing. Each answers a different question about your service:
| Profile | Question it answers |
|---|---|
| CPU | Where is the CPU spending its time? |
| Heap / Allocs | Where is memory being allocated? |
| Goroutine | What are all my goroutines doing right now? |
| Block | Where are goroutines waiting (channels, I/O, locks)? |
| Mutex | Which goroutines are holding contended locks? |
Understanding which profile type to start with — based on your symptom — is the most important skill in Go performance debugging.
Before You Profile: The Setup That Actually Matters
Expose the pprof Endpoints
The fastest path to profiling a running Go service is the net/http/pprof package. Import it for its side effects and it registers handlers on your default mux:
import _ "net/http/pprof"
That gives you:
| Endpoint | What it collects |
|---|---|
/debug/pprof/profile?seconds=30 |
30-second CPU profile |
/debug/pprof/heap |
Current heap snapshot |
/debug/pprof/goroutine |
Stack trace of every live goroutine |
/debug/pprof/block |
Stack traces of goroutine blocking events |
/debug/pprof/mutex |
Stack traces of mutex contention |
/debug/pprof/allocs |
Allocation profile (like heap, different sampling) |
/debug/pprof/threadcreate |
OS thread creation events |
/debug/pprof/trace?seconds=5 |
Full runtime execution trace |
Critical note: don’t expose these endpoints on your public-facing port. In production, either bind the debug server to a separate port that’s only accessible internally, or protect these routes behind authentication. The goroutine dump alone can expose implementation details you don’t want public.
// Separate debug server — the right way
go func() {
log.Println(http.ListenAndServe("localhost:6060", nil))
}()
Build With Optimization, Not Without
A common mistake is disabling compiler optimizations for profiling. Don’t. Profile your production binary. The inlining and optimization decisions the compiler makes are part of what you’re measuring. If you disable inlining (-gcflags='-l'), you’ll profile a different program than the one causing your production incident.
The one exception: when you’re specifically doing escape analysis work and need to read the compiler’s output. But that’s analysis, not profiling.
Understand Sampling Rates Before You Touch Them
Go’s CPU profiler uses OS signals — it sends SIGPROF 100 times per second and captures the current goroutine’s stack trace at each signal. This means short-running functions (sub-millisecond) are captured proportionally but not absolutely. If you see a function at 5% of CPU samples, it’s consuming approximately 5% of one CPU core’s time over the profiling window.
The memory profiler has a different model: it samples one allocation per runtime.MemProfileRate bytes allocated (default: 512KB). Allocations smaller than this threshold may be underrepresented. You can lower this to catch more:
runtime.MemProfileRate = 1 // Profile every allocation (expensive)
runtime.MemProfileRate = 4096 // Profile every 4KB allocated (reasonable for debugging)
Don’t lower this in production without understanding the overhead. At MemProfileRate = 1, the allocation profiler itself becomes a performance problem.
CPU Profiles: Finding Where Your Service Spends Its Time
Collecting a CPU Profile
The most reliable collection method in production:
# Capture a 30-second CPU profile
curl -o cpu.prof "http://localhost:6060/debug/pprof/profile?seconds=30"
# Open it
go tool pprof cpu.prof
Or, directly:
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
Thirty seconds is the minimum window worth analyzing for most services. Under ten seconds, the sampling statistics get noisy. For services with periodic spikes, try to capture during a realistic load period — profiling during low traffic gives you an accurate picture of the wrong workload.
Reading the CPU Profile
Inside the pprof interactive shell:
(pprof) top10
This shows the top ten functions by flat time — time the CPU was executing that specific function, not time spent in functions it called. Compare that to cumulative time, which includes callees:
(pprof) top10 -cum
The difference matters. A function with high flat time but low cumulative time is doing expensive work itself. A function with low flat time but high cumulative time is a hotspot because of what it calls — it’s the common ancestor of expensive work.
The flame graph is the fastest way to see the whole picture:
go tool pprof -http=:8080 cpu.prof
Open your browser at localhost:8080 and select Flame Graph from the view menu. Width equals CPU time. The widest bars are where the time is going. Focus on the widest bars in the middle of the graph — deep callstacks with wide bars mean your hot path is calling through many layers before hitting the expensive work.
For a detailed walkthrough of reading flame graphs, see How to Read a Go CPU Flame Graph.
CPU Profile Patterns to Act On
Pattern 1: runtime.mallocgc in your top 10. This means you’re allocating enough to make allocation itself a bottleneck. Look at what’s calling mallocgc and move those allocations off the heap — sync.Pool, pre-allocated slices, struct reuse.
Pattern 2: runtime.gcBgMarkWorker consuming 10-25% of CPU. Your GC is working hard. Either your allocation rate is high (address the allocation sites) or your heap is sized wrong. See the Go GC tuning guide for GOGC and GOMEMLIMIT options.
Pattern 3: syscall.Read or syscall.Write dominating. You’re I/O bound, not CPU bound. A CPU profile won’t tell you what to fix here — switch to the block profile.
Pattern 4: encoding/json.Marshal or encoding/json.Unmarshal in your top 5. Standard library JSON is reflection-based. At high throughput, switching to goccy/go-json or sonic can cut this dramatically. Benchmark both before switching.
Pattern 5: Surprisingly little time in your actual application code. This usually means your hot path is being inlined, and the profiler is attributing time to the caller. Run go build -gcflags='-m' on the relevant packages to see what’s being inlined, then look at the caller.
Memory Profiles: Finding Where Your Heap Is Going
Go gives you two distinct memory profiles. Understanding the difference is important before you touch either.
Heap Profile vs. Allocs Profile
The heap profile (/debug/pprof/heap) shows objects that are currently live on the heap, weighted by size. It answers: what is taking up memory right now?
The allocs profile (/debug/pprof/allocs) shows where allocations occurred over the profile window, regardless of whether those objects are still alive. It answers: what is generating GC pressure by allocating frequently?
If you have a memory leak, start with the heap profile. If you have high GC overhead without a leak, start with allocs.
# Heap profile — what's alive
curl -o heap.prof http://localhost:6060/debug/pprof/heap
go tool pprof heap.prof
# Allocs profile — what's been allocated
curl -o allocs.prof http://localhost:6060/debug/pprof/allocs
go tool pprof allocs.prof
Inside the Heap Profile
The heap profile has four sample types, selectable with -sample_index:
go tool pprof -sample_index=inuse_space heap.prof # Bytes currently in use
go tool pprof -sample_index=inuse_objects heap.prof # Object count in use
go tool pprof -sample_index=alloc_space heap.prof # All bytes allocated (includes collected)
go tool pprof -sample_index=alloc_objects heap.prof # All objects allocated
inuse_space is the right starting point for diagnosing high memory. alloc_objects is best for understanding which code paths produce the most GC work.
Comparing Two Heap Profiles
This is a powerful technique for diagnosing memory leaks. Capture a profile, wait, capture another, compare:
# Capture baseline
curl -o heap1.prof http://localhost:6060/debug/pprof/heap
# Wait 5 minutes under load
sleep 300
# Capture comparison
curl -o heap2.prof http://localhost:6060/debug/pprof/heap
# Diff them
go tool pprof -base heap1.prof heap2.prof
(pprof) top10
The diff shows you what grew. Positive numbers mean more memory in the second profile. If you see the same call site appearing across multiple diffs, that’s your leak site.
Memory Profile Patterns to Act On
Pattern 1: bytes.makeSlice or strings.(*Builder).grow in your top allocators. You’re growing buffers repeatedly because you started them at the wrong size. Pre-size slices with make([]T, 0, expectedSize) when you know the approximate count upfront.
Pattern 2: Large []byte allocations from your HTTP handlers. If you’re reading request bodies into []byte and then passing them around, consider streaming the decode directly rather than buffering the full body.
Pattern 3: Many small allocations from interface conversions. When you store a concrete value in an interface, Go may allocate on the heap to hold it. If this appears in your hot path, passing concrete types rather than interfaces can eliminate the allocation.
Pattern 4: The profile is dominated by runtime.(*mcache).refill. Your memory profile rate may be so low it’s only catching large allocations, and this runtime bookkeeping is drowning out the signal. Raise runtime.MemProfileRate temporarily to catch smaller allocations.
Goroutine Profiles: Understanding Your Concurrency Model
What the Goroutine Profile Shows
The goroutine profile is a snapshot of every goroutine currently alive, grouped by stack trace. It’s free to collect — it’s just a read of the runtime’s internal state — but it does take a stop-the-world pause proportional to the number of goroutines.
# Full goroutine dump
curl http://localhost:6060/debug/pprof/goroutine?debug=2
# Or the pprof-format version
curl -o goroutine.prof http://localhost:6060/debug/pprof/goroutine
go tool pprof goroutine.prof
The ?debug=2 format gives you a raw text dump — much more readable for high-goroutine-count services because it shows the actual goroutine state alongside each stack trace.
Goroutine States
When you read the raw dump, each goroutine has a state. The states you should pay attention to:
| State | Meaning | Action |
|---|---|---|
running |
Executing on an OS thread | Normal |
sleep |
time.Sleep call |
Normal if expected |
chan receive |
Blocked waiting for a channel | Could be a leak if count is high |
chan send |
Blocked trying to send | Likely a leak — receiver is gone |
select |
Waiting in a select | Normal for service goroutines |
IO wait |
Waiting for network/disk I/O | Normal under load |
semacquire |
Waiting for a mutex or semaphore | Could indicate contention |
syscall |
In a blocking syscall | More than a few hundred is a problem |
Goroutine Leak Detection
A goroutine leak is when goroutines accumulate over time because they never exit. The classic scenario: a goroutine is blocked on a channel receive, the producer is gone, and no one ever closes the channel.
The detection pattern:
# Check goroutine count via the runtime metrics
curl -s http://localhost:6060/debug/pprof/goroutine | head -1
Or from code:
fmt.Println("Goroutines:", runtime.NumGoroutine())
Add this to your metrics pipeline. A goroutine count that trends upward with time (not just with load) is a leak. To find it, capture two goroutine profiles with debug=2 and compare what stacks are accumulating.
The most common leak sources:
// Classic leak: goroutine blocks on channel, context is cancelled, nobody drains
go func() {
result := <-results // blocks forever if no one sends and no one cancels
process(result)
}()
// Fix: always respect context cancellation
go func() {
select {
case result := <-results:
process(result)
case <-ctx.Done():
return
}
}()
The goleak library is excellent for catching leaks in tests before they reach production:
func TestMyHandler(t *testing.T) {
defer goleak.VerifyNone(t)
// ... test code
}
For a complete treatment of goroutine leaks — detection, common patterns, and fixes — see How to Detect and Fix a Goroutine Leak in Go.
Block Profiles: Finding Where Your Goroutines Wait
What Block Profiling Measures
The CPU profiler can’t see time a goroutine spends blocked — waiting on a channel, waiting for a mutex, waiting for an I/O call to return. That’s what the block profile is for.
The block profiler tracks two things: the number of blocking events, and the cumulative duration of those blocks. High duration with few events means a few operations are taking a long time. High event count with moderate duration means your code is blocking frequently on fast operations — which adds up.
Enabling Block Profiling
Block profiling is disabled by default because it adds a small overhead for every blocking event. Enable it explicitly:
runtime.SetBlockProfileRate(1) // Profile every blocking event (highest overhead)
runtime.SetBlockProfileRate(100) // Profile one in every 100 ns of blocking time (reasonable)
The rate parameter is in nanoseconds. Setting it to 1 means “record every block that lasts at least 1 nanosecond” — effectively everything. Setting it higher filters out short blocks and reduces overhead.
For a production debugging session:
runtime.SetBlockProfileRate(1000) // Record blocks >= 1 microsecond
Then collect and reset:
curl -o block.prof http://localhost:6060/debug/pprof/block
go tool pprof block.prof
(pprof) top10
Block Profile Patterns to Act On
Pattern 1: sync.(*RWMutex).RLock or sync.(*Mutex).Lock dominating. You have lock contention. Before reaching for a lock-free solution, check whether the lock is protecting shared state that could be sharded, or whether the critical section is doing more work than it needs to.
Pattern 2: Channel operations (chansend, chanrecv) near the top. Either your channel buffer is too small for the producer’s rate, or your consumer is slower than expected. Profile the consumer to see why.
Pattern 3: net/http.(*persistConn).roundTrip blocking. Your outbound HTTP client is stalling waiting for a connection from the pool. Either your pool is too small, keep-alive is misconfigured, or the downstream service is slow. Check http.Transport.MaxIdleConnsPerHost.
Pattern 4: database/sql.(*DB).conn blocking. Your database connection pool is exhausted. Use db.Stats() to check WaitCount and WaitDuration in your metrics.
Pattern 5: time.Sleep or time.After dominating. This is usually expected behavior — background loops, retry delays. But if it’s in a hot path, you may be sleeping unnecessarily.
Mutex Profiles: Finding Lock Contention
Mutex vs. Block Profile
There’s overlap here that confuses people. The block profile captures all goroutine blocking events, including mutex waits. The mutex profile specifically focuses on the holders of contended mutexes — which goroutines are holding locks that other goroutines are waiting for.
Use the block profile to find that contention exists. Use the mutex profile to find who is holding the contended lock.
Enabling Mutex Profiling
Like block profiling, mutex profiling is disabled by default:
runtime.SetMutexProfileFraction(1) // Profile every mutex event
runtime.SetMutexProfileFraction(10) // Profile 1/10 of mutex events
Then collect:
curl -o mutex.prof http://localhost:6060/debug/pprof/mutex
go tool pprof mutex.prof
(pprof) top10
Mutex Profile Patterns to Act On
Pattern 1: Your own application mutex in the top results. Check the critical section size. Lock, do the minimum necessary work, unlock. Don’t do I/O or complex computation while holding a lock.
Pattern 2: sync.(*Map).Store or sync.(*Map).Load. sync.Map is optimized for a specific access pattern: mostly reads with infrequent writes. If you’re writing frequently, a map protected by a sync.RWMutex may perform better. Benchmark both.
Pattern 3: runtime.(*gcControllerState).findRunnable. The GC is holding a lock while doing work. This usually means your GC is running very frequently — the underlying cause is allocation rate, not the mutex itself.
A Repeatable Production Profiling Workflow
Here’s the sequence that works for diagnosing a Go service under a performance complaint — whether that’s high latency, high CPU, or memory growth.
Step 1: Establish the Symptom Precisely
Before you collect a single profile, write down the exact symptom with numbers:
- “P99 latency is 850ms, up from 40ms two weeks ago”
- “Memory grows by ~200MB per hour and doesn’t plateau”
- “CPU is pegged at 95% with 2,000 RPS, and we’re dropping requests”
This gives you a benchmark to measure against and helps you choose the right profile type first.
Step 2: Match Symptom to Profile Type
| Symptom | Start With |
|---|---|
| High CPU | CPU profile |
| High memory / growing memory | Heap profile (compare two, spaced apart) |
| High allocation rate / GC pressure | Allocs profile |
| Goroutine count growing | Goroutine profile (debug=2) |
| Latency spikes without CPU spike | Block profile |
| Contention-related latency | Mutex profile |
| Mysterious latency with no obvious cause | Runtime trace (/debug/pprof/trace) |
Step 3: Collect Under Load
Profiles are only meaningful if the service is doing realistic work during collection. Collect during normal traffic hours. If you need to profile against a staging environment, drive realistic load with a load test first, then capture.
# Run load test in background, then capture
hey -n 100000 -c 200 http://your-service/endpoint &
sleep 5
curl -o cpu.prof "http://localhost:6060/debug/pprof/profile?seconds=30"
wait
Step 4: Identify the Top 3 Hotspots
Don’t try to fix everything. In every profile, there’s a power-law distribution — the top 3 call sites are responsible for the majority of the impact. Fix those first, measure again, and repeat.
go tool pprof -http=:8080 cpu.prof
# Focus on the widest bars in the flame graph
# Resist the urge to fix the 10th item on the list
Step 5: Make One Change, Measure Again
The worst profiling mistake is making five changes at once. You can’t attribute the improvement. Change one thing, redeploy (or benchmark locally), and capture the same profile type again to compare.
Step 6: Document What You Found and What Changed
Write down the before-after numbers. This is how you build a track record and communicate value upward. “We reduced CPU by 34% by eliminating three allocation sites in the hot path” is a sentence your eng manager can understand.
Tools and Flags Reference
Common pprof Commands
# Interactive shell with a profile
go tool pprof profile.prof
# Web UI (most useful)
go tool pprof -http=:8080 profile.prof
# Key interactive commands
(pprof) top10 # Top 10 by flat time
(pprof) top10 -cum # Top 10 by cumulative time
(pprof) list functionName # Annotated source for a function
(pprof) web # Open SVG in browser
(pprof) png # Save PNG
(pprof) tree # Text tree view
# Focus/ignore filters
(pprof) top -focus=main # Only show frames matching "main"
(pprof) top -ignore=runtime # Exclude runtime frames
Profiling From Code (Tests and Benchmarks)
For more controlled profiling during development:
import "runtime/pprof"
// CPU profile in a test
f, _ := os.Create("cpu.prof")
pprof.StartCPUProfile(f)
defer pprof.StopCPUProfile()
// Heap profile
f, _ := os.Create("heap.prof")
pprof.WriteHeapProfile(f)
For benchmarks, Go’s testing package has built-in integration:
go test -bench=. -cpuprofile=cpu.prof -memprofile=mem.prof ./...
go tool pprof cpu.prof
Using the pprof Web UI
The web UI at localhost:8080 (from -http=:8080) has several views worth knowing:
- Top: Sortable table of functions by flat or cumulative cost
- Graph: Call graph with edges showing cost flow
- Flame Graph: Stacked chart showing the full call stack (best for finding hot paths)
- Source: Annotated source view (when you have source available)
- Peek: Shows callers and callees of a selected function
The flame graph is the right starting point for most investigations. The source view is invaluable once you’ve narrowed to a specific function.
What to Do When the Profile Doesn’t Show the Problem
Sometimes you collect a CPU profile and the top 10 results look completely reasonable. The problem is elsewhere. Here’s where to look:
The work is I/O bound. CPU profiling won’t show this. Switch to the block profile to see where goroutines are waiting.
The problem is intermittent. A 30-second CPU profile averages over everything in that window. If your latency spikes last 500ms every 60 seconds, the CPU profile may not capture them prominently. Use the runtime trace (/debug/pprof/trace) which records every scheduler event at microsecond resolution.
The bottleneck is external. If your service is spending most of its time waiting for a database or downstream API, your profiler correctly shows little CPU work — because there isn’t any. This is a distributed tracing problem, not a profiling problem.
The problem is resource contention at the OS level. NUMA effects, CPU throttling in containers, noisy neighbors in the cloud. None of these will show up clearly in pprof. Check dmesg, cgroup throttling metrics, and CPU steal time in your cloud provider’s metrics.
Frequently Asked Questions
Does enabling pprof endpoints slow down my production service?
The endpoints themselves have no cost until they’re called. Collecting a CPU profile adds roughly 1–2% overhead for the duration of the capture window (default 30 seconds). The goroutine dump causes a brief stop-the-world pause proportional to goroutine count — usually under 1ms for services with fewer than 10,000 goroutines. Block and mutex profiling add continuous overhead and should be disabled by default, enabled only during active debugging sessions.
What’s the difference between go tool pprof and go tool trace?
pprof uses statistical sampling — it captures goroutine stacks at fixed intervals (100 Hz for CPU). It gives you an aggregate view of where time was spent. go tool trace is an event recorder — it captures every scheduler event, goroutine creation, GC pause, and system call at microsecond resolution, producing a timeline. pprof is better for identifying hotspots. trace is better for diagnosing scheduler delays, GC pauses, and latency spikes that are too short to appear in a sampled profile. See go tool pprof vs go tool trace for a full comparison.
How long should I run a CPU profile?
30 seconds minimum for most services. Longer (60–120s) if your workload has natural periodicity — for example, if your service does something expensive every 30 seconds, a 30-second window may catch it at the start or miss it entirely. Capture during representative load. A CPU profile taken at 3am when traffic is minimal tells you what the service does at 3am, not what it does at peak.
Can I profile a Go service that’s already slow without making it worse?
Yes. The CPU profiler adds roughly 1–2% CPU overhead. The heap profile is nearly free (it reads existing runtime data). The goroutine dump is free but causes a sub-millisecond STW pause. If your service is pegged at 99% CPU, adding 1–2% overhead will have no visible effect on users but will give you the data you need to fix the problem.
Should I run continuous profiling or collect profiles on demand?
On-demand collection is sufficient for investigating known incidents. Continuous profiling (tools like Pyroscope, Grafana Pyroscope, or Polar Signals) is valuable for two scenarios: (1) catching gradual regressions that accumulate over weeks without triggering alerts, and (2) comparing profiles across service instances or deployments when behavior differs. See setting up continuous profiling with Pyroscope for implementation details.
pprof shows normal-looking results but my service is slow. What now?
Start with the symptom-to-profile mapping table above. If CPU looks clean, move to the block profile — the service may be I/O bound. If block looks clean, check your goroutine count over time (leak?). If everything looks clean, use go tool trace for a 5-second window — it will show you GC pauses and scheduler delays that sampling profiles miss. If trace also looks clean, the bottleneck is external (database, downstream service, network) and you need distributed tracing, not profiling.
Next Steps
If you’ve worked through a profile and found something worth fixing, these posts go deeper on the specific techniques:
- How to Read a Go CPU Flame Graph — what the axes mean, how to find the hot path, and the most common misreadings
- go tool pprof vs go tool trace: When to Use Each — when sampling isn’t enough and you need the full event timeline
- How to Expose pprof Safely in Production — the right port separation and auth patterns for live services
- Setting Up Continuous Profiling with Pyroscope — catch gradual regressions before they become incidents
- Go Memory Optimization: Allocations, sync.Pool, and GC Tuning — the full toolkit once your heap profile shows you where to look
Profiling an unexamined service almost always finds something actionable. The question is usually not whether there’s a problem — it’s which problem is worth fixing first.
Need help applying this to your stack? Book a free 30-minute infrastructure review — we’ll look at what your profiles are telling you and identify the highest-leverage optimization targets for your specific service.