go tool pprof vs go tool trace: Which One Do You Actually Need?

Two tools, two completely different questions. One answers where CPU time went. The other answers why your goroutines weren’t running when they should have been.

The most common Go profiling mistake is reaching for pprof when the problem is a scheduler or GC issue — and staring at a clean flame graph that tells you nothing. The second most common mistake is reaching for go tool trace when the problem is a slow function, and generating a 400MB file that’s impossible to read.

These tools are complements, not alternatives. Understanding which question each one answers is what makes Go performance debugging fast.

This post covers the mechanics of both, the specific symptoms that direct you to each, and the practical workflow for collecting and reading trace output — including the file size trap that catches everyone the first time.

What pprof Actually Measures

pprof is a statistical sampling profiler. It doesn’t record every function call — it periodically interrupts your program and captures the current goroutine’s stack trace.

For CPU profiling, Go’s runtime sends SIGPROF 100 times per second and records which function each goroutine is currently executing. Over a 30-second window, this produces ~3,000 samples. The flame graph aggregates those samples — a function that appears in 15% of samples is consuming approximately 15% of one CPU core’s time.

This sampling model has a direct implication: pprof cannot see time your goroutines aren’t running. If a goroutine is blocked waiting for a mutex, a channel, or a database response, it’s off the CPU. pprof doesn’t sample it. The time disappears.

This is not a bug. It’s the correct behavior for a CPU profiler. The tool is measuring what the CPU is doing. When goroutines are blocked, the CPU is doing something else (or idling).

What pprof measures well:

CPU hotspots — functions consuming disproportionate CPU time
Heap allocations — which code paths allocate the most memory
Goroutine stack traces — a point-in-time snapshot of all running goroutines
Lock contention — aggregate view of mutex and channel blocking (block/mutex profiles)
Allocation pressure — allocs profile shows what’s generating GC work

What pprof misses:

Why a goroutine was waiting for 200ms when your handler was supposed to return in under 5ms
Whether your latency spike coincides with a GC pause
Which goroutine was starved because the scheduler deprioritized it
The precise sequence of events that caused a cascade

What go tool trace Actually Measures

go tool trace is an event recorder. Every significant runtime event — goroutine creation and destruction, OS thread blocking, GC phases, network I/O, syscalls, and scheduler preemptions — is written to a buffer at the moment it happens, with a timestamp at microsecond resolution.

The result is a timeline, not a statistical aggregate. You can see:

Exactly when goroutine G23 was created, when it started running, when it blocked on a channel, and when it was unblocked
The exact start and end of every GC phase (mark start, concurrent mark, mark termination, sweep)
Which OS thread (P) was running which goroutine at every moment
How long network I/O actually blocked before the runtime’s network poller woke the goroutine

This answers questions pprof structurally cannot:

“My P99 is 80ms but my handler does 2ms of CPU work. Where did the other 78ms go?”
“We have GC pauses, but gctrace shows them under 1ms. Why is latency spiking?”
“We’re running on 8 cores but only 2 seem busy at a time. Is this a GOMAXPROCS issue?”

What trace misses:

Which function within a goroutine is slow (it records goroutine-level events, not call stacks)
Aggregate view of where CPU time went across all goroutines
Anything below goroutine granularity

The Diagnostic Decision

Use this mapping when you’re trying to decide:

Symptom	Tool	Why
CPU is high, you need to find what	pprof CPU profile	Sampling shows which functions use CPU
Memory is growing	pprof heap profile (diffed)	Shows what’s alive and where it was allocated
Allocation rate is high / GC overhead	pprof allocs profile	Shows what’s generating GC pressure
P99 latency spikes but CPU looks normal	`go tool trace`	Goroutines may be blocked or delayed by scheduler
Latency spikes correlate with GC	`go tool trace`	Shows exact GC phase timing at microsecond resolution
Goroutine count is high (possible leak)	pprof goroutine profile	Snapshot of all goroutines grouped by stack
Goroutines aren’t being scheduled promptly	`go tool trace`	Shows scheduler events and P utilization
“It’s slow sometimes but profile looks clean”	`go tool trace`	Intermittent issues need event recording, not sampling
Lock contention exists, you need to find who holds it	pprof mutex profile	Shows which goroutines hold contended locks

The short version: pprof tells you what the CPU was doing. Trace tells you what it wasn’t doing, and why.

Collecting a Trace

If you already have pprof endpoints exposed (import _ "net/http/pprof"), trace collection is one line:

curl -o trace.out "http://localhost:6060/debug/pprof/trace?seconds=5"
go tool trace trace.out

This opens a browser UI at a local port with the full event timeline.

From code, for controlled collection (benchmarks, tests):

import "runtime/trace"

f, _ := os.Create("trace.out")
trace.Start(f)
defer trace.Stop()
// ... your code

And in tests:

go test -trace=trace.out -run TestMyHandler ./...
go tool trace trace.out

The File Size Trap

Trace files grow fast. A busy service generating 100,000 goroutine events per second will produce tens of megabytes of trace data per second. Five seconds of trace on a high-throughput service is often 200–500MB.

The go tool trace UI loads the entire file into memory before rendering. Very large traces either crash the browser tab or become impossible to navigate.

Practical limits:

2 seconds for high-throughput services (>10k req/s)
5 seconds for most services
10–30 seconds only for services with rare, infrequent events you’re trying to catch

If you need to capture a rare event over a longer window, use trace.StartRegion to mark the specific code path you care about — the trace will be smaller and the region will be highlighted in the UI.

ctx, task := trace.NewTask(ctx, "my-operation")
defer task.End()

region := trace.StartRegion(ctx, "expensive-sub-step")
// ... work
region.End()

Overhead During Collection

The trace recorder adds overhead to your running service. Unlike pprof (which adds ~1–2% for CPU profiling), trace can add 15–30% latency overhead on services with high goroutine churn. This is because every goroutine lifecycle event is written to a buffer synchronously.

Collect traces during representative load, not peak load. The overhead itself can mask scheduler problems you’re trying to diagnose.

Reading the Trace UI

go tool trace opens a browser UI with several views. The right view depends on what you’re investigating.

The Timeline View

The main view. Each row is a GOMAXPROCS “P” (logical processor). Color-coded goroutine execution is shown on each P. GC phases appear as colored overlays across all Ps.

What to look for:

Idle Ps — if you have 8 Ps but only 2-3 show activity most of the time, your goroutines are blocked waiting for something (I/O, channels, locks). This is not a GOMAXPROCS problem — adding more Ps won’t help if goroutines aren’t runnable.

GC stop-the-world phases — appear as thin colored bars spanning all Ps simultaneously. These are the actual STW pauses. If they’re sub-millisecond, GC isn’t your latency problem. If they’re 5–20ms, they are.

Goroutine bursts — a sudden vertical thickening of the P rows indicates many goroutines becoming runnable at once. If this correlates with your latency spike, you have a thundering herd or a large fan-out that’s queuing goroutines faster than they can be scheduled.

Goroutine Analysis View

Click Goroutine analysis in the menu. This groups all goroutines by their creation stack and shows aggregate statistics:

Column	What it means
Goroutines	How many goroutines were created with this stack
Total execution time	Actual time on CPU
Network wait	Time blocked on network I/O
Sync block	Time blocked on channels or mutexes
Syscall block	Time in blocking syscalls
Scheduler wait	Time runnable but not scheduled

Scheduler wait is the number that isn’t visible anywhere in pprof. If your HTTP handlers show large scheduler wait times, goroutines are sitting in the run queue waiting for a P to pick them up. This is genuine scheduler contention — often caused by GOMAXPROCS being too low, or by long-running goroutines not yielding.

Minimum Mutator Utilization (MMU)

Click Minimum mutator utilization in the menu. This shows the worst-case fraction of time your goroutines were actually running (as opposed to being paused for GC) over sliding windows of different durations.

If MMU at 1ms is 0.7 (70%), it means that in the worst 1ms window in your trace, GC consumed 30% of your runtime. For latency-sensitive services, you want MMU to be above 0.9 at your P99 latency target duration.

A Practical Example: Diagnosing a P99 Latency Spike

Here’s the workflow when pprof shows nothing but latency is high:

Step 1 — Confirm the symptom is real under the trace window.

Start a load test, wait for baseline, then capture:

hey -n 100000 -c 50 http://your-service/endpoint &
sleep 10  # let baseline establish
curl -o trace.out "http://localhost:6060/debug/pprof/trace?seconds=3"
wait

Step 2 — Open the timeline and look for GC phases.

go tool trace trace.out

If you see STW phases wider than 1ms, GC is the culprit. Check /gc/pauses:seconds from runtime/metrics and tune with GOGC/GOMEMLIMIT (see Go GC tuning guide).

Step 3 — Check the Goroutine Analysis view for scheduler wait.

Sort by Scheduler wait. If your request handlers have large scheduler wait, goroutines are runnable but not running. Check:

Is GOMAXPROCS set to something lower than the number of available CPUs?
Are there long-running goroutines that hold a P without yielding?
Is the service spawning so many goroutines per request that the scheduler queues them?

Step 4 — Check Sync block time.

Large sync block time in handlers that shouldn’t be waiting indicates lock contention or channel blocking. The trace will show you when the block happened and which goroutine unblocked it — this causality is impossible to get from pprof alone.

Using Both Together

The right workflow is not “pprof or trace” — it’s pprof first, trace when pprof is insufficient.

1. Collect a 30-second CPU profile with pprof
   → Shows a clear hotspot? Fix it, re-measure.
   → Top 10 looks reasonable? Move to step 2.

2. Check goroutine count trend
   → Trending up? Goroutine leak — see pprof goroutine profile.
   → Stable? Move to step 3.

3. Collect a 3-second trace
   → Open Goroutine Analysis, sort by Scheduler wait
   → Open timeline, look for STW GC pauses
   → Open MMU, check GC pressure at your latency target

This sequence covers >90% of Go production performance investigations. The trace step takes 3 seconds to collect and 2 minutes to read. It’s not expensive — the reason most engineers skip it is unfamiliarity with the UI, not cost.

Frequently Asked Questions

Does go tool trace replace OpenTelemetry distributed tracing?

No. They operate at completely different layers. go tool trace records runtime-internal events — goroutine scheduling, GC, syscalls — within a single process. OpenTelemetry distributed tracing records application-level spans across service boundaries. For diagnosing why a specific service is slow, use go tool trace. For diagnosing why a request is slow across multiple services, use distributed tracing. They’re complementary; use both.

Can I use go tool trace in production continuously?

No. The overhead (15–30% latency impact, large file sizes) makes continuous trace collection impractical for most services. For continuous visibility into scheduler and GC behavior, use runtime/metrics — export /gc/pauses:seconds, /sched/latencies:seconds, and /gc/cycles/total:gc-cycles to your metrics system. Run go tool trace only during active debugging sessions.

The trace UI is hard to read on busy services. Any tips?

Narrow the time window (2 seconds maximum for busy services), use the search/filter in Goroutine Analysis to focus on a specific goroutine type, and use trace.NewTask and trace.StartRegion in your code to annotate the specific operation you’re debugging. Annotated regions appear highlighted in the timeline, making it easy to find the span you care about among thousands.

My trace shows large “Sync block” time but pprof block profile shows nothing obvious. Why?

The block profile measures aggregate blocking time, sampled. It can miss short blocks that happen very frequently but resolve quickly. The trace records every block event, so a function that blocks for 500μs thousands of times per second will be visible in the trace but underrepresented in the block profile (which has a configurable minimum threshold). Lower runtime.SetBlockProfileRate to 1 to catch all blocking events, then compare.

How do I capture a trace around a specific event rather than a fixed time window?

Use runtime/trace programmatically instead of the pprof HTTP endpoint. Start the trace before the operation, stop it after. For production, you can build a “trigger” into your service: when a request exceeds a latency threshold, write a trace of the last N seconds using a ring buffer. The felixge/fgprof library does something similar for CPU profiles. There’s no standard library support for ring-buffer tracing, but it’s buildable with trace.Start/trace.Stop and a temporary file.

How to Profile a Go Application: The Complete pprof Guide — the full reference for all five pprof profile types and the production profiling workflow
How to Tune Go GC with GOGC and GOMEMLIMIT — when the trace MMU view shows GC pressure and you need to act on it
How to Detect and Fix a Goroutine Leak in Go — goroutine count trending up while trace shows accumulating chan receive blocks
How to Expose pprof Safely in Production — prerequisite for collecting both pprof profiles and traces from a live service