Skip to content

Fix Netty HTTP span lifecycle for chunked/streaming responses#1

Open
gtukmachev wants to merge 2 commits into
masterfrom
gsc-577-fix-ktor-streaming-instrumentation
Open

Fix Netty HTTP span lifecycle for chunked/streaming responses#1
gtukmachev wants to merge 2 commits into
masterfrom
gsc-577-fix-ktor-streaming-instrumentation

Conversation

@gtukmachev
Copy link
Copy Markdown
Collaborator

Problem

Applications using chunked HTTP responses (e.g. Ktor's respondOutputStream) report near-zero latency in APM — the span closes when response headers are sent, not when the stream finishes.

// Ktor — respondOutputStream sends headers immediately, streams body async
call.respondOutputStream {
    repeat(5) { writeSomeChunk(); delay(1000) }  // 5s total — DD reports ~0ms
}

This affects any Netty-backed server that sends HttpResponse + multiple HttpContent + LastHttpContent separately (i.e. chunked transfer encoding).


Root Cause

Bug 1 — span always closed on HttpResponse, no LastHttpContent handling

The original handler had a single dispatch branch. Every HttpResponse (headers-only or full) immediately finished the span. LastHttpContent — the actual end of a chunked stream — was never inspected; it fell through to ctx.write(msg, prm) silently.

// Before: only one branch, span always closed at header-send time
if (span == null || !(msg instanceof HttpResponse)) {
    ctx.write(msg, prm);   // LastHttpContent passes through here — span untouched
    return;
}
// ... span.finish() always called here, even for chunked headers

The fix adds explicit routing for all four Netty message types. FullHttpResponse must be checked first because it extends both LastHttpContent and HttpResponse; without that ordering it would be caught by the wrong branch.

// After: three explicit branches; everything else (chunks, unrelated messages) falls through
if (msg instanceof FullHttpResponse) { handleFullHttpResponse(...); return; }  // finish immediately
if (msg instanceof HttpResponse)     { handleHttpResponse(...);     return; }  // headers only — don't finish
if (msg instanceof LastHttpContent)  { handleLastHttpContent(...);  return; }  // finish here
ctx.write(msg, prm);  // intermediate HttpContent chunks + unrelated messages — pass through

Bug 2 — keep-alive race condition

Under HTTP keep-alive, Netty's event loop can process channelRead for the next request (overwriting CONTEXT_ATTRIBUTE_KEY with the new span) before the pending write task for the previous response's LastHttpContent runs. Result: handleLastHttpContent finishes the new request's span with ~1-chunk duration.

Fix: a dedicated STREAMING_CONTEXT_KEY channel attribute, set when chunked headers are sent and read (then cleared) by LastHttpContent — immune to overwrite by the next request.

// AttributeKeys.java
public static final AttributeKey<Context> STREAMING_CONTEXT_KEY =
    attributeKey("datadog.server.streaming.context");

// Set when chunked headers go out
ctx.channel().attr(STREAMING_CONTEXT_KEY).set(storedContext);

// Read in LastHttpContent — safe from keep-alive overwrite
Context streamingCtx = ctx.channel().attr(STREAMING_CONTEXT_KEY).getAndRemove();

Files Changed

  • netty-4.1/.../HttpServerResponseTracingHandler.java — routing by message type, streaming context key
  • netty-common/.../AttributeKeys.java — added STREAMING_CONTEXT_KEY

Verification

Concurrent load test (48 requests, 8 threads): streaming/slow → ~5020ms, streaming/fast → ~61ms in DataDog APM. No outliers.


Error capturing for streaming responses

The agent cannot change the HTTP status code once headers are sent. However, application code can mark the span as an error before LastHttpContent closes it, and the agent will preserve those tags.

The window is: after the exception is thrown inside the streaming lambda, before the framework closes the OutputStream (which writes LastHttpContent and triggers span.finish()). Mark the span during that catch block and it will be captured correctly.

Required pattern (application side)

Add io.opentracing:opentracing-api and io.opentracing:opentracing-util as dependencies (the DD agent auto-registers as the GlobalTracer). Then wrap the streaming body in a try/catch:

// build.gradle / pom.xml
implementation("io.opentracing:opentracing-api:0.33.0")
implementation("io.opentracing:opentracing-util:0.33.0")
import io.opentracing.util.GlobalTracer

call.respondOutputStream {
    try {
        // ... write chunks ...
        if (someErrorCondition) {
            throw RuntimeException("stream failed mid-way")
        }
    } catch (e: Exception) {
        // Mark the active DD span as error BEFORE LastHttpContent closes it.
        // HTTP status code cannot be changed (headers already sent), but the
        // span will be tagged as an error and appear correctly in APM.
        GlobalTracer.get().activeSpan()?.let { span ->
            span.setTag("error", true)
            span.setTag("error.type",    e::class.java.name)
            span.setTag("error.message", e.message ?: "unknown")
        }
        throw e   // re-throw so the framework closes the stream
    }
}

Why this works

The DD agent instruments Kotlin coroutines, so GlobalTracer.get().activeSpan() returns the HTTP server span even inside respondOutputStream's lambda on Dispatchers.IO. The catch block runs before Ktor closes the OutputStream and emits LastHttpContent, giving the agent time to record the error before span.finish() is called.

Verified result (DataDog APM)

Endpoint Duration Status error.type error.message
POST /testing/streaming/slow (error) ~5025ms error java.lang.RuntimeException Streaming slow: error after all chunks sent
POST /testing/streaming/slow (success) ~5024ms ok
POST /testing/streaming/fast (error) ~67ms error java.lang.RuntimeException Streaming fast: error after all chunks sent
POST /testing/streaming/fast (success) ~64ms ok

Full streaming duration is preserved for both success and error cases.


Note: This is a demo PR showing the fix applied to the GoodNotes org fork.
The same fix was submitted upstream: DataDog#10656

@gtukmachev gtukmachev force-pushed the gsc-577-fix-ktor-streaming-instrumentation branch from 14b968a to ed004a0 Compare May 13, 2026 21:59
@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 13, 2026

Hi! 👋 Thanks for your pull request! 🎉

To help us review it, please make sure to:

  • Add at least one type, and one component or instrumentation label to the pull request

If you need help, please check our contributing guidelines.

@gtukmachev gtukmachev force-pushed the gsc-577-fix-ktor-streaming-instrumentation branch 2 times, most recently from c06a783 to bca88d1 Compare May 13, 2026 22:23
New JUnit 5 test class NettyChunkedResponseTest with a real Netty server
that writes chunked responses manually (HttpResponse + HttpContent* +
LastHttpContent), exercising the code path that HttpObjectAggregator-based
tests never reach.

Four test cases:
- chunkedResponseSpanIncludesFullStreamDuration: span covers full stream
  time (~1s for 5 chunks x 200ms), not just header-send time (~0ms)
- fullResponseStillFinishesSpanImmediately: FullHttpResponse regression
- keepAliveSequentialChunkedRequestsGetCorrectSpans: STREAMING_CONTEXT_KEY
  lifecycle across back-to-back keep-alive requests
- connectionDropDuringChunkedResponseFinishesSpan: span finished with
  error when client disconnects mid-stream
HttpServerResponseTracingHandler: route by message type (FullHttpResponse,
HttpResponse, LastHttpContent) instead of finishing every span on
HttpResponse. FullHttpResponse finishes immediately; HttpResponse defers
to LastHttpContent via STREAMING_CONTEXT_KEY to avoid keep-alive race.
WebSocket upgrades and bodyless responses (204, 205, 304) finish
immediately since they never produce LastHttpContent.

HttpServerRequestTracingHandler: channelInactive now checks
STREAMING_CONTEXT_KEY and finishes leaked spans when channel closes
mid-stream.

AttributeKeys: added STREAMING_CONTEXT_KEY for chunked response context.
@gtukmachev gtukmachev force-pushed the gsc-577-fix-ktor-streaming-instrumentation branch from bca88d1 to 77cc414 Compare May 14, 2026 09:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant