Fix Netty HTTP span lifecycle for chunked/streaming responses#1
Open
gtukmachev wants to merge 2 commits into
Open
Fix Netty HTTP span lifecycle for chunked/streaming responses#1gtukmachev wants to merge 2 commits into
gtukmachev wants to merge 2 commits into
Conversation
14b968a to
ed004a0
Compare
|
Hi! 👋 Thanks for your pull request! 🎉 To help us review it, please make sure to:
If you need help, please check our contributing guidelines. |
c06a783 to
bca88d1
Compare
New JUnit 5 test class NettyChunkedResponseTest with a real Netty server that writes chunked responses manually (HttpResponse + HttpContent* + LastHttpContent), exercising the code path that HttpObjectAggregator-based tests never reach. Four test cases: - chunkedResponseSpanIncludesFullStreamDuration: span covers full stream time (~1s for 5 chunks x 200ms), not just header-send time (~0ms) - fullResponseStillFinishesSpanImmediately: FullHttpResponse regression - keepAliveSequentialChunkedRequestsGetCorrectSpans: STREAMING_CONTEXT_KEY lifecycle across back-to-back keep-alive requests - connectionDropDuringChunkedResponseFinishesSpan: span finished with error when client disconnects mid-stream
HttpServerResponseTracingHandler: route by message type (FullHttpResponse, HttpResponse, LastHttpContent) instead of finishing every span on HttpResponse. FullHttpResponse finishes immediately; HttpResponse defers to LastHttpContent via STREAMING_CONTEXT_KEY to avoid keep-alive race. WebSocket upgrades and bodyless responses (204, 205, 304) finish immediately since they never produce LastHttpContent. HttpServerRequestTracingHandler: channelInactive now checks STREAMING_CONTEXT_KEY and finishes leaked spans when channel closes mid-stream. AttributeKeys: added STREAMING_CONTEXT_KEY for chunked response context.
bca88d1 to
77cc414
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Applications using chunked HTTP responses (e.g. Ktor's
respondOutputStream) report near-zero latency in APM — the span closes when response headers are sent, not when the stream finishes.This affects any Netty-backed server that sends
HttpResponse+ multipleHttpContent+LastHttpContentseparately (i.e. chunked transfer encoding).Root Cause
Bug 1 — span always closed on
HttpResponse, noLastHttpContenthandlingThe original handler had a single dispatch branch. Every
HttpResponse(headers-only or full) immediately finished the span.LastHttpContent— the actual end of a chunked stream — was never inspected; it fell through toctx.write(msg, prm)silently.The fix adds explicit routing for all four Netty message types.
FullHttpResponsemust be checked first because it extends bothLastHttpContentandHttpResponse; without that ordering it would be caught by the wrong branch.Bug 2 — keep-alive race condition
Under HTTP keep-alive, Netty's event loop can process
channelReadfor the next request (overwritingCONTEXT_ATTRIBUTE_KEYwith the new span) before the pending write task for the previous response'sLastHttpContentruns. Result:handleLastHttpContentfinishes the new request's span with ~1-chunk duration.Fix: a dedicated
STREAMING_CONTEXT_KEYchannel attribute, set when chunked headers are sent and read (then cleared) byLastHttpContent— immune to overwrite by the next request.Files Changed
netty-4.1/.../HttpServerResponseTracingHandler.java— routing by message type, streaming context keynetty-common/.../AttributeKeys.java— addedSTREAMING_CONTEXT_KEYVerification
Concurrent load test (48 requests, 8 threads):
streaming/slow→ ~5020ms,streaming/fast→ ~61ms in DataDog APM. No outliers.Error capturing for streaming responses
The agent cannot change the HTTP status code once headers are sent. However, application code can mark the span as an error before
LastHttpContentcloses it, and the agent will preserve those tags.The window is: after the exception is thrown inside the streaming lambda, before the framework closes the
OutputStream(which writesLastHttpContentand triggersspan.finish()). Mark the span during that catch block and it will be captured correctly.Required pattern (application side)
Add
io.opentracing:opentracing-apiandio.opentracing:opentracing-utilas dependencies (the DD agent auto-registers as theGlobalTracer). Then wrap the streaming body in atry/catch:Why this works
The DD agent instruments Kotlin coroutines, so
GlobalTracer.get().activeSpan()returns the HTTP server span even insiderespondOutputStream's lambda onDispatchers.IO. The catch block runs before Ktor closes theOutputStreamand emitsLastHttpContent, giving the agent time to record the error beforespan.finish()is called.Verified result (DataDog APM)
POST /testing/streaming/slow(error)java.lang.RuntimeExceptionStreaming slow: error after all chunks sentPOST /testing/streaming/slow(success)POST /testing/streaming/fast(error)java.lang.RuntimeExceptionStreaming fast: error after all chunks sentPOST /testing/streaming/fast(success)Full streaming duration is preserved for both success and error cases.