Skip to content

Add distributed tracing (OpenTelemetry) support#266

Open
torosent wants to merge 23 commits intomainfrom
feature/distributed-tracing-otel
Open

Add distributed tracing (OpenTelemetry) support#266
torosent wants to merge 23 commits intomainfrom
feature/distributed-tracing-otel

Conversation

@torosent
Copy link
Member

@torosent torosent commented Mar 3, 2026

Issue describing the changes in this PR

Adds distributed tracing support to the Java SDK using OpenTelemetry, aligned with the .NET SDK tracing conventions in microsoft/durabletask-dotnet.

The SDK now automatically propagates W3C Trace Context (traceparent/tracestate) from client → orchestrations → activities → sub-orchestrations, and creates OTel spans around activity and orchestration execution on the worker side.

Changes

Core SDK (client/):

  • TracingHelper.java (new) — Utility class with getCurrentTraceContext(), extractTraceContext(), startSpan(), startSpanWithStartTime(), endSpan(), emitRetroactiveClientSpan(), emitTimerSpan(), emitEventRaisedFromWorkerSpan(), emitEventRaisedFromClientSpan(). Uses Microsoft.DurableTask as the tracer name (matching .NET ActivitySource).
  • DurableTaskGrpcClient.java — Auto-creates create_orchestration span in scheduleNewOrchestrationInstance(); added event span on raiseEvent()
  • TaskOrchestrationExecutor.java — Reads parentTraceContext from ExecutionStartedEvent, emits retroactive Client spans at task completion/failure (with scheduling-to-completion duration), emits timer spans with creation-to-fired duration, emits event spans on sendEvent()
  • DurableTaskGrpcWorker.java — Wraps activity execution in Server spans; emits orchestration span on completion with ExecutionStartedEvent timestamp as start time for full lifecycle coverage
  • OrchestrationRunner.java — Adds orchestration span for Azure Functions execution path

Span types (aligned with .NET SDK TraceActivityConstants.cs):

Span Type Kind When Name Pattern
create_orchestration Internal Client schedules orchestration create_orchestration:<name>
orchestration Server Orchestration completes/terminates orchestration:<name>
activity Client+Server Retroactive client at completion + server during execution activity:<name>
timer Internal Timer fires (spans creation→fired duration) orchestration:<name>:timer
event / orchestration_event Producer Event raised from worker or client orchestration_event:<eventName>

Span attributes (aligned with .NET SDK Schema.cs):

  • durabletask.type"orchestration", "activity", "create_orchestration", "timer", "event"
  • durabletask.task.name — orchestration/activity/event name
  • durabletask.task.instance_id — orchestration instance ID
  • durabletask.task.task_id — activity/timer task ID
  • durabletask.fire_at — timer fire time (ISO-8601)
  • durabletask.event.target_instance_id — target instance for raised events

Span durations:

  • Orchestration span — Full lifecycle from ExecutionStartedEvent timestamp to completion
  • Activity client spans — Retroactive: scheduling timestamp to completion (matches .NET EmitTraceActivityForTaskCompleted)
  • Activity server spans — Actual execution duration on worker
  • Timer spansTimerCreated timestamp to TimerFired processing time (matches .NET EmitTraceActivityForTimer)

Note: Java OTel doesn't support SetSpanId() like .NET's Activity.SetSpanId(), so child spans appear as siblings under create_orchestration rather than nested orchestration > client > server. All spans have meaningful durations.

Tests (21 tests):

  • TracingHelperTest.java — 19 tests covering all utility methods including retroactive client spans, timer spans with start time, event spans, round-trip context propagation, error recording, and SpanContext validation
  • TaskOrchestrationExecutorTest.java — 4 tests verifying trace context propagation to activities and sub-orchestrations

Samples:

  • TracingPattern.java — Fan-Out/Fan-In sample with 1s timer, 5× parallel GetWeather + CreateSummary. Uses DTS emulator + Jaeger OTLP exporter.
  • TracingChain.java — Azure Functions sample with HTTP trigger, chained activities, and sub-orchestration
  • samples/README.md — Documentation with run instructions and screenshots

Screenshots

Jaeger — Trace search showing FanOutFanIn trace (19 spans):
Jaeger trace list

Jaeger — Full trace detail with proper span durations:
create_orchestration:FanOutFanIn (108ms) → orchestration:FanOutFanIn (1.23s) + orchestration:FanOutFanIn:timer (965ms) + 5× activity:GetWeather client (184ms) + 5× server (25ms) + activity:CreateSummary client (8ms) + server (0.7ms)
Jaeger trace detail

Jaeger — Span detail showing attributes (aligned with .NET SDK schema):
Shows durabletask.type=activity, durabletask.task.name=GetWeather, durabletask.task.task_id=3, otel.scope.name=Microsoft.DurableTask, span.kind=client
Jaeger span detail

DTS Dashboard — FanOutFanIn orchestration completed (1.23s):
DTS Dashboard

Pull request checklist

  • My changes do not require documentation changes
    • Otherwise: Documentation issue linked to PR
  • My changes are added to the CHANGELOG.md
  • I have added all required tests (Unit tests, E2E tests)

Additional information

  • No new runtime dependencies — uses existing opentelemetry-api and opentelemetry-context
  • No breaking API changes — all additions are internal
  • Graceful degradation — OTel API returns no-op spans when no SDK is configured
  • Proto fields (parentTraceContext) already exist in the proto definition
  • Works for DTS, Durable Functions, and standalone Durable Task SDK
  • Verified end-to-end with DTS emulator and Jaeger
  • Tracing schema fully aligned with .NET SDK (Schema.cs, TraceActivityConstants.cs, TraceHelper.cs)

Add W3C Trace Context propagation throughout the SDK, enabling
end-to-end distributed tracing from client to orchestrations,
activities, and sub-orchestrations.

Core changes:
- TracingHelper.java: utility class for trace context capture,
  extraction, and span management
- DurableTaskGrpcClient: refactored to use TracingHelper
- TaskOrchestrationExecutor: reads parentTraceContext from
  ExecutionStartedEvent and propagates to ScheduleTaskAction
  and CreateSubOrchestrationAction
- DurableTaskGrpcWorker: wraps activity and orchestration
  execution in OTel spans with proper scope management
- OrchestrationRunner: adds orchestration span for Azure
  Functions execution path

Tests:
- TracingHelperTest: 12 tests covering all utility methods
- TaskOrchestrationExecutorTest: 3 new tests verifying trace
  context propagation to activities and sub-orchestrations

Samples:
- TracingPattern.java: standalone SDK sample with DTS emulator
  and Jaeger OTLP exporter
- TracingChain.java: Azure Functions sample with chained
  activities and sub-orchestration
- README.md with screenshots showing Jaeger traces and DTS
  dashboard

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@torosent torosent requested a review from a team as a code owner March 3, 2026 20:00
Copilot AI review requested due to automatic review settings March 3, 2026 20:00
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds OpenTelemetry-based distributed tracing to the Durable Task Java SDK, including W3C Trace Context propagation through orchestration/activity execution and runnable samples/tests to validate and demonstrate the behavior.

Changes:

  • Introduces TracingHelper utilities for W3C Trace Context capture/extraction and span lifecycle.
  • Propagates parentTraceContext from ExecutionStartedEvent into activity and sub-orchestration scheduling actions.
  • Wraps orchestration/activity execution paths in OpenTelemetry spans and adds sample + unit tests.

Reviewed changes

Copilot reviewed 13 out of 16 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
client/src/main/java/com/microsoft/durabletask/TracingHelper.java Adds helper methods for capturing/extracting W3C context and starting/ending spans.
client/src/main/java/com/microsoft/durabletask/DurableTaskGrpcClient.java Refactors orchestration scheduling trace capture to use TracingHelper.
client/src/main/java/com/microsoft/durabletask/TaskOrchestrationExecutor.java Stores parentTraceContext from history and propagates it to activity/sub-orchestration actions.
client/src/main/java/com/microsoft/durabletask/DurableTaskGrpcWorker.java Adds worker-side spans around orchestration and activity execution.
client/src/main/java/com/microsoft/durabletask/OrchestrationRunner.java Adds orchestration span for the Azure Functions execution path.
client/src/test/java/com/microsoft/durabletask/TracingHelperTest.java Adds unit coverage for trace context round-tripping and span error recording.
client/src/test/java/com/microsoft/durabletask/TaskOrchestrationExecutorTest.java Adds tests to verify trace context propagation into scheduled actions.
client/build.gradle Adds OpenTelemetry SDK/testing deps for unit tests.
samples/src/main/java/io/durabletask/samples/TracingPattern.java Adds a runnable DTS+Jaeger tracing sample.
samples/build.gradle Adds a Gradle run task and OpenTelemetry deps for the tracing sample.
samples/README.md Documents how to run and view the tracing sample.
samples-azure-functions/src/main/java/com/functions/TracingChain.java Adds an Azure Functions sample demonstrating trace propagation.
samples/images/dts-dashboard-completed.png Adds documentation screenshot asset.
CHANGELOG.md Notes the new tracing feature in Unreleased.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

torosent and others added 3 commits March 3, 2026 12:06
Add isValid() check after creating remote SpanContext to prevent
malformed trace contexts from propagating silently.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Tracer name: 'Microsoft.DurableTask' (matching .NET ActivitySource)
- Span kinds: Server for worker execution (matching .NET)
- Span naming: 'orchestration:<name>' and 'activity:<name>' (not instanceId)
- Add 'durabletask.type' attribute on all spans (matching .NET Schema.cs)
- Use shared constants for attribute keys and type values
- Extract orchestration name from ExecutionStartedEvent for span names

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Screenshots now reflect updated span naming (orchestration:<name>),
Server span kind, durabletask.type attribute, and
Microsoft.DurableTask tracer name. Added span detail screenshot
showing full attribute list.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@torosent
Copy link
Member Author

torosent commented Mar 3, 2026

Re: bachuv's comment on Functions sample screenshots

I tested the Azure Functions sample locally with func start --java. The orchestration completes successfully:

TracingChain → TracingReverse → TracingCapitalize → TracingChildOrch → TracingAddSuffix
Status: Completed

However, Durable Functions distributed tracing V2 exports traces to Application Insights, not to OTLP/Jaeger. The extension requires APPLICATIONINSIGHTS_CONNECTION_STRING and the traces are generated by the .NET extension host, not the Java worker. The Java SDK's OrchestrationRunner creates OTel spans via the API, but they only become visible when an OTel SDK is configured in the Java worker (which Azure Functions doesn't do by default).

For Jaeger-style screenshots, we'd need either:

  1. An Application Insights instance (traces visible in Azure portal)
  2. A custom OTel exporter configured in the Functions Java worker

I'd suggest adding a note in the README explaining that Functions traces appear in Application Insights when deployed to Azure, and adding App Insights screenshots as a follow-up when we have a test Azure environment. Would that work?

Copy link
Contributor

Copilot AI commented Mar 3, 2026

@torosent I've opened a new pull request, #267, to work on those changes. Once the pull request is ready, I'll request review from you.

torosent and others added 6 commits March 3, 2026 12:26
- Replaced chaining sample with Fan-Out/Fan-In pattern (5× GetWeather + CreateSummary)
- Updated README.md to reflect FanOutFanIn span hierarchy
- Captured updated Jaeger screenshots showing parallel activity spans

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Added createClientSpan() to TracingHelper for orchestrator scheduling spans
- TaskOrchestrationExecutor creates Client-kind spans when scheduling activities
  and sub-orchestrations (only during non-replay to avoid duplicates)
- DurableTaskGrpcWorker creates orchestration Server span only for first execution
- Trace now shows 14 spans with Depth 3, matching .NET SDK exactly:
  create_orchestration (root) → orchestration (server) →
  activity (client) → activity (server) for each task
- Updated screenshots showing paired Client+Server span hierarchy
- Added createClientSpan test coverage (2 new tests, 21 total passing)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Activity worker: track error separately, end span once in finally block
  (fixes double span.end() call on error path)
- OrchestrationRunner: close scope before ending span, add null check
  (fixes scope/span lifecycle ordering and potential NPE)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Timer span: emitted on TIMERFIRED in orchestrator (Internal kind,
  durabletask.fire_at attribute, name: orchestration:<name>:timer)
- Event span from worker: emitted on sendEvent in orchestrator
  (Producer kind, name: orchestration_event:<eventName>)
- Event span from client: emitted on raiseEvent in client
  (Producer kind, name: orchestration_event:<eventName>)
- Added 3 new tests (17 TracingHelper tests, 24 total)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Timer span now correctly uses parentTraceContext as parent, linking
  it to the orchestration trace instead of creating a separate trace
- Updated sample to include 1-second durable timer before fan-out
- Updated screenshots showing 15 spans with timer span in hierarchy
- Updated README with timer span documentation

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Move create_orchestration span into scheduleNewOrchestrationInstance()
  so SDK creates it automatically (bachuv feedback: span should not be
  in user code). TYPE_CREATE_ORCHESTRATION is now used by the client.
- Remove manual span creation from TracingPattern sample; simplify code
- Update all 4 screenshots with latest trace structure
- README span detail shows activity:GetWeather attributes

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@andystaples
Copy link
Contributor

andystaples commented Mar 3, 2026

Posting a Jaeger trace from a .NET orchestration, I see a few discrepancies from the screenshots posted in the PR description:
image

  1. The timer span is instantaneous here, where in Dotnet, it spans from the time it was scheduled to the time it fired.
  2. Similar issue for the outer "client" spans for activities - in .NET, they encompass the server span, in Java, they are instantaneous.
  3. Orchestrator span is also instant, it should cover the full execution

…vent timestamp

Use ExecutionStartedEvent timestamp as orchestration span start time instead of
OrchestrationTraceContext.spanStartTime (which is not populated by DTS).
Emit orchestration span only on completion/termination, with startTime from
the first ExecutionStartedEvent, so it visually wraps all child activity spans.

Added TracingHelper.startSpanWithStartTime() for creating spans with explicit
start timestamps.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@bachuv
Copy link
Contributor

bachuv commented Mar 3, 2026

Re: bachuv's comment on Functions sample screenshots

The Azure Functions sample (TracingChain.java) requires the Azure Functions runtime (func start) with Application Insights or an OTel exporter configured in host.json, which makes it harder to capture standalone screenshots like the DTS emulator + Jaeger setup.

@bachuv would you prefer:

  1. Screenshots from running with Azure Functions Core Tools + Jaeger (requires local Functions runtime setup)
  2. A note in the README explaining how to view traces in Application Insights when deployed to Azure

Happy to add either approach as a follow-up.

I would prefer option 1 and I'm also happy to add these as a follow up item.

torosent and others added 2 commits March 3, 2026 14:13
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
Address PR #266 review feedback (comment #3993852725):

1. Timer spans now have proper duration from creation to fired time,
   using TimerCreated event timestamp as setStartTimestamp().

2. Activity/sub-orchestration client spans now have proper duration
   from scheduling to completion, created retroactively at completion
   time with TaskScheduled event timestamp as setStartTimestamp().

3. Removed instantaneous client span creation at scheduling time;
   propagate orchestration's parentTraceContext directly instead.

Note: Java OTel doesn't support SetSpanId() like .NET, so child spans
are siblings under create_orchestration rather than nested under the
orchestration span. All 15 spans have meaningful durations.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@torosent
Copy link
Member Author

torosent commented Mar 3, 2026

Addressed all three issues from your feedback:

  1. Timer spans now have proper duration — spans from timer creation time to fired time using setStartTimestamp() from the TimerCreated event timestamp.
  2. Activity client spans now have proper duration — created retroactively at completion time with setStartTimestamp() from the TaskScheduled event timestamp, covering scheduling-to-completion.
  3. Orchestration span covers full lifecycle (fixed in previous commit).

Note on Java OTel limitation: Java OTel doesn't support SetSpanId() like .NET's Activity.SetSpanId(), so we can't make the retroactive client span take on the same span ID as the server span's parent. As a result, child spans are siblings under create_orchestration rather than nested orchestration > client > server. All 15 spans have meaningful durations.

Updated Jaeger screenshot:
jaeger-trace

torosent and others added 3 commits March 3, 2026 14:40
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Updated all screenshots to reflect the latest span durations:
- Timer spans now show creation-to-fired duration (~965ms)
- Activity client spans show scheduling-to-completion duration (~184ms)
- Activity server spans show execution duration (~25ms)
- Orchestration span covers full lifecycle (~1.23s)
- 15 total spans in a clean, coherent trace

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Updated TracingChain.java to use the same Fan-Out/Fan-In pattern as the
DTS sample (TracingPattern.java): 1s timer → 5× GetWeather → CreateSummary.
This ensures both samples are consistent and demonstrate the same tracing
capabilities.

Updated samples/README.md with Azure Functions section explaining that
Durable Functions tracing exports to Application Insights, not Jaeger.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
andystaples
andystaples previously approved these changes Mar 3, 2026
- Remove unused createClientSpan() method (52 lines) and its 2 tests.
  Replaced by emitRetroactiveClientSpan() which creates spans with
  proper scheduling-to-completion duration.

- Extract emitClientSpanIfTracked() helper to eliminate 4× duplicated
  retroactive client span emission blocks in task/sub-orchestration
  completed/failed handlers.

- Extract storeSchedulingMetadata() helper to consolidate 2× duplicated
  scheduling metadata storage in handleTaskScheduled and
  handleSubOrchestrationCreated.

Net: -151 lines.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pans

Start orchestration span BEFORE executor runs so child spans (activities,
timers, client spans) are nested under it. Each dispatch creates its own
orchestration span, matching JS/dotnet behavior (multiple orchestration
spans per trace). Depth is now 3: create_orchestration → orchestration →
activity/timer spans.

Updated all Jaeger and DTS dashboard screenshots.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
torosent and others added 2 commits March 3, 2026 16:41
Co-authored-by: Copilot Autofix powered by AI <223894421+github-code-quality[bot]@users.noreply.github.com>
…ed orchestration

1. Changed create_orchestration span from INTERNAL (default) to PRODUCER,
   matching .NET SDK's StartActivityForNewOrchestration which uses
   ActivityKind.Producer.

2. Set ERROR status on orchestration span when the orchestration fails,
   matching .NET SDK's pattern of checking CompleteOrchestration action
   for FAILED status before disposing the span.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the new screenshot, it seems like the orchestration:FanOutFanIn span is showing up multiple times (for each replay?). If I'm looking at the correct orchestration sample code, I would expect the following spans:

  • create_orchestration:FanOutFanIn
  • orchestration:FanOutFanIn
  • timer
  • activity:GetWeather
  • activity:GetWeather
  • activity:GetWeather
  • activity:GetWeather
  • activity:GetWeather
  • activity:CreateSummary

Let me know if I'm missing something here and the screenshot is showing the expected amount and type of spans.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants