Observability for Java Microservices: ELK, OpenTelemetry & Distributed Tracing

When a single user request fans out across a dozen Java microservices, “check the logs” stops being a strategy. You need to ask, “show me everything that happened for this request, across every service, in order” — and get an answer in seconds. That is what observability delivers, built on three pillars: structured logs, metrics, and distributed traces. This deep dive shows how enterprises wire it up for Spring Boot using the Elastic (ELK) stack and OpenTelemetry.

TL;DR: Emit logs as JSON with a trace ID on every line, ship them to Elasticsearch and explore in Kibana, and instrument services with Micrometer Tracing (the Spring Boot 3 successor to Sleuth) bridged to OpenTelemetry, exporting spans to Jaeger, Tempo, or Zipkin. The trace ID is the join key that lets you pivot from a slow span to the exact log lines that produced it.

Tailor your resume to an observability/SRE role →

The three pillars, and why you need all of them

Logs answer “what exactly happened” — the detailed, high-cardinality record of events.
Metrics answer “how much / how often / how fast” — cheap, aggregatable time series for dashboards and alerts (rate, errors, duration).
Traces answer “where did the time go” — the end-to-end path of one request across services, with timing per hop.

Each is weak alone. A metric tells you latency spiked but not why; a trace shows the slow service but not the exception detail; a log has the exception but no context about the broader request. The power comes from correlating them — and the correlation key is the trace ID.

Structured logging with correlation IDs

The first step is to stop emitting free-text logs. Machine-parseable JSON, indexed in Elasticsearch, turns logs from a haystack into a queryable database. Use Logback with a JSON encoder, and put contextual fields in the MDC (Mapped Diagnostic Context) so every line carries the request’s identity.

<!-- logback-spring.xml -->
<configuration>
  <appender name="JSON" class="ch.qos.logback.core.ConsoleAppender">
    <encoder class="net.logstash.logback.encoder.LogstashEncoder">
      <includeMdcKeyName>traceId</includeMdcKeyName>
      <includeMdcKeyName>spanId</includeMdcKeyName>
    </encoder>
  </appender>
  <root level="INFO">
    <appender-ref ref="JSON" />
  </root>
</configuration>

You rarely set traceId by hand — the tracing library injects it into the MDC for you (next section). The result is log lines like this, where every event for a request shares one traceId:

{"@timestamp":"2026-06-19T14:02:11.503Z","level":"INFO",
 "logger":"com.acme.OrderService","message":"order placed",
 "service":"orders","traceId":"7d3a...","spanId":"a91f...","orderId":"O-8842"}

Distributed tracing: Micrometer Tracing + OpenTelemetry

In Spring Boot 2, tracing meant Spring Cloud Sleuth. In Spring Boot 3, Sleuth is replaced by Micrometer Tracing, a vendor-neutral facade that bridges to either OpenTelemetry or Brave and propagates context automatically. OpenTelemetry (OTel) is the CNCF standard for generating and exporting telemetry; pairing the two is the current enterprise default.

<dependency>
  <groupId>io.micrometer</groupId>
  <artifactId>micrometer-tracing-bridge-otel</artifactId>
</dependency>
<dependency>
  <groupId>io.opentelemetry</groupId>
  <artifactId>opentelemetry-exporter-otlp</artifactId>
</dependency>

With those on the classpath, Spring instruments incoming HTTP requests, RestClient/WebClient calls, and messaging. A trace represents the whole request; each unit of work is a span; and the trace context (W3C traceparent header) is propagated across service boundaries automatically, so service B’s spans attach to the same trace that started in service A.

management:
  tracing:
    sampling:
      probability: 0.1        # sample 10% in high-traffic prod
  otlp:
    tracing:
      endpoint: http://otel-collector:4318/v1/traces

Sampling is a real decision: tracing 100% of requests is expensive at scale, so teams sample a percentage (or use tail-based sampling in the OTel Collector to keep all error/slow traces while sampling the rest). To enrich a trace with a custom span, you can use the API directly:

@Observed(name = "inventory.reserve")   // Micrometer Observation -> span + metric
public Reservation reserve(String sku, int qty) { /* ... */ }

The Elastic (ELK) stack

ELK is the classic log platform: Elasticsearch stores and indexes, Logstash transforms, and Kibana visualizes. Modern deployments usually ship logs with lightweight Beats (Filebeat) or the OTel Collector rather than running heavyweight Logstash everywhere — collecting the JSON your services already write to stdout.

# filebeat.yml — tail container stdout, send to Elasticsearch
filebeat.inputs:
  - type: container
    paths: ["/var/log/containers/*.log"]
output.elasticsearch:
  hosts: ["https://elasticsearch:9200"]
# JSON logs are parsed into fields, so traceId/service/level
# become first-class, filterable columns in Kibana.

In Kibana you can now filter to service: "orders" AND level: "ERROR", or paste a traceId and see every log line from every service for that one request, in timestamp order.

Correlating logs and traces — the payoff

This is where it comes together. You spot a latency spike on a metrics dashboard, open the trace in Jaeger or Grafana Tempo, and see the time was spent in the inventory service’s database span. You copy that trace’s ID, paste it into Kibana, and instantly see the exact SQL warning and stack trace logged during that span. Metric → trace → log, joined by the trace ID, in under a minute. Without correlation, that investigation is hours of guesswork across disconnected systems.

Managed options on AWS and Azure

Most enterprises do not self-host this whole stack. The cloud-native equivalents:

Concern	AWS	Azure
Logs	CloudWatch Logs, or Amazon OpenSearch Service (managed Elasticsearch/Kibana)	Azure Monitor Logs / Log Analytics (KQL)
Traces	AWS X-Ray	Application Insights (distributed tracing)
Metrics	CloudWatch Metrics	Azure Monitor Metrics
Ingest	OTel Collector / CloudWatch agent	OTel Collector / App Insights agent

Because OpenTelemetry is vendor-neutral, the smart move is to instrument with OTel once and point the exporter at whichever backend (self-hosted ELK + Jaeger, AWS X-Ray, Azure Monitor, Datadog) your org runs — switching backends becomes a config change, not a re-instrumentation project.

Operational guardrails

Cardinality kills. Don’t put unbounded values (user IDs, request IDs) into metric tags — that explodes time-series count and cost. Those belong in logs and traces.
Set retention and ILM. Index lifecycle management rolls hot logs to cheaper storage and deletes them on schedule; unbounded log retention is a budget incident.
Log at the right level, and scrub PII. DEBUG everywhere in prod is noise and risk; never log secrets, tokens, or personal data.
Make trace IDs visible to users/support. Returning the trace ID in an error response turns a support ticket into a one-paste investigation.

Takeaways

Observability for Java microservices is a pipeline: structured JSON logs carrying a trace ID, metrics for dashboards and alerts, and distributed traces from Micrometer Tracing + OpenTelemetry — all correlated so you can pivot freely between them. Build it on open standards (OTel) so you stay portable across ELK, Jaeger/Tempo, AWS X-Ray, and Azure Monitor, and you turn “it’s slow somewhere” into a precise, minutes-long diagnosis.

Frequently asked questions

What replaced Spring Cloud Sleuth for tracing?
In Spring Boot 3, Spring Cloud Sleuth was replaced by Micrometer Tracing, which bridges to OpenTelemetry or Brave and exports to backends like Jaeger, Tempo, or Zipkin. It auto-propagates trace and span IDs and integrates them into your logs via MDC.

How do you correlate logs with distributed traces?
Put the trace ID and span ID into every log line (Micrometer Tracing adds them to the MDC automatically), emit logs as JSON, and index them in Elasticsearch. You can then jump from a slow span in Jaeger/Tempo to the exact log lines for that trace ID in Kibana.

Land your next Java role — tailor your resume with AI →