Running Spring Boot on Kubernetes: Probes, Resources, Config & Autoscaling

Putting a Spring Boot container into Kubernetes is easy. Running it well — so it restarts only when truly broken, never receives traffic before it’s ready, shuts down without dropping requests, sizes its memory so the kernel doesn’t kill it, and scales with load — is where most teams get it wrong. This deep dive covers the production patterns: probes wired to Actuator, resource requests and limits, externalized config, graceful shutdown, and autoscaling.

TL;DR: Wire liveness/readiness/startup probes to Spring Boot Actuator’s health groups. Set memory requests and limits (and let the JVM size the heap from the limit); set CPU requests and be cautious with CPU limits. Externalize config via ConfigMaps and Secrets. Enable graceful shutdown so in-flight requests finish. Scale with an HPA on the right metric.

Tailor your resume to a Kubernetes / Java role →

Probes: the three kinds and why each exists

Kubernetes uses probes to decide whether to restart a container and whether to route traffic to it. Conflating them is the most common Spring-on-Kubernetes mistake.

Liveness — “is this process broken and stuck?” If it fails, Kubernetes restarts the pod. Keep it cheap and dependency-free; if it checks a downstream database, a database blip restarts all your pods and turns a small problem into an outage.
Readiness — “should this pod receive traffic right now?” If it fails, Kubernetes removes the pod from the Service but does not restart it. This is where you check that caches are warm and critical dependencies are reachable.
Startup — “has the app finished booting?” It holds off liveness/readiness until the app is up, which matters for JVM apps with non-trivial startup time so a slow boot isn’t mistaken for a crash loop.

Spring Boot Actuator exposes these out of the box. Enable the probe health groups and the dedicated endpoints:

# application.yml
management:
  endpoint:
    health:
      probes:
        enabled: true        # /actuator/health/liveness and /readiness
  health:
    livenessstate:
      enabled: true
    readinessstate:
      enabled: true

# deployment.yaml (excerpt)
startupProbe:
  httpGet: { path: /actuator/health/liveness, port: 8080 }
  failureThreshold: 30
  periodSeconds: 5          # allows up to ~150s to boot
livenessProbe:
  httpGet: { path: /actuator/health/liveness, port: 8080 }
  periodSeconds: 10
readinessProbe:
  httpGet: { path: /actuator/health/readiness, port: 8080 }
  periodSeconds: 10

Resources: requests, limits, and the OOM-kill trap

Requests are what the scheduler reserves for the pod; limits are the hard ceiling. For Java, memory is the dangerous one. Always set a memory limit, and make the JVM size its heap from that limit rather than guessing:

resources:
  requests:
    memory: "512Mi"
    cpu: "250m"
  limits:
    memory: "768Mi"        # exceed this and the kernel OOM-kills the pod (exit 137)
# JVM side: heap = 75% of the limit, leaving room for non-heap memory
env:
  - name: JAVA_TOOL_OPTIONS
    value: "-XX:MaxRAMPercentage=75.0"

The classic failure: heap set to (or near) 100% of the memory limit, then thread stacks, metaspace, and direct buffers push total RSS over the limit and the pod is OOM-killed — appearing as a random crash. Leave headroom (≈25%).

CPU limits are more nuanced. A CPU limit throttles the container when it hits the cap, which can spike latency badly for a bursty JVM. A widely used pattern is to set CPU requests (so scheduling and fair-share work) but omit CPU limits, while always keeping a memory limit. Whatever you choose, set requests — without them the scheduler is flying blind and nodes get overcommitted.

Externalized configuration: ConfigMaps and Secrets

Keep config out of the image (twelve-factor). Non-sensitive values go in a ConfigMap, secrets in a Secret (and ideally a real secret manager — Vault, AWS Secrets Manager, Azure Key Vault — synced in via the Secrets Store CSI driver or External Secrets, so secrets aren’t sitting base64-encoded in etcd). Inject them as environment variables that Spring picks up via relaxed binding:

envFrom:
  - configMapRef: { name: orders-config }
env:
  - name: DB_PASSWORD
    valueFrom:
      secretKeyRef: { name: orders-secrets, key: db-password }
# Spring maps DB_PASSWORD -> spring.datasource.password automatically.

Mounting config as files instead of env vars lets Spring Cloud Kubernetes hot-reload some properties on change, but for most services env injection is simpler and sufficient. Use Spring profiles (SPRING_PROFILES_ACTIVE=prod) so the same image behaves per environment.

Graceful shutdown: don’t drop in-flight requests

When Kubernetes terminates a pod (deploy, scale-down, node drain) it sends SIGTERM, waits the termination grace period, then SIGKILL. Naively, the app dies immediately and in-flight requests fail. The fix has two halves. First, enable Spring Boot graceful shutdown so it stops accepting new requests and lets active ones finish:

server:
  shutdown: graceful
spring:
  lifecycle:
    timeout-per-shutdown-phase: 30s

Second — and easy to miss — there is a race: Kubernetes removes the pod from Service endpoints and sends SIGTERM at roughly the same time, so a request can still be routed to a pod that has begun shutting down. The common remedy is a preStop hook that sleeps a few seconds, giving endpoint removal time to propagate before the app stops, and a grace period longer than your shutdown timeout:

lifecycle:
  preStop:
    exec: { command: ["sh", "-c", "sleep 5"] }
terminationGracePeriodSeconds: 45

Autoscaling

The Horizontal Pod Autoscaler (HPA) adds and removes pods based on a metric. CPU is the default, but for I/O-bound Java services — which spend their time waiting, not burning CPU — CPU is often a poor signal. Custom or external metrics (requests per second, queue depth, Kafka consumer lag) via the Prometheus Adapter or KEDA usually track real load far better.

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
  scaleTargetRef: { kind: Deployment, name: orders }
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target: { type: Utilization, averageUtilization: 70 }

Two JVM-specific cautions: account for warmup (a freshly scaled pod is slow until the JIT warms — readiness probes plus a brief preStop/ramp help) and don’t scale so aggressively that you thrash. Pair the HPA with a PodDisruptionBudget so voluntary disruptions (node upgrades) can’t take down too many replicas at once.

A few more production essentials

Run as non-root with a read-only root filesystem and a locked-down securityContext — basic supply-chain hygiene.
One replica is not HA. Run ≥2–3 and spread them across nodes/zones with topology spread constraints.
Workload identity (IRSA on EKS, managed identity on AKS) for cloud access instead of mounting cloud keys.
Layered/native images keep pulls fast and cold-starts low (see our Spring Boot at scale guide).

Takeaways

Running Spring Boot well on Kubernetes is a checklist of getting the platform contract right: probes mapped to the correct Actuator health groups (cheap liveness, dependency-aware readiness, generous startup), memory requests and limits with the heap sized from the limit, config externalized to ConfigMaps and Secrets, graceful shutdown plus a preStop delay to drain cleanly, and autoscaling on a metric that reflects real load. Nail those and your services restart only when they should, never serve traffic before they’re ready, and ride deploys and scale events without dropping a request.

Frequently asked questions

What is the difference between liveness and readiness probes?
A liveness probe tells Kubernetes whether to restart the container (it is broken and stuck); a readiness probe tells Kubernetes whether to send it traffic (it is up but maybe not ready, e.g. still warming caches or a dependency is down). Spring Boot Actuator exposes both at /actuator/health/liveness and /actuator/health/readiness.

Should I set CPU limits on Java pods in Kubernetes?
Always set requests. CPU limits are debated: a too-low limit causes CPU throttling that hurts latency, so many teams set CPU requests but no CPU limit (relying on requests for scheduling) while always setting a memory limit to prevent the node running out of memory.

Land your next Java role — tailor your resume with AI →