Putting a Spring Boot container into Kubernetes is easy. Running it well — so it restarts only when truly broken, never receives traffic before it’s ready, shuts down without dropping requests, sizes its memory so the kernel doesn’t kill it, and scales with load — is where most teams get it wrong. This deep dive covers the production patterns: probes wired to Actuator, resource requests and limits, externalized config, graceful shutdown, and autoscaling.
Kubernetes uses probes to decide whether to restart a container and whether to route traffic to it. Conflating them is the most common Spring-on-Kubernetes mistake.
Spring Boot Actuator exposes these out of the box. Enable the probe health groups and the dedicated endpoints:
# application.yml
management:
endpoint:
health:
probes:
enabled: true # /actuator/health/liveness and /readiness
health:
livenessstate:
enabled: true
readinessstate:
enabled: true
# deployment.yaml (excerpt)
startupProbe:
httpGet: { path: /actuator/health/liveness, port: 8080 }
failureThreshold: 30
periodSeconds: 5 # allows up to ~150s to boot
livenessProbe:
httpGet: { path: /actuator/health/liveness, port: 8080 }
periodSeconds: 10
readinessProbe:
httpGet: { path: /actuator/health/readiness, port: 8080 }
periodSeconds: 10
Requests are what the scheduler reserves for the pod; limits are the hard ceiling. For Java, memory is the dangerous one. Always set a memory limit, and make the JVM size its heap from that limit rather than guessing:
resources:
requests:
memory: "512Mi"
cpu: "250m"
limits:
memory: "768Mi" # exceed this and the kernel OOM-kills the pod (exit 137)
# JVM side: heap = 75% of the limit, leaving room for non-heap memory
env:
- name: JAVA_TOOL_OPTIONS
value: "-XX:MaxRAMPercentage=75.0"
The classic failure: heap set to (or near) 100% of the memory limit, then thread stacks, metaspace, and direct buffers push total RSS over the limit and the pod is OOM-killed — appearing as a random crash. Leave headroom (≈25%).
CPU limits are more nuanced. A CPU limit throttles the container when it hits the cap, which can spike latency badly for a bursty JVM. A widely used pattern is to set CPU requests (so scheduling and fair-share work) but omit CPU limits, while always keeping a memory limit. Whatever you choose, set requests — without them the scheduler is flying blind and nodes get overcommitted.
Keep config out of the image (twelve-factor). Non-sensitive values go in a ConfigMap, secrets in a Secret (and ideally a real secret manager — Vault, AWS Secrets Manager, Azure Key Vault — synced in via the Secrets Store CSI driver or External Secrets, so secrets aren’t sitting base64-encoded in etcd). Inject them as environment variables that Spring picks up via relaxed binding:
envFrom:
- configMapRef: { name: orders-config }
env:
- name: DB_PASSWORD
valueFrom:
secretKeyRef: { name: orders-secrets, key: db-password }
# Spring maps DB_PASSWORD -> spring.datasource.password automatically.
Mounting config as files instead of env vars lets Spring Cloud Kubernetes hot-reload some properties on change, but for most services env injection is simpler and sufficient. Use Spring profiles (SPRING_PROFILES_ACTIVE=prod) so the same image behaves per environment.
When Kubernetes terminates a pod (deploy, scale-down, node drain) it sends SIGTERM, waits the termination grace period, then SIGKILL. Naively, the app dies immediately and in-flight requests fail. The fix has two halves. First, enable Spring Boot graceful shutdown so it stops accepting new requests and lets active ones finish:
server:
shutdown: graceful
spring:
lifecycle:
timeout-per-shutdown-phase: 30s
Second — and easy to miss — there is a race: Kubernetes removes the pod from Service endpoints and sends SIGTERM at roughly the same time, so a request can still be routed to a pod that has begun shutting down. The common remedy is a preStop hook that sleeps a few seconds, giving endpoint removal time to propagate before the app stops, and a grace period longer than your shutdown timeout:
lifecycle:
preStop:
exec: { command: ["sh", "-c", "sleep 5"] }
terminationGracePeriodSeconds: 45
The Horizontal Pod Autoscaler (HPA) adds and removes pods based on a metric. CPU is the default, but for I/O-bound Java services — which spend their time waiting, not burning CPU — CPU is often a poor signal. Custom or external metrics (requests per second, queue depth, Kafka consumer lag) via the Prometheus Adapter or KEDA usually track real load far better.
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
spec:
scaleTargetRef: { kind: Deployment, name: orders }
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource:
name: cpu
target: { type: Utilization, averageUtilization: 70 }
Two JVM-specific cautions: account for warmup (a freshly scaled pod is slow until the JIT warms — readiness probes plus a brief preStop/ramp help) and don’t scale so aggressively that you thrash. Pair the HPA with a PodDisruptionBudget so voluntary disruptions (node upgrades) can’t take down too many replicas at once.
securityContext — basic supply-chain hygiene.Running Spring Boot well on Kubernetes is a checklist of getting the platform contract right: probes mapped to the correct Actuator health groups (cheap liveness, dependency-aware readiness, generous startup), memory requests and limits with the heap sized from the limit, config externalized to ConfigMaps and Secrets, graceful shutdown plus a preStop delay to drain cleanly, and autoscaling on a metric that reflects real load. Nail those and your services restart only when they should, never serve traffic before they’re ready, and ride deploys and scale events without dropping a request.
What is the difference between liveness and readiness probes?
A liveness probe tells Kubernetes whether to restart the container (it is broken and stuck); a readiness probe tells Kubernetes whether to send it traffic (it is up but maybe not ready, e.g. still warming caches or a dependency is down). Spring Boot Actuator exposes both at /actuator/health/liveness and /actuator/health/readiness.
Should I set CPU limits on Java pods in Kubernetes?
Always set requests. CPU limits are debated: a too-low limit causes CPU throttling that hurts latency, so many teams set CPU requests but no CPU limit (relying on requests for scheduling) while always setting a memory limit to prevent the node running out of memory.