Best Docker Practices for Production Apps

Best Docker Practices for Production Apps

Profile-Image
Bright SEO Tools in saas Published: Apr 04, 2026 | Updated: Apr 04, 2026 · 2 months ago
0:00

Best Docker Practices for Production Apps

Docker in production is fundamentally different from Docker in development. What works fine locally—large images, root users, default configurations—creates security vulnerabilities, performance problems, and operational headaches at scale. Production Docker requires deliberate choices about image construction, security hardening, resource management, and observability. The gap between casual containerization and production-ready containers is wider than most teams initially realize.

This guide covers essential production practices that prevent common failure modes: minimizing image size to reduce attack surface and deployment time, implementing proper security boundaries, handling secrets safely, configuring resource limits to prevent noisy neighbor problems, and structuring health checks for reliable orchestration. These aren't theoretical best practices—they're lessons learned from production incidents across thousands of containerized applications.

We'll examine image optimization, security hardening, configuration management, logging strategies, and deployment patterns that work at scale.

Building Minimal Production Images

Image size directly impacts deployment speed, storage costs, attack surface, and container startup time. A Node.js application built from the default node:18 image weighs 900+ MB. The same application on node:18-alpine drops to 180 MB. With multi-stage builds and careful dependency management, you can reach 50-80 MB for many applications.

Smaller images mean faster pulls across your cluster, lower storage costs in container registries, reduced network transfer during deployments, and fewer installed packages that could contain vulnerabilities. Every unnecessary megabyte is overhead you pay repeatedly across every deployment, every node, every region.

Using Alpine Linux as Base

Alpine Linux images are minimal by design—they use musl libc instead of glibc and a lightweight package manager. Most official images offer Alpine variants:

FROM node:18-alpine
FROM python:3.11-alpine
FROM golang:1.21-alpine

The tradeoff is compatibility. Some packages with native extensions fail to build on Alpine, especially Python wheels compiled against glibc. If you encounter build failures, you have two options: install build dependencies in Alpine (apk add gcc musl-dev) or use Debian slim variants (python:3.11-slim) which are larger but more compatible.

Multi-Stage Builds for Production

Multi-stage builds separate build-time dependencies from runtime dependencies. You compile or bundle your application in one stage with all development tools, then copy only the compiled artifacts to a minimal final stage:

FROM node:18-alpine AS builder

WORKDIR /app

COPY package*.json ./
RUN npm ci --only=production

COPY . .
RUN npm run build

FROM node:18-alpine AS production

WORKDIR /app

COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist
COPY package*.json ./

USER node

EXPOSE 3000

CMD ["node", "dist/server.js"]

The builder stage installs dependencies and compiles TypeScript to JavaScript. The production stage copies only node_modules and compiled code—no source files, no dev dependencies, no build tools. The resulting image contains exactly what's needed to run the application and nothing more.

Distroless Images for Maximum Minimalism

Google's distroless images remove the entire OS layer except the runtime. They have no shell, no package manager, no utilities—just your language runtime and application:

FROM node:18 AS builder

WORKDIR /app

COPY package*.json ./
RUN npm ci --only=production

COPY . .
RUN npm run build

FROM gcr.io/distroless/nodejs18-debian11

WORKDIR /app

COPY --from=builder /app/node_modules ./node_modules
COPY --from=builder /app/dist ./dist

CMD ["dist/server.js"]

Distroless images are incredibly small (Node.js distroless is ~70 MB) and secure (no shell means many attack vectors disappear). The downside is debugging difficulty—you can't docker exec into these containers to run commands because there's no shell. Use distroless when security is paramount and you have robust external debugging tools.

Pro Tip: For debugging distroless images during development, use debug variants like gcr.io/distroless/nodejs18-debian11:debug which include a busybox shell. Never use debug variants in production—they defeat the security benefits of distroless.

Security Hardening for Production Containers

Default Docker configurations prioritize ease of use over security. Production containers need explicit hardening to prevent privilege escalation, limit blast radius of compromises, and satisfy security compliance requirements.

Never Run as Root

Containers default to running as root (UID 0). If an attacker exploits a vulnerability in your application, they have root access inside the container. While container isolation limits what root can do, several container escape vulnerabilities have existed. Running as a non-root user dramatically reduces risk.

Create and use a non-privileged user in your Dockerfile:

FROM node:18-alpine

RUN addgroup -g 1001 appgroup && \
    adduser -D -u 1001 -G appgroup appuser

WORKDIR /app

COPY --chown=appuser:appgroup package*.json ./
RUN npm ci --only=production

COPY --chown=appuser:appgroup . .

USER appuser

EXPOSE 3000

CMD ["node", "server.js"]

The --chown flag in COPY instructions ensures files are owned by the application user, not root. The USER instruction switches to that user before CMD runs. Some images (like official Node images) include a node user you can use directly: USER node.

Read-Only Root Filesystem

Applications rarely need to write to their filesystem—they should write to volumes, external storage, or in-memory caches. Running with a read-only root filesystem prevents attackers from modifying binaries or dropping malicious files:

docker run --read-only -v /tmp:/tmp myapp

In Kubernetes:

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: app
        image: myapp:v1
        securityContext:
          readOnlyRootFilesystem: true
        volumeMounts:
        - name: tmp
          mountPath: /tmp
      volumes:
      - name: tmp
        emptyDir: {}

The /tmp mount provides writeable space for temporary files while the rest of the filesystem is read-only. Identify all paths your application writes to and mount them as emptyDir volumes.

Dropping Capabilities

Linux capabilities grant specific privileges to processes. Containers run with a default set of capabilities—more than most applications need. Drop all capabilities and add back only those required:

docker run --cap-drop=ALL --cap-add=NET_BIND_SERVICE myapp

In Kubernetes:

securityContext:
  capabilities:
    drop:
      - ALL
    add:
      - NET_BIND_SERVICE

The NET_BIND_SERVICE capability allows binding to ports below 1024. Most applications don't need any capabilities—run with --cap-drop=ALL and only add capabilities if the application fails.

Scanning Images for Vulnerabilities

Container images inherit vulnerabilities from base images and installed packages. Regular vulnerability scanning identifies CVEs before they reach production. Tools like Trivy, Grype, or Snyk integrate into CI pipelines:

docker run aquasec/trivy image myapp:v1.2.3

Trivy reports vulnerabilities with severity ratings. Block deployment if critical vulnerabilities exist:

trivy image --exit-code 1 --severity CRITICAL myapp:v1.2.3

This command exits with status 1 if critical vulnerabilities are found, failing your CI pipeline. Container registries like Google Artifact Registry, AWS ECR, and Docker Hub also provide automatic scanning.

Warning: Vulnerability scanning finds known CVEs, but a clean scan doesn't mean your image is secure. It won't detect misconfigurations, weak secrets, or application-level vulnerabilities. Use scanning as one layer of defense, not the only one.

Managing Secrets and Sensitive Configuration

Hardcoded secrets in images or environment variables expose credentials to anyone with access to the image or container runtime. Production applications need external secret management with rotation, auditing, and encryption at rest.

Never Bake Secrets into Images

Secrets in Dockerfile layers persist even if you delete them in later layers. This fails:

FROM node:18-alpine
COPY .env /app/.env
RUN some-command-using-secrets
RUN rm /app/.env

The .env file exists in the layer created by the COPY instruction. Anyone with access to the image can extract it. Never include secret files in COPY or ADD instructions. Use BuildKit secrets for build-time secrets:

FROM node:18-alpine

RUN --mount=type=secret,id=npmrc,target=/root/.npmrc \
    npm install private-package

Build with:

docker build --secret id=npmrc,src=$HOME/.npmrc .

The secret is mounted temporarily during the RUN instruction and doesn't persist in any layer.

Runtime Secret Injection

For runtime secrets, use Docker secrets (Swarm), Kubernetes secrets, or external secret managers like HashiCorp Vault, AWS Secrets Manager, or Google Secret Manager. Kubernetes secrets example:

apiVersion: v1
kind: Secret
metadata:
  name: app-secrets
type: Opaque
data:
  db-password: 
---
apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      containers:
      - name: app
        image: myapp:v1
        env:
        - name: DB_PASSWORD
          valueFrom:
            secretKeyRef:
              name: app-secrets
              key: db-password

Better yet, use the Secrets Store CSI driver to inject secrets from external vaults directly into pods as files, avoiding etcd storage entirely.

Secret Rotation

Secrets must rotate periodically. Design applications to reload configuration without restarts. For secrets stored as files, watch for file changes:

const fs = require('fs');

let dbPassword = fs.readFileSync('/run/secrets/db-password', 'utf8');

fs.watch('/run/secrets/db-password', () => {
  dbPassword = fs.readFileSync('/run/secrets/db-password', 'utf8');
  reconnectDatabase(dbPassword);
});

This pattern works with Kubernetes secret updates, which atomically swap symlinks to new secret versions.

Resource Limits and Requests

Containers without resource limits can consume all available memory or CPU, affecting other workloads on the same host. In orchestrated environments like Kubernetes, resource limits determine scheduling and eviction behavior.

Setting Memory Limits

Memory limits prevent containers from using more than allocated memory. Without limits, a memory leak or spike kills the host's kernel OOM killer, potentially affecting all containers on the node:

docker run -m 512m myapp

In Kubernetes:

resources:
  requests:
    memory: "256Mi"
  limits:
    memory: "512Mi"

Requests indicate how much memory the scheduler should guarantee. Limits define maximum usage—exceeding this kills the container. Set requests based on average usage and limits with headroom for spikes. Monitor actual usage in production and adjust accordingly.

CPU Limits and Throttling

CPU limits work differently—they throttle rather than kill. A container limited to 1 CPU can use 100% of one core but no more. If it tries, the scheduler throttles it:

docker run --cpus=1.5 myapp

In Kubernetes:

resources:
  requests:
    cpu: "500m"
  limits:
    cpu: "1000m"

CPU is measured in millicores—1000m equals one CPU core. Requests affect scheduling, limits affect throttling. Be cautious with CPU limits; aggressive throttling degrades performance unpredictably. Some teams use CPU requests without limits, allowing burst usage while maintaining baseline capacity.

Key Insight: Memory limits are hard—exceeding them kills containers. CPU limits are soft—exceeding them throttles performance. Rightsizing both requires production metrics. Start conservative, monitor actual usage, and tune based on data.

Health Checks and Readiness Probes

Orchestrators need to know when containers are healthy and ready to serve traffic. Without proper health checks, orchestrators route traffic to containers that are starting up, degraded, or completely broken.

Implementing Health Check Endpoints

Add a health check endpoint to your application that verifies critical dependencies:

app.get('/health', async (req, res) => {
  try {
    await database.ping();
    await redis.ping();
    res.status(200).json({ status: 'healthy' });
  } catch (error) {
    res.status(503).json({ status: 'unhealthy', error: error.message });
  }
});

A good health check validates actual application functionality, not just whether the process is running. It should be fast (sub-100ms) and lightweight to avoid impacting production traffic.

Docker Health Checks

Define health checks in your Dockerfile:

HEALTHCHECK --interval=30s --timeout=3s --start-period=40s --retries=3 \
  CMD node healthcheck.js

Docker runs the health check command every 30 seconds. If it fails three consecutive times, the container is marked unhealthy. The start period gives the application time to initialize before health checks affect status.

Kubernetes Liveness and Readiness Probes

Kubernetes separates liveness (is the container alive?) from readiness (is it ready to serve traffic?):

livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 60
  periodSeconds: 10
  timeoutSeconds: 3
  failureThreshold: 3

readinessProbe:
  httpGet:
    path: /ready
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 5
  timeoutSeconds: 3
  failureThreshold: 2

Liveness probe failures restart the container. Readiness probe failures remove it from service endpoints until it recovers. The /ready endpoint should return 503 during startup or when the application can't serve traffic (database connection lost, warming caches). The /health endpoint indicates whether a restart would help—only fail liveness if the container is in an unrecoverable state.

Logging and Observability

Production containers must emit structured logs to stdout/stderr for collection by log aggregators. File-based logging inside containers doesn't work—logs are lost when containers terminate, and accessing logs requires exec access to containers.

Structured Logging to Stdout

Log JSON to stdout for parsing by log aggregation systems:

const logger = require('pino')({
  level: process.env.LOG_LEVEL || 'info',
  formatters: {
    level: (label) => {
      return { level: label };
    }
  },
  timestamp: () => `,"timestamp":"${new Date().toISOString()}"`
});

logger.info({ userId: req.user.id, action: 'login' }, 'User logged in');

JSON logs enable querying by structured fields in Elasticsearch, Loki, or CloudWatch Logs. Include correlation IDs to trace requests across services:

app.use((req, res, next) => {
  req.correlationId = req.headers['x-correlation-id'] || uuidv4();
  req.log = logger.child({ correlationId: req.correlationId });
  next();
});

Application Metrics

Expose Prometheus metrics for monitoring:

const promClient = require('prom-client');

const httpRequestDuration = new promClient.Histogram({
  name: 'http_request_duration_seconds',
  help: 'Duration of HTTP requests in seconds',
  labelNames: ['method', 'route', 'status_code']
});

app.use((req, res, next) => {
  const start = Date.now();
  res.on('finish', () => {
    const duration = (Date.now() - start) / 1000;
    httpRequestDuration.labels(req.method, req.route?.path || 'unknown', res.statusCode).observe(duration);
  });
  next();
});

app.get('/metrics', async (req, res) => {
  res.set('Content-Type', promClient.register.contentType);
  res.end(await promClient.register.metrics());
});

Prometheus scrapes the /metrics endpoint periodically. Expose custom metrics for business logic—queue lengths, background job durations, cache hit rates—to gain visibility into application behavior beyond infrastructure metrics.

Distributed Tracing

For microservices, distributed tracing connects requests across services. Integrate OpenTelemetry:

const { NodeTracerProvider } = require('@opentelemetry/sdk-trace-node');
const { SimpleSpanProcessor } = require('@opentelemetry/sdk-trace-base');
const { JaegerExporter } = require('@opentelemetry/exporter-jaeger');

const provider = new NodeTracerProvider();
const exporter = new JaegerExporter({
  endpoint: process.env.JAEGER_ENDPOINT
});

provider.addSpanProcessor(new SimpleSpanProcessor(exporter));
provider.register();

Instrumented applications automatically create spans for HTTP requests, database queries, and other operations, sending them to Jaeger or your tracing backend.

Optimizing Container Startup Time

Slow container startup delays deployments, extends downtime during rolling updates, and slows autoscaling response. Target sub-10-second startup for most applications.

Minimizing Initialization Work

Move expensive initialization out of the startup path. Lazy-load dependencies, defer non-critical setup, and avoid synchronous network calls during startup:

// Bad: blocks startup
const db = await connectDatabase();
const redis = await connectRedis();
const cache = await warmCache();

app.listen(3000);

// Better: connect asynchronously, fail health checks until ready
let db, redis;

(async () => {
  db = await connectDatabase();
  redis = await connectRedis();
  // Don't block on cache warming
  warmCache().catch(err => logger.warn('Cache warming failed', err));
})();

app.get('/ready', (req, res) => {
  if (db && redis) {
    res.status(200).send('ready');
  } else {
    res.status(503).send('not ready');
  }
});

app.listen(3000);

This starts the HTTP server immediately while connections establish in the background. Readiness probes prevent traffic until dependencies are ready, but the container starts fast.

Image Layer Caching Strategy

Structure Dockerfiles to maximize layer reuse across builds. Frequently changing layers should appear late:

FROM node:18-alpine

WORKDIR /app

# Dependencies change rarely
COPY package*.json ./
RUN npm ci --only=production

# Code changes frequently
COPY . .

CMD ["node", "server.js"]

When you update application code but not dependencies, Docker reuses cached dependency layers, only rebuilding the COPY and subsequent layers. This reduces build time and speeds up image pulls since most layers are cached.

Handling Graceful Shutdown

Containers receive SIGTERM when stopping. Applications must handle this signal to shut down gracefully—finishing in-flight requests, closing database connections, and flushing logs before exiting.

Implementing Graceful Shutdown

const server = app.listen(3000);

process.on('SIGTERM', () => {
  logger.info('SIGTERM received, shutting down gracefully');

  server.close(() => {
    logger.info('HTTP server closed');

    database.close().then(() => {
      logger.info('Database connection closed');
      process.exit(0);
    }).catch(err => {
      logger.error('Error during shutdown', err);
      process.exit(1);
    });
  });

  // Force shutdown after 30 seconds
  setTimeout(() => {
    logger.error('Forced shutdown after timeout');
    process.exit(1);
  }, 30000);
});

The server.close() call stops accepting new connections but allows existing connections to complete. Database connections close cleanly. The timeout ensures shutdown completes even if something hangs.

In Kubernetes, set terminationGracePeriodSeconds to give your application time to shut down before SIGKILL:

spec:
  terminationGracePeriodSeconds: 60
  containers:
  - name: app
    image: myapp:v1
Warning: If your application doesn't handle SIGTERM, Docker waits for the termination grace period (default 10 seconds), then sends SIGKILL which forcibly kills the process. This can corrupt data, drop connections, or lose in-flight work. Always implement graceful shutdown.

Production Deployment Patterns

Immutable Infrastructure

Never update running containers. Always deploy new containers with updated images. This ensures consistent, reproducible deployments and simplifies rollback—just deploy the previous image version:

docker stop myapp-v1
docker run -d --name myapp-v2 myapp:v1.2.0

In Kubernetes, updates happen automatically through Deployments. Change the image tag, apply the manifest, and Kubernetes performs a rolling update.

Rolling Updates with Zero Downtime

Rolling updates replace containers gradually, maintaining availability. Kubernetes Deployments support this natively:

spec:
  replicas: 6
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 2

This maintains at least 5 replicas available (6 - maxUnavailable) while creating up to 8 during rollout (6 + maxSurge). Kubernetes creates new pods with the updated image, waits for readiness probes to pass, then terminates old pods. Tune these parameters based on your tolerance for reduced capacity versus faster rollouts.

Blue-Green Deployments

Blue-green deployments run two complete environments and switch traffic atomically. This enables instant rollback and eliminates rolling update risks:

# Deploy new version to "green" environment
kubectl apply -f deployment-green.yaml

# Test green deployment
kubectl port-forward deployment/myapp-green 8080:3000

# Switch service to green
kubectl patch service myapp -p '{"spec":{"selector":{"version":"green"}}}'

# Monitor, rollback if needed by switching back to blue
kubectl patch service myapp -p '{"spec":{"selector":{"version":"blue"}}}'

Blue-green requires double the resources during cutover but provides the safest deployment path for critical systems.

Registry Management and Image Versioning

Semantic Versioning for Images

Tag images with semantic versions, not latest. latest is ambiguous—you can't determine what version is running without inspecting the image:

docker build -t myapp:v1.2.3 .
docker tag myapp:v1.2.3 myapp:v1.2
docker tag myapp:v1.2.3 myapp:v1
docker push myapp:v1.2.3
docker push myapp:v1.2
docker push myapp:v1

This provides multiple specificity levels. Production manifests reference specific versions (myapp:v1.2.3). Less critical environments might use minor versions (myapp:v1.2) to automatically pick up patch releases.

Private Registry Configuration

Production images should be in private registries. Configure Docker to authenticate:

docker login registry.example.com
docker build -t registry.example.com/myapp:v1.2.3 .
docker push registry.example.com/myapp:v1.2.3

In Kubernetes, store registry credentials as secrets:

kubectl create secret docker-registry regcred \
  --docker-server=registry.example.com \
  --docker-username=user \
  --docker-password=pass

apiVersion: apps/v1
kind: Deployment
spec:
  template:
    spec:
      imagePullSecrets:
      - name: regcred
      containers:
      - name: app
        image: registry.example.com/myapp:v1.2.3

Monitoring Container Resource Usage

Runtime monitoring reveals whether resource limits are appropriate and identifies performance issues:

docker stats

This shows CPU, memory, network, and disk I/O for all running containers in real-time. For historical data, use Prometheus with cAdvisor or the Kubernetes metrics-server:

kubectl top pods
kubectl top nodes

Set up alerts for containers approaching resource limits, high restart rates, or excessive CPU throttling. These signals indicate rightsizing needs or application issues.

Frequently Asked Questions

Should I use Alpine or Debian-based images for production?

Alpine provides smaller images and reduced attack surface, making it preferable for production. However, compatibility issues with native dependencies can occur, especially for Python packages with C extensions. If you encounter build failures or runtime crashes on Alpine that don't occur on Debian, use Debian slim variants instead. The 50-100 MB size difference is negligible compared to deployment reliability. Test both in staging before committing to production.

How do I handle database migrations in containerized deployments?

Run migrations as init containers in Kubernetes or separate one-off tasks before deploying application containers. Never run migrations in the application container's startup script—if multiple replicas start simultaneously, they'll execute migrations concurrently, causing failures. Use job resources or init containers to ensure migrations complete before application pods start.

What's the right CPU/memory limit for my application?

Start by monitoring actual usage in production without limits. Set requests at the 50th percentile of usage and limits at the 95th percentile, adding 20-30% headroom. Iterate based on throttling metrics and OOM kills. There's no universal answer—limits depend on your application's behavior under load. Under-limiting causes performance degradation; over-limiting wastes resources.

Should health checks be enabled during deployment?

Yes, but configure appropriate initialDelaySeconds or startupProbe to account for application startup time. If health checks start before your application is ready, containers enter a crash loop. Startup probes (Kubernetes 1.16+) handle slow-starting applications better than high initialDelaySeconds on liveness probes.

How do I debug production containers that are crashing?

Enable crashloop backoff delay to keep containers running long enough to exec into them. In Kubernetes, use kubectl logs to view logs from crashed containers: kubectl logs pod-name --previous. Configure log aggregation to persist logs beyond container lifecycle. For distroless images, add a debug variant build target that includes busybox for debugging.

What's the best way to handle configuration that differs between environments?

Use environment variables injected at deployment time, not baked into images. Store environment-specific configuration in ConfigMaps or external configuration services. The same image should run in dev, staging, and production with only environment variables changing. This ensures testing matches production and simplifies deployments.

Should I run multiple processes in a single container?

Generally no. Containers should run one process to simplify restart behavior, resource accounting, and logging. If you need multiple processes, use sidecar containers in Kubernetes pods. The exception is process supervisors like s6-overlay for running essential supporting processes (like nginx + application server), but this adds complexity.

How often should I rebuild base images?

Rebuild regularly to incorporate security patches from base images. Automate weekly rebuilds in CI and scan for vulnerabilities. Even if your code hasn't changed, base images receive updates that address CVEs. Use Dependabot or Renovate to get notifications when base images update.

What's the performance impact of running containers vs bare metal?

Container overhead is minimal—typically 1-5% for CPU and memory. Networking can have slightly higher overhead depending on CNI implementation. Storage performance depends on volume drivers. For most applications, the performance difference is negligible compared to the operational benefits. Measure in your specific environment if performance is critical.

How do I ensure consistent behavior between development and production containers?

Use the same base images and multi-stage build targets. Development should use the production target with development-specific environment variables, not a completely different Dockerfile. Pin dependency versions to avoid drift. Regularly test production image builds locally. The closer dev and prod environments are, the fewer surprises you'll encounter in production.

Conclusion

Production-ready Docker images require deliberate optimization, security hardening, and operational considerations that go well beyond basic containerization. Minimal images reduce attack surface and deployment time. Non-root users and read-only filesystems limit damage from compromises. Proper resource limits prevent noisy neighbor problems. Health checks enable reliable orchestration. Structured logging and metrics provide observability.

These practices aren't optional nice-to-haves—they're essential for running containers reliably at scale. The difference between development Docker and production Docker is the difference between proof-of-concept and battle-tested systems. Start with security and observability fundamentals, measure actual resource usage to inform limits, and iterate based on production behavior.

Production Docker maturity is a journey, not a destination. Begin with the highest-impact practices—non-root users, vulnerability scanning, resource limits, health checks—and progressively adopt additional practices as your operational sophistication grows.


Share on Social Media: