Top DevOps Interview Questions and Answers

Top DevOps Interview Questions and Answers

Profile-Image
Bright SEO Tools in saas Published: Apr 04, 2026 | Updated: Apr 04, 2026 · 2 months ago
0:00

Top DevOps Interview Questions and Answers

DevOps interviews test whether you can bridge development and operations in practice, not just theory. The gap between reading about CI/CD pipelines and debugging why your deployment failed at 2 AM is where most candidates struggle. Interviewers probe for hands-on experience with infrastructure automation, incident response under pressure, and the judgment to know when to automate versus when to ship manually.

This guide covers 50+ questions spanning Docker containerization, Kubernetes orchestration, CI/CD pipeline design, infrastructure as code, monitoring strategies, and incident management. Each answer explains not just the "what" but the "why it matters in production" context that distinguishes practitioners from tutorial followers. Questions are organized by technical domain with difficulty indicators to help you prioritize preparation based on the role level.

The structure progresses from foundational concepts through tool-specific implementation questions to scenario-based problem solving that mirrors real interview formats.

Container and Docker Questions

What is the difference between a Docker image and a container?

A Docker image is a read-only template containing application code, dependencies, and filesystem snapshots. A container is a running instance of that image with an isolated process namespace, network stack, and writable layer on top of the image layers. The distinction matters because images are immutable and shareable while containers are ephemeral with state that disappears when stopped unless explicitly persisted to volumes.

In practice, this means you build images once in CI and run them as containers across environments. The same image that passed QA tests runs in production unchanged. Containers add a writable layer for temporary data, but application logs or user uploads stored there vanish on container restart unless mounted to volumes or external storage.

How do you reduce Docker image size?

Start with Alpine Linux base images instead of full Ubuntu or Debian distributions. An Alpine-based Node.js image is 40MB versus 350MB for the Ubuntu equivalent. Use multi-stage builds to compile dependencies in one stage and copy only production artifacts to the final stage, leaving build tools behind. Remove package manager caches with apt-get clean or apk del after installing dependencies.

Combine RUN commands to reduce layer count. Each RUN instruction creates a new layer, so RUN apt-get update && apt-get install -y package && apt-get clean is one layer while three separate RUN commands create three layers, even if the last one deletes files. Order Dockerfile instructions from least to most frequently changing to maximize layer cache hits during rebuilds.

Pro Tip: Use .dockerignore files to exclude node_modules, .git directories, and test files from the build context. A 500MB node_modules folder sent to Docker daemon on every build slows CI by 30-60 seconds even if those files aren't copied into the image.

Explain Docker networking modes

Bridge mode (default) connects containers to a private internal network with NAT to the host. Containers communicate via internal IPs and require port mapping (-p) to expose services externally. Host mode shares the host's network namespace directly, giving containers direct access to host ports but removing network isolation. None mode disables networking entirely for security-sensitive workloads.

Overlay networks enable multi-host communication in Swarm or Kubernetes clusters by encapsulating container traffic in VXLAN tunnels between nodes. Macvlan mode assigns containers MAC addresses on the physical network, making them appear as physical devices to the network infrastructure. Most production deployments use bridge mode for single-host development and overlay networks for orchestrated clusters.

What are Docker volumes versus bind mounts?

Volumes are managed by Docker in /var/lib/docker/volumes with lifecycle independent of containers. They persist data across container restarts, enable sharing between containers, and work on all platforms including Windows. Bind mounts map a specific host path into the container, giving direct access to host filesystem with host path syntax dependencies.

Use volumes for production databases and application state because Docker handles permissions and cross-platform compatibility. Use bind mounts for development when you need live code reloading, mounting your local source directory into the container so changes appear immediately without rebuilds. Bind mounts fail in orchestrated environments where pods might schedule on any node without that host path.

Kubernetes Architecture and Concepts

Explain the Kubernetes control plane components

The API server is the central management hub that processes REST requests, validates them, and updates etcd. All components communicate through the API server, never directly. Etcd is the distributed key-value store holding cluster state including configurations, secrets, and service registrations. The controller manager runs control loops that watch cluster state and make changes to match desired state, like restarting failed pods.

The scheduler assigns pods to nodes based on resource requirements, affinity rules, and node conditions. It runs a filtering phase to eliminate impossible placements followed by scoring to rank viable nodes. The kubelet runs on every node, managing pod lifecycle and reporting node status. It watches the API server for pods assigned to its node and tells the container runtime (Docker or containerd) to start containers.

Warning: Etcd outages freeze the entire cluster because the API server cannot confirm state changes. Run etcd with 3 or 5 members for quorum tolerance. Two-member etcd provides no failure tolerance because losing one member loses quorum.

What is the difference between Deployment, StatefulSet, and DaemonSet?

Deployments manage stateless replicated applications like web servers where pod identity doesn't matter. They support rolling updates, rollbacks, and scaling by adding identical pods. Pods get random names and can be replaced arbitrarily. StatefulSets provide stable network identities and persistent storage for stateful applications like databases where pod identity matters. Pods get predictable names (app-0, app-1) and persistent volume claims that follow the pod across rescheduling.

DaemonSets ensure exactly one pod runs on each node, automatically adding pods to new nodes and removing them from drained nodes. Use them for node-level services like log collectors, monitoring agents, or CNI plugins. A DaemonSet for Fluentd ensures every node ships logs to your logging backend without manual deployment to new nodes.

How does Kubernetes service discovery work?

Services create stable DNS names and virtual IPs for dynamic pod sets. When you create a service named "api" in the default namespace, Kubernetes DNS (CoreDNS) adds an A record at api.default.svc.cluster.local pointing to the service's ClusterIP. The kube-proxy running on each node watches service definitions and programs iptables rules (or IPVS in ipvs mode) to load balance traffic to backing pod IPs.

ClusterIP services are only reachable within the cluster. NodePort services expose a static port on every node's external IP, forwarding traffic to the ClusterIP. LoadBalancer services integrate with cloud provider APIs to provision external load balancers that forward to NodePorts. ExternalName services return CNAME records for external DNS names, enabling service abstraction for external databases.

Explain Kubernetes resource requests and limits

Requests define guaranteed resources the scheduler uses for placement decisions. A pod requesting 1 CPU and 2GB RAM only schedules on nodes with those resources available. Limits define maximum resources a container can consume before being throttled (CPU) or killed (memory). Requests affect scheduling, limits affect runtime behavior.

Setting requests without limits lets pods burst above their guarantee when node capacity allows. Setting limits without requests makes scheduling unpredictable because the scheduler doesn't know actual resource needs. A container with 100m CPU request and 1 CPU limit runs on nodes with 100m available but can use up to 1 CPU when other pods are idle. Exceeding memory limits triggers OOMKill, but exceeding CPU limits only throttles the process.

Resource Type Request Behavior Limit Behavior Exceeded Consequence
CPU Guaranteed millicores Maximum burst CPU Throttled, not killed
Memory Guaranteed RAM Maximum RAM OOMKilled immediately
Ephemeral Storage Guaranteed disk Maximum disk Pod evicted

CI/CD Pipeline Design

What makes a good CI/CD pipeline?

Fast feedback loops where developers see test results in under 10 minutes, not hours. Slow pipelines encourage batching changes which makes failures harder to debug. Deterministic builds that produce identical artifacts from the same source commit regardless of when or where they run. Non-deterministic builds with flaky tests destroy trust in the pipeline.

Security scanning integrated into the pipeline, not bolted on afterward. Scan dependencies for CVEs, lint Dockerfiles for security misconfigurations, and run SAST tools before deployment. Rollback capability that deploys the previous known-good version in one click without rebuilding. Store version tags or git commit SHAs with each deployment so rollbacks redeploy that exact artifact.

How do you handle secrets in CI/CD pipelines?

Never commit secrets to version control, even in private repositories. Git history is immutable, so a leaked credential remains in history forever unless you rewrite the entire repository. Use the CI platform's secret management feature (GitHub Actions secrets, GitLab CI/CD variables, Jenkins credentials) to inject secrets as environment variables at runtime. These are encrypted at rest and masked in logs.

For Kubernetes deployments, store secrets in external secret managers like HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault. Use tools like External Secrets Operator to sync secrets into Kubernetes namespaces at deploy time. Rotate secrets regularly with automated rotation scripts in CI. A database password rotated monthly limits exposure window if compromised.

Critical: Revoke cloud access keys and service account credentials when employees leave. Automated pipelines with long-lived credentials become backdoors if not audited. Use short-lived OIDC tokens instead of static credentials when your CI platform supports it.

Explain trunk-based development versus GitFlow

Trunk-based development commits frequently to a main branch with feature flags controlling incomplete features in production. Branches are short-lived (less than a day) to minimize merge conflicts. Releases deploy directly from main with tags marking release points. This enables rapid iteration but requires strong test coverage since broken code reaches main quickly.

GitFlow maintains separate develop and main branches with long-lived feature branches merged to develop via pull requests. Releases branch from develop, get tested, then merge to main with tags. Hotfixes branch from main for emergency production fixes. This provides more gating but slows feature velocity and creates complex merge scenarios when multiple features interact.

Trunk-based development works better for small teams shipping continuously while GitFlow suits larger teams with scheduled releases and extensive QA cycles. Most modern DevOps practices favor trunk-based development with automated testing replacing manual QA gates.

How do you implement blue-green deployments?

Maintain two identical production environments labeled blue and green. When deploying version 2, update the idle green environment while blue serves production traffic. Run smoke tests against green, then switch the load balancer or DNS to route traffic to green. Blue becomes the new idle environment. If issues appear, switch back to blue as an instant rollback.

This requires double infrastructure capacity since both environments run full-scale at cutover. In Kubernetes, use two separate Deployments with different labels and switch the Service selector to change which Deployment receives traffic. In cloud environments, use ALB target groups or Route 53 weighted routing with health checks to verify the new environment before cutover.

Infrastructure as Code

What is the difference between declarative and imperative infrastructure?

Declarative IaC like Terraform defines the desired end state and the tool figures out how to achieve it. You declare "I want 3 EC2 instances with these properties" and Terraform compares current state to desired state, creating a plan to converge them. Running the same Terraform configuration multiple times produces the same result because it's idempotent.

Imperative approaches like shell scripts with AWS CLI commands specify exact steps to execute. A script that runs aws ec2 run-instances creates new instances every execution unless you add logic to check if they already exist. Imperative scripts require explicit error handling and state tracking while declarative tools handle that internally.

How does Terraform state management work?

Terraform stores a state file mapping configuration resources to real infrastructure IDs. When you run terraform apply, it reads the state file to see what currently exists, compares to your configuration, and generates a plan showing what will change. The state file contains resource IDs, attributes, and dependencies that Terraform needs to modify infrastructure.

Store state remotely in S3 with DynamoDB locking or Terraform Cloud to enable team collaboration. Local state files on developers' laptops cause drift when multiple people apply changes. Remote state with locking prevents concurrent modifications that corrupt state. Enable state file versioning in S3 to recover from accidental deletions or corruption.

Best Practice: Never manually edit Terraform state files. Use terraform state commands to move or remove resources safely. Manual state edits that mismatch real infrastructure cause Terraform to delete and recreate resources unexpectedly.

What are Terraform modules and when should you use them?

Modules are reusable Terraform configurations that encapsulate related resources with input variables and output values. A VPC module might create subnets, route tables, and NAT gateways with configurable CIDR blocks and availability zones. You call modules with different inputs to provision consistent infrastructure patterns without duplicating code.

Use modules when you need the same infrastructure pattern in multiple places (like VPCs per environment) or want to share infrastructure components across teams. Public modules from the Terraform Registry provide community-vetted implementations for common resources. Create private modules for organization-specific patterns like standard application deployment setups with load balancers, auto-scaling, and monitoring preconfigured.

How do you test infrastructure as code?

Start with static analysis using tools like tfsec for Terraform or cfn-lint for CloudFormation to catch security misconfigurations and syntax errors without deploying. Run unit tests with Terratest or Kitchen-Terraform that spin up real infrastructure in a test account, verify it works as expected with integration tests, then destroy it. These catch configuration errors that static analysis misses.

Implement policy as code with Open Policy Agent or Sentinel to enforce organizational standards like "all S3 buckets must enable encryption" or "EC2 instances must use approved AMIs." These policies run in CI before terraform apply to prevent non-compliant infrastructure from reaching production. Test modules in isolated environments before promoting to production to avoid widespread issues.

Monitoring and Observability

What is the difference between monitoring and observability?

Monitoring tracks predefined metrics like CPU usage, request rate, and error counts with alerts when thresholds are crossed. You decide upfront what to measure and when to alert. This works for known failure modes but misses unexpected issues. A spike in database connection pool exhaustion won't trigger alerts if you're only monitoring CPU and memory.

Observability instruments systems to answer arbitrary questions about behavior without predicting what to ask beforehand. Distributed tracing shows request flow across services to debug latency spikes. Structured logs with high cardinality attributes enable filtering by customer ID, feature flag, or deployment version to isolate issues affecting subsets of users. Metrics, logs, and traces together enable investigating unknown-unknowns.

Explain the four golden signals of monitoring

Latency measures how long requests take to complete, distinguishing between successful requests and errors since failed requests often complete faster by short-circuiting. Track percentiles (p50, p95, p99) rather than averages because averages hide outliers. A p99 latency of 5 seconds means 1% of requests take over 5 seconds even if average latency looks healthy.

Traffic measures demand on the system in requests per second, transactions per second, or concurrent users. Saturation tracks resource utilization like CPU, memory, disk I/O, and thread pool usage. High saturation (above 80%) signals approaching limits before hard failures occur. Errors count failed requests whether due to bugs, invalid input, or infrastructure failures. Track error rate as percentage of traffic since absolute error counts spike with traffic increases.

Signal What It Measures Key Metrics Alert Threshold
Latency Request duration p50, p95, p99 response time p95 exceeds SLA
Traffic System demand Requests/sec, active users 50% deviation from baseline
Errors Failed requests Error rate %, 5xx count Error rate > 1%
Saturation Resource utilization CPU %, memory %, disk I/O Utilization > 80%

How do you implement distributed tracing?

Instrument application code to generate trace spans for each operation with a unique trace ID propagated across service boundaries. When Service A calls Service B, it passes the trace ID in HTTP headers (like X-B3-TraceId) so both services record spans under the same trace. Spans capture start time, duration, tags (like HTTP method and status code), and logs for events during the span.

Use OpenTelemetry libraries to instrument code with automatic propagation and span management. Exporters send traces to backends like Jaeger, Zipkin, or commercial SaaS products like Honeycomb or Datadog APM. Sampling prevents overwhelming the tracing backend by only recording a percentage of traces. Sample 100% of errors and slow requests but only 1% of fast successful requests.

What metrics should you track for a Kubernetes cluster?

Node-level metrics include CPU and memory utilization to detect nodes approaching capacity and disk usage to catch logging or image storage issues before nodes fail. Pod-level metrics track restarts (high restart counts indicate crashlooping), pending pods (signals scheduling problems), and resource usage versus requests/limits to optimize resource allocation.

Control plane metrics monitor API server latency and error rates since control plane issues affect all cluster operations. Track etcd performance with fsync duration and leader changes because etcd is the single source of truth. Monitor kubelet and container runtime metrics for node health. Application metrics depend on workload but always include request rate, latency percentiles, and error rates for services.

Incident Response and Reliability

How do you handle a production outage?

Establish incident command immediately with a clear incident lead who coordinates response and communication. The incident lead doesn't fix the issue but manages the responders and communicates status. Assess blast radius first: how many users are affected and which functionality is down. Prioritize mitigation over root cause analysis during active outages.

Restore service by rolling back recent changes, failing over to backup systems, or scaling up resources depending on the incident. Once stable, conduct blameless postmortems within 48 hours while details are fresh. The postmortem identifies what happened, why detection or response failed, and specific action items to prevent recurrence. Track action items to completion rather than letting them languish.

Key Principle: During incidents, communicate proactively on a fixed schedule (every 15-30 minutes) even if there's no new information. Silence during outages causes stakeholders to interrupt responders asking for status updates.

What are SLIs, SLOs, and SLAs?

Service Level Indicators (SLIs) are specific quantitative measurements of service behavior like request latency, availability, or throughput. An SLI might be "percentage of requests completing in under 200ms" measured over a time window. Choose SLIs that directly affect user experience rather than internal metrics users don't care about.

Service Level Objectives (SLOs) set target values for SLIs representing the reliability you commit to providing. An SLO might be "99.9% of requests complete in under 200ms over a rolling 30-day window." SLOs define your error budget: if your SLO is 99.9% uptime, you have 43 minutes of downtime per month before breaching the SLO. Spend error budget on feature velocity; when exhausted, focus on reliability.

Service Level Agreements (SLAs) are contracts with users specifying consequences if you miss SLOs, like refunds or service credits. Set SLAs more permissive than internal SLOs to create a safety buffer. If your SLA is 99.5% but you target 99.9% internally, you have margin to handle unexpected issues without contractual penalties.

Explain chaos engineering principles

Chaos engineering proactively injects failures into production to verify systems handle adverse conditions gracefully. Start with a hypothesis like "if an availability zone fails, the system remains operational" then design experiments to test it. Run experiments in production because staging environments don't capture real system behavior under load with production data patterns.

Begin with small blast radius experiments affecting a tiny percentage of traffic, gradually increasing scope as confidence grows. Automate common experiments like random pod deletion, network latency injection, or resource exhaustion to run continuously. Tools like Chaos Mesh, Litmus, or Gremlin provide frameworks for defining and executing chaos experiments in Kubernetes.

How do you implement automated rollbacks?

Define health metrics that indicate deployment success like error rate, latency percentiles, and key business metrics. After deploying, monitor these metrics for 10-15 minutes looking for deviations from baseline. If error rate increases by more than 2x or p95 latency exceeds SLO thresholds, trigger automatic rollback to the previous version.

In Kubernetes, use progressive delivery tools like Argo Rollouts or Flagger that automate canary deployments with metric analysis. Configure them to query Prometheus for metrics and rollback automatically if metrics degrade. In AWS, use CodeDeploy with CloudWatch alarms to rollback on alarm triggers. Store the previous deployment artifact or tag so rollbacks deploy that exact version without rebuilding.

Security and Compliance

How do you implement least privilege access in Kubernetes?

Use Role-Based Access Control (RBAC) to grant minimum necessary permissions. Create Roles defining specific allowed operations on specific resource types in a namespace. Bind Roles to ServiceAccounts used by pods rather than the default ServiceAccount with cluster-wide permissions. A web application only needs permission to read ConfigMaps and Secrets in its namespace, not cluster-wide pod deletion.

Use Pod Security Standards to enforce security policies like prohibiting privileged containers, host network access, or running as root. The restricted policy enforces multiple protections preventing container escapes. Network Policies limit pod-to-pod communication so compromised pods can't access database pods directly unless explicitly allowed. Apply defense in depth with multiple controls rather than relying on any single mechanism.

What are container security scanning best practices?

Scan base images for CVEs before using them in Dockerfiles using tools like Trivy, Snyk, or commercial solutions. Scan during image build in CI to catch new vulnerabilities before pushing to registries. Configure container registries like Harbor, ECR, or Docker Hub to scan on push and prevent pulling images with critical vulnerabilities.

Rescan running images periodically since new CVEs are disclosed daily against previously safe images. Prioritize fixing vulnerabilities with available exploits over theoretical issues. A critical CVE in an unprivileged library with no network access matters less than a moderate CVE in an internet-facing service. Automate image rebuilds when base images are patched to inherit security fixes.

Warning: Scanning alone doesn't secure systems. Act on scan results by updating dependencies, rebuilding images, and redeploying applications. Scanning without remediation wastes effort and creates alert fatigue.

How do you manage secrets rotation?

Automate secret rotation using cloud provider features like AWS Secrets Manager automatic rotation or external tools like Vault's dynamic secrets. Configure applications to reload secrets periodically or watch for secret changes rather than reading once at startup. A database password rotated monthly but not reloaded by applications causes authentication failures until pods restart.

Rotate secrets on a schedule based on sensitivity: database credentials monthly, API keys quarterly, encryption keys annually. Rotate immediately when employees with access leave or after security incidents. Use dual-write periods where both old and new secrets work simultaneously to avoid downtime during rotation. Rotate the secret, deploy updated configuration, verify applications work, then revoke the old secret.

Configuration Management

What is the difference between Ansible and Terraform?

Terraform provisions infrastructure like VPCs, instances, and load balancers using cloud provider APIs. It's declarative and maintains state to track resources. Ansible configures servers by executing tasks in order like installing packages, copying files, and starting services. It's procedural and agentless, using SSH to connect to servers and execute configuration tasks.

Use Terraform to create the infrastructure, then Ansible to configure applications on those servers. Terraform creates EC2 instances with networking and security groups. Ansible installs web servers, deploys application code, and configures monitoring agents. You can manage both infrastructure and configuration with either tool, but each is optimized for different concerns.

How do you handle configuration drift?

Run infrastructure-as-code in detect mode periodically to identify resources that diverged from code definitions. Terraform plan shows resources modified outside Terraform. Tools like CloudQuery or Steampipe query cloud resources to compare actual state against defined policies. Configure alerts when drift exceeds thresholds to trigger investigation.

Prevent drift by restricting manual changes to production infrastructure. Require all changes through code reviews and automated pipelines. Use OPA policies to block disallowed manual operations at the cloud API level. When drift occurs, either update code to match reality if the manual change was intentional, or apply code to revert infrastructure to the defined state.

What are Helm charts and when should you use them?

Helm charts are Kubernetes application packages containing YAML templates, default values, and metadata. They enable deploying complex applications with multiple resources (Deployments, Services, ConfigMaps, Ingresses) using a single command. Values files override template defaults to customize deployments for different environments without modifying charts.

Use Helm for applications deployed across multiple clusters or environments where you need different configurations. A web application Helm chart might deploy with 2 replicas in staging but 10 replicas in production using different values files. Public Helm charts from ArtifactHub provide community-maintained packages for common software like databases, monitoring tools, and ingress controllers.

Cloud Platform Specific Questions

How do you design for high availability in AWS?

Deploy resources across multiple Availability Zones to tolerate datacenter failures. Use Elastic Load Balancers to distribute traffic across instances in multiple AZs with health checks removing failed instances. Configure Auto Scaling Groups with instances in multiple AZs and policies to maintain desired capacity by launching replacements for failed instances.

Use Multi-AZ RDS deployments for automatic failover to a standby replica in a different AZ within minutes. Configure Route 53 health checks to detect regional failures and route traffic to healthy regions. Design stateless applications where sessions stored in ElastiCache or DynamoDB survive instance failures without losing user state. Avoid single points of failure like NAT Gateways by deploying one per AZ.

What AWS services help reduce costs?

Reserved Instances and Savings Plans provide discounts up to 72% for committed usage over one or three years. Spot Instances offer 50-90% discounts for interruptible workloads like batch processing, CI/CD runners, or stateless containers. Use Auto Scaling to shut down excess capacity during low traffic periods rather than running peak capacity 24/7.

Right-size instances by analyzing CloudWatch metrics to identify oversized resources. An instance running at 10% CPU wastes 90% of capacity costs. Use S3 lifecycle policies to transition infrequently accessed data to cheaper storage classes like S3 IA or Glacier. Enable S3 Intelligent-Tiering for automatic cost optimization. Delete unused resources like old snapshots, unattached volumes, and idle load balancers.

Cost Optimization Strategy Potential Savings Implementation Effort Risk Level
Reserved Instances 40-72% Low Low (committed spend)
Spot Instances 50-90% Medium Medium (interruptions)
Right-sizing 20-40% Medium Low
Auto Scaling 30-60% High Medium (scaling delays)

How do you implement disaster recovery in the cloud?

Define Recovery Time Objective (RTO) as maximum acceptable downtime and Recovery Point Objective (RPO) as maximum acceptable data loss. These drive DR architecture choices. A 4-hour RTO with 1-hour RPO allows backup-based recovery while a 5-minute RTO requires active-active multi-region deployment.

For backup-and-restore DR (highest RTO/RPO), take regular backups and store in a different region. Test restoration procedures quarterly to verify backups work. For pilot-light DR, maintain minimal infrastructure in a secondary region like databases with replication enabled, scaling up when failover is needed. For warm standby, run scaled-down versions of all production services in a secondary region, scaling up during disasters. For active-active, run full production capacity in multiple regions with load balancing between them.

Performance Optimization

How do you troubleshoot high CPU usage in containers?

Use kubectl top pods to identify which pods consume excessive CPU. Exec into the container and run top to see which processes within the container use CPU. Check application logs for errors indicating inefficient code paths like infinite loops or excessive garbage collection. Profile the application using language-specific tools like Go's pprof or Java's JProfiler.

Examine recent deployments for code changes correlating with CPU spikes. Review application metrics to see if request volume increased proportionally with CPU or if efficiency decreased. Check if CPU throttling is occurring by comparing CPU usage to limits in kubectl describe pod. Containers hitting CPU limits get throttled, appearing as high CPU usage but actually running slower.

What causes memory leaks in containerized applications?

Common causes include unclosed database connections, event listeners not removed when components unmount, caching without eviction policies causing unbounded growth, and circular references preventing garbage collection. Node.js applications leak memory from closures capturing large contexts. Java applications leak through static references preventing objects from being collected.

Detect memory leaks by monitoring container memory usage over time, looking for steady growth rather than fluctuations. Use heap dumps to compare memory snapshots before and after processing to identify objects that should be released but remain. Configure applications to expose memory metrics via Prometheus or custom endpoints showing heap size, GC frequency, and object counts.

How do you optimize database queries in production?

Enable slow query logs to identify queries taking longer than thresholds like 1 second. Analyze slow queries with EXPLAIN or EXPLAIN ANALYZE to see query execution plans, identifying missing indexes or inefficient joins. Add indexes on columns used in WHERE clauses, JOIN conditions, and ORDER BY statements. Be cautious with indexes on high-write tables since indexes slow inserts and updates.

Cache frequently accessed data in Redis or Memcached to reduce database load. Configure query result caching with appropriate TTLs based on how stale data can be. Use read replicas to offload read queries from the primary database, directing analytical queries to replicas while writes go to the primary. Consider database connection pooling to reuse connections instead of establishing new connections for each request.

GitOps and Deployment Strategies

What is GitOps and how does it differ from traditional CI/CD?

GitOps uses Git as the single source of truth for declarative infrastructure and application definitions. Desired system state is stored in Git and operators like ArgoCD or Flux continuously reconcile cluster state to match the Git repository. Changes happen by committing to Git, which triggers automated deployment rather than running imperative deployment scripts.

Traditional CI/CD uses push-based deployment where CI pipelines push changes to environments using credentials stored in the CI system. GitOps uses pull-based deployment where operators running inside clusters watch Git for changes and apply them. This improves security by eliminating external access to production clusters and provides auditability since all changes have Git commits with authors and timestamps.

How do you implement canary deployments?

Deploy the new version alongside the current version, initially routing a small percentage of traffic (like 5%) to the canary while 95% goes to stable. Monitor canary metrics for 15-30 minutes comparing error rates, latency, and business metrics against the stable version. If metrics are healthy, gradually increase canary traffic to 10%, 25%, 50%, then 100% with validation at each step.

If canary metrics degrade beyond thresholds at any stage, automatically rollback by removing the canary and routing all traffic back to stable. In Kubernetes, use Argo Rollouts or Flagger to automate canary progression with metric analysis. Configure traffic splitting using service meshes like Istio or Linkerd for fine-grained traffic control, or multiple Service objects with LoadBalancer weights.

Best Practice: Test canary deployments with internal users or specific customer segments before exposing to all traffic. Use feature flags to enable new features for canary traffic only, isolating risk further than deployment-level canaries alone.

What are feature flags and how do they enable safer deployments?

Feature flags are runtime toggles that enable or disable code paths without redeploying. Wrap new features in flag checks: if (featureFlags.newCheckout) { /* new code */ } else { /* old code */ }. Deploy new code with flags disabled, verify deployment succeeds, then gradually enable flags for increasing user percentages. This decouples deployment from feature releases.

Use feature flags for gradual rollouts (5% of users, then 25%, then 100%), A/B testing different implementations, or kill switches to disable problematic features instantly without redeployment. Store flag configurations in databases or services like LaunchDarkly or Split.io with APIs to change flag values in real-time. This enables instant rollback of problematic features by flipping a flag rather than redeploying.

Frequently Asked Questions

How do you handle configuration differences between environments?

Use environment-specific configuration files or values rather than conditional logic in code. Store non-sensitive configuration in ConfigMaps (Kubernetes) or parameter stores (AWS Systems Manager, Azure App Configuration). Inject configurations as environment variables or mounted files that applications read at startup. Use different values files with Helm or different variable files with Terraform for environment-specific settings like replica counts, instance sizes, or API endpoints.

What is the purpose of a service mesh?

Service meshes like Istio, Linkerd, or Consul provide infrastructure-level traffic management, security, and observability for microservices without requiring application code changes. They inject sidecar proxies alongside application containers to intercept all network traffic. Proxies handle mutual TLS for service-to-service encryption, distributed tracing, traffic splitting for canary deployments, circuit breaking, retry logic, and detailed metrics collection.

How do you debug networking issues in Kubernetes?

Start by verifying basic connectivity with kubectl exec into a pod and using curl or wget to test service endpoints. Check if DNS resolution works by querying service names with nslookup. Examine NetworkPolicy resources that might block traffic between pods. Verify Service selectors match pod labels with kubectl describe service and kubectl get pods --show-labels. Check if endpoints exist for services with kubectl get endpoints.

What causes pods to be stuck in Pending state?

Insufficient cluster resources where pod requests exceed available node capacity. Check node resources with kubectl top nodes and describe the pod to see FailedScheduling events explaining why placement failed. Unsatisfied pod affinity rules or node selectors requesting nodes that don't exist. Missing persistent volumes when pods request PersistentVolumeClaims without available volumes. Taints on nodes without corresponding tolerations on pods preventing scheduling.

How do you implement zero-trust security for microservices?

Require authentication and authorization for all service-to-service communication rather than trusting network boundaries. Use mutual TLS to verify service identity in both directions. Implement service identity with certificates or JWT tokens that encode what service is making requests. Define authorization policies specifying which services can call which endpoints rather than allowing all internal traffic. Service meshes automate much of this by managing certificates and enforcing policies transparently.

What is the difference between horizontal and vertical pod autoscaling?

Horizontal Pod Autoscaler (HPA) adds more pod replicas when CPU or custom metrics exceed thresholds. It scales a Deployment from 2 pods to 10 pods under load, then back down when load decreases. Vertical Pod Autoscaler (VPA) adjusts resource requests and limits on existing pods by restarting them with new values. It increases a pod's memory request from 512MB to 1GB if usage consistently approaches the limit. Use HPA for stateless applications and VPA for applications that don't scale horizontally well.

How do you implement blue-green deployments in Kubernetes?

Create two identical Deployments labeled blue and green. The Service selector determines which Deployment receives traffic. To deploy version 2, update the green Deployment while blue serves traffic. Verify green pods are healthy, then update the Service selector to point to green pods. Traffic instantly switches to the new version. Keep blue Deployment running for 30 minutes as instant rollback by switching the Service selector back if issues arise.

What are init containers and when should you use them?

Init containers run to completion before application containers start, useful for setup tasks like downloading configuration, waiting for dependencies to be ready, or setting file permissions. A web app might use an init container to run database migrations before the app starts serving traffic. Init containers run sequentially in the order defined while application containers run in parallel. They restart if they fail until they complete successfully.

How do you troubleshoot CrashLoopBackOff errors?

Check container logs with kubectl logs pod-name and kubectl logs pod-name --previous to see logs from the crashed container. Describe the pod to see events and exit codes. Exit code 0 means clean exit, non-zero indicates errors. Common causes include application crashes due to bugs, missing configuration or secrets, health check failures that kill healthy containers, insufficient memory causing OOMKill, or incorrect entrypoint commands that fail immediately.

What is the difference between readiness and liveness probes?

Liveness probes detect if a container is deadlocked and needs restarting. Kubernetes restarts containers that fail liveness probes. Configure liveness to check if the application process is responsive, typically with HTTP GET to a basic health endpoint. Readiness probes detect if a container is ready to receive traffic. Kubernetes removes pods from service endpoints when readiness probes fail but doesn't restart them. Use readiness for slow startup situations where the app needs time to warm up before accepting requests.

Conclusion

DevOps interviews assess practical experience with production systems under real constraints. Focus preparation on understanding tradeoffs rather than memorizing commands—why you choose eventual consistency over strong consistency in distributed systems, when to invest in automation versus manual processes, how incident response procedures balance speed with thorough investigation. The questions that reveal depth ask about failure modes, edge cases, and lessons learned from mistakes.

Build hands-on experience by running Kubernetes clusters locally with kind or minikube, setting up CI/CD pipelines in GitHub Actions or GitLab CI, and deploying applications with monitoring and automated rollbacks. Failure experience is as valuable as success—debug crashlooping pods, investigate memory leaks, and recover from failed deployments. These experiences inform interview answers with specificity that distinguishes practitioners from those who only read documentation.


Share on Social Media: