Question 1

Walk me through what happens between a developer merging to main and the change running in production in a well-built pipeline.

Accepted Answer

Cover the stages in order: CI triggers on merge, builds a versioned immutable artifact (container image tagged with the SHA), runs the test gates, pushes to a registry with scanning, then deployment promotes that same artifact through environments rather than rebuilding. Describe the production rollout strategy (progressive delivery with health checks), the automated verification after deploy, and the rollback path. Mentioning that the artifact is built once and promoted, not rebuilt per environment, is a strong signal.

Question 2

A pod is in CrashLoopBackOff. How do you debug it?

Accepted Answer

Evidence first: kubectl describe for events and the last state (OOMKilled, exit code, failed probes), then logs with --previous for the crashed container. Common causes in rough order: application error on startup, missing config or secret, failing liveness probe with too-tight timing, OOM from a low memory limit, or an image/architecture mismatch. Fix the identified cause rather than restarting and hoping. Mentioning exit codes (137 for OOM) and probe misconfiguration signals real operational time with Kubernetes.

Question 3

How do you manage Terraform state across a team and multiple environments?

Accepted Answer

Remote state in versioned, encrypted storage with locking so two applies cannot collide; state split per environment and per logical stack rather than one monolith, so blast radius and plan time stay bounded. Changes go through CI with plan output reviewed in the pull request, not applied from laptops. For drift: scheduled plan checks to detect it, import or targeted apply to reconcile it, and an honest acknowledgment that state surgery (moved or removed resources) happens and needs care.

Question 4

Compare rolling, blue-green, and canary deployments. When would you pick each?

Accepted Answer

Rolling is the default: cheap, built into orchestrators, but mixes versions during rollout and rollback is slow. Blue-green gives instant cutover and rollback at the cost of double capacity, and works best when you need atomic switches. Canary sends a controlled traffic slice to the new version and verifies against metrics before promoting, which is the strongest risk reduction but requires real observability and traffic-splitting infrastructure. Mention the shared constraint: all three require backward-compatible database changes, expand-and-contract migrations, because the schema cannot blue-green.

Question 5

How would you design observability for a system moving from one monolith to twenty services?

Accepted Answer

Standardize instrumentation at the platform layer (OpenTelemetry SDKs or service mesh) so teams do not each invent it: structured logs with correlation IDs, RED metrics per service, and distributed tracing, because tracing is what answers cross-service questions a monolith never had. Alert on user-facing symptoms via SLOs, not on every host metric, or twenty services will produce alert noise that drowns the on-call. Mention cost controls early: log sampling and trace sampling policies, because observability spend grows faster than service count if unmanaged.

Question 6

How do you handle secrets across environments and CI?

Accepted Answer

Principles first: no secrets in code, images, or CI variables where avoidable; a dedicated manager (Vault, cloud-native secret stores) as the source of truth; per-environment scoping so staging credentials cannot touch production. For CI, prefer OIDC federation issuing short-lived cloud credentials per job over stored long-lived keys, which removes the rotation problem for the largest credential class. Cover rotation for what remains, audit logging on access, and the incident answer: how you would rotate everything if a repository leaked.

Question 7

The cloud bill has doubled in a year. How do you find and cut the waste?

Accepted Answer

Attribute before optimizing: tagging and cost-allocation reporting to find which teams and services grew, separating growth that tracks traffic from growth that does not. Then the usual suspects in order of effort-to-savings: idle and oversized compute (rightsizing from utilization data), storage that never gets cleaned (old snapshots, logs without lifecycle policies), unoptimized data transfer paths, then commitment coverage (Savings Plans, reserved capacity) once usage is efficient, because committing to waste locks it in. End with prevention: budgets and anomaly alerts per team so the next doubling gets caught at twenty percent.

Question 8

CI takes 40 minutes on a monorepo and developers are complaining. How do you cut it down?

Accepted Answer

Profile the pipeline first to find where the time actually goes; teams routinely optimize the wrong stage. The usual levers: caching (dependencies, Docker layers, build outputs), parallelizing and sharding the test suite, and for a monorepo specifically, affected-only builds so a change to one package does not rebuild the world (build-graph tooling or path filters). Then look at runner sizing and queue time, which is often half the wall clock. State a target and re-measure; also note the cultural fix of moving slow suites to merge-queue or post-merge gates where appropriate.

Question 9

What does container and supply chain security involve in practice?

Accepted Answer

Layers: minimal base images (distroless or slim) to shrink the attack surface, image scanning in CI with a policy on what blocks the build, no root processes and read-only filesystems where possible, and pinned, verified dependencies. Up the stack: signing images and verifying signatures at admission, generating SBOMs so you can answer "are we affected" in hours instead of weeks when the next major CVE lands, and short-lived registry credentials. Acknowledge the tradeoff honestly: a zero-critical-CVE policy with no exception process just trains teams to bypass the gate.

Question 10

Tell me about the worst production incident you have been part of.

Accepted Answer

Pick an incident where you were materially involved, not adjacent. Cover detection (how long until you knew, and would you have known without a customer report), diagnosis (the wrong hypotheses too, they make the story credible), mitigation versus root-cause fix as separate decisions, and communication during the incident. End with the postmortem: what systemic fix shipped, not just the patch. Telling it without blaming an individual is the signal most panels are explicitly listening for.

Question 11

Walk me through a migration you led: to Kubernetes, to IaC, or between CI systems.

Accepted Answer

Cover why the migration was worth its cost, the incremental path (which workloads moved first and why, usually stateless and low-risk before stateful and critical), how old and new coexisted mid-migration, and how you verified each step before the next. Include the social half: how you brought the teams along, what documentation or tooling reduced their cost to move. Name what went wrong and how you absorbed it; a migration story with zero setbacks reads as either luck or distance from the work.

Question 12

Tell me about a time you reduced on-call pain for your team.

Accepted Answer

Strong answers start with measurement: page volume, what fraction was actionable, which alerts fired most. Then the fixes in order: delete or tune the alerts that never represented user impact, convert threshold noise into SLO-based alerts, automate the runbooks that on-call performed by hand at 3am, and fix the top recurring root cause outright. Give the before-and-after in pages per week and what the team did with the recovered attention. The tell of real experience is talking about alert quality, not alert quantity.

Question 13

You join a company where all infrastructure was built by hand in the cloud console. Where do you start?

Accepted Answer

Inventory before changing anything: what exists, what talks to what, where the credentials live, what has backups. Stabilize the riskiest gaps first (single points of failure, missing backups, shared admin credentials), because those kill you before drift does. Then bring infrastructure under code incrementally by importing what exists rather than rebuilding, starting with what changes most often, since that is where drift and mistakes concentrate. Set the rule that new infrastructure ships as code from day one so the hole stops getting deeper. Resist the big-bang rebuild; describe how you would sell the incremental path to leadership.

Question 14

Developers say the deploy process is too slow and have started bypassing it. What do you do?

Accepted Answer

Take the bypass as data: the official path lost on speed, so fix the speed before enforcing the rule. Talk to the teams bypassing it and measure where the time goes, then cut the wait (parallelization, caching, removing approval steps that no longer earn their delay). Close the bypass only once the paved road is genuinely faster, because enforcement without improvement just drives workarounds underground. Mention making the safe path the easy path as the operating principle; panels hiring for platform work listen for exactly that framing.

Question 15

A senior engineer asks for standing admin access to production to debug faster. How do you respond?

Accepted Answer

Acknowledge the need is legitimate even if the request is not: slow debugging is a real cost. Offer the alternatives that serve it: break-glass access that is time-boxed, logged, and alerting; better read paths (centralized logs, traces, profiling) so production shell access is rarely needed at all; and session recording where elevated access is unavoidable. Explain the reasoning rather than citing policy: standing admin credentials are the blast radius in nearly every serious breach postmortem. If the friction is frequent, treat that as a tooling gap to fix, not a queue of exceptions to grant.

Question 16

How do you balance shipping speed against reliability and security?

Accepted Answer

Frame it as explicit budgets rather than vibes: SLOs define how much unreliability the business has agreed to spend, and when the budget is healthy you ship aggressively, when it is burned you slow down and invest in stability. Security controls should be proportional to blast radius: heavy gates on payment and auth paths, lightweight ones on internal tools. Give a real example where you made the tradeoff in each direction, including once where you argued to ship despite risk and what bounded that risk.

Question 17

Why infrastructure and DevOps work rather than product engineering?

Accepted Answer

Anchor in what the work uniquely offers: your customers are engineers whose feedback is immediate and brutally honest, the leverage is multiplicative (a pipeline improvement compounds across every team that ships on it), and the problems span the whole stack from kernel to org chart. Be honest about the tradeoff, less direct contact with end users, and name where you still enjoy product work. Avoid "I like servers"; it undersells the systems thinking the role actually runs on.

Question 18

Do you have any questions for me?

Accepted Answer

Infrastructure-specific questions land best: what does the on-call load actually look like in pages per week, who owns reliability when product deadlines conflict with it, what fraction of infrastructure is under code today, how do postmortems work and do action items actually ship, and what is the team empowered to say no to. The answers tell you whether the company funds the platform or just expects it, which is the single biggest predictor of whether the job matches the posting.

DevOps Engineer Interview Questions (2026)

18 questions to prepare

Behavioral (3)