Question 1

Design an API rate limiter for a multi-tenant service.

Accepted Answer

Clarify the requirements first: per-tenant limits, burst tolerance, latency budget. Token bucket is the usual default — cheap, burst-friendly, easy to implement on Redis with atomic operations. Cover the distributed enforcement question (centralized Redis vs local buckets with sync) and the failure mode: if the limiter store is down, fail open or closed depending on what the API protects. Finish with the contract: 429s with Retry-After and rate-limit headers so clients can behave well.

Question 2

How would you design a system to process a few million webhook events per day reliably?

Accepted Answer

Accept fast and durably: validate the signature, write to a queue, return 200 — never do real work in the receiving request. Consumers process with retries and exponential backoff, a dead-letter queue for poison messages, and idempotent handling because delivery will be at-least-once. Address ordering honestly: per-key ordering via partitioning if required, global ordering avoided. Mention monitoring queue depth and consumer lag as the primary health signals.

Question 3

A critical endpoint's p99 latency tripled overnight. Walk me through your investigation.

Accepted Answer

First question: what changed — deploys, traffic shape, data volume, dependency behavior. Then bisect with traces: is the time in the app, the database, or a downstream call? p99-specific causes deserve early attention: GC pauses, connection-pool exhaustion, a slow query triggered by specific inputs, one bad node behind the balancer. State what you would do if users are impacted now (roll back the suspect deploy, shed load) versus the full diagnosis. Mentioning that p99 moved while p50 did not — and what that implies — is a strong signal.

Question 4

When would you choose a relational database versus a document store versus a key-value store?

Accepted Answer

Start from access patterns and consistency needs, not the data's shape. Relational as the default: transactions, joins, constraints, and decades of operational tooling — most applications never outgrow it. Document stores when the schema genuinely varies per record and you read whole documents by key. Key-value when access is purely by key at high volume — sessions, caches, counters. Name the costs you accept leaving relational: no cross-document transactions, application-enforced integrity, and harder ad-hoc queries.

Question 5

How do you run a schema migration on a large, live table without downtime?

Accepted Answer

The expand-and-contract pattern: add the new column or table without constraints that lock, dual-write from the application, backfill in batches checkpointed by ID range with throttling, verify counts and spot-check consistency, switch reads, then remove the old path in a later release. Mention the engine-specific traps you know: lock behavior of ALTER TABLE on your database, index builds, replication lag from backfill writes. The phased shape of the answer is what they are listening for.

Question 6

Explain idempotency and how you would implement it for a payment endpoint.

Accepted Answer

Clients send an idempotency key per logical operation; the server stores the key with the operation result and returns the stored result on replays instead of re-executing. Implementation details that show depth: the key check and the side effect must be atomic (unique constraint, not check-then-act), keys need a TTL policy, and an in-flight replay should block or return a conflict rather than racing. Note that idempotency at the API edge does not absolve internal consumers — every retried boundary needs its own story.

Question 7

How would you add caching to a read-heavy API, and what can go wrong?

Accepted Answer

Cache-aside with TTLs as the default, keyed carefully (tenant, locale, version). The failure catalog matters most: stale reads after writes (choose TTL tolerance or explicit invalidation, and invalidation is famously hard), thundering herd on expiry (request coalescing, jittered TTLs), the cache becoming load-bearing (if hit rate drops, can the database survive? size for that), and cached errors. State the discipline: cache only what has a defined staleness tolerance, and measure hit rate from day one.

Question 8

What is the difference between optimistic and pessimistic locking, and when have you needed each?

Accepted Answer

Pessimistic takes the lock before the work — safe under heavy contention but adds latency and deadlock risk. Optimistic does the work and checks a version at commit, retrying on conflict — better throughput when conflicts are rare. The judgment: optimistic for typical user-facing CRUD where two writers on one row is unusual; pessimistic (or better, redesign) for hot rows like inventory counters or ledger balances. Strongest answers mention redesigning away from contention entirely — append-only ledgers, reservation patterns — rather than just picking a lock type.

Question 9

How do you version a public API without breaking existing clients?

Accepted Answer

Additive changes by default — new optional fields and endpoints are free if clients are tolerant readers. For breaking changes: a versioning scheme (URL or header, pick one and be consistent), both versions running in parallel, usage telemetry per version per client so you know who has not migrated, proactive communication with deadlines, and only then removal. The honest point: the versioning mechanism is the easy part; the migration of reluctant clients is the actual work, and telemetry plus deadlines are what make it finish.

Question 10

Walk me through the most serious production incident you have been part of.

Accepted Answer

Structure it as: impact and detection, the mitigation path (including wrong turns — they add credibility), root cause, and what changed afterward. Be precise about your own role versus the team's. The strongest endings are systemic: the alert that now catches it earlier, the failure mode designed out, the runbook that did not exist before. Avoid stories where the lesson is "we were more careful after."

Question 11

Tell me about a time you made a significant architecture decision. How did you decide?

Accepted Answer

Pick a decision with a real alternative — queue vs synchronous call, split a service vs keep it monolithic, build vs adopt. Name the options, the evaluation criteria (operational load, team skills, reversibility, cost), and why the winner won. Crucially: what you predicted would be the cost of the choice, and what the cost actually turned out to be. A decision story with hindsight calibration reads as senior; one without reads as advocacy.

Question 12

Describe the database performance problem you are proudest of solving.

Accepted Answer

Walk the diagnostic chain: the symptom, how you isolated the query (slow-query log, pg_stat_statements or equivalent), what the plan showed (sequential scan, bad row estimate, lock waits), the fix, and the measured result. Fixes that show range: a covering or partial index, rewriting the query shape, denormalizing one read path, or fixing statistics. Mention what you did to keep it fixed — a plan-regression check, a dashboard, a budget.

Question 13

A downstream service your API depends on becomes slow, and your service starts timing out too. What do you do?

Accepted Answer

Immediate: confirm the dependency is the cause via traces, then protect your own service — tighten the timeout, apply a circuit breaker so requests fail fast instead of holding connections, and degrade gracefully (serve cached or partial data, queue the writes) if the product allows. Communicate to the dependency's owners and your own consumers. Afterward: make the protections permanent, because dependencies will be slow again. Candidates who let their service exhaust its own thread pool waiting politely have not operated under load.

Question 14

Product asks for a feature that needs strongly consistent reads across two services that each own their data. How do you respond?

Accepted Answer

First interrogate the requirement: what does the user actually experience if the read is a second stale? Most "must be consistent" asks dissolve under that question. If real, present the options with costs: synchronous cross-service reads (couples availability), moving the data into one ownership boundary (the honest fix if the boundary was wrong), or an event-driven read model with bounded staleness. The skill being tested is naming the costs in product language, not the CAP theorem recital.

Question 15

You inherit a service with no tests, no docs, and an on-call rotation you join next week. What do you do first?

Accepted Answer

Work from the operational surface inward: dashboards and alerts first (what does healthy look like?), then the runbook (write one as you learn, since it does not exist), then trace the two or three highest-traffic request paths through the code. Read recent incidents and the last few months of merged PRs to find the bodies. First safety investments: a smoke test on the critical path and a verified rollback procedure — those make every later change survivable.

Question 16

How do you decide when code is good enough to ship versus needing more rigor?

Accepted Answer

Name the gradient explicitly: reversible, low-traffic changes ship with standard review and tests; anything touching money, data integrity, or auth gets design review, extra test coverage, staged rollout, and a rollback plan; schema and contract changes get the full migration treatment. Give one example of each end — something you shipped fast deliberately, and something you slowed down deliberately — and what told you which was which.

Question 17

Tell me about a time you disagreed with a teammate about a technical approach.

Accepted Answer

Pick a disagreement with technical substance. Describe how you made the disagreement concrete — a spike, a benchmark, a one-page writeup of the tradeoffs — rather than repeating positions in meetings. Say how it resolved, and if the call went against you, what committing looked like in practice. Bonus credibility: a case where the other person was right and what you took from it.

Question 18

Do you have any questions for me?

Accepted Answer

High-signal questions: what does the on-call rotation actually look like in pages per week; what is the deploy frequency and rollback story; how do design decisions get made and written down; what is the oldest piece of the system and what is the plan for it. The last one is revealing — every team has one, and whether they have a plan tells you how the team relates to its own debt.

Backend Engineer Interview Questions (2026)

18 questions to prepare

Behavioral (3)