Question 1

Design a pipeline that ingests events from an external API into the warehouse, reliably.

Accepted Answer

Cover incremental extraction (cursor or updated-at watermark, with overlap to catch late updates), landing raw data immutably before transformation (so reprocessing is always possible), idempotent loads (merge on a natural key, never blind append), and orchestration with retries, alerting, and a dead-letter path for malformed records. Address schema drift explicitly: tolerate additive change, alert on breaking change. Close with the operational story — what page fires when the API is down, and how a missed day backfills without manual surgery.

Question 2

Write a SQL query to deduplicate a table keeping the latest record per key, then make it incremental.

Accepted Answer

ROW_NUMBER() OVER (PARTITION BY key ORDER BY updated_at DESC) filtered to 1 is the canonical dedup. For the incremental version: process only records newer than the last high-water mark, merge into the target on the key, and handle the subtlety that a late-arriving older record must not overwrite a newer one already present (compare timestamps in the merge condition). Mentioning how you would test it — duplicate keys, late arrivals, ties on the timestamp — separates practitioners from syntax-knowers.

Question 3

How would you design the schema for a warehouse serving both BI dashboards and data scientists?

Accepted Answer

A layered architecture: raw/staging (source-shaped, immutable), an integration layer with conformed entities, and consumer marts — dimensional models (facts and dimensions) for BI, wider denormalized tables for data-science feature work. Cover grain as the first modeling decision for any fact table, slowly-changing dimensions where history matters, and the governance question: shared definitions for core metrics so the dashboard and the model do not disagree about what "active user" means. dbt as the implementation vehicle with tests and documented lineage.

Question 4

Batch versus streaming: how do you decide, and what does streaming actually cost?

Accepted Answer

Start from the consumer: what decision changes if the data is minutes fresh instead of hours? Most analytics use cases do not clear that bar, and micro-batch covers much of what remains. Streaming earns its cost for operational use cases — fraud, dispatch, live personalization. The costs to name: operational complexity (consumer lag, rebalancing, state management), harder correctness (exactly-once semantics, late and out-of-order events, windowing), harder testing and backfills, and an always-on infrastructure bill. Strong answers include a case where you chose batch and were right.

Question 5

A dashboard shows numbers that are wrong, and the CEO saw it. Walk me through your investigation.

Accepted Answer

First, triage: is it wrong now or was it always wrong, and how wrong — direction-changing or cosmetic? Communicate early: flag the dashboard as under investigation so decisions stop being made on it. Then walk the lineage backward from the metric: the BI layer definition, the mart model, the transformations, the source — diffing against a known-good date at each hop to bisect where correct became incorrect. Common culprits: a silent upstream schema change, a join that went one-to-many, timezone or late-data window bugs, a backfill collision. End with the prevention: the test or monitor that would have caught it, added as part of the fix.

Question 6

How do you handle late-arriving and out-of-order data in a pipeline that feeds daily metrics?

Accepted Answer

Distinguish event time from processing time explicitly, and define the policy: a lookback window that reprocesses recent partitions on every run (cheap, common), watermarks with a defined lateness tolerance in streaming, and a restatement policy for data later than the window. The consumer-facing half matters: metrics for recent days must be labeled provisional until the window closes, or finance will reconcile against numbers that still move. Naming that conversation — agreeing on the staleness/stability tradeoff with stakeholders — is the senior version of the answer.

Question 7

Your warehouse bill doubled this quarter. How do you find out why and what do you do?

Accepted Answer

Attribute first: query history and warehouse metering grouped by user, role, workload, and table to find where the growth actually is — usually a handful of culprits, not uniform growth. Common findings: a new scheduled job scanning a huge table unpartitioned, a BI tool refreshing dashboards hourly that nobody opens, a cross join someone shipped, clustering gone stale. Fixes follow the finding: partition pruning, materialization of expensive shared subqueries, refresh-frequency sanity, warehouse rightsizing and auto-suspend. Then the durable part: per-team cost attribution and budgets so the next doubling gets caught at 20%.

Question 8

What are data contracts, and how would you introduce them with a producer team that keeps breaking your pipelines?

Accepted Answer

Technical shape: explicit schemas in a registry, CI checks in the producer's repo that fail on breaking changes, versioning and deprecation windows for intentional change. The introduction is political: start with the producer team causing the most pain, bring incident receipts (hours lost, dashboards broken), and offer the trade — the contract costs them a CI check and buys them freedom from being paged when analytics breaks. Start advisory, then enforce once trust exists. Answers that jump straight to tooling without the negotiation read as not having done it.

Question 9

Walk me through the most complex pipeline you have built. What made it hard?

Accepted Answer

Choose a pipeline where the difficulty was real: volume, source unreliability, correctness requirements, or a gnarly migration — and name what specifically made it hard rather than narrating the happy path. Cover the design choices, the failure handling (late data, duplicates, partial loads), and the operational outcome over time: incident rate, SLA adherence, what broke anyway and how you hardened it. Self-awareness about what you over- or under-engineered reads as senior.

Question 10

Tell me about a data-quality incident you owned end-to-end.

Accepted Answer

Structure: how it was detected (a monitor, or worse, a human — say which honestly), how you scoped the damage (which tables, dashboards, models, and decisions consumed the bad data), the communication (who you told and when — early notification is the trust-preserving move), the fix and backfill, and the prevention that outlived the incident. The strongest closers quantify the detection improvement: "this class of breakage now alerts within 15 minutes instead of being found by an analyst days later."

Question 11

Describe a backfill or migration you ran over a large dataset. How did you make it safe?

Accepted Answer

Cover the mechanics: batched processing checkpointed by partition or ID range (resumable after failure), idempotent writes (re-running a batch is safe), rate limiting so the backfill does not starve production workloads, and dual-write or parallel-run strategies for migrations. Verification is the differentiator: row counts, checksums, or automated output diffing against the old system before cutover, not spot checks. Include the rollback story — what you would have done at 60% complete if the diff had failed.

Question 12

An executive asks why they should fund a data-quality initiative when dashboards "mostly work." How do you make the case?

Accepted Answer

Anchor in incidents, not principles: the hours analysts spent re-validating numbers last quarter, the decision made on a wrong dashboard, the ML model that trained on broken data. Convert to cost and risk, then propose a scoped start — monitors and tests on the ten tables the business actually runs on, not a platform rewrite — with a measurable target (incident count, time-to-detection). Offering the incremental version is what makes the answer credible; asking for a quality moonshot is what loses the room.

Question 13

Two teams define "active customer" differently and both metrics are in production. You are asked to fix it. What do you do?

Accepted Answer

First, document both definitions precisely and find where they diverge (trial users? a 30- vs 28-day window?) — often the teams have never seen the diff stated plainly. Convene the owners with the diff and the downstream consumers of each metric; usually one definition serves most uses and the other is a legitimately different metric that needs a different name. Encode the outcome: one certified definition in the semantic/metrics layer with lineage, the variant renamed, both documented. The durable fix is the process — new core metrics get review before they ship to two dashboards with one name.

Question 14

A data scientist needs a new feature pipeline for a model launching in two weeks, and your roadmap is full. How do you handle it?

Accepted Answer

Scope before answering: what does minimal viable look like — often a daily batch feature table covers a launch that was framed as needing streaming. Then negotiate visibly: present the tradeoff to whoever owns the roadmap rather than absorbing it silently, and time-box an interim solution with an explicit hardening follow-up so "temporary" does not become load-bearing. The anti-patterns being screened: heroically building it nights-and-weekends (unsustainable, invisible cost) and "file a ticket for next quarter" (the platform team everyone routes around).

Question 15

How do you decide what to test and monitor in a pipeline, given you cannot check everything?

Accepted Answer

Weight by blast radius and detectability: the tables feeding executive dashboards, financial reporting, and ML training get the strictest coverage; silent failure modes (a join quietly going one-to-many, volume drifting down) get monitors because no one will report them. Standard layers: schema and uniqueness tests at transformation, freshness and volume monitors at the platform level, anomaly detection on the business-critical metrics. Name the discipline of pruning noisy alerts — a monitor nobody trusts is worse than no monitor.

Question 16

How do you work with analysts and data scientists who consume your pipelines?

Accepted Answer

Describe the practices: SLAs stated in consumer terms (data ready by 7am, not "the job runs at 6"), proactive communication when something will be late or restated, documentation and lineage they can self-serve, and regular contact with the heaviest consumers so you learn what they are building before it breaks on you. A concrete story helps — a time consumer feedback changed a design, or a time you found out something you built was being misused and fixed the interface rather than the user.

Question 17

Do you have any questions for me?

Accepted Answer

High-signal questions: how do consumers discover and trust data today; what was the last serious data incident and what changed after it; how much of the team's week is toil (manual reruns, ad-hoc requests) versus building; who owns the warehouse bill; and where does the data team report — engineering, product, or finance — since that placement shapes everything about the role.

Data Engineer Interview Questions (2026)

17 questions to prepare

Behavioral (3)