Q: Your A/B test shows a 15% lift with p < 0.01 after two days. The PM wants to ship. What do you say?

A surprising early win is more often a problem than a victory: peeking inflates false positives, early samples skew toward heavy users, and a 15% lift on a mature product should trigger suspicion, not celebration. Check instrumentation first (a bug in the control arm is the most common cause of dramatic early lifts), confirm the pre-registered sample size, and let the test run to it. Offer the PM the honest framing: two more weeks is cheap insurance against shipping a measurement artifact and rolling it back publicly.

Q: A key business metric dropped 8% week-over-week. Walk me through your investigation.

First fork: real or measurement? Check tracking changes, pipeline failures, and definition changes before anything else, because a large share of metric drops are instrumentation. If real: decompose by segment (platform, geography, new vs returning, acquisition channel) to find whether the drop is broad or concentrated, then check the calendar for seasonality, holidays, a marketing pause, a competitor launch, a recent deploy. State the discipline explicitly: enumerate explanations and eliminate them with data, writing down what you ruled out, rather than chasing the first plausible story.

Q: How would you design the metrics for a new feature before it launches?

Start from the decision the metric must support: what result would make us keep, iterate, or kill this feature? Choose one primary metric tied to the user value (task completion or retained usage, not clicks), guardrails for what must not degrade (overall retention, performance, adjacent-feature usage), and diagnostics for understanding movement. Define the kill threshold before launch, because afterward every stakeholder will negotiate it. Mention novelty effects: plan to measure past the first-week spike.

Q: When can you not run an A/B test, and what do you do instead?

Cannot randomize when: effects spill between units (marketplaces, social features), the change is all-or-nothing (a rebrand, a pricing policy), sample is too small, or ethics forbid it. Alternatives in rough order of credibility: switchback tests for marketplace interference, geo experiments, difference-in-differences with a credible parallel-trends argument, synthetic control, regression discontinuity where a threshold exists. The senior point: name the assumptions each method leans on and how you would check them, and be willing to say "we will have directional evidence, not proof" out loud.

Q: Walk me through how you would build a model to predict customer churn, end to end.

Start where most candidates do not: define churn precisely (what window, what counts as gone) and ask what intervention the prediction feeds, because the model only matters if someone acts on it. Features from behavioral history with a strict time cutoff; name leakage as the thing you actively hunt (any feature computed after the prediction point invalidates everything). Gradient boosting as a strong baseline before anything deeper. Evaluate with temporal splits, not random ones, and report precision at the intervention capacity (the retention team can call 500 accounts, so precision@500). Close the loop: measure the intervention against a holdout, not the model against a test set.

Q: Explain p-values and confidence intervals to a PM who keeps misreading them.

Give the working translation: the p-value asks "if the feature truly did nothing, how surprising is this result?" Small means hard to explain by luck; it is not the probability the feature works. The confidence interval is more useful for decisions: the plausible range of the true effect, where a wide interval crossing zero means "we have not measured this yet," not "there is no effect." Strongest answers add the operational fix: lead readouts with effect sizes and intervals rather than p-values, so the conversation is about magnitude, not a significance ritual.

Q: Your model performs well offline but the online experiment shows no lift. What are the candidate explanations?

The catalog: offline metric does not proxy the business metric (better ranking by clicks, no change in purchases); training-serving skew (features computed differently in production); the model shifts behavior the offline eval cannot see (feedback loops, position effects); the experiment is underpowered for the realistic effect size; or the baseline was already capturing the signal. Diagnostic order: verify serving features match training features first, since skew is the most common and most fixable. The meta-answer that lands: this is why you build offline-online correlation checks before betting a roadmap on offline gains.

Q: How would you evaluate the quality of an LLM-powered feature, like an AI summary in a product?

Layer the evaluation: define failure modes first (hallucination, omission of critical content, tone) and build a rubric; use human labels on a sampled set to calibrate; scale with LLM-as-judge only after validating judge agreement against the human labels, because judges drift and have biases (verbosity, position). In production: behavioral metrics (edits, regenerations, abandonment, downstream task completion) as the real signal, plus an online experiment against the no-AI baseline for business impact. The discipline to emphasize: a fixed eval set rerun on every prompt or model change, because regressions are silent otherwise.

Q: Tell me about an analysis that changed a decision.

Pick a case where the organization did something different because of your work: a launch stopped, a roadmap reordered, a price changed. Structure: what was about to happen, what your analysis showed and how rigorous it was, how you communicated it (the writeup, the meeting, the resistance), and what happened. Influence stories with friction in them (a stakeholder who pushed back, evidence that had to be defended) read as more real than frictionless ones.

Q: Describe a time your analysis or model was wrong. What happened?

Pick a real error with consequences: a leakage bug that inflated a model, a test misread that shipped a dud, a forecast that missed badly. Cover how the error was discovered, what you did immediately (correcting the record matters more than the error), the root cause in your process, and the specific check you added so the class of error cannot recur. The strongest versions show the correction was public and the process change stuck.

Question 1

Your A/B test shows a 15% lift with p < 0.01 after two days. The PM wants to ship. What do you say?

Accepted Answer

A surprising early win is more often a problem than a victory: peeking inflates false positives, early samples skew toward heavy users, and a 15% lift on a mature product should trigger suspicion, not celebration. Check instrumentation first (a bug in the control arm is the most common cause of dramatic early lifts), confirm the pre-registered sample size, and let the test run to it. Offer the PM the honest framing: two more weeks is cheap insurance against shipping a measurement artifact and rolling it back publicly.

Question 2

A key business metric dropped 8% week-over-week. Walk me through your investigation.

Accepted Answer

First fork: real or measurement? Check tracking changes, pipeline failures, and definition changes before anything else, because a large share of metric drops are instrumentation. If real: decompose by segment (platform, geography, new vs returning, acquisition channel) to find whether the drop is broad or concentrated, then check the calendar for seasonality, holidays, a marketing pause, a competitor launch, a recent deploy. State the discipline explicitly: enumerate explanations and eliminate them with data, writing down what you ruled out, rather than chasing the first plausible story.

Question 3

How would you design the metrics for a new feature before it launches?

Accepted Answer

Start from the decision the metric must support: what result would make us keep, iterate, or kill this feature? Choose one primary metric tied to the user value (task completion or retained usage, not clicks), guardrails for what must not degrade (overall retention, performance, adjacent-feature usage), and diagnostics for understanding movement. Define the kill threshold before launch, because afterward every stakeholder will negotiate it. Mention novelty effects: plan to measure past the first-week spike.

Question 4

When can you not run an A/B test, and what do you do instead?

Accepted Answer

Cannot randomize when: effects spill between units (marketplaces, social features), the change is all-or-nothing (a rebrand, a pricing policy), sample is too small, or ethics forbid it. Alternatives in rough order of credibility: switchback tests for marketplace interference, geo experiments, difference-in-differences with a credible parallel-trends argument, synthetic control, regression discontinuity where a threshold exists. The senior point: name the assumptions each method leans on and how you would check them, and be willing to say "we will have directional evidence, not proof" out loud.

Question 5

Walk me through how you would build a model to predict customer churn, end to end.

Accepted Answer

Start where most candidates do not: define churn precisely (what window, what counts as gone) and ask what intervention the prediction feeds, because the model only matters if someone acts on it. Features from behavioral history with a strict time cutoff; name leakage as the thing you actively hunt (any feature computed after the prediction point invalidates everything). Gradient boosting as a strong baseline before anything deeper. Evaluate with temporal splits, not random ones, and report precision at the intervention capacity (the retention team can call 500 accounts, so precision@500). Close the loop: measure the intervention against a holdout, not the model against a test set.

Question 6

Explain p-values and confidence intervals to a PM who keeps misreading them.

Accepted Answer

Give the working translation: the p-value asks "if the feature truly did nothing, how surprising is this result?" Small means hard to explain by luck; it is not the probability the feature works. The confidence interval is more useful for decisions: the plausible range of the true effect, where a wide interval crossing zero means "we have not measured this yet," not "there is no effect." Strongest answers add the operational fix: lead readouts with effect sizes and intervals rather than p-values, so the conversation is about magnitude, not a significance ritual.

Question 7

Your model performs well offline but the online experiment shows no lift. What are the candidate explanations?

Accepted Answer

The catalog: offline metric does not proxy the business metric (better ranking by clicks, no change in purchases); training-serving skew (features computed differently in production); the model shifts behavior the offline eval cannot see (feedback loops, position effects); the experiment is underpowered for the realistic effect size; or the baseline was already capturing the signal. Diagnostic order: verify serving features match training features first, since skew is the most common and most fixable. The meta-answer that lands: this is why you build offline-online correlation checks before betting a roadmap on offline gains.

Question 8

How would you evaluate the quality of an LLM-powered feature, like an AI summary in a product?

Accepted Answer

Layer the evaluation: define failure modes first (hallucination, omission of critical content, tone) and build a rubric; use human labels on a sampled set to calibrate; scale with LLM-as-judge only after validating judge agreement against the human labels, because judges drift and have biases (verbosity, position). In production: behavioral metrics (edits, regenerations, abandonment, downstream task completion) as the real signal, plus an online experiment against the no-AI baseline for business impact. The discipline to emphasize: a fixed eval set rerun on every prompt or model change, because regressions are silent otherwise.

Question 9

Tell me about an analysis that changed a decision.

Accepted Answer

Pick a case where the organization did something different because of your work: a launch stopped, a roadmap reordered, a price changed. Structure: what was about to happen, what your analysis showed and how rigorous it was, how you communicated it (the writeup, the meeting, the resistance), and what happened. Influence stories with friction in them (a stakeholder who pushed back, evidence that had to be defended) read as more real than frictionless ones.

Question 10

Describe a time your analysis or model was wrong. What happened?

Accepted Answer

Pick a real error with consequences: a leakage bug that inflated a model, a test misread that shipped a dud, a forecast that missed badly. Cover how the error was discovered, what you did immediately (correcting the record matters more than the error), the root cause in your process, and the specific check you added so the class of error cannot recur. The strongest versions show the correction was public and the process change stuck.

Question 11

Walk me through the most technically complex thing you have shipped.

Accepted Answer

Choose work where the difficulty was inherent (interference in the experiment design, a ranking problem with feedback loops, a forecasting system with regime changes) and explain why the simpler approaches failed first; that ordering proves the complexity was earned, not performed. Cover the design, the validation, and how it behaved in production over time. End with what you would simplify now, which signals taste rather than attachment.

Question 12

A VP asks you for analysis to support a decision they have clearly already made. What do you do?

Accepted Answer

Run the honest analysis. If it supports the decision, fine. If not, present what the data shows directly to the VP first, framed as the analysis rather than a confrontation, with uncertainty stated plainly, and let them decide with accurate information. Decision rights are theirs; the integrity of the numbers is yours, and you do not sign analyses that say what they did not find. Practical notes that strengthen the answer: get the question in writing up front, and pre-agree what evidence would support versus undercut the decision before running anything.

Question 13

Two stakeholders want the same metric defined two different ways, and both definitions are defensible. How do you resolve it?

Accepted Answer

Surface the divergence precisely: compute both definitions on the same period and show where and why they differ; abstract debates collapse fast when the actual gap is on screen. Then drive to the decision each definition serves: often one fits company-level reporting and the other a team's operational view, and the fix is two named metrics rather than one winner. Encode the outcome in the metrics layer with documentation and lineage. The anti-pattern: letting both ship under one name and becoming the referee forever.

Question 14

You have two weeks before a launch decision and the rigorous analysis would take six. What do you deliver?

Accepted Answer

Triage by what changes the decision: identify the one or two questions whose answers could flip it, and put the two weeks there. Use the fast tools honestly, meaning historical data, a scrappy holdout, and bounds analysis ("even in the best case, this cannot pay back before Q4"), and label confidence levels explicitly in the readout. Crucially, state what the six-week version would add and recommend whether the decision should wait for it. Delivering a confident-sounding answer at two-week rigor without the caveats is the failure mode being screened.

Question 15

How do you decide when an analysis is good enough to ship versus needing more rigor?

Accepted Answer

Scale rigor to the decision's stakes and reversibility: a reversible product tweak gets a quick read; anything irreversible, expensive, or politically loaded gets the full treatment: sensitivity checks, a second pair of eyes, pre-registered criteria. Name your personal floor that never flexes (data-quality checks, looking at the raw distributions, stating uncertainty) and give one example of deliberately stopping early and one of deliberately going deeper, with the reasoning for each.

Question 16

How do you handle a stakeholder who keeps asking for analyses and ignoring the results?

Accepted Answer

Diagnose before escalating: ask what decision each request feeds. Often the honest answer reshapes or kills the request, and "what would you do differently if the answer is X vs Y?" is the highest-leverage question in the field. If results genuinely keep being ignored, raise it directly with the stakeholder once, framed as wanting the work to be useful, then triage their requests accordingly and reallocate effort to teams that act on evidence. Protecting the team's time is part of the job; martyrdom is not.

Question 17

Do you have any questions for me?

Accepted Answer

High-signal questions: what decision did the data science team change last quarter; what happens when an experiment contradicts a leader's plan, and has that occurred; who owns experiment review and metric definitions; what is the state of the data you would work on; and where does DS report. The answers separate companies where data science informs decisions from companies where it decorates them.

Data Scientist Interview Questions (2026)

17 questions to prepare

Behavioral (3)