A/B Testing

What Is A/B Testing? A Practical Guide for Analysts

A/B testing compares a control experience with a changed experience so teams can estimate causal impact instead of relying on before-and-after guesses.

The basic idea

In an A/B test, users are randomly assigned to a control group or a treatment group. If the groups are comparable, the difference in outcomes can be attributed to the treatment with much more confidence than a simple before-and-after comparison.

For a data analyst, the goal is not just to calculate a p-value. The goal is to help the team decide whether a product, marketing, pricing, or onboarding change should be launched, rolled back, or tested again.

A/B testing versus before-and-after analysis

Before-and-after analysis compares performance before a change and after a change. That is easy to run, but risky. Seasonality, campaigns, competitor actions, product outages, holidays, and traffic mix can all change at the same time.

A/B testing compares groups during the same time window. Random assignment reduces many external differences between groups, which makes the result more useful for causal decision-making.

The experiment vocabulary

  • Control: the existing version or baseline experience.
  • Treatment: the changed version being tested.
  • Randomization unit: the entity assigned to a group, usually user, account, session, or device.
  • Primary metric: the main metric used for the decision.
  • Guardrail metric: a metric that should not get worse, such as refund rate, latency, complaints, or churn.
  • Minimum detectable effect: the smallest lift worth detecting.

What analysts must define before launch

Every experiment needs a hypothesis, target users, split method, primary metric, guardrail metrics, expected effect size, sample size, and decision rule. These should be written before the experiment starts, because changing the rules after seeing early data makes the result much less trustworthy.

Hypothesis: clearer reward messaging increases deposit conversion.
Control: existing reward page.
Treatment: new reward page with clearer tiers.
Primary metric: deposit conversion rate.
Guardrail metric: complaint rate.
Decision rule: launch if conversion improves and guardrails are stable.

Common metrics

Primary metrics include conversion rate, activation rate, purchase rate, retention, revenue per visitor, or average order value. Guardrail metrics include refund rate, complaint rate, latency, cancellation rate, churn, or long-term retention.

Do not choose too many primary metrics. If every metric is primary, no metric is primary. Pick one main decision metric, then use secondary metrics to explain where the movement came from.

Sample size matters

If the sample size is too small, the test may fail to detect a real improvement. If the experiment stops as soon as a result looks good, the team can overstate the effect. Practical sample-size planning uses baseline conversion rate, expected effect size, significance level, and power.

When not to run a test

Do not run an A/B test when traffic is too low, the change is legally required, the experience cannot be randomized, or the metric will take too long to observe. In those cases, use a phased rollout, cohort analysis, pre/post analysis with caveats, or qualitative research.

Common mistakes

  • Changing traffic allocation mid-test without documenting it.
  • Stopping the test early because one day looks promising.
  • Using sessions as the unit when users can appear in both groups.
  • Ignoring guardrail metrics after the primary metric improves.
  • Reporting relative lift without showing sample size and baseline rate.

Read next

After learning the basics, use the A/B test design template to plan an experiment and the analysis report guide to summarize results.