New

Now in Claude, ChatGPT, Cursor & more with our MCP server

Back to docs
Research Methods

Formative vs. Summative Research: When to Use Each Method (And Why It Matters)

Formative research shapes a product while it's still being built. Summative research evaluates how it performs after it ships. Confusing the two is the most common reason research budgets get wasted on the wrong question at the wrong time.

The single most expensive mistake in user research is running the wrong type of study at the wrong stage of a project. Teams routinely commission a 40-person summative benchmark to evaluate an unfinished prototype, or a 5-person formative usability test to "prove" that a shipped product is performing well. Both fail — not because the methods are bad, but because they were designed for a different question.

Formative research and summative research are the two foundational modes of evaluating any design, product, or service. Understanding the distinction — and knowing which one your current question requires — is the difference between research that drives decisions and research that decorates them.

The Core Distinction

Formative research is research that forms the product. It is conducted during design and development to identify what is working, what is broken, and what to change next. The output is a list of specific issues and concrete recommendations.

Summative research is research that sums up the product. It is conducted after a design is largely complete to measure overall performance, often against a benchmark or competitor. The output is a number, a comparison, or a verdict.

The Nielsen Norman Group puts it cleanly: "Formative evaluations are used in an iterative process to make improvements before production. Summative evaluations are used to evaluate a shipped product in comparison to a benchmark." Get this framing right and most other decisions — sample size, methodology, metrics — follow naturally.

Side-by-Side Comparison

DimensionFormativeSummative
When in lifecycleEarly to mid (during design and iteration)Late (post-launch or pre-launch benchmark)
Primary questionWhat's wrong and what should we fix?How well does it perform?
ApproachMostly qualitative — observation, think-aloud, interviewsMostly quantitative — task success rates, SUS, time-on-task
Typical sample size5–10 participants per iteration20–40+ participants for statistical confidence
OutputA prioritized list of issues + design recommendationsA score, a comparison, a pass/fail verdict
FrequencyOften — every sprint or design iterationRarely — pre-launch, post-launch, annual benchmark
Decision it informsWhat to change in the next iterationWhether to ship, whether you improved, how you compare
Failure mode if used wrongMistakes a noisy benchmark for a real signalSpends 40-participant budget on issues 5 users would surface

When to Run Formative Research

Formative research is your default mode during active design and development. Use it when:

  • A prototype, mock-up, or early build needs to be vetted before you invest more engineering time
  • You are between design iterations and need to know what to change
  • You're running a discovery study where the goal is to surface unknown problems, not measure known ones
  • A specific feature, flow, or copy block is suspected of underperforming and you need to understand why
  • You're in continuous discovery — talking to users weekly to keep design decisions grounded

The canonical formative method is qualitative usability testing with 5 participants. Jakob Nielsen and Tom Landauer's 1993 mathematical model showed that 5 qualitative participants typically uncover around 85% of usability issues in an interface — assuming you run multiple rounds and fix what you find between them. Critically, the value of testing with 5 users only holds for qualitative formative work. Quantitative summative measurements need substantially more.

Other common formative methods:

  • Think-aloud protocols — participants narrate their thoughts as they complete tasks
  • Cognitive walkthroughs — experts simulate user decision-making at each step
  • Heuristic evaluation — design audit against established usability principles
  • Diary studies — longitudinal observation during real use
  • Concept testing — early feedback on an idea before prototyping

All share the same underlying goal: figure out what is wrong while changing it is still cheap.

When to Run Summative Research

Summative research is your evaluation tool. It is expensive, slower, and statistically rigorous. Use it when:

  • You need a defensible number to share with stakeholders or executives
  • You're comparing a new version against an old one ("did the redesign actually help?")
  • You're benchmarking against a competitor or industry standard
  • You're measuring whether a product meets a usability threshold before launch
  • You're running a regulatory or audit-grade evaluation

The canonical summative instrument is the System Usability Scale (SUS) — a 10-item questionnaire developed by John Brooke in 1986. SUS has been validated across more than 500 published studies and over 5,000 participants. The benchmark from that body of work: an average SUS score of 68 (SD 12.5). Scores above 68 are above average; below 68, below average.

SUS requires at least 20–30 participants to produce a statistically reliable score. The Nielsen Norman Group's recent guidance for quantitative user testing is around 40 users, depending on the effect size you're trying to detect. This is the math that makes summative research expensive — and the math that makes it appropriate only when a precise number actually matters.

Other common summative methods:

  • Task success rate measurement at scale
  • Time-on-task benchmarking versus prior version or competitor
  • A/B tests comparing two designs in production
  • NPS, CSAT, CES for overall product perception (with proper sample sizes)
  • Large-scale surveys measuring satisfaction or attitude shifts

The "Five Users" Confusion

No principle in user research is more misquoted than "you only need 5 users." It is true for qualitative formative research only. If your goal is to find usability issues to fix, 5 users per iteration is well-supported by the data. If your goal is to measure anything quantitatively — task success rate, time, SUS score, NPS — 5 users will produce numbers with such wide confidence intervals that the result is statistically meaningless.

A 100% task success rate from 5 users has a 95% confidence interval that stretches from roughly 48% to 100%. That is not a benchmark. That is a guess with extra steps.

The right mental model: formative research finds problems, summative research measures them. Finding needs few people. Measuring needs many.

How to Sequence Formative and Summative Together

In a healthy research program, formative and summative work in cycles:

  1. Discovery (formative). Qualitative interviews, contextual inquiry, opportunity mapping. 5–15 participants.
  2. Ideation + prototyping. Designers and PMs translate insight into options.
  3. Iterative testing (formative). Round 1 with 5 users → fix → Round 2 with 5 users → fix → Round 3. Repeat until the prototype stabilizes.
  4. Pre-launch benchmark (summative). SUS, task success, time-on-task with 30–40 participants. Establishes a baseline you can compare future versions against.
  5. Post-launch monitoring (summative). Periodic re-runs of the same instruments to track drift over time.
  6. Back to formative. When the benchmark dips or the team plans a major change, return to discovery.

The failure mode in immature research orgs is skipping straight to step 4 with no formative work — producing a precise score on a design full of issues that 5 users would have surfaced in week one.

How Koji Supports Both Modes

Most research tools force a choice: a survey platform optimizes for quantitative summative work; a usability testing platform optimizes for qualitative formative work. Running both modes means stitching together separate tools, recruitment funnels, and analysis workflows.

Koji is designed to run both modes in the same platform. For formative discovery, Koji's AI moderator runs adaptive, conversational interviews with 5–15 participants, probing for specific issues, surfacing unexpected friction, and producing a thematic analysis automatically. The structured questions can be edited or removed; the AI follows the brief. Methodology presets — exploratory, mom_test, jtbd, discovery — pre-configure the interview around classic formative frameworks.

For summative measurement, the same Koji study can scale to 100+ respondents, with structured questions of six types (open_ended, scale, single_choice, multiple_choice, ranking, yes_no) producing the kind of quantitative data SUS-style benchmarks require. Quality scoring (1–5 scale) flags low-effort responses automatically. The same study can produce both a qualitative theme summary and a quantitative benchmark, sidestepping the usual tool-fragmentation tax.

The operational benefit: teams using AI-assisted research report meaningfully faster time-to-insight, because the boundary between "formative interview round" and "summative benchmark wave" collapses into the same workflow. You stop choosing between depth and scale.

A Quick Decision Heuristic

Before commissioning any study, write down the exact decision the research will inform:

  • "Should we ship this redesign or wait?" → summative
  • "What's making people drop off in onboarding?" → formative
  • "Did our v2 actually improve over v1?" → summative
  • "Why does usage spike in week three then crash?" → formative
  • "How do we compare to our biggest competitor on usability?" → summative
  • "What jobs are users trying to get done?" → formative

If the decision needs a number, you're looking at summative. If it needs an explanation, you're looking at formative. Match the method to the question and the budget follows.

Related Resources

Sources

  • Nielsen Norman Group, Formative vs. Summative Evaluations
  • Nielsen Norman Group, Why 5 Participants Are Okay in a Qualitative Study, but Not in a Quantitative One
  • Brooke, J. (1986). SUS — A Quick and Dirty Usability Scale
  • MeasuringU, Measuring Usability with the System Usability Scale (SUS) — 500-study benchmark of mean 68 (SD 12.5)
  • Nielsen, J. & Landauer, T. (1993). A mathematical model of the finding of usability problems