Formative vs. Summative Research: When to Use Each (2026 Guide)

The single most expensive mistake in user research is running the wrong type of study at the wrong stage of a project. Teams routinely commission a 40-person summative benchmark to evaluate an unfinished prototype, or a 5-person formative usability test to "prove" that a shipped product is performing well. Both fail — not because the methods are bad, but because they were designed for a different question.

Formative research and summative research are the two foundational modes of evaluating any design, product, or service. Understanding the distinction — and knowing which one your current question requires — is the difference between research that drives decisions and research that decorates them.

The Core Distinction

Formative research is research that forms the product. It is conducted during design and development to identify what is working, what is broken, and what to change next. The output is a list of specific issues and concrete recommendations.

Summative research is research that sums up the product. It is conducted after a design is largely complete to measure overall performance, often against a benchmark or competitor. The output is a number, a comparison, or a verdict.

The Nielsen Norman Group puts it cleanly: "Formative evaluations are used in an iterative process to make improvements before production. Summative evaluations are used to evaluate a shipped product in comparison to a benchmark." Get this framing right and most other decisions — sample size, methodology, metrics — follow naturally.

Side-by-Side Comparison

Dimension	Formative	Summative
When in lifecycle	Early to mid (during design and iteration)	Late (post-launch or pre-launch benchmark)
Primary question	What's wrong and what should we fix?	How well does it perform?
Approach	Mostly qualitative — observation, think-aloud, interviews	Mostly quantitative — task success rates, SUS, time-on-task
Typical sample size	5–10 participants per iteration	20–40+ participants for statistical confidence
Output	A prioritized list of issues + design recommendations	A score, a comparison, a pass/fail verdict
Frequency	Often — every sprint or design iteration	Rarely — pre-launch, post-launch, annual benchmark
Decision it informs	What to change in the next iteration	Whether to ship, whether you improved, how you compare
Failure mode if used wrong	Mistakes a noisy benchmark for a real signal	Spends 40-participant budget on issues 5 users would surface

When to Run Formative Research

Formative research is your default mode during active design and development. Use it when:

A prototype, mock-up, or early build needs to be vetted before you invest more engineering time
You are between design iterations and need to know what to change
You're running a discovery study where the goal is to surface unknown problems, not measure known ones
A specific feature, flow, or copy block is suspected of underperforming and you need to understand why
You're in continuous discovery — talking to users weekly to keep design decisions grounded

The canonical formative method is qualitative usability testing with 5 participants. Jakob Nielsen and Tom Landauer's 1993 mathematical model showed that 5 qualitative participants typically uncover around 85% of usability issues in an interface — assuming you run multiple rounds and fix what you find between them. Critically, the value of testing with 5 users only holds for qualitative formative work. Quantitative summative measurements need substantially more.

Other common formative methods:

Think-aloud protocols — participants narrate their thoughts as they complete tasks
Cognitive walkthroughs — experts simulate user decision-making at each step
Heuristic evaluation — design audit against established usability principles
Diary studies — longitudinal observation during real use
Concept testing — early feedback on an idea before prototyping

All share the same underlying goal: figure out what is wrong while changing it is still cheap.

When to Run Summative Research

Summative research is your evaluation tool. It is expensive, slower, and statistically rigorous. Use it when:

You need a defensible number to share with stakeholders or executives
You're comparing a new version against an old one ("did the redesign actually help?")
You're benchmarking against a competitor or industry standard
You're measuring whether a product meets a usability threshold before launch
You're running a regulatory or audit-grade evaluation

The canonical summative instrument is the System Usability Scale (SUS) — a 10-item questionnaire developed by John Brooke in 1986. SUS has been validated across more than 500 published studies and over 5,000 participants. The benchmark from that body of work: an average SUS score of 68 (SD 12.5). Scores above 68 are above average; below 68, below average.

SUS requires at least 20–30 participants to produce a statistically reliable score. The Nielsen Norman Group's recent guidance for quantitative user testing is around 40 users, depending on the effect size you're trying to detect. This is the math that makes summative research expensive — and the math that makes it appropriate only when a precise number actually matters.

Other common summative methods:

Task success rate measurement at scale
Time-on-task benchmarking versus prior version or competitor
A/B tests comparing two designs in production
NPS, CSAT, CES for overall product perception (with proper sample sizes)
Large-scale surveys measuring satisfaction or attitude shifts

The "Five Users" Confusion

No principle in user research is more misquoted than "you only need 5 users." It is true for qualitative formative research only. If your goal is to find usability issues to fix, 5 users per iteration is well-supported by the data. If your goal is to measure anything quantitatively — task success rate, time, SUS score, NPS — 5 users will produce numbers with such wide confidence intervals that the result is statistically meaningless.

A 100% task success rate from 5 users has a 95% confidence interval that stretches from roughly 48% to 100%. That is not a benchmark. That is a guess with extra steps.

The right mental model: formative research finds problems, summative research measures them. Finding needs few people. Measuring needs many.

How to Sequence Formative and Summative Together

In a healthy research program, formative and summative work in cycles:

Discovery (formative). Qualitative interviews, contextual inquiry, opportunity mapping. 5–15 participants.
Ideation + prototyping. Designers and PMs translate insight into options.
Iterative testing (formative). Round 1 with 5 users → fix → Round 2 with 5 users → fix → Round 3. Repeat until the prototype stabilizes.
Pre-launch benchmark (summative). SUS, task success, time-on-task with 30–40 participants. Establishes a baseline you can compare future versions against.
Post-launch monitoring (summative). Periodic re-runs of the same instruments to track drift over time.
Back to formative. When the benchmark dips or the team plans a major change, return to discovery.

The failure mode in immature research orgs is skipping straight to step 4 with no formative work — producing a precise score on a design full of issues that 5 users would have surfaced in week one.

How Koji Supports Both Modes

Most research tools force a choice: a survey platform optimizes for quantitative summative work; a usability testing platform optimizes for qualitative formative work. Running both modes means stitching together separate tools, recruitment funnels, and analysis workflows.

Koji is designed to run both modes in the same platform. For formative discovery, Koji's AI moderator runs adaptive, conversational interviews with 5–15 participants, probing for specific issues, surfacing unexpected friction, and producing a thematic analysis automatically. The structured questions can be edited or removed; the AI follows the brief. Methodology presets — exploratory, mom_test, jtbd, discovery — pre-configure the interview around classic formative frameworks.

For summative measurement, the same Koji study can scale to 100+ respondents, with structured questions of six types (open_ended, scale, single_choice, multiple_choice, ranking, yes_no) producing the kind of quantitative data SUS-style benchmarks require. Quality scoring (1–5 scale) flags low-effort responses automatically. The same study can produce both a qualitative theme summary and a quantitative benchmark, sidestepping the usual tool-fragmentation tax.

The operational benefit: teams using AI-assisted research report meaningfully faster time-to-insight, because the boundary between "formative interview round" and "summative benchmark wave" collapses into the same workflow. You stop choosing between depth and scale.

A Quick Decision Heuristic

Before commissioning any study, write down the exact decision the research will inform:

"Should we ship this redesign or wait?" → summative
"What's making people drop off in onboarding?" → formative
"Did our v2 actually improve over v1?" → summative
"Why does usage spike in week three then crash?" → formative
"How do we compare to our biggest competitor on usability?" → summative
"What jobs are users trying to get done?" → formative

If the decision needs a number, you're looking at summative. If it needs an explanation, you're looking at formative. Match the method to the question and the budget follows.

Related Resources

Structured Questions Guide — the six structured question types that make summative measurement possible inside an AI-moderated study
Generative vs Evaluative Research — a related but distinct framing for the same lifecycle question
Research Synthesis Guide — how to convert formative findings into shareable insight
How Many User Interviews — practical guidance on sample size at each stage
Continuous Discovery User Research — embedding formative research into a weekly cadence
UX Research Process — the end-to-end research lifecycle

Sources

Nielsen Norman Group, Formative vs. Summative Evaluations
Nielsen Norman Group, Why 5 Participants Are Okay in a Qualitative Study, but Not in a Quantitative One
Brooke, J. (1986). SUS — A Quick and Dirty Usability Scale
MeasuringU, Measuring Usability with the System Usability Scale (SUS) — 500-study benchmark of mean 68 (SD 12.5)
Nielsen, J. & Landauer, T. (1993). A mathematical model of the finding of usability problems

Product & Research

Revenue & Growth

Advisory & Services

Formative vs. Summative Research: When to Use Each Method (And Why It Matters)

The Core Distinction

Side-by-Side Comparison

When to Run Formative Research

When to Run Summative Research

The "Five Users" Confusion

How to Sequence Formative and Summative Together

How Koji Supports Both Modes

A Quick Decision Heuristic

Related Resources

Sources

Related Articles

Research Synthesis: How to Combine Multiple Studies Into Clear Insights

Structured Questions in AI Interviews

How Many User Interviews Do You Need? The Sample Size Guide for Qualitative Research

UX Research Process: A Complete Framework for 2026

Generative vs. Evaluative Research: When to Use Each Method

Continuous Discovery: How to Run Weekly Customer Interviews Without Burning Out