{"site":{"name":"Koji","description":"AI-native customer research platform that helps teams conduct, analyze, and synthesize customer interviews at scale.","url":"https://www.koji.so","contentTypes":["blog","documentation"],"lastUpdated":"2026-05-14T12:47:25.612Z"},"content":[{"type":"documentation","id":"80a9d77c-44a3-4c1c-b159-aa44f0995055","slug":"formative-vs-summative-research","title":"Formative vs. Summative Research: When to Use Each Method (And Why It Matters)","url":"https://www.koji.so/docs/formative-vs-summative-research","summary":"Formative research shapes a product during design (5-10 participants, qualitative, weekly cadence) and outputs a list of issues. Summative research evaluates a shipped or near-shipped product (20-40+ participants, quantitative, rare cadence) and outputs a benchmark score. Confusing the two is the most common research-budget mistake. The \"5 users\" rule applies only to formative qualitative work — quantitative measurement requires 20-40+. SUS benchmark is mean 68 (SD 12.5) across 500 studies and 5,000 users. Koji runs both modes in the same platform: AI-moderated conversational interviews for formative discovery, structured questions and quality scoring for summative measurement.","content":"The single most expensive mistake in user research is running the wrong type of study at the wrong stage of a project. Teams routinely commission a 40-person summative benchmark to evaluate an unfinished prototype, or a 5-person formative usability test to \"prove\" that a shipped product is performing well. Both fail — not because the methods are bad, but because they were designed for a different question.\n\nFormative research and summative research are the two foundational modes of evaluating any design, product, or service. Understanding the distinction — and knowing which one your current question requires — is the difference between research that drives decisions and research that decorates them.\n\n## The Core Distinction\n\n**Formative research** is research that *forms* the product. It is conducted *during* design and development to identify what is working, what is broken, and what to change next. The output is a list of specific issues and concrete recommendations.\n\n**Summative research** is research that *sums up* the product. It is conducted *after* a design is largely complete to measure overall performance, often against a benchmark or competitor. The output is a number, a comparison, or a verdict.\n\nThe Nielsen Norman Group puts it cleanly: \"Formative evaluations are used in an iterative process to make improvements before production. Summative evaluations are used to evaluate a shipped product in comparison to a benchmark.\" Get this framing right and most other decisions — sample size, methodology, metrics — follow naturally.\n\n## Side-by-Side Comparison\n\n| Dimension | Formative | Summative |\n|-----------|-----------|-----------|\n| **When in lifecycle** | Early to mid (during design and iteration) | Late (post-launch or pre-launch benchmark) |\n| **Primary question** | *What's wrong and what should we fix?* | *How well does it perform?* |\n| **Approach** | Mostly qualitative — observation, think-aloud, interviews | Mostly quantitative — task success rates, SUS, time-on-task |\n| **Typical sample size** | 5–10 participants per iteration | 20–40+ participants for statistical confidence |\n| **Output** | A prioritized list of issues + design recommendations | A score, a comparison, a pass/fail verdict |\n| **Frequency** | Often — every sprint or design iteration | Rarely — pre-launch, post-launch, annual benchmark |\n| **Decision it informs** | What to change in the next iteration | Whether to ship, whether you improved, how you compare |\n| **Failure mode if used wrong** | Mistakes a noisy benchmark for a real signal | Spends 40-participant budget on issues 5 users would surface |\n\n## When to Run Formative Research\n\nFormative research is your default mode during active design and development. Use it when:\n\n- A prototype, mock-up, or early build needs to be vetted before you invest more engineering time\n- You are between design iterations and need to know what to change\n- You're running a discovery study where the goal is to surface unknown problems, not measure known ones\n- A specific feature, flow, or copy block is suspected of underperforming and you need to understand *why*\n- You're in continuous discovery — talking to users weekly to keep design decisions grounded\n\nThe canonical formative method is qualitative usability testing with 5 participants. Jakob Nielsen and Tom Landauer's 1993 mathematical model showed that 5 qualitative participants typically uncover around 85% of usability issues in an interface — assuming you run multiple rounds and fix what you find between them. Critically, the value of testing with 5 users *only* holds for qualitative formative work. Quantitative summative measurements need substantially more.\n\nOther common formative methods:\n- **Think-aloud protocols** — participants narrate their thoughts as they complete tasks\n- **Cognitive walkthroughs** — experts simulate user decision-making at each step\n- **Heuristic evaluation** — design audit against established usability principles\n- **Diary studies** — longitudinal observation during real use\n- **Concept testing** — early feedback on an idea before prototyping\n\nAll share the same underlying goal: figure out what is wrong while changing it is still cheap.\n\n## When to Run Summative Research\n\nSummative research is your evaluation tool. It is expensive, slower, and statistically rigorous. Use it when:\n\n- You need a defensible number to share with stakeholders or executives\n- You're comparing a new version against an old one (\"did the redesign actually help?\")\n- You're benchmarking against a competitor or industry standard\n- You're measuring whether a product meets a usability threshold before launch\n- You're running a regulatory or audit-grade evaluation\n\nThe canonical summative instrument is the **System Usability Scale (SUS)** — a 10-item questionnaire developed by John Brooke in 1986. SUS has been validated across more than 500 published studies and over 5,000 participants. The benchmark from that body of work: an average SUS score of 68 (SD 12.5). Scores above 68 are above average; below 68, below average.\n\nSUS requires at least 20–30 participants to produce a statistically reliable score. The Nielsen Norman Group's recent guidance for quantitative user testing is around 40 users, depending on the effect size you're trying to detect. This is the math that makes summative research expensive — and the math that makes it appropriate only when a precise number actually matters.\n\nOther common summative methods:\n- **Task success rate measurement** at scale\n- **Time-on-task benchmarking** versus prior version or competitor\n- **A/B tests** comparing two designs in production\n- **NPS, CSAT, CES** for overall product perception (with proper sample sizes)\n- **Large-scale surveys** measuring satisfaction or attitude shifts\n\n## The \"Five Users\" Confusion\n\nNo principle in user research is more misquoted than \"you only need 5 users.\" It is true *for qualitative formative research only*. If your goal is to find usability issues to fix, 5 users per iteration is well-supported by the data. If your goal is to *measure* anything quantitatively — task success rate, time, SUS score, NPS — 5 users will produce numbers with such wide confidence intervals that the result is statistically meaningless.\n\nA 100% task success rate from 5 users has a 95% confidence interval that stretches from roughly 48% to 100%. That is not a benchmark. That is a guess with extra steps.\n\nThe right mental model: **formative research finds problems, summative research measures them.** Finding needs few people. Measuring needs many.\n\n## How to Sequence Formative and Summative Together\n\nIn a healthy research program, formative and summative work in cycles:\n\n1. **Discovery (formative).** Qualitative interviews, contextual inquiry, opportunity mapping. 5–15 participants.\n2. **Ideation + prototyping.** Designers and PMs translate insight into options.\n3. **Iterative testing (formative).** Round 1 with 5 users → fix → Round 2 with 5 users → fix → Round 3. Repeat until the prototype stabilizes.\n4. **Pre-launch benchmark (summative).** SUS, task success, time-on-task with 30–40 participants. Establishes a baseline you can compare future versions against.\n5. **Post-launch monitoring (summative).** Periodic re-runs of the same instruments to track drift over time.\n6. **Back to formative.** When the benchmark dips or the team plans a major change, return to discovery.\n\nThe failure mode in immature research orgs is skipping straight to step 4 with no formative work — producing a precise score on a design full of issues that 5 users would have surfaced in week one.\n\n## How Koji Supports Both Modes\n\nMost research tools force a choice: a survey platform optimizes for quantitative summative work; a usability testing platform optimizes for qualitative formative work. Running both modes means stitching together separate tools, recruitment funnels, and analysis workflows.\n\nKoji is designed to run both modes in the same platform. For **formative discovery**, Koji's AI moderator runs adaptive, conversational interviews with 5–15 participants, probing for specific issues, surfacing unexpected friction, and producing a thematic analysis automatically. The structured questions can be edited or removed; the AI follows the brief. Methodology presets — *exploratory*, *mom_test*, *jtbd*, *discovery* — pre-configure the interview around classic formative frameworks.\n\nFor **summative measurement**, the same Koji study can scale to 100+ respondents, with [structured questions](/docs/structured-questions-guide) of six types (open_ended, scale, single_choice, multiple_choice, ranking, yes_no) producing the kind of quantitative data SUS-style benchmarks require. Quality scoring (1–5 scale) flags low-effort responses automatically. The same study can produce both a qualitative theme summary *and* a quantitative benchmark, sidestepping the usual tool-fragmentation tax.\n\nThe operational benefit: teams using AI-assisted research report meaningfully faster time-to-insight, because the boundary between \"formative interview round\" and \"summative benchmark wave\" collapses into the same workflow. You stop choosing between depth and scale.\n\n## A Quick Decision Heuristic\n\nBefore commissioning any study, write down the *exact* decision the research will inform:\n\n- *\"Should we ship this redesign or wait?\"* → summative\n- *\"What's making people drop off in onboarding?\"* → formative\n- *\"Did our v2 actually improve over v1?\"* → summative\n- *\"Why does usage spike in week three then crash?\"* → formative\n- *\"How do we compare to our biggest competitor on usability?\"* → summative\n- *\"What jobs are users trying to get done?\"* → formative\n\nIf the decision needs a number, you're looking at summative. If it needs an explanation, you're looking at formative. Match the method to the question and the budget follows.\n\n## Related Resources\n\n- [Structured Questions Guide](/docs/structured-questions-guide) — the six structured question types that make summative measurement possible inside an AI-moderated study\n- [Generative vs Evaluative Research](/docs/generative-vs-evaluative-research) — a related but distinct framing for the same lifecycle question\n- [Research Synthesis Guide](/docs/research-synthesis-guide) — how to convert formative findings into shareable insight\n- [How Many User Interviews](/docs/how-many-user-interviews) — practical guidance on sample size at each stage\n- [Continuous Discovery User Research](/docs/continuous-discovery-user-research) — embedding formative research into a weekly cadence\n- [UX Research Process](/docs/ux-research-process) — the end-to-end research lifecycle\n\n## Sources\n\n- Nielsen Norman Group, *Formative vs. Summative Evaluations*\n- Nielsen Norman Group, *Why 5 Participants Are Okay in a Qualitative Study, but Not in a Quantitative One*\n- Brooke, J. (1986). *SUS — A Quick and Dirty Usability Scale*\n- MeasuringU, *Measuring Usability with the System Usability Scale (SUS)* — 500-study benchmark of mean 68 (SD 12.5)\n- Nielsen, J. & Landauer, T. (1993). *A mathematical model of the finding of usability problems*","category":"Research Methods","lastModified":"2026-05-14T03:18:11.926719+00:00","metaTitle":"Formative vs. Summative Research: When to Use Each (2026 Guide)","metaDescription":"Formative research shapes a product during design with 5-10 participants. Summative research evaluates after launch with 30-40+. Learn when to use each, and why mixing them up wastes research budgets.","keywords":["formative research","summative research","formative vs summative","formative evaluation","summative evaluation","usability testing","SUS score","sample size usability","qualitative research","quantitative research"],"aiSummary":"Formative research shapes a product during design (5-10 participants, qualitative, weekly cadence) and outputs a list of issues. Summative research evaluates a shipped or near-shipped product (20-40+ participants, quantitative, rare cadence) and outputs a benchmark score. Confusing the two is the most common research-budget mistake. The \"5 users\" rule applies only to formative qualitative work — quantitative measurement requires 20-40+. SUS benchmark is mean 68 (SD 12.5) across 500 studies and 5,000 users. Koji runs both modes in the same platform: AI-moderated conversational interviews for formative discovery, structured questions and quality scoring for summative measurement.","aiPrerequisites":["ux-research-process"],"aiLearningOutcomes":["Distinguish formative from summative research by question, sample size, method, and output","Pick the right sample size for the question you're asking (5-10 for formative, 20-40+ for summative)","Sequence formative and summative research into a coherent research program","Recognize when \"5 users\" is the right answer and when it's wildly insufficient","Use SUS and similar instruments correctly for summative benchmarking"],"aiDifficulty":"intermediate","aiEstimatedTime":"12 min read"}],"pagination":{"total":1,"returned":1,"offset":0}}