New

Now in Claude, ChatGPT, Cursor & more with our MCP server

Back to docs
Research Methods

Usability Metrics: Task Success Rate, Time on Task, and Error Rate Explained

The complete guide to the core usability metrics — task success rate, time on task, and error rate — including industry benchmarks, formulas, sample sizes, and how to capture them automatically with AI-moderated research.

What are the core usability metrics?

The three core usability metrics are task success rate (effectiveness), time on task (efficiency), and error rate (accuracy). Together they answer the only questions that matter in a usability test: Can users finish the job? How long does it take them? And how many mistakes do they make along the way? Every other quantitative usability measure — completion confidence, lostness, task-level satisfaction — is a refinement of these three.

The Nielsen Norman Group calls success rate "the simplest usability metric" precisely because it is the bottom line of usability: if users cannot complete what they came to do, nothing else about the interface matters. The strength of these three metrics is that they are objective and behavioral. Unlike attitudinal scores such as NPS or SUS, they record what users actually did, not what they later said they felt.

This guide defines each metric, gives you the formulas and the published industry benchmarks, explains how many participants you need, and shows how an AI-native platform like Koji captures all three automatically — turning a multi-day analysis grind into a real-time dashboard.

Metric 1: Task success rate (effectiveness)

Task success rate is the percentage of participants who complete a task successfully out of everyone who attempted it.

Task success rate = (number of successful attempts ÷ total attempts) × 100

If 17 of 20 participants successfully add an item to their cart and reach checkout, your success rate is 85%.

The benchmark: In an analysis of 1,189 tasks across 115 usability studies, MeasuringU founder Jeff Sauro found the average task completion rate is 78%. Most teams treat roughly 78–80% as the dividing line between "acceptable" and "needs work" for an important task — though success rate is highly sensitive to task difficulty, so the right target is always relative to the task and to your own historical baseline.

Binary vs. levels of success. The cleanest version is binary: a participant either completed the task or did not. But many teams record levels of success — full success, partial success (completed with significant struggle or workaround), and failure — because a binary view hides the difference between a user who breezed through and one who barely limped to the finish line. Partial successes are often where your richest design insights hide.

The trap: success rate alone is misleading. A task can show a 90% success rate while users take three minutes and make two errors getting there. That is why success rate must always be read alongside time and errors.

Metric 2: Time on task (efficiency)

Time on task measures how long it takes a participant to complete a task, usually reported as the mean or median time in seconds for successful attempts only. (Including failed attempts pollutes the number — a user who gave up after 10 seconds would otherwise look "efficient.")

Because time data is almost always skewed by a few very slow users, the geometric mean or the median is the statistically appropriate measure of center for small samples, not the arithmetic mean. Report a measure of spread too — the range or confidence interval — because an average of 45 seconds means something very different if the spread is 40–50 seconds versus 10–120 seconds.

How to use it: time on task is most powerful as a comparative metric — old design vs. new, your product vs. a competitor, or release over release. An absolute "good" time rarely exists in isolation; a 30-second task time is excellent for a complex configuration flow and terrible for a one-click action.

Metric 3: Error rate (accuracy)

An error is any unintended action, slip, mistake, or omission a user makes while attempting a task. Error rate is typically expressed as errors per task (the average number of errors across all attempts) or as a defect rate (the percentage of attempts containing at least one error).

Errors per task = total errors observed ÷ total attempts

The benchmark: across an analysis of 719 tasks using consumer and business software, Jeff Sauro found an average of 0.7 errors per task, with roughly two out of every three users making at least one error. Errors are far more common than most teams assume — which is exactly why counting them surfaces friction that success rate alone would never reveal.

Not all errors are equal. Classify them by severity (does the error block completion, or merely slow the user down?) and by type (slips, where the user knows the goal but executes the wrong action, vs. mistakes, where the user has the wrong mental model). The pattern in where errors cluster is usually more actionable than the raw count.

Putting the three together

Effectiveness, efficiency, and accuracy form a triangle. A mature usability scorecard reads all three at once:

MetricWhat it measuresTypical benchmark
Task success rateEffectiveness — can they finish?~78% average (Sauro/MeasuringU)
Time on taskEfficiency — how fast?Comparative; no universal target
Error rateAccuracy — how clean?~0.7 errors/task; ~2 of 3 users err

Layer a task-level satisfaction question on top — a single scale question such as "How easy or difficult was that task?" — and you capture the user's attitude alongside their behavior. The Nielsen Norman Group repeatedly finds that performance and satisfaction metrics correlate only moderately, so measuring both protects you from shipping something users can use but hate using.

How many participants do you need?

For qualitative, formative usability testing — finding problems to fix — five users per round uncovers roughly 85% of issues, the classic Nielsen-Landauer finding. But the moment you want reliable quantitative metrics like a stable success rate or time on task, five is far too few: the confidence interval on a metric from five users is enormous.

A practical rule of thumb:

  • 5–8 users — formative testing, finding usability problems (not for reporting precise numbers).
  • 15–20 users — a reasonably tight success-rate estimate for a single design.
  • 30–50+ users — benchmark-grade metrics you intend to track over time or quote externally. (See our usability benchmarking guide for the full methodology.)

Use adjusted-Wald binomial confidence intervals for small-sample completion rates rather than naive percentages — a 4-of-5 success "80%" actually carries a confidence interval running from roughly 36% to 98%.

The modern approach: capturing usability metrics with AI

Traditionally, collecting these three metrics meant scheduling moderated sessions, watching every recording, manually timing each task with a stopwatch, tallying errors by hand, and reconciling notes across a research team. A 20-participant benchmark could swallow a week of analyst time — which is why most teams measured usability once a quarter at best, if at all.

This is exactly the bottleneck AI-native research platforms remove. Koji captures all three core metrics automatically:

  • Task success rate is recorded directly through structured questions. Frame each task with a yes_no or single_choice outcome question, and Koji aggregates the completion rate across every respondent in real time — no manual tallying.
  • Time on task is timestamped automatically for every session, with the distribution (median, range, outliers) computed and charted as responses arrive.
  • Error and friction signals surface through Koji's AI moderator, which probes in the moment ("What made that step confusing?") and then clusters the open-ended answers into themed friction findings, so you see where and why users struggle, not just that they did.

Koji supports all six structured question typesopen_ended, scale, single_choice, multiple_choice, ranking, and yes_no — which means you can capture a binary success flag, a 1–5 ease rating, and a rich open-ended "what went wrong" in a single automated study. Because Koji runs 24/7, you can recruit 30–50 participants for a true quantitative benchmark in days rather than weeks, and re-run the identical study every release to track the trend line. Teams using AI-assisted research tools consistently report dramatically faster time-to-insight precisely because the counting, timing, and tagging — the slow part — is done the instant the last response lands.

You do not need a PhD in measurement theory to run a rigorous usability study. Define the tasks, attach the right structured questions, and let the platform handle the statistics.

Common mistakes to avoid

  1. Reporting time on task for failed attempts. Always separate successful and unsuccessful times.
  2. Using the arithmetic mean on small samples. Time data is skewed — use the median or geometric mean.
  3. Quoting a success rate without a confidence interval. "80% from five users" is not a precise number.
  4. Measuring success but never satisfaction. A usable-but-frustrating product still loses users.
  5. Changing the task wording between benchmark rounds. Consistency is what makes release-over-release comparison valid.

Related Resources

Related Articles

How to Measure Customer Effort Score (CES) and Reduce Friction

The complete guide to Customer Effort Score surveys. Learn how to measure and reduce friction in customer interactions, and why low-effort experiences drive loyalty more than delight.

Structured Questions in AI Interviews

Mix quantitative data collection — scales, ratings, multiple choice, ranking — with AI-powered conversational follow-up in a single interview.

System Usability Scale (SUS): Complete Guide with Calculator, Benchmarks & Examples

The definitive 2026 guide to the System Usability Scale (SUS): the 10-question formula, scoring calculator, Sauro–Lewis benchmark grades, and how to deploy SUS at scale with AI-moderated interviews on Koji.

Think-Aloud Protocol: How to Run and Analyze Think-Aloud Sessions

A complete guide to the think-aloud protocol — the most widely used usability testing method. Learn how to set up sessions, moderate effectively, analyze verbal data, and run remote think-aloud studies.

Usability Benchmarking: How to Run a Benchmark UX Study and Track Metrics Over Time

The complete guide to UX benchmarking — what it is, the metrics to track, how to run a repeatable benchmark usability study, sample sizes, and how AI-moderated research makes continuous benchmarking practical.

How to Conduct Usability Testing: The Complete Guide

A comprehensive guide to usability testing for UX researchers and product managers. Covers types of testing, participant numbers, step-by-step facilitation, and the most common mistakes to avoid.