System Usability Scale (SUS): Complete Guide with Calculator, Benchmarks & Examples
The definitive 2026 guide to the System Usability Scale (SUS): the 10-question formula, scoring calculator, Sauro–Lewis benchmark grades, and how to deploy SUS at scale with AI-moderated interviews on Koji.
What is the System Usability Scale (SUS)?
The System Usability Scale (SUS) is a 10-item Likert questionnaire that produces a single 0–100 score representing the perceived usability of any product, app, or system. It was developed by John Brooke at Digital Equipment Corporation in 1986 as a "quick and dirty" measure of usability, and has since become the most widely cited usability questionnaire in the world — used in more than 1,300 published research articles and behind tens of thousands of commercial usability evaluations.
If you only have time to capture one number that summarizes how usable your product feels to a real user, it should be a SUS score. It is short (10 questions), validated, comparable across products, and reliable even with small samples (Brooke's original validation work showed SUS produces consistent scores with as few as 8 to 12 participants per condition).
The bottom line: a SUS score above 68 is "above average." Above 80.3 is in the top 10% of products tested. Anything below 51 is in the bottom 15%. These cut-points come from Jeff Sauro and James Lewis's analysis of more than 5,000 SUS scores across 500+ studies — the de-facto benchmark database.
The 10 SUS Questions (verbatim)
The SUS uses ten statements, alternating between positive and negative phrasing, and a 5-point Likert scale ("Strongly disagree" → "Strongly agree"):
- I think that I would like to use this system frequently.
- I found the system unnecessarily complex.
- I thought the system was easy to use.
- I think that I would need the support of a technical person to be able to use this system.
- I found the various functions in this system were well integrated.
- I thought there was too much inconsistency in this system.
- I would imagine that most people would learn to use this system very quickly.
- I found the system very cumbersome to use.
- I felt very confident using the system.
- I needed to learn a lot of things before I could get going with this system.
The alternating polarity is intentional. It forces respondents to read each item rather than satisficing down a column of "Strongly agree" answers, and it yields a more discriminating score.
How to Calculate a SUS Score (the Formula)
SUS scoring confuses people the first time they see it because the math is non-obvious. The reason is the alternating polarity above — odd items measure positive sentiment, even items measure negative sentiment, and both have to be normalized in opposite directions before they can be summed.
The full formula:
SUS = 2.5 × ( 20 + (Q1 + Q3 + Q5 + Q7 + Q9) − (Q2 + Q4 + Q6 + Q8 + Q10) )
Step by step, for each respondent:
- Convert each Likert response to a 1–5 number ("Strongly disagree" = 1, "Strongly agree" = 5).
- For odd-numbered items (1, 3, 5, 7, 9): subtract 1 from the score.
- For even-numbered items (2, 4, 6, 8, 10): subtract the score from 5.
- Sum all 10 adjusted scores. You will get a number between 0 and 40.
- Multiply by 2.5 to convert to the final 0–100 scale.
That gives you a per-respondent SUS score. To produce the SUS score for the system, average the per-respondent scores.
A worked example: a respondent rates Q1 = 4, Q2 = 2, Q3 = 5, Q4 = 1, Q5 = 4, Q6 = 1, Q7 = 5, Q8 = 2, Q9 = 4, Q10 = 1. Adjusted scores: 3, 3, 4, 4, 3, 4, 4, 3, 3, 4 → sum = 35 → × 2.5 = 87.5, an "A" grade.
A common mistake: SUS is not a percentage. A score of 75 does not mean "75% of users found it usable." It is a normalized index, and the only meaningful comparison is to the benchmark distribution.
SUS Benchmarks: What Counts as "Good"?
Jeff Sauro's 2011 analysis of 5,000+ SUS scores produced a now-canonical benchmark distribution. Sauro and Lewis later refined it into a curved letter-grade scale (Sauro & Lewis, 2016).
| SUS Score | Grade | Percentile | Interpretation |
|---|---|---|---|
| ≥ 84.1 | A+ | 96–100 | Best imaginable |
| 80.8 – 84.0 | A | 90–95 | Excellent |
| 78.9 – 80.7 | A− | 85–89 | Excellent |
| 77.2 – 78.8 | B+ | 80–84 | Good |
| 74.1 – 77.1 | B | 70–79 | Good |
| 72.6 – 74.0 | B− | 65–69 | Good |
| 71.1 – 72.5 | C+ | 60–64 | OK |
| 65.0 – 71.0 | C | 41–59 | Average (the 68 mid-point) |
| 62.7 – 64.9 | C− | 35–40 | OK / borderline |
| 51.7 – 62.6 | D | 15–34 | Poor — fix it |
| < 51.7 | F | 0–14 | Unusable |
The single number to remember is 68 — the population mean across all SUS studies. If your score is below 68, your product is below average. If it is above 80, you are in the top 10–15%.
"A SUS score above a 68 would be considered above average, and anything below 68 is below average. The best way to interpret your results involves normalizing the scores to produce a percentile ranking." — Jeff Sauro, MeasuringU
When to Use SUS (and When Not To)
Use SUS when you want to:
- Track the perceived usability of a product over time (release-over-release).
- Compare two design alternatives in an A/B usability test.
- Benchmark your product against competitors using a shared yardstick.
- Compare across very different product categories — SUS is technology-agnostic and works for desktop apps, mobile apps, websites, voice interfaces, hardware, and even physical products.
- Provide a single, defensible number for stakeholders who want a usability KPI.
SUS is the wrong tool when:
- You need diagnostic insight into why something is hard. SUS gives a score, not reasons. Pair it with open-ended interview questions to find the root cause.
- You are testing a prototype with major missing functionality — respondents will rate the gaps, not the design.
- You are measuring task-level success or efficiency. Use task completion rates, time-on-task, and error counts for those.
- You only have one or two responses. SUS is reliable at small samples, but a sample of 1 is still a sample of 1.
A widely cited Nielsen Norman Group rule of thumb: complement quantitative usability metrics like SUS with at least 5 qualitative usability sessions — quant gives you the score, qual gives you the why.
Sample Size: How Many Respondents Do You Need?
SUS is unusually robust at small samples — Brooke's 1996 paper showed it produces stable scores with as few as 8 respondents per condition, and Tullis & Stetson (2004) found that 12–14 respondents are enough to detect a 95% confidence interval narrow enough to make decisions.
Practical guidance:
- Formative usability test (one design): 8–12 respondents is sufficient.
- Comparative test (A vs B): 14–20 per condition for adequate statistical power.
- Benchmark / tracking study: 30–50 respondents per release for tighter confidence intervals.
- Public benchmark claims ("our SUS is 82"): 50+ respondents, ideally weighted to your real user mix.
The width of the 95% confidence interval shrinks rapidly between n=5 and n=20, then much more slowly. There is rarely a payoff to going beyond 100 unless you need sub-segment estimates.
How to Run a SUS Study with Koji (the Modern Approach)
Traditional SUS studies require a survey tool, an email invite list, manual scoring in a spreadsheet, and a separate qualitative session to understand the why behind the score. That is the workflow MeasuringU and Sauro have refined for two decades — and it works, but it takes days of wall-clock time.
With Koji, a SUS study runs end-to-end in a single AI-moderated study, and the score plus the diagnostic insights come back together.
Step 1 — Create the study. In Koji, start a new project and pick the Discovery or Exploratory methodology. Tell the AI consultant your goal: "I want to measure the System Usability Scale for our checkout flow and understand the top usability frictions." The AI consultant drafts a research brief in seconds.
Step 2 — Add the 10 SUS items as scale questions. Koji supports six structured question types, one of which is scale. Configure each SUS item with scaleMin: 1, scaleMax: 5, and scaleLabels: ["Strongly disagree", "Strongly agree"]. Because SUS items have deterministic numeric responses, Koji's ground-truth override locks in the click-based answer at high confidence — the LLM never re-interprets a deterministic widget click.
Step 3 — Add 2–3 open-ended probes. Tell the AI moderator to follow each "Strongly disagree" or "Strongly agree" response with a probe: "What made it feel that way?" Koji's adaptive interviewer (configured via maxFollowUps) handles the probing dynamically — you do not script every branch.
Step 4 — Recruit and launch. Share a personalized link, embed the interview widget, or import a CSV of contacts. Koji supports voice (3 credits per interview) or text (1 credit per interview) modalities — voice produces 3.4× longer responses on average, but text often yields higher completion rates for longer SUS surveys.
Step 5 — Read the report. Koji's aggregateScaleResponses function rolls up every respondent's 1–5 answers into per-question distributions, and the aggregateThemes function clusters the open-ended probes into top friction themes. The aggregate SUS score appears at the top of the report; the why sits underneath as themed quotes with citations back to the source interview.
What used to take 2 weeks (recruit → field → score in Excel → write up) collapses to a single afternoon. Teams using AI-moderated research report 60–80% reductions in time-to-insight versus traditional manual workflows.
Common SUS Pitfalls and How to Avoid Them
- Modifying the question wording. Researchers love to "improve" SUS by swapping "system" for "app" or "website." Brooke's original validation work was done on the 1986 wording — modifying it invalidates comparison to the benchmark database. Use the items verbatim, even if the language sounds dated.
- Reporting SUS as a percentage. Stakeholders see "75" and assume "75% of users approve." Always present the score next to the benchmark grade so the meaning is clear.
- Comparing SUS to NPS or CSAT directly. They measure different constructs. SUS measures perceived usability; NPS measures loyalty/recommendation; CSAT measures task-level satisfaction. They are complementary, not interchangeable.
- Running SUS on a broken prototype. If 40% of respondents rate "I found the various functions well integrated" with "Strongly disagree" because the demo had a bug, you have measured the bug, not the design.
- Ignoring the open-ended why. A SUS score with no qualitative follow-up is just a vanity metric. Always pair it with open-ended probes.
Beyond SUS: Adjacent Usability Instruments
If SUS does not quite fit your context, consider:
- UMUX-Lite — 2 items, designed to correlate with SUS; useful for ultra-short pulse surveys.
- SUPR-Q — 8 items, optimized for websites (includes loyalty, trust, appearance, usability sub-scales).
- CSUQ (Computer System Usability Questionnaire) — 19 items, more diagnostic.
- QUIS — longer (50+ items), heavyweight diagnostic instrument for academic studies.
For most product teams, SUS plus open-ended probes is the right starting point. You can always layer in a more diagnostic instrument once you know which screens or flows to investigate.
Frequently Asked Questions
See the FAQ section below.
Related Resources
- Usability Testing Guide — the broader methodology that SUS fits inside
- Structured Questions Guide — how Koji's 6 question types (scale, single_choice, multiple_choice, ranking, yes_no, open_ended) work
- Scale Questions in AI Interviews — how to deploy SUS items in Koji
- Likert Scale Research Guide — the underlying scale type SUS relies on
- Customer Effort Score (CES) Guide — a complementary task-level metric
- How to Analyze Qualitative Data — for the open-ended probes that pair with SUS
Sources & further reading: Brooke, J. (1986). SUS: A Quick and Dirty Usability Scale; Sauro, J. & Lewis, J. R. (2016). Quantifying the User Experience; Tullis, T. & Stetson, J. (2004). A Comparison of Questionnaires for Assessing Website Usability; MeasuringU SUS benchmark database (5,000+ scores across 500+ studies).
Related Articles
How to Analyze Qualitative Data: From Raw Interviews to Actionable Insights
A step-by-step guide to qualitative data analysis — from reviewing raw transcripts to synthesizing themes, generating insights, and presenting findings that teams act on.
Scale Questions in AI Interviews: Measure NPS, CSAT, and Ratings Automatically
Learn how to configure and use scale questions in Koji AI interviews to capture NPS, CSAT, and satisfaction ratings — with automatic probing and aggregated distribution charts in your research report.
Structured Questions in AI Interviews
Mix quantitative data collection — scales, ratings, multiple choice, ranking — with AI-powered conversational follow-up in a single interview.
Likert Scale Questions: How to Use Rating Scales in User Research
A complete guide to Likert scale questions in user research — what they are, when to use them, how to write them correctly, and how Koji's AI interviews take rating scales further by pairing quantitative scores with qualitative follow-up.
How to Conduct Usability Testing: The Complete Guide
A comprehensive guide to usability testing for UX researchers and product managers. Covers types of testing, participant numbers, step-by-step facilitation, and the most common mistakes to avoid.
How to Measure Customer Effort Score (CES) and Reduce Friction
The complete guide to Customer Effort Score surveys. Learn how to measure and reduce friction in customer interactions, and why low-effort experiences drive loyalty more than delight.