New

Now in Claude, ChatGPT, Cursor & more with our MCP server

Back to blog
Tutorial15 min read

User Research for AI Products: The 2026 Playbook for Building With LLMs

A complete 2026 guide to running user research on AI and LLM-powered products — covering trust, hallucination tolerance, perceived value, and the 5-step research framework Koji recommends for teams shipping generative-AI features.

Koji Research Team

May 13, 2026

User Research for AI Products: The 2026 Playbook for Building With LLMs

Most product teams are now shipping AI features faster than they can research them. Generative AI has reached 54.6% adoption in three years — outpacing the personal computer and the internet at the same point in their respective timelines — yet most engineering teams shipping LLM features in 2026 are testing them less rigorously than traditional features, reflecting how fast LLM adoption has outpaced testing-and-research practice maturity.

That gap is where AI products quietly fail. Not in benchmarks. In trust. In comprehension. In the moment a user sees a confident-sounding answer that's subtly wrong and silently decides the feature isn't for them.

This is the 2026 playbook for user research on AI products — what to test, how to test it, and the 5-step framework Koji recommends for any team shipping generative-AI features.

Why AI Products Need a Different Research Approach

Traditional UX research evaluates deterministic systems: same input, same output, predictable failure modes. AI products break that frame. The same prompt can produce different completions. The system can be technically correct but contextually wrong. Hallucinations can be confident-sounding and fluent — which makes them harder to detect than crashes.

Three research dimensions become uniquely critical for AI products in 2026:

1. Trust calibration. Stanford's 2026 AI Index reports that 37% of non-users avoid AI products specifically because they do not trust them, with 29% citing broader societal impact concerns and 26% concerned about privacy. Trust is a feature, not a side effect.

2. Hallucination tolerance. McKinsey's 2026 enterprise AI survey shows 51% of organizations report negative consequences from AI use, with inaccuracy and hallucinations (56%) the top concern preventing faster deployment. How much wrongness is your audience willing to absorb before they churn?

3. Embedded value. 64% of Americans now use AI tools monthly, but standalone AI features rarely deliver the promised lift. The 2026 consensus is that "embedded AI" — AI seamlessly integrated into existing workflows — drives adoption, while bolted-on AI features get ignored.

A research plan that doesn't explicitly measure trust, hallucination tolerance, and embedded value is going to ship features the user politely abandons.

The Trust–Value Matrix: Where AI Features Actually Live

Before designing research, locate the feature on the trust-value matrix:

| | Low Stakes | High Stakes | |---|---|---| | Low Trust Needed | Tone suggestions, emoji recs — research focuses on perceived usefulness | Autocomplete in medical notes — research focuses on error-recovery affordances | | High Trust Needed | AI-written tweet drafts — research focuses on voice fidelity and edit cost | Diagnosis suggestions, financial advice — research focuses on calibrated confidence and human-in-the-loop boundaries |

Research question, sample size, and methodology should all change quadrant-by-quadrant. A 5-person usability test is plenty for emoji recommendations and dangerous for diagnostic AI.

The 5-Step Research Framework for AI Products

Step 1: Generative Research — Map the "Why Would I Even Use This?" Story

Before any AI feature ships, run generative research to understand what job the user currently does manually, how confident they currently feel, and what tolerance they have for AI mistakes. Open-ended voice interviews are the gold standard.

What to ask:

  • Walk me through the last time you did [task]. What did you do, in order?
  • What's the worst part of that workflow?
  • When you imagine an AI doing this, what worries you?
  • Where do you not want help — even if the AI was 100% accurate?

How Koji helps: AI-moderated voice interviews probe adaptively for those answers, then synthesize themes across 30–100 respondents in under 24 hours. Compare that to a 6-week interview-and-tag cycle. See the generative research guide for a fuller walkthrough.


Step 2: Concept Testing — Validate Value Before You Ship the Model

For AI features in particular, concept testing matters more than usual because the perceived value is highly dependent on framing. The same feature described as "AI suggestions" vs "your AI co-author" vs "smart autocomplete" gets dramatically different responses.

Test three things:

  1. Comprehension — Does the user understand what the feature does without you explaining?
  2. Desirability — Would they use it? In what situations?
  3. Trust threshold — What accuracy rate would they need? What happens if it's wrong?

A scale question ("How much would you trust this feature on a 1–5 scale?") plus an open-ended follow-up ("Why?") is more revealing than any benchmark. Run it on 50–200 respondents from your target segment.

Reference: concept testing guide.


Step 3: Wizard-of-Oz / Prototype Testing — Test the Experience Before the Model

For AI products, the experience is the model's output plus the surrounding interface plus the user's trust state. You can — and should — test all three before the model is production-ready, using a Wizard-of-Oz prototype: a researcher (or scripted responses) plays the AI behind the scenes.

What to observe:

  • Does the user notice when the "AI" is wrong?
  • How do they recover?
  • Do they trust subsequent answers more or less after a failure?
  • Do they verify against another source?

This surfaces UX issues — like missing "verify" affordances or confidence indicators — that pure model evaluation misses entirely. Human intuition, empathy, and methodological rigor remain irreplaceable even as AI augments the research process.


Step 4: Beta Hallucination Tolerance Testing

Once a real model is wired in, run a structured hallucination-tolerance study. Recruit 100+ real target users, give them the live AI feature, and instrument three signals:

  1. Discovery rate. When the AI hallucinated, did the user notice?
  2. Recovery cost. How long did it take them to recover? Did they abandon?
  3. Forgiveness curve. After how many errors does trust collapse permanently?

Follow up with AI-moderated interviews (Koji's voice interview platform handles this at scale) to capture the emotional response — the part log analytics can't see. AI cannot assess trust or emotional reactions on its own; humans (or AI moderators trained on emotional cues) still own this layer.

This is where most AI products silently fail. Users don't complain. They just stop using it.


Step 5: Continuous Embedded Research

AI products don't have a "launch." Model versions, prompts, and tool integrations change weekly. Your research can't be a quarterly project — it has to be continuous.

The 2026 standard: AI products embed a lightweight research surface into the product itself. Examples:

  • An in-product feedback prompt after every AI interaction (1-question scale + open-ended why).
  • A monthly cohort sent to an AI-moderated interview by Koji where the AI moderator probes the specific AI behaviors the user has experienced.
  • An automatic alert when sentiment shifts on a particular feature.

Read the continuous discovery handbook for the broader methodology — it applies double for AI products.

Why AI-Moderated Research Is Uniquely Suited to AI Products

There's a useful symmetry here: AI-native research platforms are the most natural fit for researching AI-native products, for three reasons:

  1. Speed matches the model's release cadence. Your AI feature changes weekly. Research that takes 6 weeks is permanently stale. AI-moderated platforms compress 4–6 week qualitative cycles to under 24 hours.
  2. Scale matches the user base. AI products reach broad audiences fast. AI-moderated interviews scale 10x–1000x previous human-led paradigms. You can field 500 voice interviews in a weekend.
  3. Conversational depth captures emotional nuance. The right question for AI features is rarely a 5-point scale — it's "tell me about the moment you stopped trusting it." AI moderators trained on emotional cues are now multimodal and probe naturally.

That doesn't mean AI replaces the researcher. It means it removes the bottleneck — recruitment, fielding, synthesis — so the researcher focuses on study design and strategic interpretation. AI's value lies in augmentation, not replacement.

What to Avoid: Common AI-Product Research Mistakes in 2026

Mistake 1: Testing model accuracy in isolation. A 92% accurate model can have 0% adoption if users can't tell when it's in the 8% wrong tail. Always test the experience around the model, not the model alone.

Mistake 2: Relying solely on synthetic users. Simulated agents can catch low-hanging issues but cannot assess trust or emotional reactions. Synthetic users are useful for early prototype iteration; they're a research crutch for anything past that.

Mistake 3: Surveying instead of interviewing. A scale question gets you the score. It misses why — and "why" is where AI products live or die. Use scales for tracking; use AI-moderated interviews (or human-moderated, if budget allows) for understanding.

Mistake 4: Testing only happy-path interactions. Plan deliberate stress tests: edge-case prompts, multilingual users, accessibility scenarios, contradictory follow-ups. 51% of organizations report negative consequences from AI — most of which surface only off the happy path.

Mistake 5: Skipping the trust-calibration check. If your users trust the AI too much, you have a future incident. If they trust it too little, you have low adoption. Either failure mode is fatal; research it explicitly.

The Toolkit for AI Product Research in 2026

| Research Need | Best Tool Category | Koji Fit | |---|---|---| | Generative discovery | AI-moderated voice interviews | ✅ Native | | Concept testing | Scale + open-ended hybrid surveys | ✅ Six question types in one study | | Wizard-of-Oz prototype testing | Live moderated session tools | Pair with Lookback or human moderator | | Hallucination tolerance | Beta with instrumentation + qual follow-up | ✅ Async AI-moderated follow-ups | | Continuous embedded research | In-product prompts + AI synthesis | ✅ MCP integration to query insights from any LLM client | | Sentiment monitoring | Auto thematic analysis | ✅ Native — sentiment + themes in minutes |

A Concrete 2-Week Research Sprint for an AI Feature

If you have a feature shipping in two weeks, here's the minimum-viable research plan:

Week 1

  • Day 1: Field a 5-question Koji study to 30 target users — open-ended interview + 3 scale questions on perceived value and trust threshold.
  • Day 2–3: Review the auto-generated thematic report. Identify the 2–3 things users worry about most.
  • Day 4: Build a clickable Wizard-of-Oz prototype. Run 5 moderated sessions with target users.
  • Day 5: Synthesize. Adjust feature framing, defaults, and error-recovery affordances.

Week 2

  • Day 6–9: Ship to a 5% beta with instrumentation on discovery, recovery, and abandonment.
  • Day 10: Send beta users a 10-minute AI-moderated voice interview via Koji. Probe specifically on errors, trust shifts, and unexpected uses.
  • Day 11: Review the synthesized insights. Identify the one fix that would unblock adoption.
  • Day 12: Ship the fix. Roll forward.

Total cost on Koji: well under €100. Total time: 2 weeks. Quality of insight: comparable to a $30K agency engagement that would have taken 8 weeks.

Ship AI Products Your Users Actually Trust

The single biggest predictor of AI product success in 2026 isn't model accuracy — it's how well the team understood the user's trust state, hallucination tolerance, and embedded-workflow context before shipping. That understanding doesn't come from benchmarks. It comes from talking to users, fast and at scale.

Koji is the AI-native research platform built for this exact moment: AI-moderated voice and text interviews, six structured question types, automatic thematic analysis, customizable AI consultants, and MCP integration to pull insights directly into your IDE or LLM client. You can field your first AI-feature research study tonight and have a publishable report by morning.

Try Koji free →


Frequently Asked Questions

Make talking to users a habit, not a hurdle.