New

Now in Claude, ChatGPT, Cursor & more with our MCP server

Back to docs
Research Methods

Wizard of Oz Testing: How to Validate Product Ideas Without Building Them

The complete guide to Wizard of Oz testing — a UX research method where humans simulate AI or system functionality to test concepts before any code is written. Includes when to use it, how to design a study, ethical guardrails, and how AI interview platforms like Koji extend the method.

Wizard of Oz Testing: How to Validate Product Ideas Without Building Them

Wizard of Oz testing is a research method where users believe they are interacting with a finished product, but a human (the "wizard") secretly performs the system's responses behind the curtain. It lets product teams validate concepts — especially AI-powered features — in days instead of months, without writing a line of backend code. When you pair Wizard of Oz prototypes with a research platform like Koji, you can simulate the experience, run structured interviews afterwards, and quantify the result on the same day.

Most failed products do not fail because the team built poorly. They fail because the team built the wrong thing. Wizard of Oz testing is one of the highest-leverage methods for catching that mistake before it costs three engineering quarters. It is especially valuable in 2026, where every team is being asked to ship "AI-powered" features and most teams cannot afford to find out post-launch that the AI is not what users actually wanted.

This guide explains what Wizard of Oz testing is, when to use it, how to run a study, and how to combine it with conversational research to convert raw user reactions into shippable product decisions.


What Is Wizard of Oz Testing?

The name comes from the 1939 film: behind the impressive Wizard is just a person pulling levers. In product research, the Wizard is a researcher (or operator) manually generating responses that the user believes are produced by software.

The user thinks they are using a fully built feature. In reality:

  • A "voice assistant" is a researcher typing replies through a speech synthesizer.
  • A "personalized recommendation engine" is a human curating results in real time from a back office.
  • An "AI summary" of meeting notes is being written by hand while the user waits 30 seconds.

The output a user sees is identical to the finished product. The mechanics behind it are entirely human. The goal is to test the experience — the value, the workflow, the desirability — before committing to engineering.

Origin and modern relevance

The technique was formalized by IBM researcher John F. Kelley in the early 1980s while testing natural-language interfaces. With LLMs and AI agents now embedded in nearly every product roadmap, Wizard of Oz testing has had a strong revival: simulating an AI feature with a human is dramatically faster than building it, and the user reactions you collect are real.


When to Use Wizard of Oz Testing

Wizard of Oz fits best in three situations:

1. Validating AI-powered features before training a model

Building a fine-tuned model or RAG pipeline costs weeks. Wizard of Oz lets you validate that the AI experience is even useful before spending the cycles. Run 10-15 sessions with a human pretending to be the AI; if users do not pick up the workflow or do not trust the output, no training run will save you.

2. Testing complex interactions cheaply

When the workflow involves multiple back-and-forth steps — filing a support ticket, booking travel, drafting a proposal — building a real prototype is expensive. A wizard can simulate the entire flow in a Figma file plus a chat tool.

3. Probing trust and desirability

Sometimes the question is not "can we build this" but "will users trust this enough to use it?" Wizard of Oz isolates the experience from the implementation, so the only thing being measured is the user's reaction to the idea.


When NOT to Use Wizard of Oz

Skip the method when:

  • The system's value is its speed (a wizard is slower than software, which can confound results)
  • The technical risk is high but the desirability is obvious (build a thin slice instead)
  • Long-term behavior matters more than first impressions (use a diary study or beta program)
  • You cannot reasonably simulate the experience (e.g., real-time personalization across millions of items)

For these cases, smoke tests and fake-door tests or prototype testing are usually a better fit.


How to Design a Wizard of Oz Study

Step 1 — Define the decision

Wizard of Oz is expensive to run (a researcher is moderating live), so the decision the study is feeding must be worth the effort. Typical decisions:

  • Should we build this AI feature at all?
  • Which of two interaction models should we invest in?
  • Where does the experience break down for the user?

Write the decision down in one sentence before designing anything else.

Step 2 — Choose what the wizard simulates

Pick the smallest, most decision-relevant slice of the experience. If you are testing an AI meeting summarizer, the wizard does not need to also simulate the calendar integration — they only need to produce the summary.

Step 3 — Build a believable surface

The user-facing surface needs to feel real. In practice that means:

  • A clickable Figma prototype, a Notion mock-up, or a stripped-down web app
  • Plausible loading states (so a 30-second wizard response feels like AI processing time, not a hang)
  • Real-looking output formatting (do not let the wizard send raw text where the product would render markdown)

Step 4 — Script the wizard

The wizard is not improvising. Write a one-page operations doc that defines:

  • What the wizard does for each user action
  • What the wizard does NOT do (out-of-scope requests get a polite stub response)
  • How long the wizard waits before responding (consistency matters)
  • How the wizard logs each decision for post-hoc analysis

Step 5 — Plan the post-experience interview

This is where most Wizard of Oz studies fail. Teams capture the session and stop. The reaction is the evidence; the insight lives in the post-experience interview.

Run a structured interview immediately after the wizard session, ideally in the same tool. Cover:

  • What the user expected to happen at each step
  • Where the experience matched or violated expectations
  • Whether they would use it again, and for what task
  • What they would change, prioritised

This is where Koji shines. After a wizard session, you can route the participant directly to a Koji AI interview that captures their full reasoning — without needing a researcher in the room. Koji's structured questions handle the quantitative pieces (1-5 desirability, multiple-choice friction tagging, ranking of feature ideas) while open-ended questions plus AI follow-up probing surface the qualitative depth.

Step 6 — Pilot the wizard

Always pilot with two participants before going live. The most common failures:

  • Wizard responses too fast (users notice)
  • Wizard responses too slow (users abandon)
  • Edge cases the script did not anticipate
  • Surface that breaks immersion (a typo, a placeholder image)

Fix and run again.


Sample Sizes and Recruiting

Wizard of Oz is qualitative. Sample sizes are small:

GoalRecommended sample
Detect major usability issues5-7 sessions
Compare two interaction models8-12 per arm
Validate desirability across segments5 per segment

Above 15-20 sessions, the moderator overhead becomes prohibitive. If you need a larger sample, use Wizard of Oz for the first wave of insight and follow up with a Koji conversational survey to validate findings at scale. See data saturation for how to know when you have enough.

For recruiting, prioritise users who match the target persona for the feature. If the feature is for power users, do not test it on first-time users. Use a research screener to filter, and consider research participant incentives of $50-150 per session given the time commitment.


Ethical Guardrails

Wizard of Oz raises an obvious question: are you deceiving the user?

The professional answer is "yes, with informed consent." The ethical contract is:

  • Participants are told upfront that they are testing an early prototype that may not work as advertised
  • Participants are NOT told mid-session that a human is operating the system (this would change behaviour)
  • Participants ARE debriefed at the end that the simulation was human-operated and given a chance to ask questions

Get the consent in writing using a research consent form and document the debrief. If you are testing health, finance, or safety-critical features, run the study past your legal or research ethics team first.


Wizard of Oz Vs Other Validation Methods

MethodBest forCostRealism
Wizard of OzConcept and AI featuresMedium (1 wizard per session)Very high
Prototype testingUI flowsLowMedium
Smoke / fake-door testsDemand validationVery lowLow
Beta programsReal long-term behaviourHighVery high
Concept testingIdea sortingLowLow

A common pattern: smoke test to validate demand → Wizard of Oz to validate the experience → engineering build → beta to validate retention.


Where Koji Fits

Koji is the AI-native customer research platform purpose-built to run the interview half of a Wizard of Oz study at scale.

Before the wizard session: send a Koji screener to filter for the right participants, automatically scheduling those who qualify.

After the wizard session: route every participant to a Koji conversation that captures the full reasoning behind their reactions. Voice or text. AI moderator. Six structured question types so the team gets both the desirability score (scale) and the friction quotes (open-ended) in the same study.

During analysis: Koji's automatic analysis aggregates findings across all participants in minutes, generates a research report, and lets the team query in plain English ("What did first-time users say about the AI accuracy?") via Insights Chat.

The combination — human wizard on the experience side, AI moderator on the research side — is the fastest known way to convert a Friday afternoon of user reactions into a Monday morning go/no-go decision.


Common Pitfalls

  1. Optimising for the demo instead of the decision. A great Wizard of Oz session is one where you learned, not one where the wizard performed flawlessly.
  2. Skipping the post-session interview. Reactions are evidence; reasoning is insight.
  3. Letting the wizard improvise. Inconsistent wizard behaviour means inconsistent data.
  4. Not piloting. Two participants, always.
  5. Over-scaling. Above 15-20 sessions, the moderator overhead exceeds the marginal insight. Switch to a conversational survey for larger samples.
  6. No debrief. Failing to disclose the simulation at the end damages trust and creates future recruiting problems.

Wizard of Oz in 2026: The AI-Era Comeback

Three forces have made Wizard of Oz the highest-leverage validation method on most teams' roadmaps:

  1. Every product is being asked to ship AI features fast, and AI is uniquely suited to wizard-style simulation.
  2. Conversational research platforms like Koji collapse the post-session analysis from "two weeks" to "the same afternoon," tightening the feedback loop dramatically.
  3. The cost of building real AI is high enough that even one wrong build is brutal. Wizard of Oz routinely saves entire engineering quarters.

If your team is about to commit a quarter of engineering capacity to a feature, run a Wizard of Oz study first. The asymmetry of the bet — one researcher week versus one engineering quarter — is rarely beaten.


Related Resources

Related Articles

Structured Questions in AI Interviews

Mix quantitative data collection — scales, ratings, multiple choice, ranking — with AI-powered conversational follow-up in a single interview.

Research Consent Form Templates: GDPR-Compliant Forms for Every Study

Ready-to-use consent form templates for user research, UX studies, and AI interviews. Covers GDPR compliance, informed consent best practices, and how to collect consent automatically with Koji.

Smoke Tests and Fake Door Tests: How to Validate Demand Before You Build

Smoke tests and fake door tests measure real user demand for an idea before any code is written. Learn the playbook used by Buffer, Dropbox, and modern product teams — and how to pair it with AI interviews.

Prototype Testing and Concept Validation: A Researcher's Complete Guide

Learn how to validate product concepts and prototypes through research interviews before committing to build. Covers when to use each approach, question frameworks, and how AI interviews scale concept validation 10x faster.

How to Conduct User Interviews: The Complete Step-by-Step Guide

A complete step-by-step guide to planning, conducting, and analyzing user interviews—covering discussion guide writing, participant recruitment, facilitation techniques, sample size, and modern AI-powered approaches.

Concept Testing: The Complete Methodology Guide

How to evaluate product and marketing ideas with target audiences before development — covering methods, metrics, sample sizes, and AI-powered approaches.

Data Saturation in Qualitative Research: How to Know When You Have Enough

Data saturation is the point at which additional interviews stop producing new information. This guide covers the four types of saturation (theoretical, data, code, meaning), how to recognize and document them, the empirical sample sizes from Hennink and Guest, and how AI-moderated interviews let you reach saturation in days instead of months.

Generative vs. Evaluative Research: When to Use Each Method

Understand the difference between generative and evaluative research, when to use each, and how combining both leads to better product decisions. Includes a comparison table and decision framework.

User Research for AI Products: A Practical Guide for 2026

AI products break the assumptions traditional UX research is built on — outputs are non-deterministic, trust is the central UX problem, and prompts replace navigation. This guide covers the methods, question types, and study designs that actually work for teams shipping AI features.