Showing posts with label pedagogy. Show all posts
Showing posts with label pedagogy. Show all posts

Wednesday, May 6, 2026

Your Oral Exam Won't Save You

We just ran an experiment where we gave the US Army War College’s oral comprehensive exam to four commercial AI systems.

They all passed. One got an A.

This was the MILBENCH experiment, conducted here at the US Army War College in early 2026. It was the same exam we give our students (senior military officers preparing for senior staff, command, and general officer responsibilities). Same rubrics. Same faculty panels. 

We've been giving this exam for years; the faculty who administered it are experienced examiners with deep content expertise, and the rubric has been regularly updated and refined. The AI systems had no access to any course materials. They came in cold, and they performed at a level that, for a human student, would be considered competent to excellent.

I mention this because I keep hearing people say that oral exams are the answer to AI in education. The logic sounds right: “AI can write essays, but students can't fake their way through a live conversation the way they can paste an AI's output into a paper.”

Except they can. Or more precisely, the AI can. We watched it happen.

What Passing Looked Like

Since 2024, every major AI platform has shipped voice-enabled modes that hold real-time spoken conversations with natural pauses, varied pitch, sub-three-second response times. You can interrupt mid-sentence if you like and the AI adjusts.  We experienced all of this in these sessions.

Fluent expression is not the same as intelligent expression, of course, but the AI systems didn't stumble here, either. They opened with structured, articulate responses that demonstrated command of relevant frameworks. They cited appropriate theorists. They identified alternatives. They addressed counterarguments. They maintained a professional tone (for the most part) throughout. The AI responses, for the most part, weren't just passable. They were polished

Coverage Verification vs. Boundary Finding

After the experiment, I went back and analyzed the faculty questioning using my Ecology of Questions (EOQ) framework, a system I've been developing on sabbatical that evaluates questioning architectures across 42 factors drawn from 55 distinct questioning traditions, everything from Socratic dialogue to intelligence analysis to FBI crisis negotiation. EOQ gave me a structured way to look at what the examiners were asking, not just what the AI was saying.

And we found something that I think applies well beyond our walls.

Most of the questions fell into what I'd call coverage verification: Did the student hit the checkpoints? Did they reference the right frameworks? Did they mention the relevant actors? Once the checkpoints were confirmed, the examiners often moved on. One faculty member said it explicitly: "We're not here to play stump the chump."

This is a perfectly rational approach to oral examination if the purpose is to confirm that the student absorbed the curriculum. Unfortunately, that is also exactly what AI is best at. AI can "hit checkpoints" across every domain in many domains without having "learned" anything. It has the form of understanding without the substance.

The alternative is boundary finding: probing until you discover where understanding actually breaks down. Pushing past the prepared answers into territory the student didn't anticipate. Challenging positions. Introducing new information and watching how they respond. Finding the edge of what they know and examining how they behave at that edge.

Boundary finding is harder to do. It requires more than content expertise from the examiner. It requires the examiner to create a productive discomfort and to stick with it for a bit, rather than accepting a smooth performance at face value.

It's also the only version of an oral examination that AI likely can't currently handle.

What We Learned About What AI Can't Do

When the examination was strong, when the faculty pushed hard and created genuine diagnostic pressure, the differences between AI and human performance became visible. Here's some of the things that jumped out.

  • AI retrieves. It doesn't construct. We examined the same AI system on the same question with three different faculty teams. It gave essentially the same answer every time. Same structure, same examples, same alternatives, same order. In one regard, this is comforting.  AI is often criticized for being inconsistent in its answers, i.e., if you don’t like the answer an AI gives you, ask the question again.  Modern AIs may have solved this issue.  But from our perspective, as examiners, it wasn't building analysis, it was deploying a template. A human student, asked the same question twice, would adjust because they'd notice the repetition, read the audience, and adapt. 
  • AI can't hold a position under pressure. When our faculty challenged something an AI system had said correctly, the AI almost always capitulated. It agreed with the faculty member's (incorrect) challenge rather than defending its own reasoning. One system displayed the opposite problem:  It refused to have positions at all, declining to offer judgment because "I don't have opinions." Both are failures of the same underlying ability: taking a position, holding it when the evidence supports it, and updating it when it doesn't.
  • AI can't manage a conversation. One AI system consumed an entire 10-minute thread with its initial response, so comprehensive, so well-structured, so thorough that the faculty couldn't find space to intervene. It wasn’t an answer, it was a filibuster. 
  • AI doesn't know what it doesn't know. When we asked AI systems to grade themselves against the rubric, they consistently rated themselves higher than the faculty did. One gave itself straight A's when the faculty gave B/B+. Calibrated self-awareness, knowing the quality of your own performance, is a hallmark of expertise. AI doesn't have it.

How to Build a Boundary-Finding Oral Exam

If the goal of your oral exam is to find the edges of what your students actually understand and not just confirm they absorbed the curriculum, then it needs to be designed for that purpose. Coverage verification happens naturally along the way; you don't lose it by aiming higher but the reverse isn't true. An exam designed for coverage won't accidentally find boundaries. Here's what my research suggests.

  • Scenarios, not questions. Instead of asking students to explain a concept or analyze a case, give them a situation, underspecified, multi-actor, and genuinely ambiguous. Make sure there's no single right answer. Their preparation for the scenario is developmental. The exam tests what happens when preparation meets live, unpredictable conditions.
  • Reserve most of the time for follow-up. The opening response is the least diagnostic part of any oral exam. It tells you the student prepared. What happens after that, when you introduce new information, challenge their reasoning, or take the conversation somewhere they didn't expect, is where you actually learn what they can do. If the student's opening runs long, interrupt.  Protect your diagnostic time.
  • Challenge something they got right. This is the single most discriminating move available. Push back on a correct position with a plausible counter-argument, delivered with confidence. If the student folds, they were managing you, not defending a position. If they push back and explain why they're right, they're demonstrating exactly the kind of calibrated judgment that matters. 
  • Watch for formulaic structure. If every answer follows the same template, "Here's the problem, here are the actors, here are two alternatives, here's my recommendation,” that may be a sign of weak analytical thinking. Indeed, that may simply be a sign of strong prompt engineering, where the student used AI to prepare the template. Vary your approach: ask them to argue the other side, ask them what they'd do if their recommendation failed, ask them to explain the problem to someone outside their field. Break the template and see what's underneath.
  • Test their ability to think in front of you. The hardest thing for AI to fake, and the clearest sign of genuine understanding in a human, is visible thinking-in-progress. The pause. The partial sentence that gets revised. The "wait, actually, that contradicts what I said earlier. Productive struggle is not a sign of weakness. It's a sign that the student is actually engaging with the problem rather than performing a prepared answer. If the delivery is too smooth, that's a signal to push harder, not to relax.

Where This Leaves Us

I've been giving oral exams for more than twenty years. Early on, I started telling students to think in terms of three levels of questions.

  • Level 1 is knowledge-based and usually so straightforward that getting it wrong is itself a signal. If you're supposed to be an expert on Nigeria and I ask you the population, you should know. If you don't, something is wrong, and I'm going to dig in. 

  • Level 2 is appropriate to whatever the course or program expects. If you're in a master's program, I'm asking mastery-level questions and expecting mastery-level answers. Most of my questions are here.

  • Level 3 is where I push past what I think you can answer or even where there may not be a clean answer at all. I'm testing the boundary of your knowledge, and because I don't think you can answer it (students can see these as "unfair" questions, which is why I warn them in advance), I'm also testing how you hold up when your knowledge runs out. A student who can handle Level 3 questions well (and there are many ways to handle it well) is likely in A territory.

Every good oral exam has all three levels. The proportion shifts depending on the context. An introductory course might be mostly Level 1 with some Level 2 and one level 3. A senior seminar should be mostly Level 2 with deliberate Level 3 probes. Even in a PhD dissertation defense, a committee member might ask a basic Level 1 question just to double-check that the candidate is grounded in the fundamentals and they'd push a lot harder if the candidate got it wrong. It's always a spectrum. Coverage and boundary finding aren't opposites. They're different points on the same continuum, and every oral exam should know where it sits.

But here's what the MILBENCH data made visible: If your oral exam is weighted toward coverage, it isn't testing what humans can do and AI can't. That may be a deliberate design choice given the nature of your course. But it shouldn't be the default.  Coverage verification is easier to do, easier to defend, and more comfortable for everyone in the room. Boundary finding requires the examiner to create discomfort, to push back on good-sounding answers, to challenge positions, to ask the question that might not have an answer. That's harder. 

But there's a reason why doctoral programs and rigorous master's programs require oral defenses. It isn't just because the committee wants to probe boundaries, though they do. It's because the defense is where the candidate has to talk about their work. Explain it. Apply it to settings the written document didn't cover. Defend it against a serious challenge. Not just have a conversation about it, but actually hold their ground. 

Most of what our graduates will do after they leave the War College (and maybe yours, too) involves exactly this: speaking, not writing. Briefing a commander. Defending a recommendation to a skeptical colleague. The oral exam is where those abilities either show up or don't.

AI just raised the stakes on all of this. A coverage-verification oral exam can likely be passed by any voice-enabled AI system available today, at any level of the curriculum. A well-designed boundary-finding exam, one that probes how students think when their preparation runs out, whether they can defend what they believe, and whether they know the limits of their own understanding, tests exactly the things that matter most, regardless of whether AI exists.

Deeper thinking about oral exams is more important now than ever.  But I don’t think most educators realize how urgent the redesign is because most of them have never heard a voice-enabled AI take an oral exam.

We have. It's impressive. And it should change how you think about yours.