Have GPT-5, DeepSeek, and Claude Passed the Dental Exam?

DentalGoodNews Editorial

2026-03-13

A patient with a history of heart valve replacement sits in the dental chair, scheduled for a wisdom tooth extraction today. The dentist knows that antibiotic prophylaxis needs to be considered before extraction for such patients but can't recall the specific guideline adjustments made after 2021. However, there isn't enough time in the clinic to search the literature. He takes out his phone, opens an AI tool, and inputs the patient's condition...

This decision could go smoothly, or it could lead to trouble.

In January 2026, a team from the Department of Stomatology at Dalian Medical University used 72 such "most difficult questions" to specifically test three AI models: GPT-5, DeepSeek, and Claude. To the researchers' knowledge, this is the world's first systematic test targeting such high-risk dental scenarios.

The results might be more nuanced than we imagine.

First, look at the report card

* GPT-5: 65 correct answers out of 72, accuracy rate 90.28%

* Claude: 64 correct answers, accuracy rate 88.89%

* DeepSeek: 63 correct answers, accuracy rate 87.50%

Statistical tests showed no significant difference among the three (p>0.05, chi-square test) — in other words, on these 72 questions, the performance levels of the three models were essentially equivalent, with none clearly outperforming the others.

90% sounds good. But this number can be deceptive. To understand what it truly means, you first need to know how difficult this exam was.

This is not an ordinary dental test

Ordinary dental AI research typically uses licensing exam question banks for testing — how to classify caries, indications for root canal treatment, how to read X-rays. These types of questions have standard rules; memorizing them leads to correct answers.

This time was different.

The questions designed by the research team specifically targeted one of the most challenging types of patients in clinical practice: patients with a history of systemic diseases — heart disease, diabetes, chronic kidney disease, liver cirrhosis, substance dependence, undergoing chemotherapy... Every time such a patient sits in the dental chair, it's not a matter of "routine dental treatment," but a decision involving systemic risks: Can this patient receive local anesthesia containing epinephrine? Is prophylactic antibiotic use necessary before extraction? How to manage intraoperative bleeding with abnormal coagulation function? Could there be an interaction between antihypertensive drugs and local anesthetics?

The questions didn't test memorization of knowledge points, but rather "given a patient with a specific medical history, choose the safest treatment plan" — there's no universal formula to apply; each question requires weighing multiple constraints.

If ordinary dental AI questions are basic monthly exam questions, this test was the final, most difficult question on the college entrance exam.

The questions were sourced from the globally authoritative reference for dental clinical decision-making, Little & Falace's "Dental Management of the Medically Compromised Patient" 10th Edition, covering 18 systemic diseases, totaling 72 questions. Each question was answered independently; the AI couldn't rely on context to guess. The gold standard was the textbook answer key, with no subjective human judgment.

At this point, the 90% score gains concrete meaning. And that 10% — roughly one error in every ten high-risk decisions — is where the errors occurred, which is the truly noteworthy aspect of this test.

Where did that 10% go: Two eyebrow-raising findings

Finding One: Patients with Substance Use Disorders — All three AIs only got half right

The accuracy rates for GPT-5, Claude, and DeepSeek in the Substance Use Disorders category were all 50%.

What does 50% mean? The probability of flipping a coin.

Why are these questions so difficult? Consider a typical scenario: A patient with a history of opioid dependence is scheduled for mandibular wisdom tooth extraction. How should postoperative pain be managed?

There's no simple answer to this question. You need to simultaneously consider: effectively controlling pain while not providing the patient with drugs that could trigger addictive behavior; the patient might still be using some over-the-counter or street drugs you're unaware of, posing interaction risks; the patient might have psychological dependence on "pain management" itself, and improper handling could lead to additional doctor-patient conflicts...

No single rule can be directly applied; real-time multi-factor weighing is required.

AI excels at "finding rules." But in this type of question, the rules themselves are ambiguous. Faced with such scenarios, all three models regressed to relying on luck.

Finding Two: Infective Endocarditis — DeepSeek and Claude only got 25% right

This is an even more unsettling number.

Accuracy rates in the Infective Endocarditis category: GPT-5 75%, DeepSeek 25%, Claude 25%.

Only one correct answer out of every four questions — this isn't "poor performance," it's operating at a guessing level.

Returning to the scenario at the beginning of the article: Among the Infective Endocarditis questions in this test, if you ask Claude or DeepSeek whether prophylactic antibiotics are needed for a patient with a history of heart valve replacement, there's a 75% probability it will give you a wrong answer.

Why is this? The researchers point out that guidelines for antibiotic prophylaxis in infective endocarditis have been updated, and the indications for related recommendations have changed between different versions. AI training data has a knowledge cutoff date; when encountering areas where guidelines have been updated, it might still be using old answers. And in the clinic, you might not know which version it's referencing.

This highlights a problem for AI in dynamic medical knowledge scenarios: it learns from literature up to a certain point in time. When medical guidelines are updated, it doesn't know.

There's another minefield, harder to detect

Beyond the two areas of "collective guessing" mentioned above, there's another type of problem that's more insidious:

* All three models answered incorrectly simultaneously: 5 questions, accounting for 6.9%

* The three models gave different answers: 6 questions, accounting for 8.3%

Combined, for about 15% of the questions, the AI was either wrong or the three gave contradictory answers.

"Simultaneously incorrect" means these problems lie beyond the capability boundaries of all current mainstream AI models. "Inconsistent answers" means: you might get completely different advice by switching to another AI tool — and you cannot know in advance which one is correct.

Applying this ratio to daily use: It's like a textbook where you cannot know in advance which pages are misprinted, only that roughly 1 in every 7 pages has a problem. But every time you turn a page, you don't know if you've turned to a misprinted one.

AI Application Scenarios

It's not a question of "usable/unusable," but rather "where to use it and for what purpose."

✅ Scenarios Where It Can Be Used

Liver disease, chronic kidney disease, neurological disorders, pulmonary disease, acquired bleeding disorders — in the study, all three models achieved 100% accuracy in these categories.

The common characteristic of these areas is: clear clinical guidelines and highly structured decision-making pathways. For example, adjusting medication dosage for patients with chronic kidney disease has clear stratification standards. AI can serve as a rapid retrieval tool to help you check dosage and confirm contraindications, much faster than flipping through a textbook yourself.

⚠️ Scenarios for Reference, But Requiring Verification

For most moderately complex scenarios involving systemic diseases currently, AI can only provide a reference direction. But "reference" means: it can help you think of dimensions you might have missed, not directly tell you what the answer is.

❌ Scenarios That Cannot Be Relied Upon Alone

Pain management for patients with substance use disorders, antibiotic prophylaxis decision-making for infective endocarditis — in these two types of scenarios, the advice given by AI is about as reliable as a coin toss.

Behind these three types of scenarios lies a universal boundary: AI is a knowledge retrieval tool, not a clinical decision-making tool. The difference between the two is: using AI to look up renal function contraindications for a drug, you will still use your own judgment to verify; but if you directly output AI's suggestion as a decision result, skipping the verification step, you will eventually step on that 15% minefield.

How to View DeepSeek's Performance

Overall accuracy of 87.50%, slightly lower than GPT-5 and Claude — but that's not the whole story.

In two areas, DeepSeek's performance surpassed the other two models: diabetes management and sexually transmitted disease management. DeepSeek achieved 100% accuracy in both these areas, while GPT-5's accuracy in these two categories was only 75%, and Claude's accuracy was 75% and 100% respectively.

However, it's too early to draw conclusions. Since DeepSeek is developed by a Chinese team, the composition of its training corpus differs from that of GPT-5 and Claude. Relative advantages in certain areas might reflect specific coverage depth in the training data. But with only 4 questions tested per disease, the statistical confidence in this difference is limited. It's too early to conclude that "domestic AI is stronger or weaker in a certain field."

More noteworthy is a study that hasn't been done yet: This test is based on an American textbook, but China's dental clinical environment has its own particularities.

According to WHO data, China has approximately 87 million chronic carriers of the hepatitis B virus, accounting for one-third of the world's total chronic carriers. Furthermore, less than 25% of carriers have received a diagnosis, making related dental management scenarios actually more frequent in Chinese clinical practice; the Diagnosis Related Groups (DRG) payment model and volume-based procurement policies directly influence treatment plan choices — these are completely absent from American textbooks; some domestic treatment guidelines also have differences in details compared to foreign guidelines...

Using a set of questions from an American textbook to judge whether AI can be used in Chinese clinics means the conclusion is discounted. What Chinese dental AI truly needs is an independent, in-depth test based on local guidelines — but no one has done this yet.

Back to the Initial Question

Did GPT-5, DeepSeek, and Claude pass the dental exam?

They passed — but only most of it.

In disease categories with clear guidelines and well-defined rules, the performance of all three AIs is already quite reliable and can be used as a clinical auxiliary retrieval tool.

But on the hardest type of questions — substance use disorders, infective endocarditis — the performance of the three models is similar to that of a student who hasn't studied thoroughly enough. What's more troublesome is that when you ask it in the clinic, it won't prompt you with "I might have gotten this question wrong."

For routine questions concerning most systemic diseases, AI is sufficient. But in the face of high-risk patients, "almost good enough" is not good enough.

Reference Sources

[SOURCE] Altos O., Awad A., Bashah A., Chen G. Performance of GPT-5, DeepSeek, and Claude in dental MCQs for medically compromised patients. Journal of Translational Medicine (2026). https://doi.org/10.1186/s12967-026-07763-5 | Accepted January 21, 2026, in press | Scope: 72 questions, 18 systemic diseases, source Little & Falace 10th Edition (Mosby/Elsevier)

[SOURCE] World Health Organization (WHO). Hepatitis B Key Facts. https://www.who.int/news-room/fact-sheets/detail/hepatitis-b | [To be verified] It is recommended to check the latest WHO Fact Sheet before publication to confirm the specific wording regarding the number of hepatitis B carriers in China and the diagnosis rate, ensuring consistency with the main text.

About DGN:DentalGoodNews (DGN) is a trusted professional media platform dedicated to the global dental industry. We deliver in-depth coverage of corporate news, policy & regulation, investment & funding, and clinical frontiers — serving dental institutions, device manufacturers, investors, and industry researchers worldwide. Contact us: haodeya@dongxizixun.com

Industry Focus

Next：This is the last one

Prev：This is the first article