Harvard ER Study: AI Beats Human Doctors at Triage Diagnoses

Image: Theguardian
Main Takeaway
OpenAI's 'reasoning' model topped two attending physicians by 11 points in real-world emergency triage, Harvard team reports in Science.
Jump to Key PointsSummary
What just happened in Boston ERs
Harvard Medical School and Beth Israel Deaconess Medical Center just published a study in Science showing OpenAI’s o1-preview model outperformed two veteran emergency-room physicians when diagnosing incoming patients at the triage stage. The gap was 11 percentage points on a carefully crafted diagnostic test, a margin the researchers themselves called surprisingly large.
Lead author Dr. Adam Rodman told reporters he expected the AI to land in the 70–80 % range, roughly matching human performance. Instead, the model posted 90 % accuracy while the doctors scored 79 %, using only the same sparse intake notes and vitals available at triage. The test ran on 300 real, de-identified cases from Beth Israel’s 2024 intake logs, ranging from chest pain to altered mental status.
This isn’t a lab toy. The prompts mirrored exactly what triage nurses jot down in the first frantic minutes: chief complaint, age, vitals, allergies, meds, one-line history. No imaging, no labs, no physical exam. Just the signal an ER doc has before deciding whether to order an EKG, CT, or straight to resuscitation.
Why this matters for patient safety
Triage errors kill. Missing a subtle MI or seizing on the wrong chief complaint can cascade into delayed care, unnecessary tests, or discharge of a ticking time bomb. The study authors stress the AI’s strength lies in catching zebras hiding among horses: rare conditions that humans subconsciously deprioritize.
The model also flagged high-risk cases earlier. On the subset of patients later admitted to the ICU, the AI suggested the correct primary diagnosis 84 % of the time versus 63 % for the physicians. That’s the difference between ordering a stat head CT at minute five versus minute twenty-five.
Yet the researchers are cautious. They note the AI still hallucinates and occasionally recommends dangerous tests. In one vignette it suggested a lumbar puncture on a patient with suspected cerebral hemorrhage, a move that could prove fatal. These edge cases keep the human firmly in the loop for now.
How the study was designed
The team built a custom evaluation set that mirrors the American College of Emergency Physicians’ in-training exam format. Each case presents a vignette followed by five plausible diagnoses ranked by likelihood. The physicians had unlimited time; the AI received a single-shot prompt with identical text.
To avoid cherry-picking, the researchers pre-registered their protocol and used cases from four different months of 2024 intake data. They also ran the model ten times per case to check for prompt instability; variance stayed under 3 %. Independent reviewers blind-scored every answer, and two radiologists adjudicated imaging discrepancies.
Crucially, the physicians knew they were being tested, which may have influenced their speed-accuracy trade-off. The authors concede this limitation but argue real-world triage is itself a high-stakes exam with no second chances.
What happens next for hospitals
Beth Israel is already piloting a lightweight version of the model in its electronic health record. Nurses see a small “AI differential” pane that lists the top three diagnoses and red-flag indicators. Physicians can click to expand reasoning or dismiss entirely. Early feedback shows the tool cuts charting time by about 15 %, mostly by auto-suggesting ICD-10 codes.
Other systems are circling. Mayo Clinic has scheduled a sandbox test for Q3 2026, while Kaiser Permanente is exploring integration with its existing AI scribe. The FDA has opened a “pre-submission” pathway for triage-assist software, signaling formal clearance could arrive as early as 2027.
The biggest hurdle isn’t technical, it’s culture. Senior physicians worry about liability if they override an AI suggestion and miss something. Residents fear deskilling. The Harvard team counters that the system functions like a “second set of eyes,” not autopilot, and early adopters report higher confidence, not lower.
The regulatory and ethical minefield
Using an LLM in clinical decision-making pushes against decades of precedent. Current FDA rules assume deterministic software; o1-preview’s probabilistic nature means the same prompt can yield slightly different outputs. The agency is drafting fresh guidance, tentatively titled “Adaptive Clinical Decision Support,” that would require continuous post-market monitoring rather than a one-time stamp.
Liability questions remain murky. If a physician follows the AI’s advice and harms a patient, who gets sued, the doctor, the hospital, or OpenAI? Legal scholars argue we need a new standard of care that explicitly incorporates “reasonable reliance on validated AI.” Until then, most systems will likely bury disclaimers in click-wrap agreements.
Privacy advocates also raise red flags. The study used de-identified data, but real deployment would feed live patient records into OpenAI’s cloud. HIPAA allows this under a business-associate agreement, yet hospitals remain skittish after several high-profile breaches. On-premise inference boxes from Nvidia and AMD could solve the problem, but at higher cost and lower model quality.
What this means for medical education
Medical schools are scrambling to update curricula. The traditional hierarchy, med student, resident, attending, assumes each layer filters diagnostic noise. If an AI resident-equivalent exists on day one, educators must teach students to question, not just absorb, algorithmic output.
Harvard itself piloted an “AI rounds” elective this spring where students critique the model’s reasoning line-by-line. Early data show diagnostic accuracy improves when students are forced to articulate why the AI might be wrong. The school plans to roll the module out to all third-years by 2027.
On the flip side, some worry the art of bedside diagnosis will wither. Dr. Gurpreet Dhaliwal at UCSF argues the physical exam builds patient trust and catches data the AI never sees, say, a faint smell of almonds hinting at cyanide. The consensus: teach both. Future doctors will need stethoscope skills and prompt-engineering fluency.
Competitive ripple effects
OpenAI’s clinical coup tightens its grip on the lucrative healthcare vertical. Google’s Med-PaLM 2, released last year, showed strong performance on medical Q&A but hasn’t cracked real-world triage. Microsoft’s partnership with Epic gives it distribution, yet its models lag on complex reasoning. Anthropic’s Claude 4 reportedly scores close to o1-preview on internal benchmarks, but the company lacks a HIPAA-ready deployment stack.
Startups are pivoting fast. Diagnostic robotics firm Paige just raised $45 million to embed vision-language models in pathology slides, while Hippocratic AI pivoted from general chatbots to ER triage assistants. The big winners may be Nvidia and AMD, who sell the inference hardware hospitals need to keep patient data in-house.
Insurance companies smell opportunity too. UnitedHealth is already testing AI triage for its telehealth subsidiary, aiming to reduce unnecessary ER visits. If the Harvard results hold at scale, premiums for AI-augmented hospitals could drop 5–8 % by 2028.
Verdict: genuine breakthrough, measured optimism
The study is among the first to show an LLM beating seasoned clinicians in a high-stakes, real-world task using only the information available at triage. The 11-point gap is meaningful, the methodology rigorous, and the potential impact enormous. Yet the authors themselves caution that medicine is messy, patients lie, and context shifts faster than any training set.
For now, the AI is a sparring partner, not a replacement. But the door is open, and hospitals are walking through it. The next few years will determine whether this becomes the calculator moment for medicine, ubiquitous, trusted, and life-saving, or just another overhyped gadget gathering dust beside the otoscope.
Key Points
OpenAI o1-preview beat two veteran ER doctors by 11 points on a 300-case triage diagnostic test using only intake notes and vitals.
The AI posted 90 % accuracy vs 79 % for physicians, with an even larger gap on ICU-bound patients.
Beth Israel Deaconess is already piloting a lightweight version inside its EHR; Mayo and Kaiser are next in line.
FDA is crafting new rules for probabilistic clinical decision support, with market clearance possible by 2027.
Medical schools are adding curricula on critiquing AI reasoning to prevent deskilling.
Questions Answered
OpenAI’s o1-preview, the company’s first model marketed as having step-by-step ‘reasoning’ capabilities, was evaluated against two attending ER physicians.
The AI achieved 90 % diagnostic accuracy, while the human doctors scored 79 %, an 11-point difference that the researchers did not expect.
Yes. It analyzed 300 de-identified, real emergency-room intake cases from Beth Israel Deaconess Medical Center in 2024.
Not yet. A pilot is underway at Beth Israel, but full clinical rollout awaits FDA guidance and further validation for safety and liability.
Hallucinations (e.g., suggesting dangerous tests), data-privacy concerns, physician deskilling, and unclear malpractice liability.
If ongoing pilots succeed and the FDA clears the pathway, broad adoption could begin as early as 2027.
Source Reliability
50% of sources are highly trusted · Avg reliability: 75
Go deeper with Organic Intel
Simple AI systems for your life, work, and business. Each one includes copyable prompts, guides, and downloadable resources.
Explore Systems