Particle: Studies Find AI Chatbots Don’t Improve Patient Decisions and Can Echo False Medical Claims

Overview

A randomized Nature Medicine trial of 1,298 UK participants found that using GPT‑4o, Llama 3, or Command R+ did not help people identify conditions or choose safer next steps any better than internet search or usual resources.
Without human users, the models identified conditions in about 94.9% of test cases and chose correct actions 56.3% of the time, but with real users relevant conditions were identified in under 34.5% of cases and correct actions in under 44.2%.
Researchers documented dangerous inconsistencies, including two near‑identical descriptions of a subarachnoid hemorrhage receiving opposite guidance, with one user told to rest in a dark room and the other urged to seek emergency care.
A separate Lancet Digital Health study from Mount Sinai showed LLMs accepted fabricated medical claims roughly 32% overall, rising to about 46–47% when falsehoods appeared in hospital‑style discharge notes and dropping to about 9% for social‑media‑style posts.
Susceptibility varied widely across models—GPT‑based systems were among the least likely to accept false claims while some models agreed with up to about 63.6%—leading authors to call for evidence‑grounding checks, stress tests using clinical notes, and regulatory caution.