Millions Ask AI About Health, Even Though It Misleads Them. Here is How Often Chatbots Fail

Image showing ai-medical-advice-chatbot-risks

Millions Turn to AI for Health Advice, But It Often Misleads Them

Artificial intelligence tools like ChatGPT, Gemini, Meta AI, Grok, and DeepSeek are increasingly becoming the first point of contact for people seeking health information. Today, up to 25% of adults turn to AI chatbots with medical queries. This shift is primarily driven by the convenience, privacy, and speed of instant responses, rather than a decline in trust toward traditional medical professionals. However, relying on AI for medical guidance is proving to be a highly risky approach.

As these tools become more integrated into daily life and even professional clinical settings—such as a recent controversial case where a doctor utilized ChatGPT to generate an ER discharge summary—understanding their limitations is critical for patient safety.

Are Chatbots Replacing Doctors? The Troubling Study Findings

A comprehensive analysis published in the medical journal BMJ Open evaluated the responses of five popular AI chatbots: ChatGPT, Gemini, Meta AI, Grok, and DeepSeek. Researchers from the United States, Canada, and the United Kingdom tested the models using 50 queries across five medical areas historically prone to misinformation: vaccinations, cancer treatments, stem cell therapies, nutrition, and physical performance enhancement.

The researchers rigorously evaluated the AI-generated responses against established scientific evidence. The results were alarming:

  • Nearly 50% of the AI responses were classified as “problematic,” meaning they contained significant informational gaps, inaccuracies, or potentially misleading advice.
  • Almost 20% of the answers were deemed “highly problematic,” posing direct risks to user health.
  • Chatbots consistently delivered answers with a high degree of artificial confidence, creating a dangerous illusion of medical competence.

Out of all the prompts tested, there were only two instances where a chatbot refused to provide an answer. Both refusals came from Meta AI and involved questions about anabolic steroids and alternative, unverified cancer treatments.

AI Hallucinations and the Illusion of Competence

One of the most critical failures identified in the study was the inability of the AI models to properly cite their claims. Not a single chatbot provided a complete, fully accurate list of medical sources to back up its advice. In many instances, the models cited non-existent research papers or referenced studies completely irrelevant to the topic.

This phenomenon, known as “hallucination,” occurs when a language model generates highly convincing but entirely fabricated information. It stems from the inherent limitations in how large language models (LLMs) are trained and architected. These systems are designed to predict the next logical word in a sentence, not to verify medical facts. The broader implications of these AI flaws are increasingly coming under scrutiny, echoing similar alarms raised regarding AI chatbot risks connected to mental health and violent content.

Where AI Healthcare Advice Succeeds—and Where It Fails

The researchers noted clear patterns in how chatbots handled different types of medical inquiries.

The Strengths: Closed Questions and Established Facts

AI models performed relatively well when asked closed-ended questions based on well-established, rigid medical knowledge. When users asked about simple facts—such as standard vaccination schedules or basic oncology treatment protocols—the chatbots were more likely to provide acceptably accurate answers, though they often lacked comprehensive detail.

The Weaknesses: Open Questions and Medical Myths

The chatbots failed significantly when tasked with open-ended questions or when discussing topics heavily clouded by internet myths. Areas like experimental stem cell therapies, dietary supplements, and controversial alternative therapies proved highly problematic.

In these domains, AI models frequently conflated scientifically verified data with outdated studies or gross overinterpretations of limited research. Shockingly, some chatbots suggested practices that could delay a patient from seeking vital, life-saving treatments or push them toward expensive, unproven medical procedures.

Ranking the Chatbots: Which AI Model is the Safest?

The published data indicates substantial variability in quality and safety across the tested models, making it impossible to crown a clear winner.

  • Grok: According to one safety metric, Grok generated the highest volume of “highly problematic” responses, accounting for 58% of its answers in the tested sample.
  • Gemini: Conversely, Google’s Gemini delivered the highest number of non-problematic answers, positioning it as relatively safer in specific instances.
  • DeepSeek and Grok (Overall Quality): Paradoxically, when researchers applied different holistic quality scoring metrics, Grok and DeepSeek achieved higher overall scores, outperforming Gemini, Meta AI, and ChatGPT.

This discrepancy highlights a major issue: AI performance heavily depends on the specific evaluation criteria used. Consequently, no current chatbot can be deemed “safe” for medical advice. None of the tested systems can guarantee a stable, highly reliable level of health consultation.

The severity of these issues has not gone unnoticed by safety watchdogs. ECRI, an independent organization dedicated to healthcare safety and quality, explicitly listed AI-powered healthcare chatbots as one of the most significant threats to medical technology in its upcoming 2026 forecast report.

Disclaimer: AI chatbots are not a substitute for professional medical advice, diagnosis, or treatment. Always consult a licensed healthcare provider with any questions you may have regarding a medical condition.

Frequently Asked Questions (FAQ)


Why do AI chatbots generate “hallucinations” when answering medical questions?

AI hallucinations occur because large language models (LLMs) are essentially advanced text-prediction engines, not factual databases. They are trained on vast amounts of internet data to predict the most statistically likely next word in a sentence. Because they lack true comprehension or fact-checking abilities, they can confidently generate convincing but entirely fabricated medical studies or treatments based on patterns in their training data.


Are there any medical topics where AI chatbots are considered more reliable?

Studies indicate that AI models perform better with closed-ended questions concerning well-established, rigid medical facts. For example, queries about standard childhood vaccination schedules or basic, universally accepted cancer treatment protocols generally yield more accurate results. However, even in these areas, the answers provided are frequently incomplete and should be verified by a medical professional.


How do patient safety organizations view the public’s reliance on AI for health advice?

Patient safety organizations view the trend with significant concern. ECRI, a leading independent organization focused on healthcare safety, has identified the use of AI chatbots in healthcare as one of the top technological threats to patient safety for 2026. Experts warn that AI’s tendency to mix facts with outdated or unproven data can lead patients to delay necessary care or pursue dangerous alternative treatments.

Source: BMJ Open, EurekaAlert, ECRI, DigWatch & Opening photo: Gemini

About Post Author