inner-banner-bg

Journal of Future Medicine and Healthcare Innovation(JFMHI)

ISSN: 3065-7628 | DOI: 10.33140/JFMHI

The Reliability of LLMs For Medical Diagnosis: An Examination of Consistency, Manipulation, and Contextual Awareness

Abstract

Krishna Subedi

Universal healthcare access remains a critical unmet need, especially in resource-limited settings. Large Language Models (LLMs) hold immense promise for democratizing healthcare globally, offering sophisticated diagnostic tools even in remote areas. However, responsible clinical deployment, especially in resource-scarce and trust-dependent environments, demands comprehensive reliability evaluation. This must go beyond accuracy to encompass diagnostic consistency, manipulation resilience, and intelligent contextual integration, ensuring the safe and ethical application of LLMs for universal healthcare.

This study strictly evaluated the diagnostic reliability of leading LLMs, focusing on: (1) evaluating their diagnostic consistency across repeated queries and minor demographic variations of identical clinical cases; (2) examining their susceptibility to diagnostic manipulation through prompt engineering, narrative shifts, and irrelevant information insertion; and (3) evaluating the extent of their contextual awareness and ability to incorporate patient history and lifestyle factors into diagnostic reasoning.

We employed a controlled experimental methodology utilizing a dataset of 52 original patient cases, each expanded into multiple variants. These variants included demographic alterations (age, gender, race, country), rewording of symptom descriptions, and slight modifications to physical examination de- tails while maintaining core diagnostic markers unchanged. Susceptibility to manipulation was tested by strategically inserting misleading narratives and irrelevant details into diagnostic prompts. Contextual awareness was evaluated by comparing diagnoses generated with and without supplementary patient history and lifestyle information. We analyzed both quantitative diagnostic change rates and qualitative patterns in LLM responses across these manipulations.

Both LLMs demonstrated perfect (100%) diagnostic consistency for identical clinical information, reflecting their deterministic nature and focus on core data. However, significant susceptibility to manipulation emerged: Gemini exhibited a 40% diagnosis change rate, and Chat GPT 30% when irrel- evant details were added. While Chat GPT showed a higher context influence rate (77.8% vs. Gemini’s 55.6%) quantitatively, qualitative analysis revealed limitations in clinically subtle contextual integration for both. Both models exhibited anchoring bias, prioritizing salient clinical data, superfi- cially incorporating context, and sometimes overemphasizing demographics, and medical history while underweighting contradictory evidence.

Despite remarkable consistency in controlled settings, LLMs’ demonstrated susceptibility to manipulation and limitations in sophisticated contextual understanding pose critical challenges for real-world clinical deployment. Specifically, LLMs exhibit weaknesses in contextual awareness and are highly susceptible to input manipulation, unlike human clinicians who leverage iterative questioning, critical evaluation, and comprehensive contextual integration. Human clinicians also express uncertainty and seek validation, contrasting with LLMs’ tendency to overstate diagnostic certainty. These findings strongly emphasize the urgent need for domain-specific architectures, reliable input safeguards, and careful validation frameworks to ensure ethical and reliable LLM application in healthcare. Until these fundamental vulnerabilities are decisively overcome, broad clinical implementation of LLMs outside of highly controlled, human- supervised research settings would be premature, ethically questionable, and potentially harmful.

Due to their inability to critically evaluate input validity or request clarifying information, LLMs are demonstrably more susceptible to manipulation than clinicians.

While more susceptible to manipulation and less sophisticated in contextual reasoning than clinicians, LLMs, when used responsibly under human over- sight, can still enhance diagnostics. Future research must prioritize improving LLMs’ manipulation resistance and contextual reasoning to responsibly realize their promise for global healthcare democratization.

PDF