
Artificial intelligence continues to reshape healthcare, with a new study spotlighting GPT-5’s profound medical reasoning abilities in radiology and radiation oncology. Research led by Mingzhe Hu and colleagues at Emory University has benchmarked OpenAI’s latest model, revealing its unparalleled zero-shot performance on multimodal medical tasks. GPT-5 has surpassed its forerunner, GPT-4o, by a spectacular 12.7 percent in the hot-off-the-press achievement of scoring 90.7 percent accuracy in a board-level medical physics exam with no preliminary instruction at the task level. These results support the fact that this model is able to combine textual, visual, and numerical information. This discovery would be revolutionary to worldwide diagnosis and clinical decision-making, should it be confirmed.
GPT-5’s Zero-Shot Capabilities in Medical Physics
The standout feature of this research lies in GPT-5’s success in zero-shot multimodal tasks. Unlike traditional AI systems that require fine-tuning with domain-specific datasets, GPT-5 manages to address complex medical challenges by drawing upon its extensive pre-trained knowledge. When given a series of complex medical physics questions similar to those found in the real-world board exams, it has a 90.7 percent accuracy level, a performance that compares well to that of GPT-4o, which scored 78.0 percent. Notably, the model could work with closed-ended multiple-choice questions accompanied by open-ended descriptive answers, and this affirms its capability to match queries depending on their nature.
In radiology and radiation oncology, where medical workflows involve aligning imaging data with clinical reports and numerical planning systems, GPT-5 shows particular strength in cross-modal reasoning. The study emphasized that its accurate representations are not simple pattern recognition but a reflection of deeper interpretive ability. Whether distinguishing abnormal tissue, understanding radiation dose distribution, or recalling specific anatomical details, GPT-5 exhibited reasoning consistent with board-level competency. This is especially critical in high-stakes medical environments where errors can lead to life-altering consequences. By succeeding without tailored datasets, GPT-5 demonstrates that a generalized yet robust AI system can feasibly support clinicians across multiple diagnostic and treatment domains.
Benchmark Datasets and Imaging Performance
The researchers tested GPT-5 across three rigorous benchmarks: VQA-RAD, SLAKE, and a medical physics exam clone designed to mimic board certification standards. VQA-RAD evaluated the model’s ability to answer diagnostic questions based on radiology images, while SLAKE, the structured knowledge-enhanced dataset comprising over 14,000 annotated questions, assessed multimodal interpretive skills. Together, these ensured GPT-5 faced both image-heavy and knowledge-driven reasoning challenges. Notably, the bilingual structure of SLAKE allowed the model to demonstrate versatility across languages, an advantage for global adoption in diverse healthcare ecosystems.
The results highlight clear advantages over prior generations. GPT-5 showed noticeable accuracy improvements in chest, lung, and brain imaging interpretation, yielding up to 20% performance gains compared to GPT-4o. Complex regions like lung tissue, which demand precise localization of abnormalities, were handled with improved consistency. Beyond the largest version, smaller variants such as GPT-5-mini and GPT-5-nano were also tested. While these lighter models performed reliably on simpler classification tasks, the full-scale GPT-5 demonstrated superior reasoning depth, indicating that enhancements are qualitative rather than purely scaling-based. Together, these benchmarks confirm that GPT-5 is capable not only of answering medical queries but of generalizing knowledge robustly across diverse imaging modalities and formats.
Implications and Future Directions
This research marks a revolutionary point in the use of AI in healthcare. GPT-5 could be invaluable in liberating diagnostic processes, minimizing human error, and extending the reach of extensive medical knowledge to new environments that lack access to advanced medical expertise. Its strong support of multimodal data and text, image, and numbers hints at applications as diverse as radiology reporting and oncology treatment planning. Yet, since peer-reviewed validation subjects the findings, researchers will find that the actual clinical testing and regulatory evaluation are most crucial before they can apply it at a large scale.