OpenAI has introduced HealthBench, an open-source dataset designed to evaluate how effectively AI models handle real-life medical queries. This move marks a new chapter in how artificial intelligence is being used in healthcare, not to replace doctors, but to assess how helpful AI can be when people turn to it for health-related advice.
At Techno Exponent, we believe digital transformation must be guided by accuracy, responsibility, and real-world impact. HealthBench presents an opportunity for developers and AI teams to ask a crucial question: Are our models giving answers that genuinely help when it matters most?
Built with input from global medical experts
HealthBench is the result of a collaboration between OpenAI and 262 physicians from 60 countries. The dataset includes 5,000 detailed health conversations, each reflecting situations people face in everyday life—from common illnesses to emergency cases.
Each response given by an AI model is scored against a carefully written rubric. These rubrics are created by doctors, with scores assigned based on how accurate, clear, and medically sound each response is. The scoring process uses GPT-4.1 to maintain consistency.
The dataset doesn’t just test knowledge. It examines how useful and actionable an AI response is when someone is relying on it for health guidance.
How AI models performed in the tests
OpenAI tested multiple well-known AI systems using HealthBench. Its own O3 model received the highest score at 60%, followed by xAI’s Grok (54%) and Gemini 2.5 Pro by Google (52%).
In one example, a person asks the AI what to do after finding a 70-year-old neighbor lying on the floor, breathing but unresponsive. The model recommends calling emergency services, checking breathing, and adjusting the airway. The response is then scored and reviewed based on accuracy, missing steps, and tone. In this case, it received a 77% rating.
This shows that while AI can provide helpful steps in emergencies, there is still room for improvement, and the importance of testing cannot be overlooked.
Support for multiple languages and specializations:
HealthBench isn’t limited to English-speaking users. It supports 49 languages, including Nepali, Amharic, and Bengali, making it useful for developers building tools for regions often underrepresented in health tech.
The dataset also covers 26 medical specialties, such as neurology, ophthalmology, dermatology, and cardiology. This makes it valuable not just for general-purpose health tools but also for models designed for specific areas of care.
What does this mean for developers and digital agencies?
HealthBench offers a practical way to measure the value of AI-powered healthcare tools. It helps answer a key question: Is the information provided actually helping users in meaningful ways?
With this dataset, developers now have a method to compare models and understand what quality looks like in real-world medical contexts. It can guide improvements, highlight weak spots, and help teams focus on building AI tools that are not just intelligent, but also responsible and useful.
Looking ahead with awareness and responsibility
HealthBench is a timely reminder that health-related AI development must go hand-in-hand with medical insight and careful evaluation. Instead of simply relying on AI models that “sound right,” HealthBench encourages us to ask: Is this advice clinically sound? Is it complete? Can it be trusted in critical situations?
At Techno Exponent, we see this as an opportunity to build smarter, safer tools for the future. Tools that support—not replace—medical professionals and give people confidence when they need it the most.
As we continue to create AI solutions for different industries, the launch of HealthBench inspires a more thoughtful and informed path forward for all of us building in the space.
+44 141 628 8980
(786) 269-2247
+61 872007153
+91 9831584855 (Sales only)