The offering measures AI’s real-world performance and safety around handling realistic medical conversations, using physician-created rubrics and GPT-4.1 scoring.
OpenAI unveils HealthBench to evaluate LLMs safety in healthcare

The offering measures AI’s real-world performance and safety around handling realistic medical conversations, using physician-created rubrics and GPT-4.1 scoring.