A new systematic review reveals that only 5% of health care evaluations for large language models use real patient data, with significant gaps in assessing bias, fairness, and a wide range of tasks, underscoring the need for more comprehensive evaluation methods. Study: Testing and Evaluation of Health Care Applications of Large Language Models . Image Credit: BOY ANTHONY/Shutterstock.
com In a recent study published in JAMA , researchers from the United States (U.S.) conducted a systematic review to evaluate various aspects of existing large language models (LLMs) used for healthcare applications, such as the healthcare tasks and data assessed types, to identify the most useful areas in healthcare for the application of LLMs.
Background The use of artificial intelligence (AI) in healthcare has advanced rapidly, especially with the development of LLMs. Unlike predictive AI, which is used to forecast outcomes for processes, generative AI using LLMs can create a wide range of new content, such as images, sounds, and text. Based on user inputs, LLMs can generate structured and largely coherent text responses, which makes them valuable in the healthcare field.
In some health systems in the U.S., LLMs are already being applied for notetaking and are being explored in the medical field to improve efficiency and patient care.
However, the sudden interest in LLMs has also resulted in unstructured testing of LLMs across various fields, and the performance of LLMs in clinical settings h.