Can ChatGPT be used for medical treatment? Scientists explain its risks

ChatGPT shows high accuracy when generating medical terms, but researchers from the U.S. and Poland concluded that it cannot yet be used as an independent tool for diagnosis or treatment. They tested the model on several key tasks, Kazinform News Agency correspondent reports, citing Cell.

photo: QAZINFORM

How accurate are the medical terms generated by ChatGPT?

In the first part of their study, the researchers evaluated how accurately ChatGPT identifies medical terms such as disease names, medications, symptoms, and genes. They verified its outputs against specialized medical databases.

The results showed that disease names were correct in 88 to 99 percent of cases, depending on the type of association. Drug names were accurate in about 91 percent of cases, and genetic information, including genes and genetic processes, matched in 88 to 98 percent.

However, the model performed worse with symptoms, achieving only 61 percent accuracy. This was mainly because ChatGPT often uses everyday descriptions instead of professional medical terminology. For example, it might say "sad mood" instead of "depression" or "facial redness" instead of "inflammation."

Are these associations supported by scientific literature?

For the second task, researchers tested how well the pairs of terms generated by ChatGPT matched findings in scientific publications in the PubMed database. They divided the articles into three time periods – 2009 to 2014, 2015 to 2019, and 2020 to 2024 – to see how many of the model’s suggested associations were confirmed.

The results showed that matches increased over time. For disease-drug pairs, the confirmation rate rose from 86 percent in 2009 to 2014 to 90 percent in 2020 to 2024. For disease-symptom pairs, it grew from 49 percent to 62 percent, and for disease-gene pairs, from 83 percent to 89 percent.

This suggests that ChatGPT’s outputs reflect the way medical knowledge has developed over the years and indicate the types of knowledge the model can reproduce based on its training data.

How stable are ChatGPT’s responses?

To test how stable or random ChatGPT’s associations are, the researchers used four different versions of the model and generated about 5,000 medical abstracts with each. They found that matches with real data were:

· 1 to 15 percent for disease-drug pairs

· 1 to 4 percent for disease-gene pairs

· 2 to 29 percent for disease-symptom pairs

Associations between diseases and symptoms were repeated more often than other types. This suggests that while ChatGPT does not generate entirely random data, its outputs are still not stable enough to be used without further verification.

Earlier, Kazinform News Agency reported that former engineer revealed how OpenAI operates.