The patient was a 39-year-old woman who came to the emergency department of Beth Israel Deaconess Medical Center in Boston. His left knee was aching for several days. A day ago he had 102 degree fever. It was gone now, but she still felt chills. And his knee was red and swollen.
What was the diagnosis?
On a recent tense Friday, Dr. Megan Landon, a medical resident, laid out this real case in front of a room full of medical students and residents. They gathered to learn a skill that can be notoriously difficult to teach – how to think like a doctor.
“Doctors are very bad at teaching other doctors how we think,” said medical historian Dr. Adam Rodman, an internist, event organizer at Beth Israel Deaconess.
But this time, they can call on an expert to help them reach a diagnosis – GPT-4, the latest version of the chatbot released by the company OpenAI.
Artificial intelligence is changing many aspects of medical practice, and some medical professionals are using these tools to aid in diagnosis. Doctors at Beth Israel Deaconess, a teaching hospital affiliated with Harvard Medical School, decided to explore how chatbots could be used – and misused – in training future doctors.
Instructors like Dr. Rodman hope that medical students may turn to GPT-4 and other chatbots for something similar to what doctors call curbside consultations — when they pull a colleague aside and ask for their opinion about a difficult matter. The idea is to use the chatbot the same way doctors approach each other for suggestions and insights.
For more than a century, doctors have been portrayed as detectives who gather clues and use them to find a criminal. But experienced doctors use a different method — pattern recognition — to figure out exactly what’s wrong. In medicine, this is called a disease script: the signs, symptoms, and test results that doctors put together to tell a coherent story based on similar cases they know of or have seen themselves.
If the disease script doesn’t help, Dr. Rodman said, doctors turn to other strategies, such as assigning probabilities to the different diagnoses that might be appropriate.
Researchers have tried for more than half a century to design computer programs to make medical diagnoses, but nothing has really been successful.
Doctors say that GPT-4 is different. “It would create something that would be remarkably similar to a disease script,” Dr. Rodman said. As such, he added, “it is fundamentally different from a search engine.”
Dr. Rodman and other doctors at Beth Israel Deaconess have asked GPT-4 for possible diagnosis in difficult cases. one in Study Released last month in the medical journal JAMA, they found it performed better than most doctors on weekly clinical challenges published in the New England Journal of Medicine.
But, he learned that there is an art to using the program, and it also has disadvantages.
Dr. Christopher Smith, director of the internal medicine residency program at the medical center, said medical students and residents “are definitely using it.” But, he added, “whether they are learning anything is an open question.”
Worryingly, they may rely on AI to make diagnoses in the same way they rely on the calculator on their phone to solve math problems. That, Dr. Smith said, is dangerous.
Learning, he said, involves trying to understand things: “That’s how we retain things. Part of learning is struggle. If you outsource the learning to GPT, that conflict is gone.”
At the meeting, the students and residents split into groups and tried to figure out what was troubling the patient with the swollen knee. Then they turned to GPT-4.
The groups tried different methods.
One uses GPT-4 to search the Internet, just as one uses Google. The chatbot issued a list of possible diagnoses, including trauma. But when group members asked him to explain his reasoning, the bot was disappointed, stating his choice, saying, “Trauma is a common cause of knee injuries.”
Another group thought of possible hypotheses and asked GPT-4 to test them. The chatbot’s list is lined up with a list of groups: infections including Lyme disease; arthritis, including gout, a type of arthritis involving crystals in the joints; and trauma.
GPT-4 added rheumatoid arthritis to the top possibilities, although it didn’t rank high on the group’s list. The trainers later told the group that gout was unlikely for this patient because she was young and female. And rheumatoid arthritis can probably be ruled out because only one joint was inflamed, and for only a few days.
As of curbside consultations, GPT-4 seemed to pass the test or, at least, students and residents agreed. But in this exercise, it offered no insight and no disease script.
One reason may be that students and residents used the bot more like a search engine than curbside consultation.
To use the bot correctly, the trainers said, they had to start by telling GPT-4 something like, “You are a doctor seeing a 39-year-old woman with knee pain.” Then, they’ll need to list her symptoms, the way they would with a medical colleague, before asking for a diagnosis and following up with questions about the bot’s reasoning.
This is one way to harness the power of GPT-4, the trainers said. But it’s also important to recognize that chatbots can make mistakes and “hallucinations” — providing answers without any basis in fact. To use it requires knowing when it is wrong.
“There’s nothing wrong with using these devices,” said Dr. Byron Crowe, the hospital’s internal medicine physician. “You just have to use them properly.”
He gave an analogy to the group.
Dr. Crowe said, “Pilots use GPS.” But, he added, the airlines have “very high standards for reliability.” In medicine, he said, using chatbots is “very tempting,” but the same high standards must apply.
“It’s a great thought partner, but it’s no substitute for deep mental expertise,” he said.
As soon as the session was over, the trainers explained the real reason behind the swelling in the patient’s knee.
This became a possibility that was considered by each group and was proposed by GPT-4.
He had Lyme disease.
Olivia Allison contributed reporting.