false
Catalog
Health Care Implications of Large Language Models ...
WEB03-2024
WEB03-2024
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Hello, everyone. My name is Linda Moy. I am from NYU and I am the Editor of Radiology. I am pleased to be the moderator for this session on large language models. It is one of the many educational items that the RSNA continues to make available to make sure that our membership stays on top of this rapidly expanding field of AI and now generative AI. Today, I'm going to introduce our esteemed panelists. First, there's Tessa Cook, who is from University of Pennsylvania, one of our leaders in AI. Then there's Jonathan Elias. He is one of our incomparable referring physicians who guides us as we are looking at new use case. He's from Cornell. As well, there is Keith Hentel from Cornell. He's on the Executive Vice Chairs there. Then I'm pleased to introduce Merrill Hussman, who is from Raboud in Netherlands. As well, we have George Xie, very involved with our RIC and as well as SIEM. We're going to begin this session where George is going to then show some slides and we will highlight some of the summary of LLMs and any new changes. Thank you, George. Thank you, Linda. Today, I just wanted to give a brief intro to large language models and I'll try to go pretty quickly here in my disclosures. I usually start with this slide, but what I'm really excited about is maybe for the next RSNA, what the RSNA showcase will look like in the advent of large language models, we might actually see quite a bit of these in the AI showcase. Because chatGBT is everywhere and it's really just in a year and a half started to help us make important decisions in lots of areas. It took about two months to reach the 100 million user landmark, which is the fastest app to get there. That was before the newer model GPT-4 came out. I like to show this slide because it's my claim to fame. My CS friends told me that this is going to be my biggest lifetime achievement to be on this slide. If you'll note, I've also highlighted that Sam Altman is also here, which is cool. Let's try it out. RSNA has an AI certificate. I don't know if you've heard about it, but in case you haven't, it's a great way to learn about AI. It's definitely first, but maybe the only dedicated AI certificate program for radiologists. Linda and I were the course directors for the first certificate and they've updated it a bunch of times. I think we're on module 5, and I was just invited to give a talk in that module about a couple of weeks ago. I think they're launching this module, which is module 5, this month. My talk was on reporting workflows with large language models. Basically, the normalization of the metadata for reporting, so it covered everything except for the reporting itself. I just wanted to show you an example from that. This example involves common data elements, which are standardized sets of questions and allowable answers that can be used to enhance radiology reporting and things like clinical decision support. It's not always to use them while you're doing the reporting, but you can actually start to use them in your reporting with large language models. This is an example of a CDE for acute appendicitis. In my example, what I've done is I have a radiology report that has a positive appendicitis finding. Then I've included the CDE for appendicitis that is quite extensive, and it has lots of things about appendicitis that you can see here. Then what I'll do is I'll essentially ask the large language model, in this case, GPT-4, to extract or produce the CDEs based on my report. Now you can see it basically go through and compute all the CDEs. It does a really good job, and it describes the relevant CDEs based on my report. This is really exciting for us because now we can essentially take our radiology reports, make them very useful for other systems and other workflows. One of the things that we did last year when ChatGPT first came out, and a lot of the co-authors are on this article, is that we published an editorial on the potential use cases for ChatGPT. I just wanted to highlight this partly because I think, and Linda can confirm this, but it ended up being the most downloaded and most cited paper of last year in radiology, which is really cool. What that means is that everyone's excited about it, people are doing research on it, and so this is a big area for us in radiology. In that article, we talked about the use cases that we thought were relevant, and I've highlighted generating content for reports. I think that we're going to start to see a lot of reporting solutions with large language models. There are lots of other things on here, and you guys can have a look at the article. We quickly followed that up with an article on impression generation using GPT-4 for radiology reports. The summary is that it wasn't quite as good yet compared with radiologists, but GPT-4 was already pretty good, and I think it's going to get a lot better. I really just want to cover one other topic, which I think is fresh in everyone's mind, because one of the big problems with large language models is hallucinations. I want to talk about how we might be able to prevent hallucinations. By the way, if you don't recognize the background, this is Rio de Janeiro. I just got back from the Brazilian meeting JPR down there. Thanks to Felipe. It turns out that CHAT-GPT is a general model, and it's not really specialized for radiology. The fact that it knows so much about medicine and radiology is almost incidental. I think what we're going to see in the next year, a few years, is that we're going to see these large language models evolve into medicine-specific large language models. That will likely be much more useful for us. What's really important is the content in radiology. Information from our websites, our journals like Linda's journal Radiology, is going to be really important. But probably the most important is going to be the context. I've said context is king here, because what you really want to do is deliver the appropriate context during chat so that large language models can use that context for response. I found that virtually eliminates hallucinations. I want to show you a quick example. This is a LIRADS example. From this article, I grabbed an example of hepatocellular carcinoma on MRI, and it was categorized as LIRADS 5. What I did was I then pasted the findings into chat-GPT and I asked it for the LIRADS category. You can see it gives a very nice analysis of the findings. Then at the end, it tells you that this is LIRADS 5. That's great. Out of the box, the newer version of chat-GPT knows LIRADS. But what if in my fantasy world, I decided that I was going to take over the LIRADS committee, and I wanted to rename it George RADS. How would I do that? How would I get everyone to use George RADS? One way is, you probably guessed it, provide George RADS as the context. Now you can see George RADS on the left here. I've just renamed LIRADS with George RADS categories, and I've added a fruit name here. Based off of George RADS, I'm going to ask chat-GPT to give me the George RADS category for the exact same findings. Let's see how it performs. Based on the provided information, it gives it, again, a nice analysis. There it is. It gives me a George RADS category. You can see that you can adapt these large language models to really help you provide the best recommendations for your local practices, which is really powerful. I think I'm really going to stop there because I think some of the other slides, like the multimodal stuff, we're going to talk about during this discussion. I don't think I have time to show you guys all these examples. I do want to point out that we recently published an article evaluating GPT-4V, which is the vision model for chest radiographs. You guys can hopefully read the article, but just to spoil it a little bit, it's not ready for prime time. At this point, we're still not at risk from these large language models taking over our jobs. Then I just want to point out that just, I think it was last week, OpenAI announced the release of their new model called GPT-4O. O stands for Omni. It combines text, speech, vision, and I think image generation into one model. The focus was really on speed and latency, reduced latency, so the responses are much faster. I also noticed that some of the answers in general are pretty good, and certainly not worse, and in some cases even better, although this is not very scientific. The one thing it's much better at that I've noticed is math, which is really useful, it turns out, for even radiology reports because we have a lot of measurements. Hopefully, we'll talk about that in our discussion. One last slide in case you guys are interested in GPT-4O. At the CIM meeting, which is a imaging informatics meeting in June, the 26th to the 29th, we are going to showcase GPT-4O in the AI playground there. Great. I'll stop right there, Linda. Thank you for that very nice overview, George. At this point, we're going to begin the panel discussion for the healthcare implications of large language models like ChatGPT. I figured this to be a nice opportunity for all of us to really learn more about whether we're planning to use MLMs in our workflow, whether we're currently using it. I'm going to ask all of the panelists to talk about what is the current scenario in your institution. If you are using it, we can provide a specific use case. I'm going to ask all the panelists to turn on their cameras, please, and then join our discussion. Tessa, you're first. Thank you, Linda. Hi, everybody. To answer the question of how we're using it or thinking about using it, we have been working with an internal HIPAA-compliant set of large language models, as well as with some open-source models. Nothing is currently in our workflow, but we're trialing a lot of different use cases. Some of them are specific to radiology, some of them incorporate our colleagues in cardiology and pathology, as well. Some of them are actually looking across the enterprise to solve problems like closing the follow-up loop and things like that. But others are really also very simple. One of the things that LLMs provide us that we could really use is, they are really effective information retrieval and information synthesis tools. We are using them for really some very simple problems like, go find me all the patients that have an abdominal aortic aneurysm of four-and-a-half centimeters or something like that, or patients who should be seeing a vascular surgeon to be followed for their triple A's. We're really just very much in the dabbling stage, partly because from a policy standpoint, we're not allowed currently to put live clinical data into these large language models, even though we have a HIPAA-compliant environment. Once that changes, we'll hopefully be ready to go with a bunch of things. But for right now, we're mostly experimenting. Great. Jonathan? Sure. Hi, everyone. In our institution, we're starting to utilize a HIPAA-compliant large language model within our electronic health record, EPIC. As an ordering provider, I have a little bit of a different spin on the use cases of these large language models. The primary areas of focus for an ordering provider are on documentation burden and how to minimize clinician burnout, through decreasing documentation burden. When we think about our use cases for these large language models in our electronic health record, the first is, well, can we write a note in a quicker and more accurate format, all while focusing the clinician's attention on the patient who's sitting in front of them? That's where ambient AI comes in. Putting your phone on the table and having the large language model populate the contents of a progress note while you're talking with the patient, even potentially inputting things like exam findings and orders. Those are in testing phases right now at our institution, getting close to piloting that. In terms of further documentation, a big lift for us is clinical summarization. For all those that have worked on the inpatient side, summarizing a hospital course for a patient can be challenging, especially patients that have been in the hospital for months on end. Ways to summarize text swaths of notes at a time become very important. Obviously, important to this group to provide clinical context for the imaging orders. Epic currently is working on a clinical summarization tool, again, in the testing phase. Something that is more at the forefront is actually documentation burden that is noted to have an uptick since COVID, which is focusing on in-basket messaging. We've noticed that with increased portal usage since the pandemic, there has been an increase in the amount of messages from patients. We have a integration with GPT through Epic to draft in-basket messages that are responses to patient messages. Now, this is the closest thing to prime time in production at this point, and has been rolled out to multiple institutions throughout the United States already. We're currently in the pilot phase. We've undergone multiple rounds of iterative prompt engineering and testing with our clinician informaticists and regular clinicians as well to ensure that we are receiving the best drafted responses we can get from the large language models. Even then, we're proceeding with caution and implementing to a small pilot group at this point in time. But there are other institutions that have been at the forefront of this. There are papers written by Stanford, by UCSF, by Mayo who have this actually out of pilot phase and in a more broader acceptance. That's what we're doing on the AI front from an ordering providers perspective at this point. Thank you, Jonathan. Keith? Well, John is one of our referring providers, and I have to say one of our interests is not to insult John, but I'll go ahead and say it anyway, is to get better histories and better information for the studies that we're reading. Again, it goes back to summarization of information, but more for the radiologist view. Quite frankly, I have the privilege of working with John and with George on a lot of these projects and they know that I've been a little bit frustrated at some of the speed that we've been able to implement these. I think as Tessa mentioned, there are several barriers to getting this enacted and embedded directly within clinical workflow. Not the least of which for us was getting access to HIPAA compliant version of these large language models. We've taken on multiple strategies to address that. George, I'll let you speak a little bit more about some of the open-source approach that we've taken. But I can tell you from someone who's responsible for a large imaging practice, there's no doubt in my mind that this will become embedded into our workflow within the next year, within the next several years. For me, I'm most interested in some of the things that we have going on to really improve the quality and value of the services that we provide and the ease in which we do that. For example, we're working on now looking for errors in our reports before they get released to patients and to providers. Just signed us up to do this yesterday, but we're going to be piloting this technology to look at identifying important findings and critical results and flagging them automatically within the electronic health record to make those findings more visible to not only our providers, to our patients. I very strongly believe that a patient's best advocate is the patient themselves, and I'm very excited about using these large language models to translate reports into some format that are easily digestible from patients. Embedding these into multimedia reporting. I think the addition with the newer versions of GPT that can do image generation and some of the language at the same time, I'm very excited about the potential for the utilization of that in those things. I probably could go on for most of this webinar with some of the exciting ideas that we're working on and working towards here, but I think I'll stop now. Thanks so much. Before I introduce Meryl, I'll say we do have an audience Q&A, so if you have other questions, specific use cases, please enter those questions. We're going to try to get to all of them. Meryl? Yeah. Thanks, Linda. I'm Meryl Huisman. I'm from Rotboud University Medical Center in the Netherlands. That means we don't have HIPAA, but we have MDR and actually the newly accepted AI Act, which is, I'm sure you know, far more stringent than other regulations. Actually, for us, it's quite difficult to use large language models in clinic, especially if it's patient data, of course. What we do see, as Jonathan also said, we are also an EPIC hospital. EPIC has these pilot hospitals where they use mainly for clinical notes and for summaries, the large language models, but we are unfortunately not such a hospital. We will get, however, probably next year, we will also get the EPIC plus Microsoft where they have the large language model integrated. That being said, though, I do a lot of cardiac reporting and what I mainly do is supervise the pre-dictated report by the cardiology fellow. And actually, I have to reformat those data. As you know, cardiac is a lot of numbers, so I have to reformat it in a format that the radiologists know and like, which is unfortunately different from the reported format by the cardiology fellow. I know this sounds silly, but this is daily practice. I think I see a lot of nodding faces. This is, you know, and you find yourself reformatting this manually. And this, I made a GPT myself and I let the GPT do it for me. So that has worked really well. It has been a little bit inconsistent sometimes, so you really have to check it with numbers. What I sometimes do is I make a screenshot from the report. You always have to be super careful. You do not include any patient names, so I double check that. But you can make a screenshot from actually the report from the technician, upload it as an image, and then it comes with your formats. However, it's easier if you have text input, I found. It seems far more accurate than with image input. So that's one thing I use it for, that's reformatting. And I'm pretty sure you could also use this for, you know, inconsistencies or auto-summary, et cetera, but I haven't done that. I am still a bit afraid I might accidentally paste something that I shouldn't have into there. And I have been using GPT-4 and now GPT-4-O a lot for academic tasks, such as if you're reviewing, you can nicely summarize, for example, the previous literature. You can just upload PDF in there, and it does a pretty good job at summarizing GPT-4, that is. So, and there's other tasks as well. If you're preparing a presentation, it can, you know, give you a start with the outline, et cetera. Yeah, that's about it, I guess. Wonderful. Thank you. Felipe? Hello. Thank you for inviting me for this great session. So there's one project I worked on in the past months, which is identifying patients with actionable findings in their radiology reports, so we can try to guarantee the follow-up for diseases that we know it's pertinent to have that follow-up. And this project actually started before GPT was launched. We were using classic, like, simple NLP rules-based models to identify some diseases. We know that for some, in some cases in radiology, we have already a pervasive use of structured reporting, for example, in breast imaging, so you can imagine you would not need a large language model to find out if the impression mentions a BI-RADS 2 or 4, for example. But in other cases, it might be a lot more difficult. For example, if you are describing a focal liver lesion, there are multiple ways of doing that for the same kind of disease, right? So since that description might be ambiguous and different radiologists might describe it in different ways, what we found was we had developed simple NLP models to detect around 50 diseases, but some of those, we couldn't get them right because there were a lot of false positives and false negatives because these rules can't identify correctly some of these ambiguous cases. And that's where large language models helped us a lot. So we're not using LLMs for all the 50 diseases, but for those that are more difficult to identify. And as you can see, this is a workflow that is already happening in clinical practice, but it also has the judgment of a radiologist after this case is identified. It's not like automatic follow-up. So a radiologist checks if that's correct or not and if that's pertinent or not to call the referring physician in that specific case to communicate that finding. So this is one thing we did. And the other thing I'm not using, but I see some colleagues, radiologists using, there are some products in the market that you can just mention the positive findings and it will generate the report in the same style that you use, including the impression, but the entire report. And that, if it works, let's say 90% of the time, it might help you save a lot of time instead of having to dictate everything. Another thing I just wanted to mention briefly is after this GPT thing started to take off, I was curious to try something myself and try to fine-tune an LLM. And one experiment I did a couple of months ago was to train GPT-2. And the reason GPT-2 is because GPT-3 is a lot bigger than my GPU could fit. But I trained it in the conversation between my wife and I, text messaging. And it's interesting to see how that model is able to resemble the way we talk to each other. It's just like I can talk to a chatbot that is just like the way my wife behaves. And the other way around, too. But this is just for fun. Wonderful. Well, I hope that this introductory question gives us all a flavor of the potential use cases for large language models. I'll jump to our second question. I'll have George answer first, which is we are, at least in the US, trying to be as patient-centric as possible. So far, it seems as though our patients like ChatGPT better than Dr. Google. But my question is, what's our responsibility when our patients are using public large language models and getting inaccurate information about their care in radiology? Yeah, and this is a hard one, and maybe Tessa should follow my response because I think she's done some research on this. In the current state, we haven't really validated any of these kind of commercial large language models like CHAT-GPT and their other companies to be safe enough for medicine. The bottom line is that you don't know if the answer is going to be accurate until you see it. And so at this point, it's hard to recommend patients interacting with these, but they're going to do it anyway, right? And so what I'd like to do, maybe we'll discuss this later, is to really come up with appropriate kind of medicine and radiology-specific benchmarks that we can apply and test on all these large language models as they come out, these new versions. And at that point, we might be able to have a better recommendation of whether or not they're safe for clinical use. And RSNA is actually kicking off such an effort recently, just recently, as part of the RSNA AI Committee that I'm on. So there's going to be more on that later in case any of the participants want to help. But Tessa, do you want to... Oh, sorry, Keith. No, I was just going to say, after your talk today, I'm a little bit nervous that our patients are going around telling people their George Rad score, so... Yeah, I mean, I agree with George. We don't have good benchmarks for any of these solutions, and especially certainly not for the public ones. But we did a little experimentation in the early days of CHAT-GPT, even pre-GPT-4, and found that when it makes mistakes, it makes subtle mistakes. So even for us experienced radiologists, we were sort of reading really quickly through the responses. For some of them, the first pass through, it didn't even register for us. Wait, that's wrong. And then we went back and took a second pass and realized that the response to the question wasn't actually factually correct. And so it's hard for us to say to our patients, don't use it. That's not something we can say. We can't tell them to not Google anymore than we can tell them to not use CHAT-GPT or GPT-40 now. But I think it's important for us to kind of get the message out there that these are not tools that are really vetted for healthcare as of yet. And there might be some instances where they actually give very good and very medically accurate information back. But the part that concerns me is the part where the errors are subtle, but patients might actually act on that because to them, they're not going to be able necessarily to look at a response and realize which parts of it are correct and which parts of it are incorrect. And so I think we really collectively as a community have a responsibility to try to educate our patients on this. There was a very nice, pretty significant study from the Pew Research Center, I think last year now, it's probably almost been a year, where they surveyed 11,000 adults in the US about AI in healthcare. And there's a lot of skepticism. But I think despite that, there are still going to be plenty of people who are using these models to get information regardless of that skepticism. And so I think we have a responsibility to try to educate our patients and our colleagues about what these tools can and cannot do. And as George said, that's going to change over time. So sort of not a one-time responsibility, it's an ongoing responsibility. Great, thank you. I'm going to have a new question that has showed up in the chat that we've also discussed privately, which is, do use cases need FDA or CE approval? Merle, would you like to answer from the European side? Yeah, so let me just rephrase. So you mean use cases in general, or? Yeah, for large language models, do we need to have the CE stamp approval that yes, you can use large language models for a particular use case? Yeah, yeah, okay. So for me, the answer was so obvious, I didn't even realize. Yeah, yeah. Unfortunately, yes, for every use case, especially under the AI Act, every healthcare application or use case is considered high risk. So actually at ESR, we are still, currently we're in a working group, we're looking at if there's maybe use cases that do not directly impact the use case, then we need to have the CE stamp approval. So that's directly impact medical decision-making or diagnosis. And then maybe, maybe it can be considered medium risk, but then still, I think in the end, it's all going to need approval, at least in Europe, as far as I can oversee now. Okay. I'd love to hear what Felipe thinks, because there was this very nice editorial in JAMA saying, you know what, for some of these everyday mundane use cases, that these are actually going to be implemented much quicker into clinical practice, and that may end up bypassing FDA approval. Felipe, what do you think? That's a very good point. As physicians, I think we always act on the best interest of our patients. And sometimes I know regulation is very important for us to not get to dangerous situations, but it also might hold a little bit the benefit that patients could get, right? So I think it's a balance between those two things. I can say that in Brazil, there's a similar law that was proposed, AI law, where they identified any use case, AI use case in healthcare as high risk. And I don't think that's the case. And it went to public comments, and most of the healthcare institutions mentioned the same. There are use cases in healthcare that are not actually autonomous use, and they're not related to defining a specific treatment. Use cases where a physician will always oversee the result from AI. And we are arguing that for those cases, we should have a different kind of bar for approval. And if it's not a patient related, it might not even need clearance. That's what people in Brazil are thinking about it. Can I chime in here, Linda? So I know, for example, the FDA is probably looking at this very carefully. I was invited to go give them a talk a couple of months ago, and they were really interested in large language models. But specifically, they were very interested and nervous about the vision large language models. So I have no information as to what they'll end up doing. But historically, they don't regulate a lot of things like the EHR or creation of reports. So I think it would depend on the tasks. For example, something like summarization that is commonly done with large language models. My guess is that it may not be regulated. But if you're talking about looking at medical images and even giving a differential, I would say that it may be very similar to the other AI applications, other devices they regulate. They're very, very interested in benchmarking. And so I think that's going to be a big area if we can provide clear benchmarks on how these large language models work. I agree. Meryl, I just have a question. If you could just briefly describe for the audience what the AI Act is. Most of us may not be familiar with it. Yeah, so the AI Act is a regulatory framework in which every and any AI on whatever field, so not only medicine, but also personal identification, education, all that, is heavily regulated. And it's based on the WHO principles that were released, I think, the end of 2023, so 2022 maybe. It doesn't matter. But it's a regulatory framework, the AI Act, that's extremely stringent. Every medical application is considered high risk. So what George was saying, the summary can also be considered medical because you're going to act on that, right? So I think I'm just trying to say that AI Act is a lot stricter than anything that we have now in the world regarding regulations in AI. So it's new. It hasn't come into full effect yet. So we're still debating how it's going to pan out. It's both vague and both very strict. So let's see. But it has the people's and the patient's best interest at heart, but it could also be actually slowing down innovation, but we'll see. Okay, thank you. Now I'm going to start answering some more of the questions from the audience Q&A. First one's addressed to Keith. I'm wondering what you think that large language models might do to augment decision support for high-end imaging, which really hasn't worked well with conventional means. Yeah. I mean, I think it goes back to having the right information to make decisions upon. And one of the reasons in my own personal experience why clinical decision support hasn't worked is the interruptive nature of most clinical decision support interactions. And I see John shaking his head because we interrupt him all the time when he's trying to order studies. But the, you know, having the ability for a large language model to really look at the totality of the information that we know about a patient and to similar to what George showed in his slides, take that and extract the important points relative to the type of imaging that we're considering doing, or maybe even taking a step farther, recommending imaging that would be helpful for the diagnosis of this patient in a non-interruptive manner is really where I see this field going. And I actually see this going relatively quickly. I know that we're already working on some use cases for taking some of our own appropriateness criteria and utilizing large language models with them as well. So, you know, I think it's the ability just, you know, summarizing what I was saying, it's the ability to just have the right information at the right time in a non-interruptive manner that will allow decision support to be more effective. Thank you. Anyone else has a comment? Okay. If not, I'm just going to just respond to one or two things that are in the audience Q&A. First was just a point of clarification, which Meryl had raised about using large language models for academic tasks like peer review. We want to clarify that it's okay if you want to get some background information, but most of the journals do not allow reviewers to put in the manuscript into these large language models and ask them to generate a review. You know, that is actually forbidden. Same thing for grants. I really wanted to emphasize that. I don't want to give any kind of miscommunication about that. And as well, a question for you, Meryl, maybe you can answer this question, which is from one of our audience who also does a lot of cardiac work, and they wanted to learn a little bit more about your model. You know, were you able to build this on your own? Just let us know for the audience who may want to play around in this space. Well, let me start off by saying about the papers, no, you should never upload an unpublished manuscript. My comment was about published. If it's open access, then it's available anyway to the model, right? So then you can upload. That being said, the second part, you're free to reach out to me. I can explain you in detail as well. But OpenAI's GPT has the GPTs, and there you can instruct just with free text in your own words, instruct a specific model, basically fine tune it with words to do what you would like it to do. For example, you can feed it an example of your desired format. And then you say, okay, every time I paste in this string of values of cardiac volumes I was talking about, I want you to output this and this format. And be mindful though, because it does make mistakes. I just wanted to clarify that, but it's really easy. And you can just reach out to me and I can show you or whatever, not a problem. Okay, great information though. So I'm wondering if Tessa could lead off on this next question, which is how can the accuracy of LLMs be monitored over time? Yeah, that's a great question. And I don't think it's even limited to LLMs. We're all sort of exploring how to do post-deployment monitoring for all of the AI in our workflow, but specific to LLMs, I think it really depends on the use case. For something that, to take Meryl's example one step further, where you're basically giving it a workflow task, right? You're saying take this unstructured data and structure it for me in some way. That's one way that it's probably fairly easy to monitor performance over time and get a sense of how many times it got it right, how many times it got it wrong. With Jonathan's example earlier of crafting responses to in-basket messages or organizing an encounter note or something like that, it perhaps gets a little bit trickier. But again, I think it all comes down to what George said earlier, where you have to have a certain set of benchmarks. You have to kind of have an expected range of performance. You can almost envision dozens, hundreds, maybe even more run charts within a radiology practice of all the different LLM use cases and how they're performing over time and the ability to then find where it exceeds bounds of acceptable performance to go and intervene. But I think these are all systems. This is going to be sort of the next phase. We're still in the phase of trying to figure out how we can use this technology. And the next phase is going to be once we've matured some of these use cases and actually put them into the workflow to start to evaluate how to actually keep track of their performance, because it's going to have to be something that's done in an automated way. You can't rely on the human clinician, radiologist or otherwise, that we already have enough to do, which is why we need these tools to help us. So, it can't be exclusively up to the human clinician to sort of raise a flag when something goes wrong. We're going to have to figure out how to automate the monitoring of these solutions, and that's going to require acceptable benchmarks. It's going to require clever prompting so that we can try to minimize some of the deviations up front, but also continuous monitoring so that we can identify, especially the more and more of these tools we put into the workflow, when things are not performing as expected so that they can be addressed. Okay, that's a really thoughtful answer. Does any other panelists have any other comments for this question? Okay, then if not, I'd say the next burning question we have is, what are important things to consider to ensure that LLM is going to be accepted by our patients, referring physicians and radiologists? I can answer that. I think it's partially what Tessa just mentioned, which is, at least from a referring physician, from, I would say, a scientific perspective, right, that these LLMs have been put through their paces, that you're checking benchmarks, that you're feeling confident with what was done to test the LLM before you got to it and before you're using it for patient care, and the continuous operational monitoring. Every time there is something like an update, a new GPT version, et cetera, what is the workflow for that update in terms of monitoring it? I think that will make all of us feel a little bit more accepting of large language models in general. Taking a step back, I think the first thing is about exposure, right? And I think if we were talking about this a year ago, which we were, then it's utilizing the technology to feel comfortable with the technology itself. And I think that's less and less of an issue as we use these large language models as our patients begin to use these large language models, not necessarily in medicine, but in our daily tasks. From a patient's perspective, I think that, and there have been studies that have shown this as FYI-ing patients about the use of artificial intelligence within documentation, within workflows, often is, I would say, preferred by patients. There are, and it's more anecdotal at this point about the utility of certain workflows like clinical summarization or in-basket drafting, showing that upfront to patients or just to the clinicians or to both. I think there is this idea that everything that is touched by an LLM needs to be reviewed by a clinician before it reaches a patient, no matter what, at this point in time. And I don't think that's going to change anytime soon. But putting that information upfront to patients may help our patients accept the information that much more. Can I ask a follow-up to that for Jonathan, since he's closer to patients than most of us, but anyone can answer. Do you feel like patients will today or will eventually have a positive or negative connotations if you're using large language models? In other words, today, if you misspell a word, they're like, well, why didn't you spell check? Everyone has spell check. And maybe soon, if you didn't summarize it in patient-friendly language, they're like, well, why are you using the latest large language model? Why are you making me read through this medical jargon? Some people might want the benefit of this. They might feel that you're more advanced in your medical practice if you're using this technology. What is your perception on that? I mean, I think different people have different opinions about it. I think that some patients would prefer the use of the most high-tech technology and discussing the clinical documentation that you have. But some people just like people doing it the old school way. And as a pediatrician, I think I see a lot of the first type of patient, which is basically parents who are really interested in their kids getting the newest, best care. And that has to do with parenting blogs, et cetera, the newest, hottest thing in pediatric care. Every two years, there's a new fad. So everyone wants to jump on that bandwagon. It's different in other areas of medicine and with different patient populations. So it's tough to say. I'm going to put Keith in the hot seat here. What about the medical legal risk if we're using LLMs just routinely to our patients, referring physicians? Yeah. I mean, I think like anything, to me, at least at this point in time, LLMs are a tool similar to any other tool that we're using, whether it be speech recognition, whether it be advanced post-processing. And in my mind, at this point in time, it is still, as has been said by many of my co-panelists, it's still incumbent upon the radiologist or the physician who's using the tool to make sure that what we're putting out to our patients, to our referring community, to the world is accurate and reflects it. I do think that something that John mentioned briefly is very important from a medical legal perspective too, and that's transparency. I don't think that we should use these tools without people being aware that we're using these tools. And I think at least declaring that at the end of the day will help us quite a bit. I think where I start to worry about is something that George brought up from a medical legal perspective. What if these large language models evolve to the point where they really, where we've proven their beneficial use, and then do the people who haven't implement them, are they at medical legal risk for not using them? Are they not going to be providing the standard of care? So I think it's going to be interesting to see how the medical legal landscape around this progresses, but I think for now we're relatively protected by the fact that we have to be looking at the output of whatever tool we're using. Okay, Keith, can I play devil's advocate for a second? And I'll fully disclose that I may not agree with the question that I'm about to ask, but I'm going to ask you anyway. Okay. We use all sorts of software in our workflow now, right? We do, I'm cardiovascular like Meryl, we do automated quantification or semi-automated quantification. We use all kinds of third party 3D modeling, sophisticated analysis software. We don't tell our patients that we use that. We use that because it's considered state of the art or is accepted, it's FDA cleared for whatever use case and we're using it and delivering care. So if we get to a point where use cases are FDA cleared or CE marked or these tools are bundled into some product, do we need to disclose that to patients? I think if we get to the point that they're FDA cleared and they become truly approved ways of doing business, then probably not. I think right now, whether we like it or not, we have to consider these experimental and similar to the way you wouldn't enroll a patient in an experiment, any clinical experiment without their knowledge. I feel strongly right now, if you're making decisions or providing care using one of these experimental tools, you should be transparent with the patients. Yeah, and I agree with you. I just wanted to put that thought out there for everybody to think about as well. Let me just throw the question back at Jonathan a little bit more nuanced. Would you ever tell your patients, hey, this is a complicated decision about your child, but I'm pretty sure I made the right decision because I'm using large language models to assist me? I don't think I'd phrase it in that exact way. I think that patients' parents want you to use all the resources available to you to make the right decision. It's just another resource that we have. It doesn't mean that it is the best resource. It doesn't mean that you aren't considering other resources more highly. I would say, and we can talk about medical education, that can be a whole other hour because this is an educational tool in the end in some way, shape or form. So are parent blogs, right? But they aren't always right either. Actually, not the majority of the time. So I would say that you're going to use every source with you to make the right decision. So I would say that you're going to use every source with caution, but I'm not going to exclude a source because it isn't 100% vetted all the time. Okay, great. We're down to three minutes, but since you talked about medical education, I wanted to close with this question from the audience. So given the rapid advancements in AI technology and its growing impact, I'm curious to hear from experienced radiologists, would you still recommend radiology as a career path for med students? And how do you see AI influencing the future of radiology and what opportunities or challenges might arise from it? George, you want to take a stab at it first and maybe Felipe, and then I'll hand it to you. I mean, I think I showed in one of my slides the evaluation of GPT-4 for chest x-rays. And so medical students don't have anything to worry about for a while in terms of these large language models interpreting x-rays. I do think that in the next 5, 10, maybe even up to 20, 30 years, we're going to see a golden age of radiology where we're going to have all this awesome AI. It's going to make our lives so much better as radiologists. So actually now is probably the time to go into radiology. I can't predict 30 years later. And so I'm going to keep it at that. Felipe? I totally agree with George. I think radiology is quite interesting and it's going to become even more interesting with these technological advances that we're seeing. And some friends tell me that I'm too optimistic, but I would go for radiology again if I had to choose that again. The only thing I would say is that I think our practice will look like a lot different than it is today, but it's still going to be quite interesting. What will the future practice look like since you mentioned that? I have no idea. I'm just optimistic that it's going to be more helpful for patients and less daunting. Okay, great. Well, that is the last question. So I wanted to wrap up this webinar. We had lots of interesting questions. All of us seem to be very excited about these changes happening in our field. So I thank all of the attendees for listening to this session. This is recorded and it will be available to share with everyone who signed up for this webinar. If you have other questions, you can email me directly at elmo at RSV.org and I'll make sure I get it to the right panel speakers if I cannot answer the question myself. So I have to close by thanking all the panelists for their time, for their thoughtful responses, and I'm most excited to see what will happen in this rapidly changing field. So thank you all very much.
Video Summary
In this webinar, Linda Moy from NYU moderates a panel discussion on the use of large language models (LLMs) in radiology, highlighting efforts by the Radiological Society of North America (RSNA) to educate its members on artificial intelligence (AI) and generative AI. Panelists, including Tessa Cook, Jonathan Elias, Keith Hentel, Merrill Hussman, Felipe, and George Xie, share insights into current and potential applications of LLMs in healthcare. They discuss their experiments with internal and open-source LLMs for tasks such as clinical documentation, summarization, and patient follow-ups, while emphasizing the need for HIPAA compliance and accuracy monitoring. They also explore the role of LLMs in patient interaction, noting both potential benefits and challenges, including issues of accuracy, transparency, and regulatory approval. There is a consensus that LLMs could significantly enhance radiology practice despite current limitations and that ongoing education and transparent communication with patients are crucial. The webinar concludes with optimism for the future of AI in radiology, suggesting a promising time for new entrants in the field due to the technological advancements expected to transform practices.
Keywords
large language models
radiology
artificial intelligence
RSNA
healthcare
HIPAA compliance
patient interaction
clinical documentation
technological advancements
RSNA.org
|
RSNA EdCentral
|
CME Repository
|
CME Gateway
Copyright © 2025 Radiological Society of North America
Terms of Use
|
Privacy Policy
|
Cookie Policy
×
Please select your language
1
English