false
Catalog
Breast AI in Clinical Practice…Are We There Yet? ( ...
T1-CBR01-2024
T1-CBR01-2024
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Good morning, everyone, and welcome to this morning's first session on breast AI in clinical practice. Are we there yet? So my name is Fredrik Strand. I will start the presentation and then Linda Moy and Etta Pisano will follow. So my talk will be about what do we know so far from clinical trials. First, I will discuss AI for cancer detection or CAD and then AI for supplemental imaging and briefly brief overview of ongoing AI trials. So first, AI for cancer detection. I think it's important to acknowledge or recognize that there are different settings that we have as a starting point, depending on which country and which setting you're working in. For us in Sweden and Scandinavia, we have a double read followed by consensus discussions. There's two radiologists involved first, and any flagged case will go to the consensus discussion where it's decided who will get recalled. There is also this double read and arbitration where only the discrepant cases are discussed, for example, in the UK. And then we have the single read setting, which is most, I think, common in the US. And this affects how you would implement AI, what role it can take on. When it comes to the first part, initial read in a two reader setting, you can either implement AI as an independent blinded reader so that it makes a binary flagging, whether you should forward this exam to the consensus discussion or not. And then the other radiologist, the radiologist makes the same thing, and this is blinded. And this is the design from the Screen Trust CAD clinical trial, which had a paired reader design. You can also implement it in a triage way so that depending on the AI score, the exam is forward to either be read by two radiologists or one radiologist. So if it's a low score, only one radiologist and high score two. And you can also do or not do informing and biasing the radiologist by letting the radiologist know the AI score. And this is the implementation in the Masai trial, randomized clinical trial, and also in the Copenhagen implementation. And the results from these studies were that when you had AI plus one radiologist, there were 22 percent more flaggings. And that means because AI and radiologists are less correlated when it comes to false positives than two radiologists. So there will be more false positives if you combine them. The advantage with having blinded reading is possibly that you can keep radiologists more independent from AI to keep on kind of practicing their own skills, so to say. When it comes to the Masai and Copenhagen implementation, we didn't see that much increase in flagged cases, but more flaggings of cancers and less of healthy. The potential issue might be that radiologists could become a bit more similar to AI or subject to some kind of automation bias when you're always informed of the AI assessment while you're doing your own assessments or read. And in all of these implementations, it was always followed by a consensus discussion between two radiologists, consensus readers, one and two, to decide who would get recalled. And that main purpose is to reduce the false positives that you would otherwise have unnecessary recalls. This was the same in all three of these two trials and this implementation. And the bottom line, the results from these studies was that in the Screen Trust CAD trial, if you had AI alone as the only flagger without any radiologist involved, you would get 2 percent less cancer detected, have 50 percent less recalled women, and, of course, no workload for the radiologists. In the AI plus one reader, blinded to each other, there was 4 percent more cancer detected, 4 percent less women recalled, and 50 percent less workload because one radiologist wouldn't be necessary, one of the two. In the Masai trial, there was a triage to one or two radiologists, and they were biased or informed by AI. And there was quite a large increase in cancer detection, 21 to 29 percent. There were a bit more women recalled, but I think that's warranted, so to say, when it comes to achieving this much higher cancer detection. The workload was 44 percent less for this initial read. So there's different ways of doing it. And looking a little bit deeper into the triage approach, which seems quite effective, we can see that in Copenhagen, the implementation there and in the clinical trial in South Sweden, there are differences. First of all, in Copenhagen, they did double reading for the top 30 to 50 percent of the exams, like the highest scores. But in the Masai trial, they limited it to only 10 percent, which affects the reading volume, of course. And also in the Masai trial, they had more, at least the same or probably more cancer detection than in Copenhagen, despite that there was a smaller proportion of double reading here, which you would think would help you pick up additional cancers. But then you need to take into consideration that they started at a 5.1 per thousand cancer detection, and in Copenhagen, they started at a higher point. So probably the baseline before AI will make a difference so that if you start at a low point, you're more likely to see an increase. And there's a similar thing happening with the recalls, which increased in this trial and decreased in Copenhagen. But again, they started at a low point for recalls, which makes it more likely to increase with AI, and they started at a higher. So I think it's sometimes you see that also for other types of trials like TOMO trials. It depends on your starting point, what happens when you implement a new technology. If we look at AI only, followed by consensus discussion, this is a standard of care, this reddish bar. And you can see that the AI only read will lead to less flaggings, like less consensus discussions, much less recall individuals, less women are worried unnecessarily, and there is no significant difference in cancer detection, which means that just using AI seems to be quite a good approach if you combine it with something else to not have this reduced cancer detection. And we tried also in a simulation study that is not published to see if we could kind of bypass the consensus discussion. But even if you use two different AI algorithms to try to make the score more kind of robust or better, if you skip the consensus discussion, which acts like this filter to reduce false positives, it means you would get a false positive increase about 40% if you have one AI or 17% if you combine two different AIs. And to have 40% more recalled women is not something we could handle for sure. And it's not good for the participants either. Instead, if you use the deferral approach, we could see that if we let the AIs recall without the radiologist the top scores, the absolutely highest scores, and then we defer the next 5.5% to consensus readers, then we would have an unchanged false positive rate and we would increase the cancer detection rate. So we could go down to the same, but this would mean the same number of consensus discussions, but we would have an increase, quite a large increase in cancer detections. So to try to sum it up, this is what I think you can learn from the clinical trials, is that it seems to be a good idea to let the AI recall the women with the absolutely highest scores, the exams with absolutely highest scores, and then also not recall the ones with the lowest scores. This has not been tested in a clinical trial yet, but that is something that, well, it has been tested in the Screen Trust CAD trial, but it's not been in a randomized clinical trial. And then for the second highest score, you would do this triage approach where you inform and bias the radiologist so they know they need to look carefully. You have one radiologist, if that one makes a negative read, another radiologist might do another read to see if that person catches something. And then for the next tier, it would do the same, but just with one radiologist, similar to the implementation in the Maasai trial. If you're starting in a single read setting where you have only one radiologist, it's possible to do this in a similar fashion, except that you would be the only radiologist that are informed and biased by the AI, if that's the way. But, I mean, you could also decide to implement it in this other way, because it will be very rarely that you need to engage the arbitration reader. And an important thing, I think, is that you shouldn't always go with the categories that the manufacturer gives you, like the scores, the categories 1 to 10, for example, or whatever cutoff points they use. But I think this needs to be calibrated so that you have a desired positive predictive value or desired cancer detection rate in each strata. So you can decide if the cancer detection rate is expected to be more than 20 per thousand, maybe they should be recalled or something like that. And the ones that are not recalled without the radiologist, with an AI-independent decision, you want this to be very low, maybe like less than 0.025 percent or some very low number. And on an overall level, you want this to lead to a higher cancer detection rate and less false positives than the alternative standard of care. These are some evaluation metrics I would suggest from the trials. And one thing that is important is, like, the stage shift. In the initial trials, you will get this increase in cancer detection rates, but that's just the first time. That's not likely to be sustainable over time. Probably the number of cancers that arise every year, they will not be affected by our way of screening. So hopefully it will lead to a stage shift instead. So you have in the steady state, you will have the same number of cancers detected more or less, but at the earlier stage. And now AI for supplemental imaging. We have been working with risk models for a long time. It depends on the purpose of the risk model, which one you would choose. If you do primary prevention, you might want a lifetime risk model. But if you do secondary prevention, as most of us do, I think, with screening, we need a shorter-term risk model. The traditional risk models, they are not very good at short-term risk prediction, and they use very few image biomarkers. Some have used density, but that's about it. We had this great article paper in Radiology last year by Vignesh Arasu at Kaiser, and he compared different models for risk prediction. He compared the BCSC model, the MRI model from MIT, and these different CAD models, which you could also use the score from the CAD model as a risk predictor to show how well it predicts risk at different time horizons. So this was a retrospective trial. There were some differences, and the MRI model had the highest AUCs in this comparison. And we had this prospective trial, Screen Trust MRI, in Stockholm at Karolinska, where we used an AI model that we have developed to select the women instead of using density. The screening workflow looked like this. So they had to have a negative screening mammography. They had already received a letter home that we didn't see any cancer signs. Then they were invited to the trial, and if they had a very high score, top 7%, those were invited. And 1,300 accepted. They were randomized to have MRI or no MRI, and 559 underwent MRI in the end. At the moment, we don't have a follow-up of the women that didn't have MRI, but for the ones that had it, these are the numbers. So we found 36 cancers in those 559 persons, and that's 6.4% or 64 per 1,000 MRIs, which is very high. You can compare it to a regular mammography screening where we have half a percent, and the dense trial with 1.65% using density as the selection mechanism. And this means that the number of MRIs you would need to find one cancer is now very low in the setting we ran the trial. The cancers are, I think, from this it looks like they're worth finding, but we will know for sure when we compare with women randomized to not have MRI. So the lesions were 13 millimeters in median. Some were multifocal. Most were invasive. Quite few in-situ cancers, and we had a mix of mostly ductile, as you would expect, and so on. If we look at the different components of the risk model, we have three components in, so you can see what is it that gives the signal to select this person, the masking, risk, and cancer signs. And mostly, I think, the cancer signs part, which is similar to a CAD model, a signal for most of the cases, but also the risk model helped quite a lot in identifying the women. The masking, I think, contributed the least. It doesn't mean that masking is not important. It could also be that our model is not as good as it could have been. Here's a few examples. This is one case, and you see the mammogram here is the same patient, all the images. This is mammogram and MRI, the mammogram and the MRI. So this is the axial CC view and the sagittal and the MLO. And this was assessed as negative for us, and this was the image on MRI where you can clearly see the lesions. Another one here. You can see some densities here in the mammogram, but we didn't catch that on screening, so it was pretty apparent when we did the MRIs. Same here. And this was a really large area of non-mass, mainly, enhancements, which covered a large part of the breast that we caught in this trial. A limitation. This was based in a screening system at Karolinska where we don't use AI-CAD in the initial screening, so there was no CAD to catch these lesions before we applied our model. We work with a two-year screening interval. If you have a one-year screening interval, there's probably less undetected cancers to detect, so to say. And we had relatively low recall rates, which might leave a little bit more false negatives as well. So if you would implement this in a normal U.S. setting, I would expect the numbers to not be 64 cancers per 1,001 MRIs, but a bit lower. Great. Yeah. The next step is probably to think about how you would integrate this risk model or any risk model into existing paradigms and guidelines. One way would be to help deal with the women that get the density notification letter and try to be more nuanced in the way to approach that. What would you advise? Maybe just a reread by the radiologist even before that, or a contrast-enhanced MRI or MAMO, short-term follow-up, or some way of kind of handling that, not just giving them the letter and a tomo, for example. And then also it could be connected to high-risk MRI if it's possible to demonstrate that the lifetime risk is more than 20 percent. I will just show this slide at the end now for the ongoing AI trials, or planned or ongoing. And we can see that there is another risk trial starting in Stockholm, not by my group, but Per Hall's group at KI, Karolinska. And there is also another risk trial here in the U.S. at UMass using MRI. So I think we have a lot of exciting results to look forward to for the next RSNA. Thank you. And I'm going to be talking today about the development, testing, and regulatory concerns in breast imaging AI. So my talk is about the U.S. environment for regulation. I couldn't cover the whole world, of course. But the important points about AI are we have to prove both safety and efficacy to get FDA approved in the U.S. for sale in the U.S. And there are over 20 AI products in breast imaging that have been authorized for sale to date. Important point right at the beginning is the question whether autonomous AI is allowed in the United States. And the answer is a resounding no because of MQSA. It requires the involvement of a qualified person to read the mammograms, mostly radiologists. And so we are not permitted in the U.S. to use autonomous AI. The ways that software is approved for use in mammography is these five categories. Frederick has already talked about several of them, so I'm not going to spend a huge amount of time talking about them myself. But I'll just say those are the categories. Triage, detection and or localization, diagnosis, combination of detection and diagnosis, and acquisition optimization, positioning. There's also density, but I'm talking about interpretation. So triage is a very obvious one. Which things should I read first? If you're in a practice that's very busy, it gets behind on screening, you might consider investing in this because you'll know which mammograms, the machine tells you which ones are more likely to be positive. And you can read those first. This is a big problem in many countries, not so much in the U.S., but as we get fewer and fewer radiologists reading breast imaging, this may become a bigger problem. Obviously, finding things on the mammograms is another way to use AI. Figuring out whether it's malignant or not is another software application. And so now I'm going to switch over and talk about how we prove that a technology, what the FDA takes as evidence that a technology is effective. So there are three basic methods. Lab tests, cases that you've collected that have cancers or non-cancers, and run the software on it. Independent testing of readers. Then there's reader testing with the cases, mixture of benign, malignant and negatives, and having readers read those cases. And then, of course, clinical trials. So standalone performance testing, tons of cases. Then assess diagnostic accuracy, sensitivity, specificity, positive-perfect value, et cetera, from the AI alone. That's a baseline that is required before any other FDA tests. Then, of course, standalone performance testing has its limitations. Were the cases properly classified? Were the training and testing sets independent and representative of the population that you're intending to use the software on? Will the results generalize, and what sorts of data were missing? Were there people or types of breasts or risk factors that were not represented in the test set? Reader studies are a cut above, a little bit more information. You do them with and without AI. Frederick's already gone into the various ways you can do interpretation with AI. You can measure the same metrics, and you can also figure out group performance data. And, of course, the same kind of limitations apply to these kinds of tests. Are the radiologists representative of the population? Since no patient is being impacted by the results of the test, are they performing in the way they might perform in real life? And they often require the radiologists to use a scale. This is a screening exam, and we don't use anything but 0 and 1, BI-RADS 0 and 1 or 2, in a screening exam. So in order to really show the value of AI, we have to use a bigger scale than 2 points or 3 points. And so that's a little unusual for radiologists. They have to be trained in the scale, so that can also harm the results. And in addition, the cases in screening studies are always enriched for cancer. In a screening population, it's somewhere between 4 and 10 per 1,000 women. We can't do reader studies of 10,000 mammograms to get enough cancers to find whether they can find the cancers or not. So we enrich them with cancers. And the question is, how does that affect radiologists' performance? So clinical trials are the gold standard. Frederick's already described multiple of them, but I will just say they are the most expensive way to evaluate a technology. Lots of patients are required, especially for a screening tool. Data entry requires a lot of work, and then there's a lot of oversight and central quality control management and analysis. It costs a lot of money. They take a very long time. The sample sizes are gigantic in screening studies. And are they representative of the U.S. population? Are the sites representative? Are the physicians representative? All these things are things you have to think about if you're going to run a clinical trial. And are the people, the readers, behaving the same way? They know they're being watched. Are they performing the same way they do in clinical practice? So, you know, we have been using these tests. That's what the FDA has been doing to authorize sale in the U.S. There are other methods. You could do real-world evidence studies. That has its drawbacks as well. I'll give you one example where the AI products did not perform as expected. It was not in breast imaging. It was in the field of neuroimaging, looking for stroke, looking for large vessel occlusion. This slide was provided to me by the FDA because after the product was released into the wild, there was some thought that it was misperforming. Well, it turns out people were using it off-label. When it was evaluated, it was seen that non-radiologists were using the interpretation provided by the AI system to treat patients for stroke or not, which it wasn't cleared for that purpose. It was cleared to just provide triage. You should read these first. It was there, and there was a real problem with finding posterior vessel occlusion with this software. Patients were getting treated or mistreated or missed. The cases were being missed if they had posterior vessel occlusion. A lot of times, the software doesn't necessarily perform the way it was predicted to perform once it's out in the world, partially because of human error, but for a lot of other reasons as well. We know this from mammography from the way CAD performed. This is Connie Lehman's paper in the JAMA Internal Medicine from 2015, which showed that really, even though CAD did great in lab tests, great in reader studies. It didn't change any outcomes for patients. And so we spent a lot of money, a lot of time for many years using CAD, and we really didn't change diagnostic accuracy or patient outcomes. So we have to be very careful what we do with AI once it's out in the wild. So perhaps we should be using real-world evidence at the beginning to figure out if it works or not. So one can imagine an experiment where AI is installed in clinics, and you can set up multiple readers, one reader, or just AI autonomously in the background to see how it performs with or without readers in a clinical setting. The patients would be protected because they get standard of care. Perhaps a second radiologist could read with AI, and the person would impact the patient. So that's the main difference between a reader study and a real-world evidence study, is that you actually have an impact on patients. All of us know that we take that very seriously. We don't want to harm the patient, either with an excess of callback or a missed cancer. So we're going to behave differently in that kind of setting than we would in a reader study. So I've already talked about this. How is it different exactly? Let me give you some examples of how you might do the study. You know how this works. Very expensive, very costly, very time-consuming. What you would do in a real-world evidence study is patients would check in in your clinic. You would ask them if they want to opt out of AI. Most patients are going to opt in from the evidence we've seen from people doing these kinds of studies. You would do your usual work. You'd read your mammograms the way you normally did. Maybe you'd get a second person to help. This would be easier in Europe than it is here, because we only have one person. And then you'd have to decide if you wanted to triage the patient or send the patient for additional workup, if either reader or AI called it positive. Or you could set it up the way that Frederick just discussed. There's lots of different ways you could set it up, but you could just do your clinical work without much attention to randomization, data entry. You'd have to extract the data in an automated fashion, something like this. Given what you heard today, are you interested in using autonomous AI in your practice? Remember, it's not allowed. But let's say it actually was as good as it's performed in Europe. Are you interested in using it? I'd like everyone to vote. I gave you some choices here, because I've talked to people who've said, no, I'm not going to use it, because the vendors need to take the liability if they're going to be reading the cases. So I put that one in. I'm going to show the results. Go ahead and vote. I think you're all very wise to say maybe. I must say, because I think we don't have enough evidence that it's helpful. The study in Europe was very encouraging, so that was interesting. But I think that's exactly where I fall as well. We'll see how things play out. So again, it's not permitted under US federal law at present. So I'm going to now briefly talk about what's going on in the UK. They are ahead of us in testing this, as you just heard. A lot of studies going on in Europe. And there is a study that's going to be launched soon in the UK. Fiona Gilbert told me about it. She's a professor at Cambridge University. And they're going to be cluster randomizing by site women to AI for triage, AI as the second reader, and a control group, two readers. That's the standard of care in the UK. So that's a very exciting study. And it's essentially a real-world evidence. It's a randomized study by cluster. So the practices will continue the way they always do at those sites, except with AI included for those two purposes in a randomized way by time, I believe, and by site. So now I'm going to switch gears briefly to tell you about a program that's been just announced. I was recruited by the ARPA-H to lead an initiative to help improve clinical trials. I just told you, too long, too expensive, and not representative of the US population. So ARPA-H has launched an initiative called ACTOR. What if we could launch trials that would be accessible to 90% of all Americans within 30 minutes of their home? I'm a believer that the reason why technologies take so long to disseminate is because people don't have access to them. So lots of people, we are very sophisticated. We radiologists. Everyday doctors, primary care doctors, internists, et cetera, aren't as open to technology changes and really feel uncomfortable. So if we could put trials closer to where people get care, not just the patients would benefit. They'd have access to technologies earlier, but the doctors and the care providers would as well. So we're launching an initiative to help improve clinical trials with technologies. There are going to be three phases to the study, but the first phase is developing tools. And that means extracting data in an automatic fashion. I've told you about that already, and when I talked about clinical trials, that would really speed things up. You don't have to stop and enter data. You don't have to have an RA there entering data. You just pull the data out of the record. You can curate the data. You can improve auditing of the data in an automated way. In addition, getting women or men, any patient, into a clinical trial is very labor-intensive, identifying the eligible subjects, developing culturally appropriate consenting, and engaging the patients to keep them in the study. So that's the first phase that was announced yesterday. And we're moving on to a clinical trial, and that's where radiologists come in. This is all in phase two, which has not been announced yet, but which we're intending to announce in the spring. And it will involve clinical imaging centers, and it will almost certainly involve the FDA and AI. And we're going to be testing AI products in a real-world evidence pragmatic trial design. I put this map up to show you how disseminated radiologists and imaging centers are in the U.S. We are within 91 percent of all Americans within 30 minutes. The pharmacies are always mentioned as well, 30 minutes for pharmacies. I'm sorry, 90 percent within 30 minutes for pharmacies. So we actually are more disseminated, more accessible than pharmacies. And we also are great beta testers for technology. So probably the first trial will be AI in breast imaging clinics for breast cancer screening, and the second trial will likely be in pharmacies and likely be for diabetes testing. Okay. So I was going to talk about clinical implementation, and I just realized it's been a decade since we've been presenting AI research here on breast imaging at RSNA, but still the clinical penetrance is low. So I'm going to just do a brief review after Frederick just about AI for screening mammography, specifically for standalone AI and triage, and end and talk about the top five hurdles to clinical implementation of AI in general. So let's look at the AI market. You can note the exponential increase in AI devices approved of by the FDA. Last year, it was 171 devices. Now, with this year, we're close to 1,000 products. So the issue is, which of these products will be widely adopted in our workflow? And so to answer that question, let's talk about cross and chasm. It's a concept for visualizing the adoption of a new technology over time. So in 1991, Jeffrey Moore wrote Cross and Chasm and highlighted how disruptive innovations are adopted. So the process is in a bell curve. So you have the early market comprised of consumers that are prepared to take risk with new products. So they include the innovators, tech enthusiasts who are the first to buy the product. Then you have the early adopters, individuals or businesses who use a new technology before others. Then you have the early majority, customers who purchase new technology after the innovators and early adopters have proven the benefit. Then next, we have the late majority, who is typically older, less affluent, and educated than the early segments of this technology adoption lifecycle. And finally, the laggards, consumers who avoid change and do not adopt new technologies until all traditional alternatives are no longer available. So the crucial junction then is this gap or chasm between use by few visionaries and acceptance by an early majority of pragmatists. So that's where we are now. We're still in this chasm. And so now I added the percent market penetrance in green. So the chasm is the inflection point. And of course, vendors want their product to have 100% share of the marketplace. But often, the expectations of the performance of the AI models and the actual product that's delivered is misaligned between what the radiologists need and other market forces. So I'd say part of the issue is really identifying the use case. It has to be specific to your practice. So the critical question is, who is a product being designed for? Or is it a solution in search of a problem? So now I'll discuss AI and screening mammography, really focusing on use case. So of course, we all want to improve our accuracy, finding more cancers, lower recall of false positive rate, ideally to get the win-win of also having a decreased interpretation time. So our European colleagues not only want that, but they would love to eliminate the second reader. So I think that the first hurdle is, well, does AI find more cancers? And the unequivocal answer is yes. This is four large studies from four different vendors. And they show that the radiologist's diagnostic accuracy is higher when they use AI compared to reading by themselves. So the AUC is 0.8 to 0.89. So the take-home point here is what Kurt Lanas keeps on saying. Radiologists will not be replaced, but radiologists will likely be replaced by those that are using AI. So this is a great study that Frederick had mentioned earlier. This is from UK Performs, where they assess human and AI performance over 500 readers and AI. Reading in rich cases, 70 malignancies. And what you'll see is that the AUC of the AI is 0.93, higher than the average reader, with an AUC of 0.88. But this now is the important take-home point. The study found that radiologists need to adjust this AI threshold and not automatically use that default threshold set by the developers. So the developer threshold is in that yellow arrow. And that's set to have a low recall rate and a high specificity of 91%, with the trade-off sensitivity drops to 84%. The red and green arrow are set at the reader's sensitivity and specificity, respectively. So the take-home point here is that the recall threshold can significantly affect the reader's performance and should be set at a screening program where the AI will be deployed. So this is a great study from Frederick Strand's group. He already touched upon this, 55,000 women. They found 269 cancers, comparing the conventional double-reading with three scenarios. Double-reading, radiologists plus AI. Single-reading, what I call standalone AI. And triple-reading, two radiologists plus AI. And they found that all three of these strategies are non-inferior to conventional double-reading. But more than that, one radiologist plus AI and two radiologists plus AI was superior than the conventional double-reading. So now let me go ahead and touch upon AI workload for triage. So the idea here is that if you have a very low level of suspicion, the mammogram may only be read by the AI system. So not read by the radiologists. Everything else will be with this new convention standard of AI plus radiologists. And the study we published in Radiology showed that here, again, the triage, and this can lead to significant workload through simulation only from about 47% to 72%. So now lots of, unfortunately, only small retrospective studies, some simulations, showing that there is non-inferior sensitivity to cancer detection. And this is what's nice now. We're seeing improved specificity. So we can maybe have our cake and eat it too. But the consensus is what Etta had said. We're not quite ready for this yet. We're also not quite ready for any AI product that uses generative AI or large language models. None of those have been approved by the FDA. So the evidence for the stand-alone was in this large meta-analysis that we published. I see many co-authors here. So, again, the basic point, AI for digital mammal. We're ready for these clinical trials that Frederick really beautifully illustrated. And as well, stand-alone AI, we found for digital mammal, it performed as well or better than individual radiologists. But there was really insufficient numbers for AI for tomosynthesis. Let me break down the numbers for you. So here you can see the AUCs were significantly higher for stand-alone AI than radiologists in six reader studies. The AUC was 0.87 to 0.81, but not for the historic reads. There's only four DBT studies, but those had showed higher sensitivity and then lower specificity were seen for stand-alone AI. And that turned out to be true for all radiologists, regardless of the type of mammogram that we use. So this is the MASAI trial. And here, large randomized clinical trial. The control is going to be standard double reading without AI. In the intervention group, using their system, which gives you a score of 1 to 10, 1 to 9, the lower probable malignancy scores, underwent single reading plus AI. The most suspicious, AI score of 10, underwent standard double reading plus AI. So here were their results looking at the control arm without AI and intervention. So you can see similar cancer detection rate, almost statistically significant. You can see that there was really no difference with the recall rate as well. But there is this potential reduction in workflow of about 44%. So these authors concluded that AI-supported mammographic screening had a similar cancer detection rate compared with standard double reading and with the potential to substantially lower the workload for us. So now you've all heard about this. I just wanted to think outside the box because I do think clinical utilization will depend on the use case. Here's one that we were doing a lot of preliminary work on, which is, what if you could decrease the variability among the radiologists in your practice? So we had searched the ACR NMD, mammography database, 35 million mammograms. I'm just showing you two metrics, recall rates, cancer detection rates. And you can see there's actually lots of radiologists that are not practicing within the expected range of normal based upon multiple outcome study. Here's something we've been playing with NYU. We're saying, well, what if, let's say I read a mammogram and I recall it, but the probability is really low? If we know that, and a second reader comes in, can they override my recall? So we talked about that, a lot of issues about medical legal risk, obviously. And then the other issue is we really felt uncomfortable with double reading another colleague. So this is what we've done instead. For the screening mammograms that are read as normal, then the first reader just has the EA but doesn't know where to look. They just know some basic information. But then we have a second reader that knows it's a high probability of malignancy, knows where to look. And during that, we've doubled our cancer detection rate with both single reader and double reader. So interesting, I think, where we can go. So I'm going to close this talk now with talking about some top five hurdles to clinical implementation of AI. So from number five to number one. So number five issue is really the lack of well-annotated, publicly available data sets. So what we're worried about is various shortcuts. The AI is getting the right answer, but for the wrong reason. For instance, what if your mammal units at your cancer center are all using halogens? And at cancer center, you see lots of high-risk patients or newly diagnosed patients. Well, guess what? Your AI can be trained to recognize subtle features from your halogen mammal and be more suspicious of that than, let's say, in your community practice where you have GE tomo units. So what we say in expression is really garbage in, garbage out. You really need good, clean data, as well extensive external validation using data that your AI product hasn't seen before. The other issue is this concept now of continuous learning. Let's say that you guys are in that innovator group. You're already using AI in your practice. But all AI systems, including those that are commercially available, are what we call brittle. It means that even though it's learning, despite the fact it's learning over time, actually that means they can be adrift in a model's prediction, and they may become unreliable. You're like, well, that can't happen to me. But I'll tell you, how many of us have been practices that have consolidated and merged with many more centers with new patients? Or you can have a new disease like COVID, right? So what you really need to make sure is, are the predictions comparable with the prior versions? So at NYU, we have an AI governance team that's divided into two committees. The first one is what we call the AI value committee. Using this new AI product, is it worth our time? Is it slowing us down? Is it easy to implement? The second is our AI clinical implementation committee. Here, we are looking at model drift, making sure that this AI is safe and effective over time. So I think of this as like a post-market surveillance, just like what we do for drugs. The number four hurdle is addressing AI and bias. So most training data is pretty homogeneous. But in real life, our patients are quite different. So the bottom line is models may not attain that performance seen in the training data, because they fail when operating outside of the training data range. So here's a way to make sure that the training data really captures real-world heterogeneity. Otherwise, you'll have unattended biases in your AI system. So this brings me back to, okay, what's publicly available? This was from Curt Langlois and Stanford a few years ago. They looked at the geographic distribution of all of these data sets beyond radiology. You can see pathologies included in what they found. So the lack of geographic diversity to train all of these AI models, mainly coming from three states, California, Massachusetts, New York, and actually 34 of 50 states do not have any representative data. And it's the same issue with international data sets here. Top five countries are U.S., China, U.K., Germany, and Canada. So number three is really the importance of having clinical trials, ideally randomized clinical trials. I'm going to skip this slide. So the reason is, why do we want this? As I showed you, most of the FDA algorithms rely on preliminary evidence. They need to be validated. They need to be generalizable, working across multiple patient populations. So we want to make sure that these initial accurate predictions are going to be, you know, to be maintained. You want to identify unsuspected biases. And as well, you want to make sure that AI models are improving clinical outcomes. Is it maintaining that cancer detection rate? Subsequent rounds, you should have a lower false positive rate. So this is what we're looking for. Number two, which is rarely discussed, is your ROI. What's your return on investments for buying this new AI system? So this, we'd look at what's called the Gardner hype cycle. And I'm going to show this using the traditional computer-aided detection. So CAD was introduced early on, lots of hype. And you'll see a steep drop in radiologist satisfaction using CAD. Really, there were lots of false positives. But now, I'm superimposing the payment, the model of reimbursement. You can see steep drop in reimbursement parallels the dissatisfaction of CAD. And a few years ago, we had bundle payment, so we don't even get paid for using CAD. So this brings the issue, what's going to be the reimbursement for AI tools? I can count with my hand out of the thousand that are FDA-approved that we actually get reimbursed for. So now, I have the Gardner hype cycle running alongside the diffusion of innovation curve. And you can see the chasm reflects how much disillusionment the early adopters have with a new technology. And now, talking about education, we all have to be informed, our trainees, radiologists, clinicians, and patients. So you need to get to know the AI. So this is something we haven't talked a lot about, looking at how the radiologist will use the system. I want to just touch base on two basic principles. One is, how do you know whether your AI prediction is going to be incorrect? Talking about automation bias, and then quickly, uncertainty quantification. So automation bias, we've had many talks about this behind the scenes. Deskilling of radiologists because they become over-reliant on a technology. Something published in Radiology, 27 radiologists trained on an AI that worked, then did a reader study, where they were given incorrect predictions on purpose. Normal mammogram, but a heat map shows something suspicious in red. And the results show that inexperienced radiologists were more likely to follow incorrect suggestions than the more experienced readers. Then the other issue is what I call uncertainty quantification. So can you really trust this prediction? It turns out that the AI systems actually sometimes can be really confident that there's, let's say, a stroke on the CAT scan. But there are other times when it isn't so sure, and it has a low level of confidence. So I think we really need this information in order to put AI to make an accurate forecast. So imagine here we are, the radiologist in the cockpit. And so what you need is to have, I believe, any AI models to work. You need real-world prediction problems to be able to understand what is the limit of the product that you're using. So bottom line, without clear methods for understanding how an AI model makes its prediction, it's really unlikely that AI decision support will provide optimal benefits to the radiologist. Just to summarize, here are, I think, the top five hurdles to implementation. That note, thank you for your kind attention. Thank you.
Video Summary
The session on breast AI in clinical practice explored how AI is being integrated into cancer detection and supplemental imaging, alongside ongoing AI trials. Fredrik Strand discussed differences in AI application based on geographic settings, highlighting approaches like AI as an independent reader or as part of a triage to optimize radiologist involvement. In studies, AI alone sometimes detected fewer cancers but required less recall and workflow, while combining AI with radiologists improved detection rates and reduced workload. Results varied based on initial cancer detection rates and recall levels.<br /><br />Etta Pisano emphasized regulatory concerns, notably in the U.S., where autonomous AI isn't allowed under MQSA regulations mandating radiologist involvement. The efficacy and safety proof are crucial for FDA approval, and AI's real-world performance sometimes deviates from initial expectations, as seen in past implementations like CAD (Computer-Aided Detection).<br /><br />Linda Moy discussed hurdles in AI adoption, noting the exponential increase in FDA-approved AI devices but highlighting market penetration challenges. She suggested identifying specific use cases tailored to practices and potential AI benefits in improving accuracy and reducing workload. However, hurdles such as biases in training data, clinical trial needs, reimbursement concerns, and educational requirements were also noted as barriers to broad AI implementation in breast imaging.
Keywords
breast AI
cancer detection
AI trials
radiologist involvement
regulatory concerns
FDA approval
AI adoption
training data biases
RSNA.org
|
RSNA EdCentral
|
CME Repository
|
CME Gateway
Copyright © 2025 Radiological Society of North America
Terms of Use
|
Privacy Policy
|
Cookie Policy
×
Please select your language
1
English