false
Catalog
AI in MSK - What You Need to Know (2021)
T2-CMK05-2021
T2-CMK05-2021
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Good morning. My talk is accelerating MSKMR using deep learning reconstruction. So first the question is, why is it important to accelerate MR? And I probably don't need to tell you that, but there are a number of reasons. The first is increased patient comfort. The second is for pediatric imaging. So obviously we have to sedate or anesthetize a lot of our pediatric patients. If we could go faster, hopefully at some point CT speed, we might not have to do that. We'll have improved image quality because if you could go faster for each sequence, we're going to have decreased motion artifact. This one is maybe not so important for the United States or some of the countries in Europe where there are lots of MRs, but there are a number of countries where there aren't a lot of MRs and there's not a lot of access to MR imaging, especially in LMIC countries. So if we were able to go faster, it would be great because we enable many more patients to have the opportunity to get MR imaging. So why is it a problem? So traditionally image quality versus speed has always been an equation that we haven't been able to solve. So traditionally image quality is inversely proportional to one over imaging speed. As you increase imaging speed, you decrease imaging quality. Now there are a number of ways that people have tried to overcome that. The first was with parallel imaging developed in the late 90s and early 2000s. My guess is almost all of you use parallel imaging routinely in your imaging. Here's an example of a parallel imaging with an acceleration factor of two. It looks great. The problem is that you're generally limited to an acceleration factor of two. If you go faster, you get something like this. You decrease your signal noise. You start getting banding artifacts because the coils aren't truly independent of each other and we're locked at an acceleration factor that isn't fast enough to really do what we want to do. More recently, we've been using compressed sensing. Compressed sensing takes advantage of the sparsity of data within images. We can get images. We've done some studies showing they're interchangeable, but if you see on these images, they don't look like normal MR images. They have this regularized or normalized appearance and we just haven't accepted them to introduce them into routine clinical practice. Now with AI, the question is, can AI change that equation? Can AI allow us to go faster and maintain at least the same, if not better, image quality? There are a number of techniques that have been used in AI. I don't have time to go through all of them, so I'm going to really concentrate on the one that we've worked most on, which is undersampling, but there's also super resolution and denoising and I'll try if I have time to show just examples of what those look like as well. Just to remind everybody, what we hear about in AI usually is pattern recognition. You acquire images in a usual way. You transform them into the image domain and then you put them through a neural network. What we're trying to do is actually acquire the images differently. In our case, we undersample and as you know, if you undersample, you take less phase encoding steps, you go faster. Then what we do is we put that case base through a neural network and we try to get images that look as good as our fully sampled images. This is what it looks like. This is an example where we undersampled. You can see with conventional reconstruction, you have a lot of artifact. We know it's a need, but nobody would read this image. We put that case base through a neural network, reconstruct the images. We then compare it to the reference image or a fully sampled image. We look at the reconstruction error and we keep doing that until we get an image that looks like we like. Here's an example. Again, we start here. We put it through the neural network the first time. We get something that looks like that. We put it through again. We keep doing it until we get an image that looks equivalent to our fully sampled image. When we started, this is an example of a parallel imaging fluid sensitive image. You can see that we know it's a need. You can see the bone marrow edema, but you can also see the signal noise that's really not very good. None of us would want to read that image. What we did is we took the same data and we put it through a neural network and we got an image that looked like this. This made us very excited when we first saw this. You can clearly see the edema. You can see the anatomy much better. The question was this is easy, right? Not really. Here's an example, again, a fully sampled image. You can see the small radial tear involving the medial meniscus. We went ahead and accelerated by a factor of four. It looked really good. We said let's go even faster. We want to go to an acceleration factor of eight. The problem is when we did that, the images doesn't look very bad, but you can see that we no longer see the meniscal tear. We have to figure out the right optimal quality of how fast we should accelerate and all the other parameters to really allow the AI to show us what we need to see in pathology. We've now done a number of studies. We did a retrospective study looking at 108 consecutive NEMRs that we published in AJR. We've recently just completed a prospective study, prospective undersampling of 170 needs for internal derangement. All were done with four times nominal acceleration with a total acceleration time for all of our sequences between four and six minutes depending on the size of the patient. In all cases, the deep learning accelerated images were judged not only to have equal image quality compared to the standard images, but actually to have better image quality, most likely because of decreased noise as well as the removal of artifacts. And importantly, they were just interchangeable for all 19 features that we examined, menisci, ligaments, cartilage, bone in the knee. Here are some examples. Here's an example of a proton density sagittal image. On the left is our standard clinical image. On the right is our four times accelerated image, and I would argue that these images are interchangeable. It's hard to differentiate them. Another example, this is a fluid sensitive case. You can all know this is a Sagan fracture with an ACL tear. Again, the image on the left is our conventional reconstructed image. The image on the right is four times accelerated and undersampled. And again, I would argue that this image is completely interchangeable and the image quality is at least as good as our standard images. Another example shown we can do this prospectively, not just retrospectively. Again, the image on the left, standard. The image on the right, four times accelerated. The images look excellent and you can see the changes within the meniscus. Can we do this in other joints or is this only a knee? So this is an example and this is a study that's being led by Jan Fritz, our head of MSK imaging now at NYU. And you can see again on the left, the image conventional. You can see the rotator cuff tear. The image on the right, four times accelerated. Again, interchangeable. So we're really excited that this technique has wide applicability. We're currently doing a multi-site, multi-vendor trial with GE Phillips and Siemens at multiple sites to see how generalizable our technique really is. I do want to just mention some other ways that AI is being used to increase acceleration and improve our MR imaging. This is a technique that Siemens has. It's called Deep Resolve Sharp. This is a SMS2PAT2 accelerated turbo spin echo image. On the left is conventional reconstruction. On the right is deep learning using their technique. Again, this slide was lent to me by Dr. Fritz. And you can see the improved resolution on this proton density images, both in the menisci as well as in the bony trabecula on the AI reconstructed image. Another example, an axial image looking at the patellar cartilage. Again, you can see the improvement using the AI technique. This is another type of AI reconstruction. This is actually a GE technique. It's called Air Recon DL. This is a case-based reconstruction. This also decreases noise. It removes artifact, particularly Gibbs ringing artifact, and it increases sharpness. These are sagittal T2 FATSAT images with ARC or IPAT2. And again, you can see on the left, the conventional reconstruction. You can see all of the signal noise, not very good or all the noise I should mention. On the right, the same data, but now put through the deep learning reconstruction. And again, you can see the marked improvement in the signal noise and the image quality using the AI reconstruction. So what I've tried to show in the last eight minutes is that AI and deep learning appears to be a very promising technique to finally break the equation that we've been struggling with for the past 20 or 30 years. And that is to allow us to go faster in MR imaging. Right now, we can do it in four to five minutes. The hope is with even more techniques, we can get it down to maybe two minutes or even less for an entire knee examination and making it perhaps much more available to patients and maybe even replacing plain films as the imaging modality of choice for certain types of musculoskeletal imaging. Thank you very much. Thank you, Michael, for that great presentation. Let's see if I'm coordinated enough to switch the... All right. Ready now for something completely different. What I'm going to try to do is to talk about how AI has changed my career and changed my life by imaging sarcopenia. So what I'm going to try to do is show how sarcopenia is kind of a used case of how AI can improve patient care. I'm going to make some assumptions about you that you care about AI and that you've at least heard of sarcopenia. And if you haven't, we have plenty of review articles for you. So how can we change patient care for the better? We can lower the cost of imaging exams. We can improve access to them. We can come up with new exams or better techniques as Michael just showed. Or we can promote better use of existing exams. And that's what I'm going to talk about. So the way we can do that is really three ways, potentially. One is with opportunistic imaging. Second with AI. And that's the subject of this session. And thirdly, perhaps, by coming up with new phenotypes that are not yet appreciated on imaging at all. And I'm just going to throw that in the bucket of radiomics. So as you know, opportunistic CT has been widely published about, if not used. And the idea there is that you take CTs that are obtained during conventional patient care and you try to squeeze some additional information from them without additional cost or radiation dose to the patient. Why we started with CTs is a long story. But to summarize it here, CTs are very common and in older adults, which I study, they're even more common. So it's logical that CT-derived measures of sarcopenia or age-related loss in muscle mass can be used to improve population health. And that's what we're hoping to do. So on the opportunistic research side, we started about five years ago with this study where we took basically a Medicare population at one center, which is ours. And the only entry criteria for the study was that you had to have an abdominal CT without contrast. And what we were able to show is that you can predict one-year mortality in this group based on measuring muscle. And back then, there was really two ways to measure muscle. This was pre-AI, so this is kind of the old days, I guess. You can use segmentation software, which was at that point called Mimix and still exists. And it takes about 10 minutes for an expert reader to be able to segment out muscle. And then since we have access to very eager fellows, we decided to put them to work to measure just the psoas muscle. And it takes about 10 seconds to draw a red circle around it. It turns out that either the short way or the cheap way and the long way ended up predicting mortality about equally. So those who had lower skeletal muscle or smaller skeletal muscles were more likely to die over the course of the year. And those who had lower muscle density, as evidenced by fat infiltration, also were more likely to die. And it didn't really matter whether you measured the entire muscle bulk or just one muscle. So we then started to think, well, even though we have a lot of very eager fellows, we still are interested more in larger and larger data sets. So is there a role for AI? And it turns out that there was. So we accessed a national lung screening trial database, which is publicly available and contains about 16,000 CT scans of the chest. And with the help of fairly, what should I say, novice programmers, we were able to use neural networks to determine where the muscle is at T12 and measure the muscle size and muscle density and get results that are pretty good compared to ground truth. So that was our technical validation. We immediately put it to work for what we would call clinical validation, and we measured the entire cohort of about 11,000 CT scans in NLST. And we were able to show that muscle size and muscle density predicts mortality over the years of observation in that particular cohort. So this would not have been possible without AI intervention. Now, as interesting as that sounds, one of our interests in academics is to get some money for our research, and it was impossible even at that stage to do it just with these conventional measures of sarcopenia, muscle size and density. So we had to innovate, and the way we innovated is in terms of new phenotypes or radiomics. So those of you who follow the radiomics field know that you can have the same region of interest in terms of size and in terms of density, but if the pixels are distributed differently, then the texture may be different. So we applied this, again, to the NLST. I think we were the first group to apply it to skeletal muscle, and we showed that independently of muscle size and density, more heterogeneous muscle texture was associated with mortality. So fairly excited by these results, we immediately wrote some grants, and one, R21, that I just had funded, is going to look at radiomic features versus function in older adults in two very large epidemiologic studies, MRAS and HealthABC, and you can again see that without AI, we would not be able to measure so many scans manually. So this AI really was a gigantic advance for both the traditional phenotypes in sarcopenia and these more novel phenotypes, which I'm calling radiomics. So that's all very interesting for us kind of research nerds, but how do we apply this to clinical practice? One way I think we can do it is we can take this, and this is my play of words, imaging ROI and convert it to healthcare ROI, which stands for return on investment, and we can use AI technology to help us get there. So the current practice is we have the patient in the scanner, they go to the PACS, we generate a report, and the report is mostly words. So how can we change that? We can intervene either at the scanner stage or at the PACS stage or at a later stage and segment the tissues automatically and create some numerical information that can then be used as prognosis. So in the bone realm it can be used to prognosticate, for example, fracture risk, or in the sarcopenia field it could potentially prognosticate frailty, but there's many other applications that I don't have time to go into. So I think we get from research to clinical use really by leveraging AI for tissue segmentation, which is a very different application from what Michael was talking about, but for my money equally important. And I think there's room for everybody in this field, right? So I think if I can do it with a bunch of amateurs, certainly the CT vendors can automate tissue segmentation, and some of them have already done that, PACS vendors can do the same, third-party software can do the same, and we know of at least 15 or so homegrown algorithms that allow us to do this as well. So here's our homegrown pipeline that takes a chest CT, finds a T12 image location, and measures the left paraspinous muscle size and density, and if you like, radiomic features as well. And here's our abdominal CT pipeline, where it takes an abdominal CT, finds the L3 image, segments all four tissue types, VAT, SAT, IMAT, and muscle, determines their area and density, and can also determine their radiomic features. So the purpose of this, of course, is all to improve care. Most of my research is in older adults, although I'm getting a little bit more into cancer, and in older adults you can potentially envision a scenario where if these phenotypes are automatically collected, they can be used either for prehabilitation, rehabilitation, nutritional support, or pharmacology to help improve care. We have a small but mighty research team. As I said already, we have a little bit of money from the National Institute of Aging, and I thank you for listening. Yeah, thank you for the introduction. Again, we switch the topic. This time it's about MRI-based disease detection using AI algorithms. And this is, in this talk, without cartilage or spine, since this is covered at a later talk. So far, there's a relatively small number of published studies on AI-based MRI disease detection, MSK radiology, and to the best of my knowledge, at least, there's no AI algorithm or software which received FDA approval or CE marks so far. So there are about 25 studies published so far, starting back in 2018 until today, which evaluated the algorithms for MRI disease detection. As you can see here, the largest proportion was published on the knee, and the other joints are really just a few studies overall. And when you look at the knee, there were two main topics. About 10 studies were on the ACL and seven on the meniscus, so this was really the focus of research interest. So ACL was the main topic for most studies. You can see here a large table with some of the major studies. It started in 2018, and then there were a lot of studies published in 2020. And as you can see here, there's lots of heterogeneity in imaging parameters. Here, for instance, the reference standard differed between radiological assessment or surgical correlation, and also the analyzed pulse sequences were very different. There were PD and T2-weighted images, even some T1-weighted images, with or without fat suppression in the sagittal coronal plane. So there's lots of heterogeneity there. However, when you look at the labels, so what the algorithm detected, it was in most parts tear, yes or no, so intact or torn ACL. And here on the last two columns, we see the sensitivity and specificity, which is probably the most important outcome measure for these algorithms. And it says sensitivity ranged from 76 to 100 percent around, while the specificity showed a lower range, ranging from 90 to 100 about. So let's have a closer look at sensitivities and specificities, which you can see here on this table. This table has lots of different colored areas. So what we want to see is that our algorithms are here on the left upper corner, which probably tells us this algorithm has a good clinical value. So here are four selected studies. One is of Dr. Chow's case group from back in Wisconsin. One is from Zurich. The other one, UCSF in Stanford. And as you can see here, when we take a closer look, all of these results are on the left upper corner, which would mean here 100 percent sensitivity and specificity. So they look pretty good, in my opinion. And these four studies, they also published their 95 percent confidence interval. And if we look at the confidence intervals, they are all pretty good located at this left upper area, this white-colored area. And we can have lots of confidence that really this is the performance that we can expect in a clinical daily basis. But how does this numbers compare to human readers? I think that's one of the major questions we are always wondering. So three of these studies included human readers that you can see here. It's this yellow, black, and greenish study. So you see here these very specialized MSK radiologists were almost perfect. They had an accuracy around 99 percent. And also here the readers from this study were significantly better than the AI algorithms. So there is still a gap between AI and specialized MSK radiologists. However, if you look at published meta-analysis, it looks totally different. So these are meta-analysis which were pulled from multiple studies. And I assume this was a mix of radiologists, not only specialized MSK radiologists, but possibly also general radiologists. So these numbers here compare pretty well to our AI algorithms. So I guess it's fair to say that they are about at the same level right now. Well, here is an example from the AI algorithms we used back in Zurich at Marplace. We have true positive and true negative findings here. This mid-substance chair and also here an intact ACL, which was from our algorithm correctly classified as a true positive and true negative. But you also find false classifications, obviously. Here was a mucoid degenerated ACL, which was from our algorithm classified as torn. And here on the other hand was a flipped bucket-handle chair to the central portion. And this was graded by the AI algorithm as intact. So you often find really some morphological hints why this algorithm didn't work that well. So again, we come to the next table. This time it's kind of similar. This is also reviewing major studies for meniscus tear detection. Also started from 2018. And also here lots of heterogeneity. One study, this was from our study in Zurich. We used surgery correlation, but the other studies used radiological assessment as the standard of reference. And this is kind of difficult really to compare these numbers. We have to look carefully at the performance metrics. And only these two studies here, this is, like I said, our study from Zurich. That's another Swiss research group. They were able to differentiate the tear if it was located at the medial or the lateral meniscus. The other were more a general assessment if there is somewhere a meniscus tear in the knee. And you see for the medial meniscus the numbers are not that bad in my opinion. They're ranging from 84 to 89% in both studies for sensitivity and specificity. However, if we look at the lateral meniscus, the sensitivity decreases significantly. Here we have 58%, 67% which is really low. But on the other hand, the specificity was kind of high, around 90%. Again, let's have a look at these zombie plots for graphical visualization. And we see here these two studies which can discriminate between a medial and lateral meniscus tear. And with these light gray boxes, these are meta-analysis. And you see for the medial meniscus they're kind of close together. Still the meta-analysis are a little closer to the left corner. So I guess they're still a little better, but there's large overlap of the confidence intervals suggesting that this difference might not be that big. But that completely changed for the lateral meniscus. Here we see that these two algorithms were much further away from the left upper corner and the meta-analysis performed much better, which basically located this white area. The other one we're going down to this light gray area. Again here, like practical implementation, how we use this in Switzerland. We have a label which tells us, is there a meniscus tear? Yes, no, and gives us now the probability. Here was 84.3% of lateral meniscus and gives us this kind of heat map until it really tells us where is this tear if we're not able to see it by ourselves. But this is still work in progress. This is not working too well right now, to be honest. To conclude, for MRI disease detection, most research was performed on the knee joint and for the ACL tear detection, I believe, the achieved performances are about on the same level as published meta-analysis. This is different for meniscus tear detection. The performance is currently lower of A algorithms in comparison to meta-analysis in human readers and there's also a big limitation right now for meniscus tear detection. There is no exact kind of localization of the tear describing the extent and orientation so there's still work to do. And to the best of my knowledge, there's no FDA or CE marked algorithm out there so far. Thank you for your attention. I'm going to talk about teaching people deep learning. Why do we need to do this? Well, I don't know if any of you are here at RCA 2019, back in the golden days before pandemic. This is what the AI floor looked like. Lots of stuff out there that people wanted to sell you. Stuff that is not cheap, and you'd like to make good buying decisions about this stuff. More so if you use this for making diagnoses, you are responsible for the end product, even if the AI nudges you in the right direction. If you're an academic like me, your residents need to learn how to do this stuff, and maybe your faculty does too. One more thing, when my residents say, what do I need to learn about AI, dude is going to be on the boards one of these days. So I'm going to tell you about our experience in constructing a deep learning curriculum for residents. Now you might wonder, are we trying to turn our residents into data scientists? No, we're not. Not at all. But we have four learning objectives in this course. First one is they should not be intimidated by deep learning. Second, we want to give them a lot of experience in probably one of the most difficult and important jobs of deep learning, that's getting a good data set to get this well labeled. We want to give them lots of hand-on practices training these systems, and lots of hand-on experience troubleshooting, because every AI system falls apart initially, and we hopefully get better over time. How do we do this? I'm going to employ a metaphor here. Probably most of you have learned how to use this complex system right here, and the way you probably learned to do this was some type of driver's ed course, which had a lot of classroom stuff on theory and rules of the road, but also lots and lots and lots of driving practice. So how about AI courses? How do they stack up? The good news is there's a bunch of them. Me and some pals did a conference review of what's out there for the radiologists to learn how to do deep learning. There's a lot of great stuff, but it's a little bit lean on actual driving practice. Many of these courses give you little or no actual time on the wheel. But they spend a lot of time. First, you have to build your car out of parts, and you put the parts together, maybe not these parts, but some of these parts, and these parts, and lots of these parts, lots of programming. How many people here have ever done Python coding or any computer coding? Okay. You guys know that a single punctuation mark in the wrong place causes the whole thing to blow up. So to extend this car metaphor a little bit, we do not end up with a reliable, well-bent, robust touring car. Instead, we have a clown car, where when you roll the windows down, the doors fall off, and the engine explodes. Really, that's what it's like. So is that the optimal driving experience where you want to teach someone how to drive? Maybe not. And if driver's ed classes were like most AI courses, then the roads would be a lot emptier. So we want a system that's easy to use, that's robust, that's hard for them to break, gives them lots and lots of driving practice. And we thought about this auto metaphor a little more and said, ah, we're actually talking about bumper cars. Anyone here ever drive a bumper car? Yes. Or your kids have, and you just want to admit it. To drive a bumper car, you need a fairly minimal set of driving skills. You step on this, you go faster. You step on this, you go slower. In between, you turn this a lot. That's it. The whole experience. We consider these systems safe enough to drop our kids in, and also robust enough that our kids will not tear them apart. So is there some sort of AI equivalent to this that will let where crashing is okay and even suggest it? Turns out there are a number of no-code deep learning platforms out there. We looked at all of them and chose this one called Lobe.AI. It's currently run by Microsoft. It's a fairly simple system, and I've been using it for a few months when our program director kid came and said, hey, COVID, and then we had several openings in our residency course curriculum, and COVID, by the way, and could you put on a six-hour AI course next month? And I'm up for a challenge, so I said, well, sure. But awesome chief resident at the time, Patty Ojeda, and she and I put together what we call our bumper car curriculum for deep learning, which we employed back in June on our residence. Since we've written it up, and just early this month accepted by academic radiology, so if you want to read the full gory details, they will be out soon. For every class, we had pre-class reading where they could learn about some of the terms and read up on their own about AI. We employ this in small groups across this big room here, two or three residents spaced however made them feel comfortable. They all had their own laptops with them. Over the whole six-hour course, maybe a total of one hour is devoted to didactics, just little, just in time nuggets of here's what you need to know about balancing a system. Here's what you need to know about metrics. But five hours on the wheel driving the AI unit, and we'd set among all the different mini-tasks in AI, we wanted to focus on image classification. Here's what class one looked like. They would use the webcam on their laptop to create their own data set, so very shortly, they had some medium-sized data sets of people with and without masks, people with and without glasses, people making different American Sign Language gestures. Here's our residents training an AI to recognize the peace sign versus the Vulcan live long and prosper sign. One resident in a pair had a lot of kids, the other had a lot of corgis, and they successfully trained their AI to differentiate those. Yay. Another one tackled a difficult ongoing problem, humans versus cats. They had a lot of fun with that and got a lot of experience gathering data, labeling it, and driving the AI. Class two, we turned to medical images and brought in a lot of images from our local PACs and off the web. I threw in 2,000 images from an ACL study I did a few years ago. And here's how I did it. Training on these 2,000 images, it got about 86% of the time accuracy on telling torn from normal. It had absolutely no trouble telling this was a normal ACL, that this was torn. We pulled in a subset of the massive arsenic pneumonia challenge, so we got to train it on pneumonia versus normal chest. Stanford has a huge database of MSK radiographs. We pulled in a subset of those so I could train it to recognize right versus left humerus. And that was a lot of fun. For the third class, we decided to give them a little more experience with that important thing, gathering a well-established data set. So we crowdsourced a collection of images. We thought about a lot of things and came up with enteric tubes, which are numerous in our system. We had about 500 cases. And the nice thing about enteric tubes, you can analyze them endlessly. For example, two versus no tube. Weighted versus non-weighted. Single versus multiple. Location. And our residents ran through all these different iterations. Here's just an example of what LOAD did. It did pretty well. About 80% of the time, it could tell two versus no tube. This is a system that they did not have to tweak at all. They never had to look under the hood. It dumped in the images and it trained itself. We surveyed the residents. So, what did you think of the course? Most of them thought, you know, AI is an important thing. We should learn this. Although, relatively few of them were that personally interested in it. But you know, I think once the items start appearing on the American Board of Radiology exam, they're going to get a lot more interested. Many of them thought they had learned helpful stuff in the class. That's good. The thing they liked the most, hey, being together for the first time in about 18 months. They loved the hands-on. They loved the active learning. And from my point of view, they gained intuitive feel for what goes on a deep learning process. So, how about our original learning objectives? They were definitely not intimidated. Not at all. It took, I thought we'd have to devote a whole class period to just getting the machine up and running. But they were ready to go in five minutes. They got lots and lots of experience gathering and labeling data sets, which I'm really delighted in. They did a lot of training. And they had a lot of crashes. They would, their systems would come up with preposterous diagnoses, and they would think, okay, maybe I need more data. Or maybe I need to cut out all that clutter in the back of the image. And so, after troubleshooting their systems, they refined them. I was pleased to see that. So, in conclusion, if you look out in the science fiction literature, there are an endless number of stories about AIs dominating the human race. Personally, I'm not so concerned that AI doctors or any of these other clinical systems are going to kill us or enslave us. But they're going to make goofy diagnoses. They're going to have false positives, false negatives from time to time. And I'm pretty confident our residents have a pretty good intuitive feel for, okay, here's what may be going wrong, and here's how we can test that. Here's what we might be able to do to fix that. Thank you. Good morning. For the next 10 minutes or so, I would like to review the literature on deep learning applications of osteoarthritis imaging. Two previous studies have described deep learning approaches for detecting cartilage lesions of the knee joint on MRI. Both studies use similar deep learning methodology consisting of coupled convolutional neural networks. The first cartilage segmentation convolutional neural networks segments articular cartilagin bone on the MRI image. And these segmented MRI images are then used as input into a cartilage classification CNN in order to determine the presence or absence of cartilage lesions. In a study performed by Liu and associates in radiology in 2018, the reference standard for the cartilage or the deep learning cartilage lesion detection system was the interpretation of a musculoskeletal radiologist for the presence or absence of cartilage lesions on a total of 100 image patches on different regions of the femur and tibia on all MR image slices. While in the study performed by Padoian and associates in journal magnetic resonance imaging in 2019, the reference standard was the interpretation of a musculoskeletal radiologist of the worm score for patellar cartilage. For the tibial femoral joint, the cartilage lesion detection system had an area under the curve of 0.92 and a sensitivity and specificity of 84% and 85% respectively. For the patellofemoral joint, the cartilage lesion detection system had an area under the curve of 0.88 and a sensitivity and specificity of 80%. Multiple studies have described deep learning approaches for segmenting articular cartilagin bone on MRI. These studies have used a wide variety of two-dimensional and three-dimensional convolutional neural networks and three-dimensional generalized adversarial networks in order to segment the articular cartilage of the knee joint on a wide variety of MRI pulse sequences and have reported DICE coefficients for segmentation accuracy of cartilage compared to the reference standard manual segmentation ranging between 80.1% and 88.4%. Deep learning approaches have reported the highest cartilage segmentation accuracies in the literature. In addition, in comparison to other approaches for fully automated segmentation of articular cartilage on MRI such as ATLAS and mild-based approaches, deep learning approaches provide time-efficient cartilage segmentation on the order of minutes for segmenting articular cartilage of the entire knee joint on all MRI image slices with minimal computational resources. Multiple studies have also described deep learning approaches for osteoarthritis detection on x-rays, and all these studies have used the Kelvin-Lawrence system to grade the presence and severity of knee and hip osteoarthritis. These studies have shown that deep learning methods have interclass accuracies ranging between 59.6% and 74.8% for assigning Kelvin-Lawrence grades for knee osteoarthritis on radiographs, and interclass accuracies ranging between 65.7% and 72.4% for assigning Kelvin-Lawrence grades for hip osteoarthritis on radiographs. Now, these interclass accuracies of these deep learning approaches for assigning Kelvin-Lawrence grades are almost identical to the inter-reader agreement of experienced musculoskeletal radiologists for assigning Kelvin-Lawrence grades using the same knee and hip image data sets. Studies have also described deep learning approaches for assigning grades for individual radiographic features of knee osteoarthritis, such as osteophytes, joint space narrowing, subchondral sclerosis, and subchondral cysts. There is now commercially available deep learning software from Image Biopsy Lab that can provide not only a Kelvin-Lawrence grade, but a grade for each individual radiographic feature of knee osteoarthritis on radiographs. And studies have shown that use of this deep learning software can increase reader consistency for grading radiographic features of knee osteoarthritis. Multiple studies have also described deep learning approaches for osteoarthritis risk assessment. Deep learning is ideal for creating osteoarthritis risk assessment models, as it provides a rapid and fully automated method to extract useful prognostic information from imaging studies. Models can be created using deep learning feature analysis of baseline imaging studies, such as x-rays and MRI. Traditional clinical risk factors, such as age, gender, body mass index, ethnicity, and history of knee trauma and surgery can be added to these deep learning models to create combined models, providing a final probability score for osteoarthritis progression. The creation of deep learning-based osteoarthritis risk assessment models has been made possible due to data from large longitudinal osteoarthritis research studies, such as the Osteoarthritis Initiative and Multi-Center Osteoarthritis Study, which provides clinical information, bilateral knee x-rays, and unilateral knee MRI on large numbers of subjects with or at risk for knee osteoarthritis at baseline and various follow-up time points. Multiple deep learning models for osteoarthritis risk assessment using various outcome measures of knee osteoarthritis progression have been created using this data from these longitudinal osteoarthritis research studies, including models for predicting structural progression in radiographs, pain progression, and progression to total knee replacement. This slide summarized the diagnostic performance of deep learning models for predicting knee osteoarthritis structural progression. Model performance ranged from an area under the curve of 0.7, using baseline intermediate-weighted two-dimensional FASP and echo MRI, to area under the curve of 0.86, analyzing a combination of knee and x-rays and clinical risk factors. This slide summarizes the diagnostic performance of deep learning osteoarthritis risk assessment models for predicting knee osteoarthritis pain progression. Model performance ranges between an area under the curve of 0.628 for analyzing baseline three-dimensional DAS MRI, to an area under the curve of 0.809 for analyzing baseline combinations of x-ray and clinical risk factors. This model describes the diagnostic performance of deep learning-based osteoarthritis risk assessments models used to predict progression to total knee replacement. Model performance ranges between 0.834 for analyzing a combination of baseline three-dimensional DAS MRI and clinical risk factors, to an area under the curve of 0.89 for predicting or for analyzing a combination of baseline x-rays and clinical risk factors. This slide summarizes the diagnostic performance of these top-performing osteoarthritis risk assessment models. First of all, note that the outcome that has the best ability to be used to be predicted by deep learning is total knee replacement, as opposed to structural progression and pain progression. However, for all these top-performing models, notice that the diagnostic performance is assessed using area-under-the-curve sensitivity and specificity is very promising, but by no means high enough to incorporate these models currently into clinical practice, osteoarthritis research studies, or clinical drug trials. Thus, additional future efforts are needed to improve the diagnostic performance of deep learning-based osteoarthritis risk assessment models. I thank you for your attention. So I'll be closing out this session talking about some considerations on AI implementation. So the learning objectives for the few minutes I have are to review a few tips for judging AI solutions, to understand factors that promote algorithm longevity in your practice, and to understand limitations on available options for AI model monitoring. Of course, when implementing AI that relies on many different factors, but probably most heavily on the data and the people engaged in the process, as well as the development and continual improvement of infrastructure and strategy. But after AI implementation, there is a need to regularly reevaluate and optimize these factors to ensure or allow that AI solution to continually learn and evolve with the natural changes or variations in your local patient environment. Now you've heard about some currently available MSK AI solutions that have earned FDA clearance. Some of them were mentioned earlier. And there's a current list of 11 that are related to MSK that are shown on the ACR Data Science Institute website today. But several more are under the FDA review process. And of course, there's also some open access algorithms that are available. So how do you as a radiologist know which MSK AI solution to rally for to implement into your local practice? Well, fortunately, there are some guidelines that have been recently published over the last couple of years that provide some tips on how to keenly evaluate current AI offerings and to help you determine which ones might work best for you and for your practice. In the guidelines published by Umumi et al., there are a list of top 10 questions to consider. And of these all very important 10 questions, probably the answers to these four are the most relevant for us as everyday MSK radiologists trying to get through our work lists and take care of our patients in the safest way possible. But of course, in regards to the other six questions, it's important for us to also judge what the value of any particular solution will bring to our practice and what that cost will be. We need to understand if there's going to be an adequate return on investment. So for example, if you're at an extremely busy practice that's receiving hundreds, if not thousands, of ER radiographs every single day, you're having to interpret those. Your turnaround time to the ER might not be as quick as you would like. Maybe a fracture detection algorithm makes a lot of sense for your institution or your practice. However, if you're at an institution where your turnaround time to the ER is efficient and you and your colleagues aren't missing many clinically relevant fractures, maybe that algorithm doesn't make as much sense for you. The next thing to consider is the source of the data for training and validation and testing of that algorithm. Did it come from a single institution, which of course has a higher risk for demographic bias? Or was the sample drawn more broadly from, say, international populations? Also, it's important to consider their method of validation and whether it was robust, and also how they defined that ground truth. It's very important for us to know that AI algorithms can be brittle, and that they have a higher likelihood of failing when they're placed or implemented into environments that are different than the ones that they were trained on. The next thing to assess is the real world performance of these models. And maybe you have some colleagues at other institutions that have already incorporated a certain model into their practice. It's important to reach out to them to get their feedback on how that model has been performing. And also understand how those AI companies are maintaining that model for them, and how they're handling any malfunctions or erroneous results. And of course, you also need to evaluate the model on your own patient population before you purchase it, to understand what those false positive and false negative results may be for your patients. Now, if you're at a smaller practice with maybe not a lot of AI data resources, you can reach out to the ACR AI lab. They have some applications that you can use to test certain algorithms on your local institutional data. In addition, the RSNA will be offering an on-demand course in the very near future that allows radiologists to test and use AI algorithms. So let's just suppose that there's a certain MSK solution that has passed your very well-informed and rigorous judgment, and you're ready to implement it. How do you ensure or promote the longevity of that algorithm in your practice? Well, there's, again, many factors, but some of those most important ones are listed here on this climb up this mountain. But certainly, the ultimate goal should always be kept in mind, and that is that the AI algorithm needs to be able to continuously learn and adapt to the natural drifts or variations that occur in local patient environments. And in fact, the FDA has recently provided a flow chart that emphasizes the need to continuously monitor the real-world performance of these MSK, or pretty much all AI models, and also ensure that they're incorporating new live data. And that the impact of that new live data is being logged and tracked, and being used to retrain the model for constant fine-tuning. In essence, to promote that longevity of an AI algorithm, it needs to continuously learn. There needs to be transparency, generalizability, constant monitoring, as well as a strong vigilance in the ethical domain with patient stewardship. Now, lack of any of these basic elements will increase the likelihood of a model to degrade over time. And unfortunately, at least in the current times, to my knowledge, the commercially available AI algorithms out there do not allow modification by the end user. And furthermore, there are limitations on our ability to even monitor model performance over time at our local institutions. Largely because the FDA does not have regulations in place yet that allow for post-market surveillance. However, the FDA and the ACR are working on mitigating these limitations. In fact, the ACR is currently developing an assess AI data registry with the aim of collecting metadata from algorithms from multiple institutions. Capturing this information in the background for each individual case and whether the radiologist agreed or disagreed. And the goal of that is that they can provide an output to both the algorithm developer as well as the individual institutions on that model's performance over time. And pinpoint fairly early if there's any degradation in that model to maybe stop it from degrading further. Now, if you go to the assess AI data registry website today, you can contact them if you are starting to use an MSK algorithm in your practice to get in a performance report. In addition, the Data Science Institute from the ACR is currently working closely with the FDA to align the assess AI with the FDA review process. Now, today this is still very much in the demonstration phase because the results of longitudinal monitoring have only been provided from one institution so far and pushed to the FDA. Now, in the broader AI industry world, there are companies out there that are marketing that they can explain the AI. That they can make that black box transparent so that you understand the why of an AI model's decisions. And they also are marketing these bias detection software, also model risk management software. And in my brief perusal of these websites, I couldn't myself find anywhere where they're talking specifically about the radiology space. However, that may change in the very near future. So in summary, today there are at least 11 FDA cleared MSK solutions. I checked yesterday, still just 11, but there are obviously many more that are in the works undergoing that review process. And there are these open label AI solutions continually being offered. Also, it's important for these solutions to undergo rigorous judgment before purchase. And the MSK radiologist really needs to be integral in that decision making process. But if you're at a smaller practice with limited resources, the ACR AI lab could be helpful for you. Now to help ensure adequate longevity of what you purchase, that AI model needs to be able to continually learn and adapt to changes or variations in the local patient environment. Unfortunately, our ability to monitor model performance at this time is limited, but both the FDA and the ACR are working together to help mitigate this limit. And maybe in the near future, commercial companies will step into this space as well in radiology. I'd like to sincerely thank the RCNA for inviting me to speak to you all today and to thank you for your kind attention.
Video Summary
The presentation discusses the importance and advancements in accelerating musculoskeletal MRI (MSKMR) using AI-driven deep learning techniques. Accelerating MRI is crucial for enhancing patient comfort, particularly in pediatrics, improving image quality by reducing motion artifacts, and increasing MR imaging accessibility in Low and Middle-Income Countries (LMICs). Traditional methods like parallel imaging and compressed sensing face limitations such as reduced image quality at higher speeds. Deep learning can potentially resolve the speed vs. quality trade-off, using techniques like undersampling in MR images, allowing faster acquisition while maintaining or enhancing image quality. Various studies demonstrated that deep learning could significantly boost image quality in MSK imaging, making images interchangeable with traditionally sampled images. Prospective studies showcased deep learning's effectiveness in clinical settings with improved diagnostic accuracy and reduced noise. The session also covers other deep learning applications in radiology, such as in osteoarthritis imaging and teaching AI to radiology residents. Although AI in MRI disease detection, especially for the knee, shows promise, the need for continuous development and monitoring to ensure adaptability and effectiveness in diverse clinical settings is emphasized.
Keywords
musculoskeletal MRI
AI-driven deep learning
accelerating MRI
image quality
motion artifacts
LMICs
undersampling
diagnostic accuracy
radiology applications
RSNA.org
|
RSNA EdCentral
|
CME Repository
|
CME Gateway
Copyright © 2025 Radiological Society of North America
Terms of Use
|
Privacy Policy
|
Cookie Policy
×
Please select your language
1
English