false
Catalog
Data Curation for AI with Proper Medical Imaging P ...
R7-CPH11-2023
R7-CPH11-2023
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
Welcome to this session titled Data Curation for AI with Proper Medical Imaging Physics Context. If you stay through the entire meeting by this point, you probably have listened to a lot of AI presentations and see a lot of AI products. So this session is really focused on medical physics and what medical physics can do for imaging AI development. So it's our view that medical physics makes unique contributions to AI development and deployment in clinical settings for numerous reasons, including but not limited to what's shown here. So all those scientific and technical foundations underlying the AI product that involves a lot of math and physics that place into the strengths of the medical physics to whoever with medical physics background. And quality and safety are naturally the responsibilities for medical physicists, especially those working clinical healthcare settings. And every day we see something new and the field of AI development is filled with innovations and innovations is also part of the area where medical physics is, I mean, from the day one of radiology has made substantial contribution to constantly. So with this session, we really hope to share with you some experiences we have had in developing imaging AI in our respective institutions. We hope with this session, we can educate you on some fundamentals on the process and you can have good ideas to bring back to your institution and develop your own great applications. So in this session, we have three presentations and because imaging AI development can be largely classified into three phases or modules, namely data preparation and curation, AI model development, and AI model deployment. So we have three speakers here, each focusing on one of those modules. So I'll lead off this session with a presentation on preparing high-quality data for AI, focusing on the data preparation phase of the project. And by the way, my name is Jiwa Qi from Henry Ford Health. The second speaker is Dr. Ran Zhang from University of Wisconsin. His presentation is titled Data Quality and Generalizability in AI, focusing on some important questions in the AI model development phase of the project. The last but not least is Dr. John Garrett from University of Wisconsin, too. His presentation is titled Clinical Integration of AI Models. He will do a very comprehensive overview of how to really deploy a well-developed model in clinical settings and with the expectation of making clinical impact and integrating into routine clinical use. So we hope you will enjoy them. Okay, so my presentation will be on preparing high-quality clinical data for AI, the essentials for medical physicists. So I will follow a simple and straightforward outline to explain the process. The rationale is basically what, where, and how in the process to first locate what they – first to define what data you want to acquire, and then where you can find them in clinical settings, and followed by how to develop a robust approach to make sure you can indeed preparing high-quality data for subsequent AI model development. So to illustrate the process better and demonstrate some of the concepts and approaches in the process, I'll use an example project we set out to do in recent years, and the title is – the objective of the project is to develop a chest X-ray-based COVID-19 diagnosis tool using AI. The goal is to develop an AI model that can accurately and quickly make diagnosis of COVID-19 from chest X-ray images. So this is a highly relevant topic, especially during the COVID pandemic, because in the thick of the pandemic, although the routine – the gold standard of making the diagnosis is reverse PCR testing, medical resources are heavily stretched out to distribute resources to cover the population. So the idea here is we have an existing large infrastructure of chest X-ray imaging hardware and personnel within our current healthcare systems. So if we can use chest X-ray and make diagnosis with accuracy and speed, we indeed provide another weapon in the hands of the healthcare workers in the fight of the COVID-19 pandemic or being prepared for future occurrences. So obviously, we need to collect a large dataset for the model development, and in the planning, of the project, we do – we are collaborating with outside collaborators, so consideration has to be made in the preparing stage to allow that to happen. And we also, at the end of the project, we are very interested to contribute our data to larger data commons for greater good of the society and the communities. For example, those initiatives you probably heard about, like the Midrick, where they are curating a large dataset for research activities like ours. So let's start with the what question, right? What kind of data do we want to collect? So there are several considerations as you plan out the project. So there will certainly be image data to be acquired, but at the same time, there will be some other non-image data you need to find and acquire, too. And whether you need a control group and how do you curate a control group data, what labels do you need for your research project, right? And also, we all want our project, in the end, be generalizable, right? So to have a generalizable performance at the planning stage, you probably want to collect some of those ancillary data or variables where you can monitor during the, as the project develops so that you can, making sure you have a heterogeneous collection of the data and that goes a long way to help you in the end, really helps with the generalizability of the model later on. So for the example projects, we certainly need to collect a large amount of chest X-ray images and we want, we certainly want a gold standard to indicate the diagnosis of the patient, which is a PCR test results. And depending on your research project, for our project, we were interested to correlate the findings with the diagnostic findings from radiologist reports. So we include the radiologist reports as part of the data we set out to acquire. We do need a control group. We, I mean, for our project, we decide to collect chest X-ray data from pre-pandemic period as a clean, normal data set we can train our model later on. Labels mostly coming from, not only comes from the PCR test results, but depending our specific research aims, we also considered, like, key diagnostic findings from radiologist reports to be source of the labels, too. And for the variables, that's really depending on your specific research project's goals. For us, one of the variables, as an example, we're looking to is to see how the model performance going to vary with chest X-ray device vendor and models, because we know out there, there are large number of device vendors and models, and they have, they use different techniques, and they may use different post-processing algorithms. They may generate their final images in certain different ways. And we certainly would like our models to be independent of those various, right? So it's important for us to acquire those data early on and monitor the performance and also guarantee we can indeed steer the model towards something that's more robust against such variables. But obviously, this is something highly dependent on the specific goals of your research project. So the next, we are talking about where do we, now that we are clear we want to acquire what data, but the next question is where do you find them? So here, I would start with basic radiology informatics 101. So within radiology and the larger, and also the larger healthcare system, there are various entities that hold the data that are of your interest, right? So the hospital information system holds a lot of the patient-related data outside of imaging, and this is also commonly known as electronic medical record system. And radiology has its specific information systems, often known as RIS, or radiology information system. Basically, all the modality devices that's acquiring patient images, they provide the source of the imaging data. And modern radiologists run some PAC servers. That's where, that's your central repository of the patient image data and everything related to it. And there are also, I mean, distributed workstations for post-processing and various other clinical tasks that could also provide the source of the data of your research interest. For the data to flow across this various entities, they follow standards. Largely speaking, the imaging data follows a well-known DICOM standard, whereas the non-image data largely follows HL7 standard. And across all these different data-holding entities, they all store their data one way or the other in certain form of database technologies. It often helps for you to know at least a little bit of, like, the database technologies. Nowadays, most of these databases are in the form of so-called relational databases, and it may not be that important if you are only collecting a small amount of, a small collection of data. But if you are collecting a large collection of data, it helps for you to know the behavior of those databases so that you can collect your data more efficiently and more automatically. In terms of data storage, there are largely two categories of storage means, like the physical medium versus cloud storage nowadays. And you trade off between size, speed, redundancy, data security, and the cost. So those are the factors you can consider and come up with your own way of storing your data. For our, for the example project, here are where we get the data, respectively. Chest X-ray images intuitively come from the PACS. The PCR testing results, we got them from the EMR system. During the COVID pandemic, it's common practice that EMR systems have COVID dashboard or reports where you can tap into and extract the patient data from there. The diagnostic findings from radiologists' report are typically resided within radiology information system. So for our project, we store the, when we're collecting them within the local health system, we use, yeah, physical storage medium, whereas later on when we collaborate with other external institutions, we use secure cloud storage for remote sharing. So how to prepare the data? So remember that this is, we are performing research on human subjects. So a prerequisite is for us to first get, pass ethical review and get approved by your local IRB. And if there's data flow outside every institution, you should consider plan early on for data sharing agreement if it's applicable for your project. So AI model development requires large amount of data. So it certainly applies to our case. So if you're dealing with large amount of data, efficiency is big in this process. So ideally, we want to develop automated process where it requires minimal intervention, as minimal as possible, right? So a typical workflow in collecting and presenting the data can be shown here. Like you start with querying your database, and once you filter out the datasets of your interest, you can retrieve them into local computer, and you run your algorithm to de-identify the data and do further cleaning before you're storing them securely. And later on, you can, yeah, sharing them with cloud storage methods. So there can never be too many tools to help you with the process. So here, I include some of the common tools that we find helpful in assisting with the process of data collection. You certainly need some good DICOM toolkits and libraries. And if you're dealing with reports or those textual data or non-imaging data, report mining tools is certainly very valuable. And I also want to point out that RSNA has built some very useful research tools in the past. One of the tools we find particularly valuable is this RSNA-CTP, which I'll explain shortly. And scripting tool is where, yeah, you put together some scripts to hook up all the different components of your workflow and make things run smoothly and seamlessly. So yeah, DICOM standard is well known, and there are a large number of, like, open source resources or tools out there you can use to help with the process. I won't spend too much time to dive into it. And when it comes to report mining, we can take advantage of some of the commercially available and clinically used softwares to help extracting useful data from, for example, like diagnostic radiology reports. These softwares typically have implements natural language processing in basic or more advanced ways to help you with the process once you define your search criteria. And you can, yeah, those softwares that can help you extract out the studies of your interest. So our case, we use this product, but I'm by no means endorsing any particular product here. So RSNA-CTP, or in its full name, clinical trial processor, is something RSNA developed a little while ago to help with research, especially in the form of clinical trials. It has several nice features. It has a very flexible configuration. You can configure it to both share the data across the Internet between different institutions, or you can configure it to run within your local network to effectively collect data from your, for example, from modality devices to whatever destination computer you want to have the data on. And it features a simple pipeline-based design. For example, as shown in the middle here, you define your individual steps, and the system will just run them sequentially. For example, you start with ingesting the data from your modality devices, and you can run your the pixel anonymizer here is aimed at, like, cleaning up the images for those burnt-in annotations or PHI information, and you can also mapping the develop your own research ID, right, map from the patient MRN to research ID if you find it necessary, and you can do further anonymization before you're exporting the data to your destination. So it's quite easy to use, and the interface of using it and managing it is also featured by some simple web interface, very lightweight and platform-independent among other benefits. So there are infinite possibilities for developing your own scripts. They, there are a lot of choices out there, and they help you to connect your functional modules and streamline your workflow. So for our example project, yeah, we certainly went through IRB approval, and due to the retrospective nature of the project, you can have pretty quick approval through a expedited review process, and your informed consent would typically be waived for such retrospective project, and because we, our project involves sharing between institutions, data sharing agreement certainly was entered. And for our, for the steps of our workflow, here are the tools we used individually. So for the EMR searching, we just used the reporting tools integrated into the EMR software we have. We used Epic at our institution, and PAX query, we did it with a small amount of Python scripting on top of standard .com libraries, and we queried the reporting database to use our own software, and retrieving images from PAX, we used a combination of RSNHCTP and a small amount of scripting, and contributing data to, for example, Miniric in our case, there are various pathways Miniric allows to do so, and that includes CTP and other tools, if you are familiar with like ACR or other publicly available software tools there. So in summary, high-quality data is vitally important for AI development. Medical physicists bring unique value to this process and help ensure the quality of the data being collected and curated, and data preparation process can be guided by the what, where, how rationale introduced in this, discussed in this presentation. With that, thank you for your attention. With that, I welcome Dr. Ran Zhang to the podium for his presentation titled Data Quality and Generalizability of AI Model. Thank you for the introduction. So I'll continue to talk about why high-quality data is important for AI, specifically how it is related to the generalizability of the AI model. So I'll use the COVID classification problem as the example, which is a standard binary classification. So each input x-ray is assigned a score between zero and one, where zero indicating a non-COVID and one indicating COVID. So we can set up a threshold, and if the score is above the threshold, then the chest x-ray is predicted or interpreted as COVID, otherwise as a non-COVID. And to train such a network, we need to prepare a data set of chest x-rays that are labeled as COVID positive and COVID negative. And the training process involves optimizing the ways of the network to minimize the cross-entropy loss, which is a common objective function in these classification problems. And conceptually, there's no difference from the classification of cats and dogs, which is one of the most popular deep learning projects for beginners. So to begin, you would search the Internet and download a lot of cases of cats and dogs, and from there, the other process is quite straightforward. So you can certainly curate data in this way, just like for the cats and dog problem. So for COVID classification, you can, especially in the early pandemic, early 2020, to obtain the COVID positive cases, people look for all kinds of sources, online publications, educational websites, and other data sets released by different institutions. And then for COVID negative cases, a common approach used back then is to use pre-COVID pandemic data sets, such as those from Grand Challenges. However, a crucial point is that if you use this approach, your model may perform exceptionally well on your holdout test set. But when tested on real-world clinical data sets or external data sets, the performance may drop significantly, and potentially to no better than a random guessing level. So this discrepancy often arises due to the shortcuts or the bias in the training data set. So in this paper published in Nature Machine Intelligence, the authors trained a model using the public COVID X-ray data set, and they found the model's performance on the internal set is almost close to perfect. But when tested on the external test set, its performance dropped to quite significantly and almost to a random guessing level. So the analysis of the model using the saliency maps revealed that it's actually using some markers on the shoulder to make the COVID predictions. So recently we found that not just the markers, some intrinsic image attributes such as the image contrast or image sharpness can all become shortcuts. So we know that there are many different factors that affecting the quality of chest X-rays. For example, the focal spot of the X-ray tube may affect the sharpness. The KVP selected by the technicians may affect the contrast. And the use of the different detectors and fumes both affect the contrast and the sharpness. And even some vendor-specific post-processing method could also adjust image quality and the image sharpness. And finally, even for the data storage, the compression algorithm can also have an impact on these image qualities. So if during the data curation, if most of the COVID positive cases are collected from one source while the negative cases are from the other source, then there might be associations of these image quality attributes with the COVID labels. And so that may cause shortcut learning. So we developed a framework to identify these shortcuts. So for example, we have a high-quality dataset curated by the Henry Ford. And so a model trained from this dataset can actually generalize pretty well to some external COVID test set. And then we perturb this dataset, perturb the image contrast or the sharpness by to all the positive cases or all the negative cases. So by doing so, we actually create two pathways for the model to learn. So the model can choose to learn the COVID features, but it can also learn to differentiate the different classes by their contrast difference or their sharpness difference. So here are some examples of the perturbed images. And hopefully you can see on the left is the original one, the middle one, and the one on the right are the sharpness and contrast perturbed. So they're actually almost imperceivable to human eyes, that the difference is really negligible. But when we train the model on this dataset, we can see if there's no shortcut, the COVID classification performance on the external dataset is 0.82. But when there's shortcut on the sharpness being added to the dataset, its performance dropped to 0.53 and similarly for all the different cases. So that means the model will choose to learn the shortcuts instead of the true COVID features. And we also developed models or the so-called shortcut detectives to detect shortcut in a dataset even without training your models. So to do that, we actually use the normal chest X-rays, but we apply the sharpness or contrast perturbations to half of them and then assign them to a label of one. And then we train a model using this dataset. So by design, if the model trained and tested on some other datasets, if the AUC is close to 0.5, that means there's no shortcut in the dataset. Otherwise, it indicates there might be shortcuts related to sharpness or contrast. And here are the results in the trained shortcut detectives, and we indeed find severe shortcuts in these public COVID dataset that are released in the early stage of the pandemic. But on the contrary, when we look at these other curated datasets that are more carefully curated and later released during the pandemic, we found no obvious shortcuts in this dataset. So that's why it's important to apply some quality assurance during the data curation such that we can avoid these shortcuts. And here we listed a few considerations that we summarized for this COVID classification task. And first to make sure you collect the native DICOM images and do the processing yourself such that you don't accidentally introduce any shortcuts that are related to sharpness or contrast. And also need to make sure that the label of the dataset is accurate. And by label, I mean to define whether a chest X-ray is COVID positive or negative, you have to look for the RT-PCR test that are close to the imaging date. So for this one, we use actually a pretty narrow window from negative three days to positive three days between the imaging study and RT-PCR test to ensure the label is accurate. And finally, to make sure to collect both positive and negative cohorts from the same hospitals within the same time range to mitigate shortcut learning as much as possible. And so with this, we curate our own training dataset and we also collected over 25,000 chest X-rays from these different sources for external validations. And this test set have over 15,000 patients in total. So with this data, we can now answer some questions regarding the model's generalizability. The first one is, can the model train using a high quality dataset from a single institution or single site be generalizable to other sites? And second question is, how do the model's performance change and the generalization change on the datasets used for the training? So we use the dense net model architecture and applied a three-stage transfer learning. So the model was trained using first on the ImageNet dataset and then on the pre-pandemic NIH chest X-ray dataset and finally fine-tuned on the COVID chest X-ray dataset. And for each training set, we trained five different models with different training validation splits and then we do the model ensemble to get the final prediction score. So here are the results. We see that the performance metric in terms of AUC is actually very consistent across all the test sets. The internal test set is 0.82 and these are the three external test performance. And we know AUC is a measure that can tell how well a model can classify two different classes, but it's not directly relevant to the final decision making until we pick a threshold. So quite sometimes the AUC may be similar for the two different test set, but their threshold to achieve the similar level of sensitivity or specificity might be very different. But for this model, when we choose the same operating threshold, we can see that their sensitivity and specificity actually are all quite consistent. And we can further look at the score distribution and we can see that the distribution are all very similar. Majority of the COVID positive cases have a higher score close to one and the negative cases have a score close to zero, but we can also see that they both have this long tails. That means the AI model cannot perfectly differentiate COVID positive and negative cases purely from a chest X-ray, which is true for radiologists. And here we show it is also the case for the AI model. So now we have shown that if the quality of data set is high enough, is assured, so we can train a model from a single clinical site and make it generalizable to external sites. But how much data do we need to make the model generalizable? So this question is important, especially in a pandemic setting where we want to have a model ready as soon as possible instead of waiting for years to get enough training data set. So to study this training data size dependence, so we actually sampled our training data set of different sizes from a very small size of only 100 patients to a large data set of 6,000 patients. And they're all sampled from the full training data set. And then for each size, we actually sampled 10 times to make sure the uncertainty of the sampling process is considered. So that means for each model, for each data size, we will have 10 different models and we'll have 10 AUC numbers. And from that, we can evaluate the mean and the standard deviation of the AUC. And here are the results. We can see that, well, as the training data sets increases, the performance on all of these test sets, both the internal and these three external test sets, the performance are all increasing. So we can fit the so-called learning curve of this model using this following functional form. And after the fitting, we can see the results. Now, let's look at this learning curves. While the performance may always increase as we have more data for training, but the benefit of adding more data to your data set to your training is actually diminishing. So for example, let's look at this curve. If we want to improve the performance from 0.81 to 0.83, we need an increase of data sets from 6,000 patients to almost 100,000 patients. However, we also notice that even with the small data set of pretty much just around 100 to 200 patients, the model's performance can already have a pretty good baseline that may be already useful in the pandemic, in the early stage of the pandemic. So that means, and another thing to note is that even with the small data set, the performance is also consistent across these four tests. So that means, in some sense, the generalizability of the model is actually not dependent on the data size, but rather on the data quality. So to summarize, data quality is much more important than data size. And models trained using a well-curated data, even from a single clinical site, can actually be generalizable. And even with a small training data set, we can already generate a decent baseline model with good generalization performance. And so one of the limitations I want to mention is that all these tests are still retrospective. And to evaluate the AI model performance on real clinical cohorts, we better use prospective tests, which require the model to be implemented and deployed in the actual clinical environment. So that's why I think Dr. Garrett's talk next will introduce these important topics. And with that, and thank you for your attention. Now we welcome Dr. John Garrett to the podium for his talk titled, The Clinical Integration of AI Models, Input Data Selection, Implementation Methods, and Ongoing Validation. Thanks very much for the introduction and thank you to the audience for sticking around. I know it's late on a Thursday. Try and keep it interesting. So we've already seen two great talks plus a whole week of really exciting work showing all of the things that we can do with AI. It's definitely growing. I think this curve is something I always like to show at the beginning of a talk to just emphasize we're still very much on the upswing of AI growth and medical imaging. But the gap between sort of the research and technical developments and what's actually getting used in practice is pretty dramatic. We sort of jokingly call this the valley of death. But it's true that there's a huge gap between what's sort of being developed and what's actually practical to implement. There's a couple of different reasons for this. Well, many, many reasons for this. But one challenge is that there is not a single best way to do it. And so even if an institution has taken a single model and deployed it, that doesn't mean that the next model that comes around is gonna make any sense to deploy in a similar way. I think, you know, in the spirit of keeping this sort of grounded in physics, that it's very important both for training and for implementation to think very carefully about the data that are getting fed into the model. It's highly unlikely that the clinical data that are being sort of collected and processed real time are gonna ever be as tidy as the beautiful data sets we can curate for research. But doing some careful work up front to sort of funnel the right data into a model goes a long way to helping it perform well over time. So over the course of the talk, I'm gonna try and offer some suggestions to help bridge that so-called valley of death. I'll be doing that by talking about selecting the right input data, and then cover a couple of the different pathways to actually taking a tool into clinical production. This is by no means gonna be a comprehensive discussion, but hopefully I'll cover all the main categories for that. And then I wanna end the talk talking a little bit about how you may monitor some of these tools once they're live. So to start, I wanna talk a little bit about why it is that choosing the correct data as an input to a model is difficult. I mentioned before, and we've seen in the prior two talks, how much work goes into curating data sets for research and development. But in general, our clinical data are quite a bit messier. And this has to do with things like variation in protocols from site to site. So a biphasic liver exam at UW may look totally different than it does at Henry Ford or another site in terms of the order of the series, the way they're ordered in packs at least, whether the reconstructions are presented as thicks or thins and things like that. Different manufacturer scanners may vary quite a bit. And then there may also be patient-specific issues that cause protocols to look different, cause images to look different, things like truncation for larger patients and so on. So I think an important piece of translating any of these tools into practice is thinking very carefully about the logic to select those correct images. And this is coming back to that same task we've talked about a little bit already today of taking chest radiography and diagnosing COVID-19 from it. And in this case, it may be that we have some very good hypothetical rules to say, we built a model that runs on chest radiographs. It needs to be sort of an AP or a PA view. We want adult patients only. Of course, we need to include the chest in that. Those may be sort of the rules that we state, and I think implied in that, sorry, mouse was on the wrong screen here, and implied in that is often we're not talking about sort of post-processed or other images, but primary images. We are going to need to identify the correct study types, so those should be chest X-rays. But then identifying the series itself may take a fair bit of logic. And so even in this very simple case, this is just a snippet of Python code, not necessarily that anybody needs to be able to read it, but just to highlight that it takes many lines of code sometimes to even make a simple distinction between lateral and AP-type views. Radiography relatively is quite a bit more straightforward in most cases. There is some variability certainly, but as you get to more advanced modalities, things like CT, this process can become even more challenging. So in this case, we're gonna take a hypothetical example where we may have built a body composition package that runs on abdominal CT studies. We can define those sort of abstract rules to say we are looking for axial images through the abdomen. We'd like non-contrast ideally, but at least non-arterial images. In a perfect world, we'd like those to be thin, and we want to have the standard kernel reconstructions for those. This may seem at first glance like it's a pretty straightforward thing, but as we can see in the study to the right here, there actually are four different series that if you only looked at the DICOM header fields would be theoretically eligible for this algorithm. And so coming back to our sort of physics grounding for this talk, it's important to not only look at things like the series descriptions and stuff like that, but to start with your protocols, because in most cases, the most robust way to identify sort of which series is most appropriate for your application is to look at the beginning of that process at what's being prescribed. This can often involve working with the entire protocol team, so the physics group, as well as technologists and the clinical stakeholders, and understanding basically what characteristics of the image do you care about, and how does that translate into the series that you need to use. I will say this can be iterative, so as more and more tools are being integrated clinically, taking protocols off the shelf and using them is great, but we have actually found some applications where we've modified protocols to accommodate some of our applications that are getting deployed, so additional reconstructions that we would only use for certain things may be added. Once you've worked with that group and identified sort of the right data, I strongly encourage you to do some of this offline before you kind of open up the floodgates, but working on sort of pulling live cohorts, so real data that's being acquired, and confirming that it matches sort of what you're looking for. And then one other just sort of, this capitalized don't tells you this is something we've run into many times in the past, but certainly don't count on things like series numbers to identify data. So coming back to our example from the two slides ago, if we go back to this body composition example, had we gone to the protocol first, instead of just looking at the sort of list of series and packs, we would have seen that the second set of recons, which will have a series description of thin STWC, are gonna be the most eligible series for us to use in this case. We can find that easily enough in the packs acquisition and build rules to route those to our algorithm. So once you've selected the right data for your tools, the next step is to think about how the model itself is going to be implemented. And this definitely has to do with the data that are being used, because in many cases we want images, almost always, although not always, but there's certainly other data types that may be important for us to think about. And so before I dive into the methods themselves, I wanna just highlight some of the use cases for AI in medical imaging today. This is certainly not a comprehensive list, I'll say that, but I think it helps highlight why we might care about so many different types of data. So to start, there's things like scheduling or no-show predictions or other kind of imaging adjacent tasks. These are unlikely to need image data, although who knows how useful that may be in the future. Image processing, I think, is maybe most close to what some of the other discussions today have been on, but thinking about taking images themselves and either processing them for review or extracting data from them. A very popular sort of suite of tools today are basically used for triage of studies. So this may be creating a finding based on the imaging content itself and then using that not to necessarily make a diagnosis, but to help get eyes on it more quickly. And then there's also the sort of broad, broad field of various types of CAD. So triage sort of falls into that, but these are really making an actual diagnosis from the image and potentially triggering workflows or at least making something available to the radiologist reading the study. I like to think of these as split into two categories. One of them is a timely sort of diagnosis. So this is something like a brain bleed that needs an immediate response. And then there's other categories that would be sort of incidental or opportunistic findings that may be important to sort of record and sort of know about clinically, but don't require immediate intervention. And then the final use case for this is to help speed up all sorts of laborious tasks that the clinical teams don't want to have to do manually. So things like pediatric bone age are a great example of something that is quite tedious to do in practice or automating measurements and things like that. And so, as you can see, these applications have very different audiences and may need different types of data to work well. And so the potential strategy for deploying each of them may make more sense in different contexts. So a few other things to think about as we talk about sort of deploying these tools is where the tools originate. So certainly at academic institutions and many other practices actually now are developing their own tools. This is very exciting. Often these are really cutting edge tools that don't exist in a lot of places. And typically they're built sort of under IRB approval or treated as a non-significant risk device. And then increasingly, of course, there's plenty of commercial offerings, which can be freestanding applications, as well as commercial offerings that are embedded in other tools. So an example of that might be something in PACS that helps, for example, define a hanging protocol automatically. Each of these may require different types of data. And I think we had a very nice discussion of sort of the informatics 101 earlier, but keeping in mind the different systems that these data come from. So image data most of the time is going to come from PACS or from the modalities, clinical data from the EMR or the RIS, and then report data from your dictation software as well as possibly the RIS. So one of the first implementation methods I wanna walk through is developing tools that are actually embedded in basically the PACS system or in the modality itself. So these could be, I mentioned before, the hanging protocol tool. There actually are new devices or new algorithms coming out that are deployable on devices to help with things like tube placements and things like that. And so in this case, we're almost always talking about image-only models, not necessarily only imaging, but typically data that's accompanying the images. So things that might be in the DICOM header may be available. Usually the audience for this is gonna be either a radiologist if it's in PACS or potentially even technologists if it's on the device. And these are almost always, again, gonna be related to image interpretation or, and this should be in the bullet here probably, but potentially helping with the study at hand, so prescribing the scans, things like that. This method has certainly some nice things about it. So it's taking place very close to the images. And so it, compute and things like that can be very quick. Typically PACS or the modalities will be really capable in terms of handling logic around images or triggering sort of when certain images arrive. But as I sort of alluded to, the access to other data may be a challenge here. And then compute is worth thinking about, particularly for on-device deployed models. It's not today anyways common practice to deploy a huge cluster of GPUs on your portable X-ray units. And so models that run here probably need to be a little lighter weight than things that might run in the cloud. Another method that we'll talk about is embedding a model directly in the electronic medical record. I think this seems a little bit counterintuitive at a radiology conference to talk about, but this has certainly the benefit of having tremendous access to other clinical data and also the ability to trigger other workflows in the clinical work, in the clinical environment that go beyond just the radiologist. So this, I think today is not tremendously common. I will walk through one example in a little bit that shows why this may be useful. But thinking about this is mostly going to be driven by who is consuming the results from the model. So if the result of a model is something that may trigger a clinical workflow where a nurse may be doing something or something along those lines, that's maybe a good place for it. So the next slide is gonna cover the sort of more commonplace methods. And if you walked the vendor floor this week, you likely saw each of these. The first is dedicated tools. These are kind of the most traditional, I would say applications that radiology department would be looking at acquiring. Typically, this is a standalone software, either deployed on a local VM or possibly in the cloud, occasionally on a hard workstation, but that grows less and less common every year, it seems. These are very tailored in most cases. So this might be a specific application to process a specific type of image, and they can be algorithm-specific or sort of workflow-specific, so things like 3D workflows and things like that. These do have some benefits, certainly. Compute is usually very good for these because they're doing one specific thing, so the hardware that's provisioned is really optimized for that task. But this is where we start to get into some of the informatics challenges that come along with these, which is every single tool that's being deployed this way today is often being treated as a standalone software package, which means interconnects, so DICOM and HL7 traffic is all basically a manual effort to coordinate setting that single system up. Security reviews are being handled as one-offs for those, and so that can be quite a lot of work. And I can't speak for every institution, but at least at UW, this is something that can be a significant sort of bottleneck to getting these tools implemented. And so the more sort of, I would say, popular way to think about this today is the use of sort of an orchestration engine or an AI platform. These are available for many, many different vendors and then certainly being built, I think, in-house as well by different sites. And so the sort of concept here is a single platform that goes through this integration process and security review, gets all of your imaging in and whatever clinical data is needed, and then passes that off to the appropriate algorithm that it's orchestrating as well. And these can be on-premise, but increasingly are sort of hosted in the cloud as well. So I do wanna walk through a few of these sort of practically and what they may look like. The first one I'm gonna walk through is an orchestration engine that we developed at UW. We called this our Intelligent PAX Router. I think intelligence may be a little bit of a strong word, but this was basically developed for us partly with AI in mind, but also just thinking about some of the other imaging workflows that we have. And so this is a way for users to automatically or semi-automatically sort of trigger things like image de-identification, routing things to a research PAX, or triggering a certain algorithm to run. We have a growing sort of group of researchers developing models that we're trying to deploy locally, and so we needed some framework to do that. This system is integrated and in use at UW today in the way that this works, and this is representative, I think, of many of the orchestration platforms out there, either in a fully automated way behind the scenes, which makes for a very bad demonstration, or in a semi-automated way where there may be a right-click or sort of a main toolbar option in PAX to trigger the router. Typically that will launch some type of web interface, so this is what ours looks like. This sort of shows at the top some of the metadata associated with the study, what user is asking for this, and then down below shows a pruned list of sort of eligible algorithms or workflows based on the study type and things like that. This was a radiograph of a pediatric patient's hand and wrist and so we're selecting the bone age algorithm here, and when that gets routed, it will trigger in the background the image getting moved to the compute, getting processed, and then the results get pushed back into PAX, and we can see here the sort of DICOM report that lands there and is available for interpretation. So the other workflow I want to walk through, even though I said it seems a little bit inappropriate at a radiology conference, is integration directly in the EMR, and I think this is a useful thing to talk through because the way that we think about triggering and passing results back is quite different. So in this case, instead of triggering with sort of a manual request, this is actually a process that starts at the time of ordering, and so we use Epic, and so in our case, that X-ray, when it's ordered, will trigger a process that starts to sort of look in PAX for the X-ray that belongs to that order. Once that image is available, a script runs in the background to move that image from the PAX, because of course Epic doesn't have direct access to that, into the cloud environment where the compute is taking place for this. That image is then scored, assigned a COVID risk score, and then that score itself is passed back, not into PAX, but directly into Chronicles, which is the live database that Epic uses. That score is then available for a variety of different sort of clinical workflows, and one of the reasons that this has some sort of potential is the ability to make things like dashboards. So this is sort of an ED census dashboard where you can see patients color-coded based on their COVID risk score in their X-ray, and at the time that this report ran, I think one of the things that we got really excited about was we had out of 17 patients on the list, only one single patient who has a PCR test back at this time, but 14 of the 17 have had X-rays that are scored, and so this color-coding lets us sort of risk stratify even while we're waiting for labs or things like that to come back. Another potential benefit here is the ability to look at trending over time really easily in sort of a clinical-friendly way. So you can, they sort of have this hover bubble, and you can hover your mouse over a patient and see over the last two days and six X-rays, this is sort of what their score has looked like. This may be somebody who warrants another test, a PCR test, or maybe something else. So I do wanna quickly mention governance. I think depending on where these models are getting deployed and who is consuming the outputs, governance can mean very different things, but it's gonna be important to have a formal structure for this, whether that's internal to the department or at the health system level. And just to sort of reiterate this, for us, that Epic integration took maybe four or six months to build and get implemented and tested, and then ran for more than a year in the background before we were able to go live with it clinically because the governance was pretty complicated for that with non-radiologist consumers. I do think in terms of learning governance and defining policies, it's quite a bit easier to start smaller. So the last topic that I'll cover pretty quickly here is model monitoring and ongoing validation. So once a model is deployed through whatever path you've decided on and you've done your initial sort of research validation, hopefully using some of the techniques we saw earlier to make a model that's as generalizable and portable as possible, it's still very important to monitor that in an ongoing way. This can include very simple metrics, so how many studies are we processing, how many tools are we using like this, and how many times are the users actually working with those things, but then also looking at some more sort of esoteric things like this is a tool that's expected to save time reading a certain type of study. Is it doing that? Is it performing accurately and is it fair? So at our institution, the framework that we have for defining these metrics, we basically have decided that we need to make sure that we're answering questions in each of these three boxes. So we try and think about image quality, things like performance objectively, looking at our area under the curves and things like that, but also focusing on fairness. So is this treating all of our populations equally? Is this something that's being used by everybody where it should be? And then also performance, not just in terms of clinical throughput, but also compute because cost is a real factor as we scale more of these things out. And so some more specific versions of these can be seen up here, but we've gone through, I think, most of the quality ones in the earlier talks. For fairness, I just will say that this is something where we feel pretty strongly like it's helpful before deployment to really define populations you'd like to look at so that you can do that in an unbiased way yourself to sort of evaluate performance. And then in terms of performance, keeping track of sort of resource limits and whether those are constrained because of physical hardware or budgeting and thinking ahead of time about things like that. So just to summarize the monitoring, again, I think defining metrics that will help answer those questions is really important. Keeping an eye on those over time and then also finding ways to make sure that that's not just getting monitored in the background and nobody pays attention, but that you actually have intervention. So if you're not meeting certain metrics, there's a chance to say, hey, we need to update the model, we need to turn it off, or we need to expand our audience. Model shadowing is something that's a really nice way to be able to do this. You can potentially deploy multiple models that are running in the background and sort of keep track of performance over time for those so that you know in advance before you turn something on if it's gonna be useful. So just to wrap up, I think building AI models is certainly a real challenge. It takes good data, it takes skill, but it also is a starting point in a lot of ways and there's a lot of work to be done once you have a model that's built to actually implement that in practice. Keeping the eye on sort of what data needs to be fed in is not trivial and should be done as a team effort. Thinking carefully about the target audience and data that are needed for a model will help you find the best ways to sort of deploy the given tools that you're interested in and then ongoing monitoring is really important to just make sure that they're doing what you'd expect. So with that, I will say thanks for your attention.
Video Summary
The session titled "Data Curation for AI with Proper Medical Imaging Physics Context" emphasizes the critical role of medical physics in AI development for medical imaging. The importance of scientific and technical foundations, including physics and math, quality assurance, and safety, are highlighted as crucial to AI development and deployment in clinical settings. The session begins with a presentation from Jiwa Qi, focusing on data preparation for AI, exemplified by a COVID-19 chest X-ray diagnosis project. The talk stresses the significance of collecting high-quality, generalizable data and navigating the complexities of data storage and sharing under ethical and legal constraints. <br /><br />Dr. Ran Zhang continues by discussing data quality and the importance of avoiding shortcuts in model development. Using COVID-19 classification as an example, he emphasizes the necessity of a well-curated, high-quality data set even from a single clinical site for achieving generalizable models. Shortcut learning can yield misleadingly strong results on internal datasets but falter on external ones. Continuous monitoring of AI models for performance consistency based on external datasets is advocated.<br /><br />Dr. John Garrett focuses on implementing AI models in clinical settings, showing the substantial gap between research and practical application. He discusses various methods for AI integration, ranging from embedding models in PACS systems or EMRs to using dedicated AI platforms. The importance of choosing correct data inputs, considering use-case specific needs, and ongoing validation are emphasized to ensure AI tools are both effective and efficient in practice.
Keywords
Data Curation
Medical Imaging
AI Development
Medical Physics
Quality Assurance
COVID-19 Diagnosis
Data Sharing
Model Validation
Clinical Integration
AI Platforms
RSNA.org
|
RSNA EdCentral
|
CME Repository
|
CME Gateway
Copyright © 2025 Radiological Society of North America
Terms of Use
|
Privacy Policy
|
Cookie Policy
×
Please select your language
1
English