false
Catalog
Data Curation for AI with Proper Medical Imaging P ...
M3-CHP09-2022
M3-CHP09-2022
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
So, the first talk is titled, Preparing High-Quality Clinical Data for AI, the Essentials for Medical Physicists. So, it's easy to understand that data quality is of vital importance for AI model development and its clinical adoption. In clinical applications, this ultimately affects the quality and the safety of our patient care. Medical physics plays a crucial role in AI development and its clinical adoption. First, quality and safety falls naturally into the domain of medical physics. Second, there's a lot of physics behind data and AI. It's not until we give proper consideration of all the physics involved in them that we can fully harness the power of AI and big data. So, it's both a natural fit and an area of strength for medical physics to be actively involved in AI development. So, the outline of this talk is as follows. How do we go about preparing high-quality data for AI model development? We'll follow a simple rationale in planning out the project. So, the simple questions are what, where, and how. What data are we going to acquire? Where are we going to find them? And how do we build our workflow? To put the discussion into a more practical context and help you understand the various elements better, I'll use an example project to guide the discussion. So, the project here is a chest x-ray-based COVID-19 diagnosis using AI. This is a heavily researched topic over the past few years while we were in the COVID pandemic. So, the goal is pretty simple and clear. We want to develop an AI model that is of high accuracy and that can, in the clinical settings, can do its job with minimal turnaround time. And we need to cure a large dataset for the training, validation, and testing of the models. And we are not only, we are also very interested in collaborating with experts like in the domain of machine learning. So, we are prepared to share the data with the external collaborators. And we are also, we feel compelled to contribute the data to some of the large data commons out there. One great example of such is MEDREC. And you probably can learn here a lot from the Learning Center regarding MEDREC, which is a grand initiative co-sponsored by both RSNA and the AAPM to facilitate research in the, facilitate the COVID-related, COVID imaging-related research. So, let's start with the question of what. So certainly, I mean, although a large focus will be on image data, there will be plenty of non-image data we would like to acquire, right? So, you want to define clearly what kind of image versus non-image data you need to acquire. Do we need a control group? How do we build them? How do we build the labels? In order to, in the model development process, often we need to assess the model performance and put in effort to optimize the model. That's where you want to also collect variables that could potentially affect the performance, right? So, you would also put that into your project planning phase. So, for the example project of, we discussed, the image data we want to acquire are simply chest X-ray images, lots of them. For the non-image data set, there are mainly two categories of them. One is PCR testing results. This is widely considered as a golden standard for the patient infection status. In addition, we would also need to acquire data from radiology reports. Specifically, we want certain key findings. The reason why we feel it's needed is because testing results and chest X-ray images do not necessarily happen at the same time. It could happen on a different day. So, in order to confirm that the images are indeed acquired while the patient is infected, it helps that if you limit to those images where their findings indicate something like pneumonia or likely findings, right, because there could be possibility the patient may not just have caught it or just recovered, those type of complication factors. So, this is where you put in your consideration and help ensure the quality of your data. Certainly, we need a control group, mainly from COVID-negative patient images. And the labels will be created by combining the testing results as well as key findings from radiology reports in this project. And we understand that when it comes to X-ray imaging, there is a large variety of device vendors and models out there. They all employ their mostly proprietary methods in processing the data before they are finally presented for diagnostic interpretation. So, we hypothesize that this would be a factor affecting the performance of the models. And we also would expect our – we would like to develop a model that's robust against such variables, right? So, this is an example of – this is certainly highly dependent on your project design, but it's an example of like, it's a planning phase. You want to incorporate those variables. You want to collect. A device vendor and model are easy to collect. They are embedded in DICOM metadata, but there are other variables. If they are part of your research, you may need to find for their – find their specific source. So, next, let's move on to where. Where do we find all those data? So, here is a quick radiology informatics 101 to explain the entities and the standards in routine clinical use in modern clinical care. So, as shown in this illustration, clockwise from the top, we have hospital information system, aka electronic medical record. We have a risk radiology information system. We have modality devices that's mostly producing those data, and we have PAC servers that mostly hosting those data in the various workstations that also contribute to the clinical imaging data. And they all follow, I mean, certain standards. It helps to know the standards that they use to producing and communicating those data, and also helps to understand the database engines that works in the background that powers all those different servers. So, the imaging data is largely following the international DICOM standard, whereas electronic health data are mostly following the HL7 standard. So, it's for a reason that they call it a big data, so it's common for imaging research imaging-related AI project that we are handling large amount of data. It could easily range from a few hundred of gigabytes up to a few tens of terabytes. So, it's wise to planning out how do you store your data in the beginning, since they are important for your research. The storage media choices include hard disk drive, solid state drive, cloud storage options, and et cetera. And in, I mean, to choose the best one for your particular project, you may want to trade off among various factors, including storage size, speed, redundancy, security, and cost. So, back to the example project, the table list where we get the data for the project. So, the chest x-ray images come from the PAC server. The list of patients with PCR testing results comes from the electronic medical record software or server. It's common during the COVID pandemic that EMR systems put up COVID dashboard or reporting or reports that shows real-time updates of COVID cases in the healthcare system. So, that's where you obtain the information regarding COVID positive or negative cases. Diagnostic key findings from radiology report, we find them in your local department's radiology information systems. For storing the data, we use hard disk drive for local storage and cloud storage for easy sharing with remote collaborators. The next step is how, but before we roll up sleeve and get into the data collection part of the work, please be reminded that this is human subject research. We still need to go through the process of IRB review. If there is data sharing involved, data sharing agreement should be entered whenever applicable. Because of the large amount of data we are dealing with, so it's certainly wise to build a workflow that has automated process requiring minimal human intervention, right? So, your exact workflow will be much dependent on your local informatics infrastructure and setup, but a typical workflow would look something like this. You start with querying the data, then you retrieve them from the remote server, you run your de-identification and additional cleaning before you're storing them for your local use, or if you need, you share them to your external collaborators or public data commons. So it certainly helps if you have access to some good tools and resources in carrying out research projects. So we'll go through some of them here based on our experience. First and foremost, you will need DICOM toolkits and library one way or the other. They may be embedded into the software you have, or you may need to run them independently. I mean, they mainly help you handle the DICOM files in terms of their management and network communication, and there are plenty of open source resources you can find online. When it comes to mining the radiology report, you basically need software that's capable of natural language processing that can go through the reports, analyze them, and search in an intelligent manner regarding what you are interested in. So there are open source ones, and there are proprietary softwares to choose from. And here it's just giving you an example of a proprietary software where you can, I mean, in the example here, you can put in your search criteria. In this example, you can see we specify we want chest x-ray images or digital x-ray images in this specific data range. We specify the procedure type we are interested in and the keyword in the impression section of the radiology report. You can certainly put together more complicated ones using regular expressions or other mechanisms, but these are incredibly helpful in terms of searching radiology reports to get useful data. We don't necessarily look far to find useful tools and resources. Actually, RSNA has developed in the past some quite useful software tools. The examples we are showing here is so-called RSNA-CTP, or in its full name, the Clinical Trial Processor. If you talk to the Milleric staff, this is some tool they use extensively in building out their, like, data commons. So this is an imaging research tool, RSNA, built in the past. It has numerous friendly features for research. It can be flexibly configured. In the beginning, it was designed really for running multi-site clinical trials so that sites from all over the place can share across Internet clinical data with, I mean, or merge them into the PI site. However, it can be easily reconfigured to be deployed into your local network so that you can collect data from multiple data sources effectively. It's featured by a pipeline-based design, so it's kind of like connecting Lego pieces. So as shown here, you can see a typical pipeline involves modules like shown here. You start with importing your DICOM images. You can run some filtering function to your exact interest. You can run decompressing the data, anonymizing the pixel data, mapping each study, assigning research ID to them, and everything else before you eventually store them on your destination devices. It is also known for easy deployment and use. It features a web interface like this, relatively easy to navigate. It's very lightweight and platform-independent. Quite often, we need scripting tools to connect all the various functional modules and streamline the workflow. So this is not a heavy – this is not – typically, it doesn't involve too much work, but it's necessary to build your workflow. There are unlimited choices you can pick from. So here, we just list a few, but you are not limited. So for the example project, the IRB approval is easily achieved via expedited review. This is due to the nature – the retrospective nature of such AI research project. And the tools and resources we used are shown here in this table. So for searching EMR records, we used the reporting tools that comes as part of the proprietary EMR software. We used Python script plus DICOM libraries to query the PAX database. We used a proprietary software, Illuminate Insight, to query radiology reports. We used RCA-CTP plus some Python scripting to manage the data transferring and also sharing with external collaborators. So in summary, high-quality data is of vital importance for AI development, and medical physicists bring unique value into the process of AI model development as well as its clinical adoption, and that the data preparation process can be easily guided by the relatively simple rationale presented here by following simple questions of what, where, and how. So that will end this presentation. Okay. Good morning, everyone. So my name is Ruan Zhang, and I'm from University of Wisconsin-Madison. So in the first talk of this session, Dr. Key just introduced some methods and tools that can be used to curate data for AI. So in this talk, I will discuss why high-quality data is important for AI and how it is connected to the model's generalizability. So I will use COVID classification from chest X-rays as an example. So this is a standard binary classification problem where each input X-ray is assigned a score between zero and one, and we can choose a threshold or a cutoff value. So anything, if the score is above the threshold, the chest X-ray is predicted as COVID, otherwise as non-COVID. So to train such a neural network model, we just need to prepare COVID-positive and COVID-negative chest X-rays and assign their labels as one or zero. Then we can optimize the ways of the network to minimize the cross-entropy loss. So conceptually, this is no different from the classification of cats and dogs, which is one of the most popular deep learning projects for any beginners, so everybody can actually do it. To start, you would search on the Internet and download a lot of images of cats and dogs, and from there, everything is straightforward. So we can certainly do the same and curate data in a similar fashion for the COVID classification task. In fact, people have been doing this in the community over the past two years. To get a COVID-positive chest X-rays, people look for some online preprints, educational websites, and some public data sets released by different institutions. And to get COVID-negative chest X-rays, one of the safe choices is to look for the pre-COVID data sets, such as those used in grant challenges. However, if you do this, you may find your model performs extremely well on your holdout data set, but if you test the model on some real-world clinical data sets, your model may perform no better than random guessing. So this is because of the shortcuts or biases of the training data. Now, here's an excellent example from this review article. So this model is learned to perfectly classify COVID versus non-COVID in your training data, but if we have access to the meta-information, we may find that in the curated data, most COVID-positive cases are from older patients. Well, most COVID-negative cases are from young patients. So this model actually learned to classify based on the patient's age, so that's why when we apply this model to a real clinical cohort, so in this example, all patients are from the same age group, the performance will, of course, drop significantly. So this is just an example where age becomes a shortcut, where we can see the data is biased. So many different factors, such as some related to the patient demographics and some related to the imaging systems, can become shortcuts or to introduce bias. So this problem is further complicated by the current machine learning practice. So in the current practice, we evaluate the model's generalizability based on the independent holdout test set that has the same distribution as the training data. So this is the so-called IID test set. So we think the model is generalizable if it performs well on this IID test set, regardless of any potential shortcuts or bias. But in the context of medical AI, what we really care about is whether this model can be used in prospective and external clinical cohorts where the model may be deployed. So this IID assumption quite often does not hold true in these circumstances, so we hope the model to learn the desired solution that can generalize to different patient cohorts and to different clinical sites. So I've shown the wrong approach for data curation. So what is the correct approach? How to curate high-quality data for this project? So here I listed a few quality assurance considerations that we summarized for this COVID classification task. So the first one is to make sure you collect data in their native DICOM format. And the reason is that any pre-processing step may introduce bias or shortcuts. And second, to collect as much metadata as possible, including those patient demographics, information, and imaging system information, so that we can check if there's any potential shortcuts or spurious correlations between the meta-information and the label. And then for supervised learning, the accuracy of the label is crucial. So to make sure each SX ray is labeled accurately, we use a short time window between the imaging study and the RT-PCR test, and only label those SX rays as positive or negative based if there's an RT-PCR test result within the short time window. And finally, we want to make sure that both COVID positive and COVID negative cases are collected from the same hospitals and within the same time range. This is to mitigate shortcut learning. So with these quality assurance procedures applied, here's an overview of the curated training data set. So we collected and curated data from Henry Ford Health in 2020. So this training data set has roughly 17,000 test SX rays from over 8,000 patients. And to evaluate the model's generalizability, we need more than just the internal test set from the same institution. So to do that, we curated another data set from UW Health, and we also used two large public COVID test SX ray data sets. The BMCV data set is collected and curated in Spain, and the MEDRIC data set curated by Medical Imaging and Data Resource Center. And overall, the model was evaluated using more than 25,000 test SX rays from over 15,000 patients in different regions and different countries. And also note that the distribution of the imaging vendors for these different external tests are quite different from that of the training data set. So these are not ID test set. So with this data, we can answer some important questions. First, can a trained model using a high-quality data set from a single clinical site be generalizable to other sites and patient cohorts? And second, how do the model's general performance and generalizability change with the data size? So we used a standard dense net model architecture and applied the three-stage transfer learning. So the model was first trained using the ImageNet images, and then using the pre-COVID test SX ray data set with 14 common disease labels, but there's no COVID label. And finally, on the COVID data set. And then for each training data set, we trained five different models using five different trained validation splits. And the final prediction score is an ensemble average of the five scores to reduce the variability of the prediction. And here are the results. We can see that the performance metric in terms of AUC is very consistent across all the test sets, for the internal test and for the three different external tests. So we know that AUC is a metric to show how well a model can distinguish between two different classes in general, but it's not directly related to the final decision making until you choose a threshold or a cutoff value. So quite often, two tests may have similar AUC, but their optimal threshold to reach the same level of sensitivity or specificity might be very different. And for this model, we applied a fixed threshold of 0.7 to all these tests. And here shows the sensitivity and specificity of different test result. And we can see that they're also pretty consistent. And this can be further demonstrated by directly looking at the prediction score distributions of these data sets. So all four data sets show similar score distribution. So the majority of the COVID-positive cases have a higher score, and the majority of the COVID-negative cases have a lower score, but we can also see that they both have a long tail, which means the AI model cannot perfectly differentiate COVID-positive and COVID-negative using just chest X-rays. And this is because there's no defining feature of COVID pneumonia that are 100 percent specific to chest X-rays, which is true for both radiologists and for AI model. So that's inherent to this clinical problem. So now we have shown that if the quality of the training data set is assured, we can train a model from a single clinical site and make it generalizable to other sites. But how much data do we need for the model to be generalizable? So this is an important question, especially in the pandemic setting where we want to quickly provide solution instead of waiting for years to collect enough data. So the question is, how does the model's performance and generalization change with the training data size. So to study this problem, we sampled data, training data sets from the total training data sets from a small data set of only 100 patients to 6,000 patients. And for each data size, we also randomly sampled 10 times to make sure we sampled different subsets of the data, and then we train a model and evaluate its performance in terms of AUC. So for each size, we will have 10 different AUC numbers, and we will evaluate the mean and the standard deviation of the AUC. And here are the results. As we can see, as the training data size increases, the performance on four of these data sets are also increasing. So we can actually fit the so-called learning curve of the model using this power function. And here are the fitting results. And of course, the goal of the fit, to fit a function, is to make predictions. So how about we predict the case of 6,000 patients? And as you can see here, the predicted AUC is almost identical to the AUC that was actually measured. So that shows the accuracy of the learning curve. And so now let's look at these learning curves. So we can see two things. One is the performance is always increasing as you have more and more data available for the model training. So that's true for all of these test data sets. But we can also see that the benefit of adding more data to your training is actually diminishing. For example, based on this curve from the UW Health site, if you want to boost performance from 0.81 to 0.83, it requires an increase of the training data from 6,000 patients to almost 100,000 patients. However, we can also see from these results that even with a very small but good quality data set of roughly only 100 patients, we can make the model generalizable. And also the performance is already the – we can have a model that have a pretty good baseline performance. And the most important part is that this performance is consistent across different test sets. So that means the generalizability is good. So in this sense, we can say that the generalization performance of the model is not really dependent on the data size, but instead on the data quality. And finally, we actually used this model for the major COVID-S grant challenge. And this – we just received the news from the organization that this model actually won the challenge without any further fine-tuning. So this again shows that the key to a generalizable model is the quality of the data set. And to summarize, data quality is much more important than data size. A model trained using a well-curated data, even from a single clinical site, can be generalizable to other sites. And even with a small training data set, we can already generate a decent baseline model with good generalization performance. And I want to mention one of the limitations here is that all these tests are retrospective. To evaluate the AI model's performance on real clinical cohorts, it's better to use prospective tests, which require the model to be implemented and deployed in the actual clinical environment. So Dr. John Garrett will discuss these important topics next. Thank you for your attention. All right, good morning, everybody. So this is sort of the last stage of the process that we're trying to walk through this morning. And you've seen both mechanisms to collect and generate really nicely curated data sets to train and develop AI models. You've seen sort of what the process looks like to then develop and test those models. And once you have a good model, the questions that I'm hoping to help address are, how do you integrate that clinically? And that's sort of a technical question. How do you do that from the implementation side? And that's both what tools do you use, but also who's the audience and where does it need to live? But I think also, you know, we talked a lot about curating data for training these models. How do you select data on the fly in a clinical setting to sort of correctly find and identify the input data for the models? And then how do you monitor those in an ongoing way and keep track of whether they're doing what you expect? So just a couple of brief disclosures. I don't think I need to tell anybody at this meeting that AI is very exciting. It's really sort of just exploded in the last couple of years in radiology. This plot in the top right I always like to show is the number of citations in radiology journals referencing AI or deep learning over the years. And, you know, if you walk down into the commercial floor, you can see dozens and dozens of vendors as well. But I think that even today, even though this has been now a couple of years of this sort of really lively activity, there's a big gap, and we'll call it, you know, generously the valley of death here, between sort of the technical developments that are happening, models that are being developed and built, and the actual adoption of those clinically. And I think, at least to me, there's sort of three main reasons for this today. First and foremost, building models is hard, but implementing them in a clinical practice is also very difficult. And I think one of the challenges there is that there isn't a clearly defined single best way to do it. I think that in real time, finding sort of the relevant clinical data that you need and making sure that it's normalized and matches the data that are used to train these models can be very difficult. We've seen all of the effort that goes into developing these beautiful data sets for training. How do you do that on the fly? And then finally, you know, once these models are deployed, how well are they sort of carrying on and doing what you expect? So just a brief outline for the rest of the talk. I'm going to try and address some ways to sort of mitigate those different issues. So how do you, on the fly, pick the right data? What are some of the different mechanisms? And I'll spend probably the most time there, but what are some of the mechanisms you can use to actually deploy these tools in practice? And then finally, how do you sort of decide what to monitor and how to do that? So as we've seen, again, in the previous talks, you know, building these models typically depends on data that are highly curated. They've been refined and really sort of tuned. We've got, you know, lab values that we've filtered down to very specific dates and things like that. And in addition, typically when you're training a model, you have already whittled the data sets down. So when you're looking at chest CT, you're not getting a whole study. You're getting one series or one image or something like that. But in practice, of course, that's not the case. If you're a radiologist at a PACS workstation looking at a study, you have all of the images to look at. If you open up the patient's chart in the EMR, for the most part, you're going to just sort of see everything. And so how do you select sort of which specific elements are eligible can be difficult. In addition, data actually does vary in the real world. That can be deliberate with protocol variations that are different at different sites, different scanner manufacturers, and hardware can be different. And patients, of course, are highly variable as well. Some have artifacts from motion or hardware and things like that. So this is just a relatively simple example, and this is going to be a carry-on from some of the earlier work that we've seen. So if we have a task, hypothetically or concretely, to do COVID-19 detection on chest radiography, we can define a set of rules for that where we say, okay, we've done this training. We've built a model. And what we need to run this model in practice, we need chest X-rays. So that's, you know, radiography exams, which have, you know, different DICOM header fields, but, you know, different things that we may care about. It should include the chest. The view position, of course, needs to be anterior, posterior, or posterior-anterior. And we only trained this on adult patients, so we only want to include them. I think even for us, it's sometimes it's difficult to remember to write all of this out. And so often implied in these rules is things like they should be primary images. Most of the time, we don't want to work on images that have already been post-processed or generated. There are exceptions to that. And then the study type itself, which matches, you know, CPT codes or ways to sort of map to multiples, is something to think about because, you know, any chest X-ray that has multiple views may include an anterior-posterior view, but it may not. Once you've got your sort of study-level stuff picked out, you need to actually dive down into the series and pick those. And sorry, this text is a little bit small, but this is just meant to sort of indicate a snippet of Python code that you might use. Once you have an X-ray exam, you know, it may only have four or five images, but you still need about this much code just to implement the logic to, say, find, you know, the AP or PA views. And one of the things that hopefully is visible to the audience is that you can't depend necessarily on some of these DICOM fields to always be populated, even things that are supposed to be kind of standard. In our experience, we found counting on them to be there was not successful, and they were absent. And so, you know, you might look for things like the DICOM view position. But if that's not there, you have to have a fail-safe so you don't miss those studies. So this logic is basically just saying, look for primary images, and then look for AP or PA in the view position. If it's not there, then try the series description, and you may build sort of additional cascades beyond that. So I'm going to give just another example, and this one is hypothetical, but would apply just as easily to a real problem. So we'll pretend that we've invented a tool that we built and trained that detects cardiomegaly on non-contrast test CT studies, and we'll assume that what we used for training were non-contrast series, thin slices, standard reconstruction kernels, things like that. So our first sort of attempt to deploy this in practice might be to say, great, we'll find a non-contrast test CT exam, we'll grab, you know, the axial images, we'll run the tools, and we'll immediately impress all of our colleagues in the cardiovascular test sections. But unfortunately, if we take a look, this is one of our standard sort of most routine protocols for non-contrast test CT. We end up with five different series that meet those criteria at a first glance. And so it's not quite as straightforward as just saying, pick the axials and carry on. So my experience and my sort of advice for this is to think carefully up front, but to start with your protocols. In general, these are going to be, you know, range from very, very meticulous and defined to a little bit more sort of amorphous, but you're going to have a protocol and team, which includes physicists, technologists, of course, clinical stakeholders, and things like that. Start with them and say, okay, we've built this model, here are the data that we think we need to run it, where do we go to the protocols to find things that are going to be eligible and that, you know, includes acquisition parameters, voxel dimensions, I left off all sorts of other things like contrast phase and things like that. Once you've done that, I strongly recommend going and not implementing it right away, but trying to sort of look for studies in PACS, using some of the tools that Dr. Key mentioned and pull them offline and then test your logic offline and say, okay, if I take a study with a whole, you know, all of the images that are in it, can I apply this logic and get the images that I want, does it look like what I expect? One last little tidbit that we've also run into, this is definitely learning from experience, don't count on series numbers basically in any circumstance, because even if you have, you know, the world's most perfect protocols, things are going to happen, patients are going to move, they're going to have a reaction, images are going to be retaken, you may not be able to count on those. So focus on things that are going to be reliable series descriptions and things like that. So coming back to this sort of input, a test CT example that I'd given a minute ago, if we had instead of jumping to the conclusion that we could grab the first axial GON to our protocol, we would have seen that our second reconstruction is going to be the one that has our thin slices, it uses the standard kernel, and we can grab that one, we can look at our studies in PACS or offline and see, okay, that's the series that is going to be named thin ST, and then we'll be able to build rules to sort of always find that series in practice and know that we're getting the right images for it. So assuming you found the right images, you still have to figure out where you want to deploy these tools, and this is, I think, a little bit, I'm certainly going to be leaving out some methods, but I'm going to try and highlight a couple of different ways to do this. I think before even considering the mechanisms for implementing a tool, it's really important to think about a couple of things. First of all, what's the purpose of that tool? Imaging AI means a lot of different things to a lot of different people. So there are things that I would call sort of imaging adjacent tasks that we might be thinking of. These could be predicting whether or not a patient will show up for their appointment. So that is imaging adjacent in that it impacts sort of imaging, but it isn't necessarily using images as an input. Image processing, things like reconstructions or bone subtraction, whatever. Study triage, so this is something that typically behaves a little bit like a CAD tool, but the goal is not to make a diagnosis, but to help prioritize studies, to move them up a work list so they can be interpreted more quickly. There's a CAD application, so these are things that are looking for sort of timely diagnoses. Maybe these are bleeds or, you know, a stroke finding or something like that. That triggers a sort of urgent clinical response. And then there's also CAD-type tools that are looking kind of to do things like opportunistic screening or incidental findings that may not have kind of an emergent clinical sort of follow-up, but do need to be sort of recorded and available. And then finally, one of the really exciting, I think, areas for AI applications is to help speed up sort of laborious tasks, automate measurements, do things like pediatric bone age and stuff like that. But one of the things that I want to emphasize is each of these different purposes may be using different input data and may also be targeting a very different audience on the outbound side. So the other two things to think about before you pick a mechanism to deploy these is what's the origin of the tools, because this, of course, is going to impact how you can deploy it. At academic centers, certainly there's a lot of development of exciting new models. Those typically are, you know, cutting-edge tools. They're doing really novel stuff, but they're probably not building using standard APIs or toolkits from commercial vendors. They often will be implemented under an IRB, sort of as a research application or as sort of local standard of care if it's classified as a non-standard or non-significant risk device. There also are then what I'm going to call sort of freestanding commercial offerings. So this would be, you know, all of the various independent tools that are sort of running to do different imaging AI-type stuff. These could be dedicated for chest X-ray diagnosis or things like that. And then there's embedded offerings, which could be embedded within the EMR or within PACS or things like that. On the data side, I think we talked earlier quite a bit about how to collect different data, and I think I'm not going to be able to provide the same overview of the different systems, but we may be including or thinking about data that ranges from the actual images themselves, which typically are going to live in PACS or on the modalities, to other clinical data in the electronic medical record, report text, and, again, there's a whole slew of other things. You could have financial data, other stuff that might be useful in different types of models. So thinking about what types of data need to go into a model will help influence where that can be deployed. So for the next couple of slides, I'm going to walk through some specific mechanisms you can use to deploy this. The text slides are not going to be as fun, but I'll move on to show some sort of live examples of what these look like when you do implement them, so please bear with me. The first one maybe is the most straightforward if you think about, okay, I'm a radiologist who's reading a study. Where would I deploy AI? You might think first to try and embed that directly in PACS. In general, deploying stuff in PACS or on a modality, I guess, would be the other sort of imaging-centric thing. You're only going to have imaging available. Some of the medical record data makes it there or is in the DICOM header, but it's going to be a subset of what's available, and typically this environment would make the most sense if you are doing something that's directly related to image interpretation. So this has some positives, certainly. It's very close to the images. These tools, whether it's PACS or modalities, they already know how to work with complicated image data sets. There's not going to be time or sort of effort to move the data around, so that's definitely on the plus sides. The negatives, as I sort of alluded to already, is that there is going to be limited other data available, and compute may actually be sort of a problem. You're not going to implement this incredibly computationally intensive algorithm on a portable X-ray unit today because they don't have GPUs or things like that available, so they may not be well-suited to everything for those reasons. Another option, which might not come intuitively to radiologists, at least, but to a lot of other people in our health system would be very logical, would be to take an imaging-based AI model and embed it directly in the electronic medical records. This has, obviously, a very big advantage of having access to all of the patient data that's recorded, so this includes lab values, patient histories, things like that. Whether those look as nice as the curated training data sets, that's a story for maybe another day, but I think all of the data are going to be present, particularly if you have non-radiologists who are going to act on the results from this. So if you have an emergency department who maybe gets a certain activity triggered from this, this is a nice venue for that. Radiologists do have access, of course, to the EMR, it's just typically not at the center of their sort of view when they're interpreting studies. The other sort of last sort of downside to the EMR that I'll mention, and this has certainly been our experience, is that, in general, the electronic medical record companies don't usually work as much with imaging, and so they may not be as familiar with sort of complicated imaging studies, they may not have the sort of infrastructure to move big data sets around the same way because they're really focused on text data and sort of all of the EMR-centric stuff. Dr. Zhang mentioned earlier, making sure that you're feeding your models DICOM data and then doing the preprocessing sort of at the inference level, that's something that wouldn't necessarily come naturally to an EMR company who would say, well, why can't we just hand you a PNG and you can do your AI and we'll move on. So the next mechanism would be a so-called dedicated tool. These typically would be implemented, and this, by the way, is still hands-down the most common way that these tools are deployed today in practice, but this could be an on-premise or a cloud-based server. It's going to have a typical, a very specific data stream that's sort of sending data to it, so this would be, you know, chest X-rays or CTs or something like that. In general, these tools are going to have a very, either a single or a very small set of tools that they sort of do. Historically, this would have been sort of similar to like a 3D workstation-type workflow, so the pros of this is that it's typically pretty efficient compute-wise because they're going to deploy a server that does exactly what it needs to do for the, you know, one or two algorithms that are running, and support theoretically is pretty easy because it's just doing a single task. You know if your algorithm is broken, the server is down, something like that. That's also a con because each of those servers requires you to stand up, go through security reviews to build the server, to maintain it, to patch it, and all of those things, and it doesn't take very long for a health system or a radiology department to deploy 10, 20, 50, 100 of these tools, and 100 different servers is really not sustainable or supportable over time. Both from the IS side, but also on the PACS side, they're going to have to set up senders and allow each of these devices to receive images or query-retrieve them, so the ongoing maintenance for this mechanism is fine if you have one or two, but if you have dozens or hundreds, it's just not feasible. So the last mechanism I'm going to talk about, and there's a couple of different names for this type of thing, and there's certainly examples of this on the commercial floor you can go and see, but this is a so-called orchestration engine. This is typically a single system or set of systems that receives more or less all data that are eligible, so this, you know, all radiology studies, streams of data from the EMR, things like that, and then the engine itself is handling sort of routing studies to different algorithms, so chest x-rays would go to CB19Net and be, you know, inferenced for COVID-19, things like that. There are commercially available versions of this. Some sites are also implementing these themselves. I think, you know, to me, maybe my bias here is clear, I think this is the way that future support models and things like that are going to be eventually easier with models like this because compute becomes something that can scale very easily, particularly in the cloud. They natively sort of support fully or semi-automated workflows where you're not manually saying, okay, this study, I need my technologist to send to this algorithm. It's sort of orchestrating that for you. And deployments, theoretically, can be much easier. You go through your security reviews, your deployment, and you set up a single server or set of servers, but, you know, as algorithms are getting added or tools are getting added, you're not doing that every time. One of the sort of main cons to this, at least that I see today, is that those platforms that are both offered commercially or built in-house typically will have only a subset of data that they can receive. They can't, for example, most of the time, do image reconstruction with proprietary data because a third-party orchestrator is not going to be allowed to access that. They will have specific APIs that have to be followed and things like that. And then this has just been our experience, but I think the theory of this being really a simple deployment is good. The practices that we have found are legal and sort of compliance entities still typically do want to look at the tools that are getting rolled out through this. So you can't say, well, we've got this orchestration platform. Every tool we want to add on is just a freebie. You know, we pay the licensing cost and it goes live. You do still have to think about security and compliance for those. All right. So just to give a couple of quick examples of what these might look like in the real world, so we at UW have tried to build over the last couple of years an orchestration engine like this. We call it our Intelligent PAX Router. Our purpose here is certainly not to commercialize this or do anything like that, but we have a ton of really exciting work going on in the department developing these AI algorithms, and it's very difficult for us to take those, implement them clinically, and see how worthwhile they are. And so we built this to try and make it easy to do things like build a model, test it in sort of the benchtop, and then bring it into clinical workflows really quickly. We have some sort of rules on what sort of is allowed to be implemented this way, but the idea is to make this as easy as possible, and this platform is live in our clinical PAX today. But just to give an example of what these can look like, for us at least, we've got our PAX system open, and there's just an icon in the toolbar. This can't actually be done fully automatically in the background, but that makes for a really bad demonstration. So the way that this is set up, for demonstration purposes at least, is you can click that icon which launches just a web interface. This web interface will show icons for the algorithms that the orchestration engine thinks are eligible for that study type. So in this case, we could do bone age inference, send it to research PAX or de-identify, and then it handles sort of all the orchestration to get those images processed and sent back into PAX on the backend. The other workflow I want to quickly touch on is EMR integration, and so the COVID-19 model that was presented earlier actually has been implemented in our electronic medical record at UW. I think one of the reasons that this was exciting for us is that it is something we may not use today, but it gave us kind of a path to do this, and I think the workflow for this is similar to what it would be if it was implemented in PAX. We have an order placed for a chest X-ray. Most of those images make it to PAX. Sorry, my mouse is on the wrong screen. They're going to be grabbed and sent over. In our case, we use EPIC, so to the EPIC cloud where this inference is performed. That generates a score that gets sent back not to PAX, but directly into EPIC where it's available for different types of reporting. This is a screenshot of what this looks like. We have sort of two different views that we can present this in. This is kind of the radiologist-centric view where each row is an exam, so a single chest X-ray. One thing to point out is that this is incredibly fast. And so, in this case, this is 17 studies that were performed immediately before this report ran. Fourteen of them have COVID scores, which is this column over here. Only one of them has a PCR test result, and this was at a time when we were still using testing for pretty much everybody who came to the ED, so this is easily beating out our lab values. The other way that this type of data can be very useful is that you can actually look at kind of a ward level or a patient level and track these scores over time. So you can see, for example, here's over two days a patient whose sort of COVID risk score is growing, and so even though they were admitted maybe with a negative exam, you can track and see automatically whether this is somebody who's becoming positive. So one last quick comment on this. I think governance is also really important. It's maybe not the most fun part of deploying these AI tools, but it's really important to think about because that EMR example I just showed is something that we were able to accomplish about a year ago. We were really excited about it, but we're still not using it clinically, in large part because support and sort of clinical governance for the EMR and sort of broader enterprise is complicated. On the other hand, our PAX router is something that we built in-house. It touches a very small audience, and we can really easily sort of restrict access and things like that, and so we've been able to spin algorithms up in that much more quickly. So in the last couple of minutes, I just want to quickly touch on monitoring and ongoing validation of models, and I think this is going to be a little bit more abstract, but I think it's really important to not only work hard to get these models up and running in practice, but to really be critical and see are they doing what you expect them to do over time, and to define metrics up front for what sort of constitutes success with these. So there's a couple of simple metrics that you can start to think about as you roll these up, so that's, you know, how many studies are we processing with our AI tools? How many tools do we have in use? How often are these, you know, results getting engagement? But then there's also the more esoteric things, and these are ones that, you know, every organization is going to have to define for themselves, but, you know, what's the impact of these tools? So if they're supposed to save time, are they saving time? How accurate are they? How fair are the results? And things like that. So these can be grouped in a couple of different ways. We like to think of them internally as sort of metrics in terms of quality, fairness, and performance. These are some example questions that you might ask to sort of address each of these points, but some of the things you can do to actually monitor these types of things over time would be to maintain measures of accuracy and performance, not just sort of at snapshots in time, but over time, and look at things like drift. We really like the idea, conceptually, of using what we call model shadowing, where if you have, let's say, a chest X-ray model that detects COVID, you can run a second one in the background, you know, as new technologies become available, and compare those over time to see how well they're performing. I think fairness is a really important one, but a pretty hard thing to define. It helps to try and define up front different cohorts that you think might end up treated differently. Maybe that's your emergency department or different patient backgrounds, but it's also helpful to take advantage of different tools that can automatically sort of identify, like, this is a group of patients who reliably have different results than everybody else, and then look at those sort of critically to say, is this something that's, you know, a result of bias, or is it because there's a real finding in the model? And then performance is nontrivial, too, especially as things move to the cloud, to just keep an eye on kind of the compute cost and burden of these tools. So again, just to summarize sort of the monitoring sort of strategies, I think defining metrics that drive your outcomes is really important, capturing them over time, and then coming up with ways to sort of present those if you need to intervene, and then doing things like model shadowing can allow you to sort of make sure that you're always using the latest and greatest. So just to wrap up here really quickly, I think building models is really important and exciting, and it's difficult, but it's not the only challenge that we face today in this space. Identifying data is nontrivial and should be treated as a team effort. There's a lot of different ways to deploy these AI tools in practice, and thinking about it up front is going to be really useful. And then finally, monitoring these tools as they are in use will help you make sure that they're continuing to be worth the cost of support and licensing and things like that over time. So thank you for your attention.
Video Summary
This video explores the crucial role of high-quality data in developing AI models for clinical use, emphasizing its impact on patient care quality and safety. Starting with a discussion on preparing high-quality data, the talk covers the steps of planning, acquiring, and utilizing data, particularly in the context of chest X-ray-based COVID-19 diagnosis using AI. It stresses the importance of collaboration, data sharing, and using data tools such as RSNA-CTP for effective data management.<br /><br />The second part highlights why high-quality data is pivotal for AI and its generalizability. It covers data quality assurance considerations, including maintaining native DICOM format, accurate labeling using time-window restricted RT-PCR results, and ensuring equal sourcing of COVID-positive and negative data.<br /><br />Finally, the discussion shifts to AI model deployment in clinical settings. It touches on technical aspects of deployment in PACS and EMR systems, considering factors like imaging data, EMR data integration, and orchestration engines. The importance of monitoring, defining performance metrics, ensuring fairness, and making continuous improvements to ensure AI models' clinical success are underscored, highlighting AI's potential to enhance imaging processes when carefully and thoughtfully integrated.
Keywords
high-quality data
AI models
clinical use
patient care
COVID-19 diagnosis
data management
DICOM format
AI deployment
EMR integration
RSNA.org
|
RSNA EdCentral
|
CME Repository
|
CME Gateway
Copyright © 2025 Radiological Society of North America
Terms of Use
|
Privacy Policy
|
Cookie Policy
×
Please select your language
1
English